當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

基于MaxCompute InformationSchema进行血缘关系分析

發布時間：2024/8/23 编程问答 45 豆豆

生活随笔收集整理的這篇文章主要介紹了基于MaxCompute InformationSchema进行血缘关系分析小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

一、需求場景分析
在實際的數據平臺運營管理過程中，數據表的規模往往隨著更多業務數據的接入以及數據應用的建設而逐漸增長到非常大的規模，數據管理人員往往希望能夠利用元數據的分析來更好地掌握不同數據表的血緣關系，從而分析出數據的上下游依賴關系。
本文將介紹如何去根據MaxCompute InformationSchema中作業ID的輸入輸出表來分析出某張表的血緣關系。
二、方案設計思路
MaxCompute Information_Schema提供了訪問表的作業明細數據tasks_history，該表中有作業ID、input_tables、output_tables字段記錄表的上下游依賴關系。根據這三個字段統計分析出表的血緣關系
1、根據某1天的作業歷史，通過獲取tasks_history表里的input_tables、output_tables、作業ID字段的詳細信息，然后分析統計一定時間內的各個表的上下游依賴關系。
2、根據表上下游依賴推測出血緣關系。
三、方案實現方法
參考示例一：
（1）根據作業ID查詢某表上下游依賴SQL處理如下：

select t2.input_table, t1.inst_id, replace(replace(t1.output_tables,"[",""),"]","") as output_table from information_schema.tasks_history t1 left join (select---去除表開始和結尾的[ ]trans_array(1,",",inst_id,replace(replace(input_tables,"[",""),"]","")) as (inst_id,input_table)from information_schema.tasks_history where ds = 20190902 )t2 on t1.inst_id = t2.inst_id where (replace(replace(t1.output_tables,"[",""),"]","")) <> "" order by t2.input_table limit 1000;

結果如下圖所示：

（2）根據結果可以分析得出每張表張表的輸入表輸出表以及連接的作業ID，即每張表的血緣關系。
血緣關系位圖如下圖所示：

中間連線為作業ID，連線起始為輸入表，箭頭所指方向為輸出表。
參考示例二：
以下方式是通過設置分區，結合DataWorks去分析血緣關系：
（1）設計存儲結果表Schema

CREATE TABLE IF NOT EXISTS dim_meta_tasks_history_a (stat_date STRING COMMENT '統計日期',project_name STRING COMMENT '項目名稱',task_id STRING COMMENT '作業ID',start_time STRING COMMENT '開始時間',end_time STRING COMMENT '結束時間',input_table STRING COMMENT '輸入表',output_table STRING COMMENT '輸出表',etl_date STRING COMMENT 'ETL運行時間' );

（2）關鍵解析sql

SELECT '${yesterday}' AS stat_date ,'project_name' AS project_name ,a.inst_id AS task_id ,start_time AS start_time ,end_time AS end_time ,a.input_table AS input_table ,a.output_table AS output_table ,GETDATE() AS etl_date FROM (SELECT t2.input_table ,t1.inst_id,replace(replace(t1.input_tables,"[",""),"]","") AS output_table,start_time ,end_time FROM (SELECT*,ROW_NUMBER() OVER(PARTITION BY output_tables ORDER BY end_time DESC) AS rowsFROM information_schema.tasks_historyWHERE operation_text LIKE 'INSERT OVERWRITE TABLE%'AND (start_time >= TO_DATE('${yesterday}','yyyy-mm-dd')andend_time <= DATEADD(TO_DATE('${yesterday}','yyyy-mm-dd'),8,'hh'))AND(replace(replace(output_tables,"[",""),"]",""))<>""AND ds = CONCAT(SUBSTR('${yesterday}',1,4),SUBSTR('${yesterday}',6,2),SUBSTR('${yesterday}',9,2)))t1LEFT JOIN(SELECT TRANS_ARRAY(1,",",inst_id,replace(replace(input_tables,"[",""),"]","")) AS (inst_id,input_table)FROM information_schema.tasks_historyWHERE ds = CONCAT(SUBSTR('${yesterday}',1,4),SUBSTR('${yesterday}',6,2),SUBSTR('${yesterday}',9,2)))t2ON t1.inst_id = t2.inst_idwhere t1.rows = 1 ) a WHERE a.input_table is not null ;

（3）任務依賴關系

（4）最終血緣關系

以上血緣關系的分析是根據自己的思路實踐去完成。真實的業務場景需要大家一起去驗證。所以希望大家有需要的可以根據自己的業務需求去做相應的sql修改。如果有發現處理不當的地方希望多多指教。我在做相應的調整。
歡迎加入“MaxCompute開發者社區2群”,點擊鏈接申請加入或掃描二維碼
https://h5.dingtalk.com/invite-page/index.html?bizSource=____source____&corpId=dingb682fb31ec15e09f35c2f4657eb6378f&inviterUid=E3F28CD2308408A8&encodeDeptId=0054DC2B53AFE745

原文鏈接
本文為云棲社區原創內容，未經允許不得轉載。

總結

以上是生活随笔為你收集整理的基于MaxCompute InformationSchema进行血缘关系分析的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：数云：PolarDB助力数云轻松应对双十
下一篇： MaxCompute管家详解--管家助力