【SQL编程】Greenplum 实现树结构+自定义函数+避免函数重复调用+ function cannot execute on a QE slice 问题处理(优化过程全记录)
1.需求說明
這是一個關于POI的應用,數據從水經微圖下載而來,需要處理的是街道層級的數據,但是最終的POI信息要有省、市、縣數據,所有需要用到行政區劃表來補全數據。
2.編程實例
2.1 實現樹結構
首先看一下具有樹結構的數據:
通過 WITH RECURSIVE table_name AS 實現遞歸查詢樹結構數據【這里要特別注意一下 t0 和 t1 表】:
結果驗證:
2.2 自定義函數
使用STRING_AGG把省市縣數據拼接成一個字段【函數等價于GROUP_CONCAT】:
SELECTSTRING_AGG ( "name", ',' ORDER BY "level" ) AS "divisions" FROM ( WITH RECURSIVE t1 AS (SELECT "level", parent_code, area_code, "name" FROM data_divisions WHERE "name" = '楓楊街道' UNION ALLSELECT t0."level", t0.parent_code, t0.area_code, t0."name" FROM data_divisions t0, t1 WHERE t0.area_code = t1.parent_code ) SELECT "level", "name" FROM t1 ) t2創建自定義函數:
CREATE OR REPLACE FUNCTION getdivisionsbyname ( TEXT ) RETURNS TEXT AS $BODY$ SELECTSTRING_AGG ( "name", ',' ORDER BY "level" ) AS "divisions" FROM ( WITH RECURSIVE t1 AS (SELECT "level", parent_code, area_code, "name" FROM data_divisions WHERE "name" = '楓楊街道' UNION ALLSELECT t0."level", t0.parent_code, t0.area_code, t0."name" FROM data_divisions t0, t1 WHERE t0.area_code = t1.parent_code ) SELECT "level", "name" FROM t1 ) t2;$BODY$ LANGUAGE SQL IMMUTABLE STRICT COST 100;函數調用測試:
SELECT getDivisionsByName('楓楊街道');2.3 函數使用
data_address_point 表的記錄數是261條,執行耗時119.451s,這效率明顯是由于多次調用自定義函數導致的 😢
SELECT getdivisionsbyname(zone_name) || NAME AS "poi",SPLIT_PART( coordinates, ',', 1 ) AS "longitude",SPLIT_PART( coordinates, ',', 2 ) AS "latitude",NAME AS "address",SPLIT_PART( getdivisionsbyname(zone_name), ',', 1 ) AS "prov",SPLIT_PART( getdivisionsbyname(zone_name), ',', 2 ) AS "city",SPLIT_PART( getdivisionsbyname(zone_name), ',', 3 ) AS "district",SPLIT_PART( getdivisionsbyname(zone_name), ',', 4 ) AS "town" FROM data_address_point;
避免多次調用相同的自定義函數,優化后耗時23.634s,是之前的5分之1:
3.報錯問題
實際上,上邊的函數使用并是非順利的,第一次進行查詢時報錯function cannot execute on a QE slice because it accesses relation
WITH t1 AS ( SELECT getdivisionsbyname ( zone_name ) AS "divisions", coordinates, "name", poi_type FROM data_address_point ) SELECT ROW_NUMBER ( ) OVER ( ORDER BY "name" ) AS "id", REPLACE ( divisions, ',', '' ) || "name" AS "poi", poi_type, SPLIT_PART( coordinates, ',', 1 ) AS "longitude", SPLIT_PART( coordinates, ',', 2 ) AS "latitude", NAME AS "address", SPLIT_PART( divisions, ',', 1 ) AS "prov", SPLIT_PART( divisions, ',', 2 ) AS "city", SPLIT_PART( divisions, ',', 3 ) AS "district", SPLIT_PART( divisions, ',', 4 ) AS "town" FROMt1 > ERROR: function cannot execute on a QE slice because it accesses relation "public.data_divisions" (seg0 slice1 192.168.0.123:6000 pid=168995) CONTEXT: SQL function "getdivisionsbyname" during startupUDF(User Defined Function)用戶自定義函數在 segment 上不能訪問任何表。由于 MPP 的特性,任何 segment 僅僅包含部分數據,因而在 segment 執行的 UDF 不能訪問任何表,否則數據計算錯誤。Greenplum 支持另一種分布策略:復制表,即整張表在每個節點上都有一個完整的拷貝??墒褂靡韵旅钸M行設置:
ALTER TABLE table_name SET DISTRIBUTED REPLICATED;數據量大的表不適合使用復制表模式,一些不經常變動的數據量比較小的比如碼表可以使用DISTRIBUTED REPLICATED模式,查詢性能也會有明顯的提升。
總結
以上是生活随笔為你收集整理的【SQL编程】Greenplum 实现树结构+自定义函数+避免函数重复调用+ function cannot execute on a QE slice 问题处理(优化过程全记录)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: JVM【带着问题去学习 02】数据结构栈
- 下一篇: 【Linux部署】Linux环境 .ra