當前位置：首頁 > 运维知识 > 数据库 >内容正文

数据库

[译]以PostgreSQL为例，谈join计算的代价

發布時間：2025/3/21 数据库 52 豆豆

生活随笔收集整理的這篇文章主要介紹了 [译]以PostgreSQL为例，谈join计算的代价小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

join計算的代價很高嗎？
看情況
join的代價依賴于join的條件，索引是什么樣，依賴于表有多大，相關信息是否已經cache住了，使用的什么硬件，配置參數的信息，統計信息是否已經更新，同時是否還有其他運行的計算……
暈了？別急！在以下情景下，我們依然可以找到一些規律來分析判斷：

隨著join的表的數量增加
隨著這些表的行數的增加
有沒有索引

此類情況，在工作中經常會碰到，比如：如果有一張產品表 product，但業務上需要加入一個產品的狀態，包括Active、Discontinued、Recalled等。此時，我們會有3種不同的做法：

在產品表 product 中，增加一列狀態號 status_id，同時增加一個新的狀態表 status。

在產品表 product 中，增加一列狀態號 status_id，同時讓應用來定義每個狀態號 status_id 對應的含義及顯示。

在產品表 product 中，增加一列文本列，用來描述狀態信息。

通常，我們會選擇第一個做法。關于后兩種的做法，通常的質疑在兩個方面：join的性能和開發人員的工程化能力。后者通常與個人喜好有關，姑且不談，咱們來一起討論一下join的性能問題。

為便于討論，選用PostgreSQL測試數據來討論。以等值連接為例，讓我們看看執行上面的join時，性能會有什么變化？我們擔心的性能變慢，那具體會變成多慢。

以下是用來生成測試用的建表語句。

DROP FUNCTION IF EXISTS create_tables(integer, integer, boolean); CREATE FUNCTION create_tables(num_tables integer, num_rows integer, create_indexes boolean) RETURNS void AS $function_text$ BEGIN-- There's no table before the first one, so this one's a little different. Create it here instead of in our loop. DROP TABLE IF EXISTS table_1 CASCADE; CREATE TABLE table_1 (id serial primary key );-- Populate the first table INSERT INTO table_1 (id) SELECTnextval('table_1_id_seq') FROMgenerate_series(1, num_rows);-- Create and populate all the other tables FOR i IN 2..num_tables LOOPEXECUTE 'DROP TABLE IF EXISTS table_' || i || ' CASCADE;';EXECUTE format($$CREATE TABLE table_%1$s (id serial primary key,table_%2$s_id integer references table_%2$s (id));INSERT INTO table_%1$s (table_%2$s_id)SELECTidFROMtable_%2$sORDER BYrandom();$$, i, i-1);IF create_indexes THENEXECUTE 'CREATE INDEX ON table_' || i || ' (table_' || i - 1 || '_id);';END IF; END LOOP; END; $function_text$ LANGUAGE plpgsql;-- We'll want to make sure PostgreSQL has an idea of what's in these tables DROP FUNCTION IF EXISTS analyze_tables(integer); CREATE FUNCTION analyze_tables(num_tables integer) RETURNS void AS $function_text$ BEGINFOR i IN 1..num_tables LOOPEXECUTE 'ANALYZE table_' || i || ';'; END LOOP; END; $function_text$ LANGUAGE plpgsql;

執行建表函數……

SELECT create_tables(10, 10000, False);SELECT * from table_1 limit 10;id ----12345678910 (10 rows)SELECT * from table_2 limit 10;id | table_1_id ----+------------1 | 8242 | 9733 | 8594 | 7895 | 9016 | 1127 | 1628 | 2129 | 33310 | 577 (10 rows)

OK，現在我們可以任意創建所需要的表了。
我們還需要方法來查詢，以測試join的性能。有一些不錯的長查詢，但我們不希望手工來編寫，于是我們創建了另一個函數來生成它們。只需要告訴它有多少表參與join，以及where子句中最后一張表的最大的id，它就可以執行了。

DROP FUNCTION IF EXISTS get_query(integer, integer); CREATE FUNCTION get_query(num_tables integer, max_id integer) RETURNS text AS $function_text$ DECLAREfirst_part text;second_part text;third_part text;where_clause text; BEGINfirst_part := $query$SELECTcount(*)FROMtable_1 AS t1 INNER JOIN$query$;second_part := '';FOR i IN 2..num_tables-1 LOOPsecond_part := second_part || format($query$table_%1$s AS t%1$s ONt%2$s.id = t%1$s.table_%2$s_id INNER JOIN$query$, i, i-1); END LOOP;third_part := format($query$table_%1$s AS t%1$s ONt%2$s.id = t%1$s.table_%2$s_idWHEREt1.id <= %3$s$query$, num_tables, num_tables-1, max_id);RETURN first_part || second_part || third_part || ';'; END; $function_text$ LANGUAGE plpgsql;

下面是一個生成查詢的示例。

SELECT get_query(5, 10);get_query --------------------------------------------------+SELECT +count(*) +FROM +table_1 AS t1 INNER JOIN +table_2 AS t2 ON +t1.id = t2.table_1_id INNER JOIN+table_3 AS t3 ON +t2.id = t3.table_2_id INNER JOIN+table_4 AS t4 ON +t3.id = t4.table_3_id INNER JOIN+table_5 AS t5 ON +t4.id = t5.table_4_id +WHERE +t1.id <= 10; (1 row)Time: 1.404 ms

OK，讓我們花一些時間來思考一下，當我們運行這條查詢時，我們實際讓Postgres做了哪些事情。在這條SQL中，我們在詢問表 table_5 中的 table_4_id 列有多少在表 table_4中，而且表 table_4 中 table_3_id 列有多少在表 table_2 中，而且表 table_2 中 table_1_id 列有多少在表 table_1 中，而且 table_1_id 小于等于10。
我們繼續運行……

SELECTcount(*) FROMtable_1 AS t1 INNER JOINtable_2 AS t2 ONt1.id = t2.table_1_id INNER JOINtable_3 AS t3 ONt2.id = t3.table_2_id INNER JOINtable_4 AS t4 ONt3.id = t4.table_3_id INNER JOINtable_5 AS t5 ONt4.id = t5.table_4_id WHEREt1.id <= 10;count -------10 (1 row)Time: 40.494 ms

我們可以通過拋出 EXPLAIN ANALYZE 來查看進展。

EXPLAIN ANALYZE SELECTcount(*) FROMtable_1 AS t1 INNER JOINtable_2 AS t2 ONt1.id = t2.table_1_id INNER JOINtable_3 AS t3 ONt2.id = t3.table_2_id INNER JOINtable_4 AS t4 ONt3.id = t4.table_3_id INNER JOINtable_5 AS t5 ONt4.id = t5.table_4_id WHEREt1.id <= 10;QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Aggregate (cost=827.93..827.94 rows=1 width=8) (actual time=43.392..43.392 rows=1 loops=1)-> Hash Join (cost=645.31..827.90 rows=9 width=0) (actual time=35.221..43.353 rows=10 loops=1)Hash Cond: (t5.table_4_id = t4.id)-> Seq Scan on table_5 t5 (cost=0.00..145.00 rows=10000 width=4) (actual time=0.024..3.984 rows=10000 loops=1)-> Hash (cost=645.20..645.20 rows=9 width=4) (actual time=34.421..34.421 rows=10 loops=1)Buckets: 1024 Batches: 1 Memory Usage: 9kB-> Hash Join (cost=462.61..645.20 rows=9 width=4) (actual time=25.281..34.357 rows=10 loops=1)Hash Cond: (t4.table_3_id = t3.id)-> Seq Scan on table_4 t4 (cost=0.00..145.00 rows=10000 width=8) (actual time=0.022..4.828 rows=10000 loops=1)-> Hash (cost=462.50..462.50 rows=9 width=4) (actual time=23.519..23.519 rows=10 loops=1)Buckets: 1024 Batches: 1 Memory Usage: 9kB-> Hash Join (cost=279.91..462.50 rows=9 width=4) (actual time=12.617..23.453 rows=10 loops=1)Hash Cond: (t3.table_2_id = t2.id)-> Seq Scan on table_3 t3 (cost=0.00..145.00 rows=10000 width=8) (actual time=0.017..5.065 rows=10000 loops=1)-> Hash (cost=279.80..279.80 rows=9 width=4) (actual time=12.221..12.221 rows=10 loops=1)Buckets: 1024 Batches: 1 Memory Usage: 9kB-> Hash Join (cost=8.55..279.80 rows=9 width=4) (actual time=0.293..12.177 rows=10 loops=1)Hash Cond: (t2.table_1_id = t1.id)-> Seq Scan on table_2 t2 (cost=0.00..145.00 rows=10000 width=8) (actual time=0.017..5.407 rows=10000 loops=1)-> Hash (cost=8.44..8.44 rows=9 width=4) (actual time=0.054..0.054 rows=10 loops=1)Buckets: 1024 Batches: 1 Memory Usage: 9kB-> Index Only Scan using table_1_pkey on table_1 t1 (cost=0.29..8.44 rows=9 width=4) (actual time=0.024..0.035 rows=10 loops=1)Index Cond: (id <= 10)Heap Fetches: 10Planning time: 1.659 msExecution time: 43.585 ms (26 rows)

我們可以看到，除了使用表table_1的主鍵索引外，都是順序掃描。它還可以怎么做呢？因為我們沒有建任何索引來優化它。
如果我們重新做這個實驗，告訴* create_tables()*去創建索引……

SELECT create_tables(10, 10000, True);

重新運行后，我們得到不同的查詢計劃。

EXPLAIN ANALYZE SELECTcount(*) FROMtable_1 AS t1 INNER JOINtable_2 AS t2 ONt1.id = t2.table_1_id INNER JOINtable_3 AS t3 ONt2.id = t3.table_2_id INNER JOINtable_4 AS t4 ONt3.id = t4.table_3_id INNER JOINtable_5 AS t5 ONt4.id = t5.table_4_id WHEREt1.id <= 10;QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------Aggregate (cost=88.52..88.53 rows=1 width=8) (actual time=0.411..0.411 rows=1 loops=1)-> Nested Loop (cost=1.43..88.50 rows=9 width=0) (actual time=0.067..0.399 rows=10 loops=1)-> Nested Loop (cost=1.14..85.42 rows=9 width=4) (actual time=0.054..0.304 rows=10 loops=1)-> Nested Loop (cost=0.86..82.34 rows=9 width=4) (actual time=0.043..0.214 rows=10 loops=1)-> Nested Loop (cost=0.57..79.25 rows=9 width=4) (actual time=0.032..0.113 rows=10 loops=1)-> Index Only Scan using table_1_pkey on table_1 t1 (cost=0.29..8.44 rows=9 width=4) (actual time=0.015..0.023 rows=10 loops=1)Index Cond: (id <= 10)Heap Fetches: 10-> Index Scan using table_2_table_1_id_idx on table_2 t2 (cost=0.29..7.86 rows=1 width=8) (actual time=0.007..0.007 rows=1 loops=10)Index Cond: (table_1_id = t1.id)-> Index Scan using table_3_table_2_id_idx on table_3 t3 (cost=0.29..0.33 rows=1 width=8) (actual time=0.008..0.008 rows=1 loops=10)Index Cond: (table_2_id = t2.id)-> Index Scan using table_4_table_3_id_idx on table_4 t4 (cost=0.29..0.33 rows=1 width=8) (actual time=0.007..0.008 rows=1 loops=10)Index Cond: (table_3_id = t3.id)-> Index Only Scan using table_5_table_4_id_idx on table_5 t5 (cost=0.29..0.33 rows=1 width=4) (actual time=0.007..0.008 rows=1 loops=10)Index Cond: (table_4_id = t4.id)Heap Fetches: 10Planning time: 2.287 msExecution time: 0.546 ms (19 rows)

結果是使用來索引，速度快了很多。而且這與我們預期的一致。我們現在準備將這些混到一起，看看隨著表和列數量的增加時會有什么變化？
需要說明一下，這些測試是運行在AWS RDS db.m4.large實例上。這是最便宜的實例，而且也不會在性能上自動擴容，所以可以作為基準。我們共運行10次，取平均值。
最初的查詢，join涉及的表數量從2到200，包含3種行設置，而且沒有索引。

db.m4.large
性能還是不錯的。更為重要的是，我們可以計算每個增加的join的成本！對于100行的表，下面的數據顯示了每次增加表所帶來的執行時間的增加：

表的數量平均增加量時間（ms）

2-50	(0.012738 - 0.000327) / (50 - 2) =	0.259
50-100	(0.0353395 - 0.012738) / (100 - 50) =	0.452
100-150	(0.0762056 - 0.0353395) / (150 - 100) =	0.817
150-200	(0.1211591 - 0.0762056) / (200 - 150) =	0.899

甚至在參與join的表數量已經接近200時，每增加一個表只增加少于1ms的執行時間。
在討論增加一個表用來存儲產品的狀態時，表增加至1000行時就過載了。所以，不考慮索引的情況下，增加一個引用的表并不會對性能產生實質性影響。
由于之前所做的基準測試的數據量不大，如果碰到大表呢？我們重新做了相同的測試，但這次每張表都包含一百萬行。這次只增加到50張表。為什么？運行它需要一段時間，我只有有限都預算和耐心

百萬級大表時的性能

這次運行結果曲線就沒有重疊了。請看最大的1百萬行的表，每增加1張join的表，它需要多運行93ms。

表的數量平均增加量時間（ms）

2-10	(0.8428495 - 0.0924571) / (10 - 2) =	93.799
10-20	(1.781959 - 0.8428495) / (20 - 10) =	93.911
20-30	(2.708342 - 1.781959) / (30 - 20) =	92.638
30-40	(3.649164 - 2.708342) / (40 - 30) =	94.082
40-50	(4.565644 - 3.649164) / (50 - 40) =	91.648

此次都是順序掃描，因此我們增加索引，看看會有什么變化？

增加索引后的性能

增加索引后，性能影響很顯著。當能使用到索引時，不管表中有多少行，測試結果都差不多。針對10萬行的表的查詢一般最慢，但并總是最慢。

表的數量平均增加量時間（ms）

2-50	(0.0119917 - 0.000265) / (50 - 2) =	0.244
50-100	(0.035345 - 0.0119917) / (100 - 50) =	0.467
100-150	(0.0759236 - 0.035345) / (150 - 100) = 92.638	0.811
150-200	(0.1378461 - 0.0759236) / (200 - 150) =	1.238

即使查詢已經涉及到150張表，此時增加1一張表只會增加1.2ms。
最后一個測試場景，由于歷時太長，等不及生成200個1百萬行的表及建立索引，但又想觀察性能的變化，于是選擇測試50張表時的結果……

加大數據量的測試表的數量平均增加量時間（ms）

2-10	(0.0016811 - 0.000276) / (10 - 2) =	0.176
10-20	(0.003771 - 0.0016811) / (20 - 10) =	0.209
20-30	(0.0062328 - 0.003771) / (30 - 20) =	0.246
30-40	(0.0088621 - 0.0062328) / (40 - 30) =	0.263
40-50	(0.0120818 - 0.0088621) / (50 - 40) =	0.322

基于之前增加索引帶來的性能改進結果，這次并沒有帶來太多的性能驚喜。50張1百萬行的表做join，只需要12ms。Cool！
也許這些額外的join操作的成本比我們預想的要低一些，但有件事需要我們去考慮，雖然每個增加的join運算所占用的時間很小，但越多的表意味著越多的查詢計劃需要考慮，這很可能會導致很難找到最佳的查詢計劃。例如，當join的數量超過 geqo_threshold 時（默認為12），postgres 會停止參考所有可能的查詢計劃，改為使用通用算法。這會改變查詢計劃，引起對性能的負面影響。
由于每個系統的業務千差萬別，一定要基于你的數據來測試你的查詢。雖然我們看到增加join的成本很低，但仍然非常有必要去規范你的數據。

[原文] Cost of a Join
非直譯，僅為增加樂趣，向作者的嚴謹性致敬。

總結

以上是生活随笔為你收集整理的[译]以PostgreSQL为例，谈join计算的代价的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。