當前位置：首頁 > 运维知识 > 数据库 >内容正文

数据库

递归下降语法分析器的构建_一文了解函数式查询优化器Spark SQL Catalyst

發布時間：2023/11/30 数据库 35 豆豆

生活随笔收集整理的這篇文章主要介紹了递归下降语法分析器的构建_一文了解函数式查询优化器Spark SQL Catalyst 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

大數據技術與架構點擊右側關注，大數據開發領域最強公眾號！

暴走大數據點擊右側關注，暴走大數據！記錄一下個人對sparkSql的catalyst這個函數式的可擴展的查詢優化器的理解，目錄如下：0. Overview
1. Catalyst工作流程
2. Parser模塊
3. Analyzer模塊
4. Optimizer模塊
5. SparkPlanner模塊
6. Job UI
7. Reference

Overview

Spark SQL的核心是Catalyst優化器，是以一種新穎的方式利用Scala的的模式匹配和quasiquotes機制來構建的可擴展查詢優化器。sparkSql pipelinesparkSql的catalyst優化器是整個sparkSql pipeline的中間核心部分，其執行策略主要兩方向，

基于規則優化/Rule Based Optimizer/RBO

nestedLoopsJoin，P，Q雙表兩個大循環, O(M*N)
sortMergeJoin是P，Q雙表排序后互相游標
broadcastHashJoin，PQ雙表中小表放入內存hash表，大表遍歷O(1)方式取小表內容

一種經驗式、啟發式優化思路
對于核心優化算子join有點力不從心，如兩張表執行join，到底使用broadcaseHashJoin還是sortMergeJoin，目前sparkSql是通過手工設定參數來確定的，如果一個表的數據量小于某個閾值(默認10M？)就使用broadcastHashJoin

基于代價優化/Cost Based Optimizer/CBO

針對每個join評估當前兩張表使用每種join策略的代價，根據代價估算確定一種代價最小的方案
不同physical plans輸入到代價模型(目前是統計)，調整join順序，減少中間shuffle數據集大小，達到最優輸出

Catalyst工作流程

Parser，利用ANTLR將sparkSql字符串解析為抽象語法樹AST，稱為unresolved logical plan/ULP
Analyzer，借助于數據元數據catalog將ULP解析為logical plan/LP
Optimizer，根據各種RBO，CBO優化策略得到optimized logical plan/OLP，主要是對Logical Plan進行剪枝，合并等操作，進而刪除掉一些無用計算，或對一些計算的多個步驟進行合并

other

Optimizer是catalyst工作最后階段了，后面生成physical plan以及執行，主要是由sparkSql來完成。

SparkPlanner
- 優化后的邏輯執行計劃OLP依然是邏輯的，并不能被spark系統理解，此時需要將OLP轉換成physical plan
- 從邏輯計劃/OLP生成一個或多個物理執行計劃，基于成本模型cost model從中選擇一個
Code generation
- 生成Java bytecode然后在每一臺機器上執行，形成RDD graph/DAG

Parser模塊

將sparkSql字符串切分成一個一個token，再根據一定語義規則解析為一個抽象語法樹/AST。Parser模塊目前基本都使用第三方類庫ANTLR來實現，比如Hive，presto，sparkSql等。parser切詞Spark 1.x版本使用的是Scala原生的Parser Combinator構建詞法和語法分析器，而Spark 2.x版本使用的是第三方語法解析器工具ANTLR4。Spark2.x SQL語句的解析采用的是ANTLR4，ANTLR4根據語法文件SqlBase.g4自動解析生成兩個Java類：詞法解析器SqlBaseLexer和語法解析器SqlBaseParser。SqlBaseLexer和SqlBaseParser都是使用ANTLR4自動生成的Java類。使用這兩個解析器將SQL字符串語句解析成了ANTLR4的ParseTree語法樹結構。然后在parsePlan過程中，使用AstBuilder.scala將ParseTree轉換成catalyst表達式邏輯計劃LogicalPlan。

Analyzer模塊

通過解析后ULP有了基本骨架，但是系統對表的字段信息是不知道的。如sum，select，join，where還有score，people都表示什么含義，此時需要基本的元數據信息schema catalog來表達這些token。最重要的元數據信息就是，

表的schema信息，主要包括表的基本定義(表名、列名、數據類型)、表的數據格式(json、text、parquet、壓縮格式等)、表的物理位置
基本函數信息，主要是指類信息

Analyzer會再次遍歷整個AST，對樹上的每個節點進行數據類型綁定以及函數綁定，比如people詞素會根據元數據表信息解析為包含age、id以及name三列的表，people.age會被解析為數據類型為int的變量，sum會被解析為特定的聚合函數，詞義注入//org.apache.spark.sql.catalyst.analysis.Analyzer.scala
lazy val batches: Seq[Batch] = Seq( //不同Batch代表不同的解析策略
Batch("Substitution", fixedPoint,
CTESubstitution,
WindowsSubstitution,
EliminateUnions,
new SubstituteUnresolvedOrdinals(conf)),
Batch("Resolution", fixedPoint,
ResolveTableValuedFunctions ::
ResolveRelations :: //通過catalog解析表或列基本數據類型,命名等信息
ResolveReferences :: //解析從子節點的操作生成的屬性，一般是別名引起的，比如people.age
ResolveCreateNamedStruct ::
ResolveDeserializer ::
ResolveNewInstance ::
ResolveUpCast ::
ResolveGroupingAnalytics ::
ResolvePivot ::
ResolveOrdinalInOrderByAndGroupBy ::
ResolveMissingReferences ::
ExtractGenerator ::
ResolveGenerate ::
ResolveFunctions :: //解析基本函數,如max,min,agg
ResolveAliases ::
ResolveSubquery :: //解析AST中的字查詢信息
ResolveWindowOrder ::
ResolveWindowFrame ::
ResolveNaturalAndUsingJoin ::
ExtractWindowExpressions ::
GlobalAggregates :: //解析全局的聚合函數，比如select sum(score) from table
ResolveAggregateFunctions ::
TimeWindowing ::
ResolveInlineTables ::
TypeCoercion.typeCoercionRules ++
extendedResolutionRules : _*),
Batch("Nondeterministic", Once,
PullOutNondeterministic),
Batch("UDF", Once,
HandleNullInputsForUDF),
Batch("FixNullability", Once,
FixNullability),
Batch("Cleanup", fixedPoint,
CleanupAliases)
)

Optimizer模塊

Optimizer是catalyst的核心，分為RBO和CBO兩種。
RBO的優化策略就是對語法樹進行一次遍歷，模式匹配能夠滿足特定規則的節點，再進行相應的等價轉換，即將一棵樹等價地轉換為另一棵樹。SQL中經典的常見優化規則有，

謂詞下推(predicate pushdown)
常量累加(constant folding)
列值裁剪(column pruning)
Limits合并(combine limits)

由下往上走，從join后再filter優化為filter再join

從`100+80`優化為`180`，避免每一條record都需要執行一次`100+80`的操作

剪裁不需要的字段，特別是嵌套里面的不需要字段。如只需people.age，不需要people.address，那么可以將address字段丟棄

//@see http://blog.csdn.net/oopsoom/article/details/38121259
//org.apache.spark.sql.catalyst.optimizer.Optimizer.scala
def batches: Seq[Batch] = {
// Technically some of the rules in Finish Analysis are not optimizer rules and belong more
// in the analyzer, because they are needed for correctness (e.g. ComputeCurrentTime).
// However, because we also use the analyzer to canonicalized queries (for view definition),
// we do not eliminate subqueries or compute current time in the analyzer.
Batch("Finish Analysis", Once,
EliminateSubqueryAliases,
ReplaceExpressions,
ComputeCurrentTime,
GetCurrentDatabase(sessionCatalog),
RewriteDistinctAggregates) ::
//
// Optimizer rules start here
//
// - Do the first call of CombineUnions before starting the major Optimizer rules,
// since it can reduce the number of iteration and the other rules could add/move
// extra operators between two adjacent Union operators.
// - Call CombineUnions again in Batch("Operator Optimizations"),
// since the other rules might make two separate Unions operators adjacent.
Batch("Union", Once,
CombineUnions) ::
Batch("Subquery", Once,
OptimizeSubqueries) ::
Batch("Replace Operators", fixedPoint,
ReplaceIntersectWithSemiJoin,
ReplaceExceptWithAntiJoin,
ReplaceDistinctWithAggregate) ::
Batch("Aggregate", fixedPoint,
RemoveLiteralFromGroupExpressions,
RemoveRepetitionFromGroupExpressions) ::
Batch("Operator Optimizations", fixedPoint,
// Operator push down
PushProjectionThroughUnion,
ReorderJoin,
EliminateOuterJoin,
PushPredicateThroughJoin, //謂詞下推之一
PushDownPredicate, //謂詞下推之一
LimitPushDown,
ColumnPruning, //列值剪裁,常用于聚合操作,join左右孩子操作,合并相鄰project列
InferFiltersFromConstraints,
// Operator combine
CollapseRepartition,
CollapseProject,
CollapseWindow,
CombineFilters, //謂詞下推之一,合并兩個相鄰的Filter。合并2個節點，就可以減少樹的深度從而減少重復執行過濾的代價
CombineLimits, //合并Limits
CombineUnions,
// Constant folding and strength reduction
NullPropagation,
FoldablePropagation,
OptimizeIn(conf),
ConstantFolding, //常量累加之一
ReorderAssociativeOperator,
LikeSimplification,
BooleanSimplification, //常量累加之一,布爾表達式的提前短路
SimplifyConditionals,
RemoveDispensableExpressions,
SimplifyBinaryComparison,
PruneFilters,
EliminateSorts,
SimplifyCasts,
SimplifyCaseConversionExpressions,
RewriteCorrelatedScalarSubquery,
EliminateSerialization,
RemoveRedundantAliases,
RemoveRedundantProject) ::
Batch("Check Cartesian Products", Once,
CheckCartesianProducts(conf)) ::
Batch("Decimal Optimizations", fixedPoint,
DecimalAggregates) ::
Batch("Typed Filter Optimization", fixedPoint,
CombineTypedFilters) ::
Batch("LocalRelation", fixedPoint,
ConvertToLocalRelation,
PropagateEmptyRelation) ::
Batch("OptimizeCodegen", Once,
OptimizeCodegen(conf)) ::
Batch("RewriteSubquery", Once,
RewritePredicateSubquery,
CollapseProject) :: Nil
}