日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 运维知识 > 数据库 >内容正文

数据库

spark SQL读取ORC文件从Driver启动到开始执行Task(或stage)间隔时间太长(计算Partition时间太长)且产出orc单个文件中stripe个数太多问题解决方案...

發(fā)布時間:2024/4/14 数据库 39 豆豆

1、背景:

? ??控制上游文件個數(shù)每天7000個,每個文件大小小于256M,50億條+,orc格式。查看每個文件的stripe個數(shù),500個左右,查詢命令:hdfs fsck viewfs://hadoop/nn01/warehouse/…….db/……/partition_date=2017-11-11/part-06999 -files -blocks;

stripe個數(shù)查看命令:hive --orcfiledump viewfs://hadoop/nn01/warehouse/…….db/table/partition_date=2017-11-11/part-06999 | less

2、問題出現(xiàn):

? ? 通過Spark SQL讀取orc格式文件,從spark作業(yè)提交到計算出Partition,開始執(zhí)行Task,間隔時間太長。

? ? 頻繁打印如下日志:
17/11/11 03:52:01 INFO BlockManagerMasterEndpoint: Registering block manager gh-data-hdp-dn0640.---:11942 with 6.1 GB RAM, BlockManagerId(554, ----, 11942)
17/11/11 03:52:29 INFO DFSClient: Firstly choose dn: DatanodeInfoWithStorage[10.20.--.--:50010,DS-32f8aaa5-c6ce-48a9-a2b1-3b169df193b9,DISK], --

17/11/11 03:52:29 INFO DFSClient: Firstly choose dn:?

? ? 問題抽象:如果執(zhí)行如下簡單SQL 也會出現(xiàn)作業(yè)提交后ApplicationMaster(Driver)啟動了,作業(yè)Task遲遲不執(zhí)行,Partition不能計算出來。SparkUI刷不出來DAU圖,看不到Stage相關信息。

SELECT * from table where partition_date=2017-11-11 limit 1;

3、問題分析

? ? 初步分析:Driver讀取DataNode的數(shù)據(jù),通過分析GC日志發(fā)現(xiàn):確認Driver讀取了DataNode上的數(shù)據(jù)(orc文件的head信息),導致Driver產(chǎn)生了full GC。

? ? 源碼跟蹤分析:發(fā)現(xiàn)和spark讀取orc文件的策略有關系。

? ? 查看HiveConf.java發(fā)現(xiàn)Spark讀取orc文件默認采用HYBRID策略。

?

HIVE_ORC_SPLIT_STRATEGY("hive.exec.orc.split.strategy", "HYBRID", new StringSet(new String[]{"HYBRID", "BI", "ETL"}), "This is not a user level config. BI strategy is used when the requirement is to spend less time in split generation as opposed to query execution (split generation does not read or cache file footers). ETL strategy is used when spending little more time in split generation is acceptable (split generation reads and caches file footers). HYBRID chooses between the above strategies based on heuristics."),

?

? ??查看OrcInputFormat.java文件發(fā)現(xiàn)HYBRID切分策略代碼如下:

?

public SplitStrategy call() throws IOException {final SplitStrategy splitStrategy;AcidUtils.Directory dirInfo = AcidUtils.getAcidState(dir,context.conf, context.transactionList);List<Long> deltas = AcidUtils.serializeDeltas(dirInfo.getCurrentDirectories());Path base = dirInfo.getBaseDirectory();List<FileStatus> original = dirInfo.getOriginalFiles();boolean[] covered = new boolean[context.numBuckets];boolean isOriginal = base == null;// if we have a base to work fromif (base != null || !original.isEmpty()) {// find the base files (original or new style)List<FileStatus> children = original;if (base != null) {children = SHIMS.listLocatedStatus(fs, base,AcidUtils.hiddenFileFilter);}long totalFileSize = 0;for (FileStatus child : children) {totalFileSize += child.getLen();AcidOutputFormat.Options opts = AcidUtils.parseBaseBucketFilename(child.getPath(), context.conf);int b = opts.getBucket();// If the bucket is in the valid range, mark it as covered.// I wish Hive actually enforced bucketing all of the time.if (b >= 0 && b < covered.length) {covered[b] = true;}}int numFiles = children.size();long avgFileSize = totalFileSize / numFiles;switch(context.splitStrategyKind) {case BI:// BI strategy requested through configsplitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal,deltas, covered);break;case ETL:// ETL strategy requested through configsplitStrategy = new ETLSplitStrategy(context, fs, dir, children, isOriginal,deltas, covered);break;default:// HYBRID strategyif (avgFileSize > context.maxSize) {splitStrategy = new ETLSplitStrategy(context, fs, dir, children, isOriginal, deltas,covered);} else {splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal, deltas,covered);}break;}} else {// no base, only deltassplitStrategy = new ACIDSplitStrategy(dir, context.numBuckets, deltas, covered);}return splitStrategy;} }

?

? ??HYBRID策略:Spark Driver啟動的時候,會去nameNode讀取元數(shù)據(jù),根據(jù)文件總大小和文件個數(shù)計算一個文件的平均大小,如果這個平均值大于默認256M的時候就會觸發(fā)ETL策略。ETL策略就會去DataNode上讀取orc文件的head等信息,如果stripe個數(shù)多或元數(shù)據(jù)信息太大就會導致Driver 產(chǎn)生FUll GC,這個時候就會表現(xiàn)為Driver啟動到Task執(zhí)行間隔時間太久的現(xiàn)象。

4、解決方案:

spark 1.6.2:

?

val hiveContext = new HiveContext(sc) // 默認64M,即代表在壓縮前數(shù)據(jù)量累計到64M就會產(chǎn)生一個stripe。與之對應的hive.exec.orc.default.row.index.stride=10000可以控制有多少行是產(chǎn)生一個stripe。 // 調(diào)整這個參數(shù)可控制單個文件中stripe的個數(shù),不配置單個文件stripe過多,影響下游使用,如果配置了ETL切分策略或啟發(fā)式觸發(fā)了ETL切分策略,就會使得Driver讀取DataNode元數(shù)據(jù)太大,進而導致頻繁GC,使得計算Partition的時間太長難以接受。 hiveContext.setConf("hive.exec.orc.default.stripe.size","268435456") // 總共有三種策略{"HYBRID", "BI", "ETL"}), 默認是"HYBRID","This is not a user level config. BI strategy is used when the requirement is to spend less time in split generation as opposed to query execution (split generation does not read or cache file footers). ETL strategy is used when spending little more time in split generation is acceptable (split generation reads and caches file footers). HYBRID chooses between the above strategies based on heuristics."), // 如果不配置,當orc文件大小大于spark框架估算的平均值256M時,會觸發(fā)ETL策略,導致Driver讀取DataNode數(shù)據(jù)切分split花費大量的時間。 hiveContext.setConf("hive.exec.orc.split.strategy", "BI")

?

spark2.2.0:

?

// 創(chuàng)建一個支持Hive的SparkSession val sparkSession = SparkSession.builder().appName("PvMvToBase")// 默認64M,即代表在壓縮前數(shù)據(jù)量累計到64M就會產(chǎn)生一個stripe。與之對應的hive.exec.orc.default.row.index.stride=10000可以控制有多少行是產(chǎn)生一個stripe。// 調(diào)整這個參數(shù)可控制單個文件中stripe的個數(shù),不配置單個文件stripe過多,影響下游使用,如果配置了ETL切分策略或啟發(fā)式觸發(fā)了ETL切分策略,就會使得Driver讀取DataNode元數(shù)據(jù)太大,進而導致頻繁GC,使得計算Partition的時間太長難以接受。.config("hive.exec.orc.default.stripe.size", 268435456L)// 總共有三種策略{"HYBRID", "BI", "ETL"}), 默認是"HYBRID","This is not a user level config. BI strategy is used when the requirement is to spend less time in split generation as opposed to query execution (split generation does not read or cache file footers). ETL strategy is used when spending little more time in split generation is acceptable (split generation reads and caches file footers). HYBRID chooses between the above strategies based on heuristics."),// 如果不配置,當orc文件大小大于spark框架估算的平均值256M時,會觸發(fā)ETL策略,導致Driver讀取DataNode數(shù)據(jù)切分split花費大量的時間。.config("hive.exec.orc.split.strategy", "BI").enableHiveSupport().getOrCreate()

Spark Shuffle六大問題 fetch操作、數(shù)據(jù)存儲、文件個數(shù)、什么排序算法簡單介紹
MapReduce過程詳解及其性能優(yōu)化

總結(jié)

以上是生活随笔為你收集整理的spark SQL读取ORC文件从Driver启动到开始执行Task(或stage)间隔时间太长(计算Partition时间太长)且产出orc单个文件中stripe个数太多问题解决方案...的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。