大数据开发之Hive优化篇6-Hive on spark
備注:
Hive 版本 2.1.1
一.Hive on Spark介紹
Hive是基于Hadoop平臺的數據倉庫,最初由Facebook開發,在經過多年發展之后,已經成為Hadoop事實上的SQL引擎標準。相較于其他諸如Impala、Shark(SparkSQL的前身)等引擎而言,Hive擁有更為廣泛的用戶基礎以及對SQL語法更全面的支持。Hive最初的計算引擎為MapReduce,受限于其自身的Map+Reduce計算模式,以及不夠充分的內存利用,MapReduce的性能難以得到提升。
Hortonworks于2013年提出將Tez作為另一個計算引擎以提高Hive的性能。Spark則是最初由加州大學伯克利分校開發的分布式計算引擎,借助于其靈活的DAG執行模式、對內存的充分利用,以及RDD所能表達的豐富語義,Spark受到了Hadoop社區的廣泛關注。在成為Apache頂級項目之后,Spark更是集成了流處理、圖計算、機器學習等功能,是業界公認最具潛力的下一代通用計算框架。鑒于此,Hive社區于2014年推出了Hive on Spark項目(HIVE-7292),將Spark作為繼MapReduce和Tez之后Hive的第三個計算引擎。該項目由Cloudera、Intel和MapR等幾家公司共同開發,并受到了來自Hive和Spark兩個社區的共同關注。目前Hive on Spark的功能開發已基本完成,并于2015年1月初合并回trunk,預計會在Hive下一個版本中發布。本文將介紹Hive on Spark的設計架構,包括如何在Spark上執行Hive查詢,以及如何借助Spark來提高Hive的性能等。另外本文還將介紹Hive on Spark的進度和計劃,以及初步的性能測試數據。
我們建議修改Hive,增加Spark作為第三執行后端(Hive -7292),與MapReduce和Tez并行。
Spark是一個開源的數據分析集群計算框架,它建立在Hadoop的兩階段MapReduce范式之外,但建立在HDFS之上。Spark的主要抽象是一個分布式項目集合,稱為彈性分布式數據集(Resilient distributed Dataset, RDD)。rdd可以通過Hadoop inputformat(例如HDFS文件)創建,也可以通過轉換其他rdd創建。通過一系列的轉換(如groupBy和filter),或者Spark提供的count和save等操作來應用rdd,可以處理和分析rdd,從而實現MapReduce作業的功能,而不需要中間階段。
SQL查詢可以很容易地轉換為Spark轉換和操作,正如在Shark和Spark SQL中演示的那樣。事實上,許多原始轉換和操作都是面向sql的,比如join和count。
1.1 Hive on spark 動機
下面是Hive在Spark上運行的主要動機:
Spark用戶的好處:對于已經在使用Spark進行其他數據處理和機器學習需求的用戶來說,這個特性非常有價值。在一個執行后端進行標準化對于操作管理來說是很方便的,并且可以更容易地開發專門技術來調試問題和進行改進。
更廣泛地采用Hive:遵循前面的觀點,這將Hive作為SQL on Hadoop選項引入Spark用戶群,進一步增加了Hive的采用。
性能:Hive查詢,特別是涉及多個減速階段的查詢,運行速度會更快,從而像Tez一樣提高用戶體驗。
Spark執行后端并不是為了取代Tez或MapReduce。Hive項目的多個后端并存是正常的。用戶可以選擇使用Tez、Spark或MapReduce。根據用例的不同,每種方法都有不同的優點。而Hive的成功并不完全依賴于Tez或Spark的成功。
1.2 設計原則
主要的設計原則是不影響或限制Hive現有的代碼路徑,因此不影響功能或性能。也就是說,用戶選擇在MapReduce或Tez上運行Hive,就會像現在一樣擁有現有的功能和代碼路徑。此外,在執行層插入Spark可以最大限度地保持代碼共享,并降低維護成本,因此Hive community不需要專門為Spark進行投資。
同時,選擇Spark作為執行引擎的用戶將自動擁有Hive提供的所有豐富功能特性。未來添加到Hive的特性(如新數據類型、udf、邏輯優化等)應該會自動提供給那些用戶,而無需在Hive的Spark執行引擎中進行任何定制工作。
1.3 與Shark和Spark SQL的比較
Spark生態系統中有兩個相關項目對Spark提供Hive QL支持:Shark和Spark SQL。
Shark項目將Hive生成的查詢計劃轉換為自己的表示,并在Spark上執行。
Spark SQL是Spark中的一個特性。它使用Hive的解析器作為前臺提供Hive QL支持。Spark應用程序開發人員可以在代碼中輕松地用SQL以及其他Spark操作符表示數據處理邏輯。Spark SQL支持與Hive不同的用例。
與Shark和Spark SQL相比,我們的設計方法支持所有現有的Hive特性,包括Hive QL(以及任何未來的擴展),以及Hive與授權、監視、審計和其他操作工具的集成。
1.4 其它考慮
我們知道一個新的執行后端是一個重要的任務。它不可避免地增加了復雜性和維護成本,即使設計避免觸及現有的代碼路徑。Hive現在可以在MapReduce、Tez和Spark上運行單元測試。我們認為利大于弊。從基礎設施的角度來看,我們可以贊助更多的硬件來進行持續集成。
最后,Hive on Tez已經奠定了一些重要的基礎,這對支持一個新的執行引擎(如Spark)非常有幫助。這里的項目肯定會從中受益。另一方面,Spark是一個非常不同于MapReduce或Tez的框架。因此,在集成過程中很可能會發現漏洞和小問題。預計Hive社區將與Spark社區緊密合作,以確保集成的成功。
二.Hive on Spark 性能測試
代碼:
set hive.execution.engine=mr; select count(*) from ods_fact_sale; set hive.execution.engine=spark; select count(*) from ods_fact_sale;測試記錄:
hive> > set hive.execution.engine=mr;hive> select count(*) from ods_fact_sale; Query ID = root_20210106155340_8c89f5f6-c599-49e6-9cec-d73d278a1df6 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number> In order to set a constant number of reducers:set mapreduce.job.reduces=<number> 21/01/06 15:53:40 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69 Starting Job = job_1609141291605_0037, Tracking URL = http://hp3:8088/proxy/application_1609141291605_0037/ Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1609141291605_0037 Hadoop job information for Stage-1: number of mappers: 117; number of reducers: 1 2021-01-06 15:53:48,454 Stage-1 map = 0%, reduce = 0% 2021-01-06 15:53:57,802 Stage-1 map = 2%, reduce = 0%, Cumulative CPU 12.98 sec 2021-01-06 15:54:03,965 Stage-1 map = 3%, reduce = 0%, Cumulative CPU 24.23 sec 2021-01-06 15:54:10,130 Stage-1 map = 4%, reduce = 0%, Cumulative CPU 30.39 sec 2021-01-06 15:54:11,158 Stage-1 map = 5%, reduce = 0%, Cumulative CPU 36.31 sec 2021-01-06 15:54:17,313 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 48.68 sec 2021-01-06 15:54:23,460 Stage-1 map = 8%, reduce = 0%, Cumulative CPU 53.97 sec 2021-01-06 15:54:24,492 Stage-1 map = 9%, reduce = 0%, Cumulative CPU 59.82 sec 2021-01-06 15:54:30,630 Stage-1 map = 10%, reduce = 0%, Cumulative CPU 71.02 sec 2021-01-06 15:54:36,770 Stage-1 map = 12%, reduce = 0%, Cumulative CPU 82.71 sec 2021-01-06 15:54:42,903 Stage-1 map = 14%, reduce = 0%, Cumulative CPU 94.78 sec 2021-01-06 15:54:48,025 Stage-1 map = 15%, reduce = 0%, Cumulative CPU 99.97 sec 2021-01-06 15:54:54,167 Stage-1 map = 16%, reduce = 0%, Cumulative CPU 112.03 sec 2021-01-06 15:54:56,203 Stage-1 map = 17%, reduce = 0%, Cumulative CPU 118.08 sec 2021-01-06 15:55:00,300 Stage-1 map = 18%, reduce = 0%, Cumulative CPU 124.0 sec 2021-01-06 15:55:03,370 Stage-1 map = 19%, reduce = 0%, Cumulative CPU 130.14 sec 2021-01-06 15:55:06,440 Stage-1 map = 20%, reduce = 0%, Cumulative CPU 136.01 sec 2021-01-06 15:55:10,531 Stage-1 map = 21%, reduce = 0%, Cumulative CPU 141.94 sec 2021-01-06 15:55:16,665 Stage-1 map = 22%, reduce = 0%, Cumulative CPU 153.96 sec 2021-01-06 15:55:18,711 Stage-1 map = 23%, reduce = 0%, Cumulative CPU 159.88 sec 2021-01-06 15:55:23,829 Stage-1 map = 24%, reduce = 0%, Cumulative CPU 165.26 sec 2021-01-06 15:55:24,853 Stage-1 map = 25%, reduce = 0%, Cumulative CPU 170.8 sec 2021-01-06 15:55:29,968 Stage-1 map = 26%, reduce = 0%, Cumulative CPU 176.84 sec 2021-01-06 15:55:36,112 Stage-1 map = 27%, reduce = 0%, Cumulative CPU 188.49 sec 2021-01-06 15:55:38,155 Stage-1 map = 28%, reduce = 0%, Cumulative CPU 194.5 sec 2021-01-06 15:55:42,244 Stage-1 map = 29%, reduce = 0%, Cumulative CPU 200.54 sec 2021-01-06 15:55:44,293 Stage-1 map = 30%, reduce = 0%, Cumulative CPU 206.5 sec 2021-01-06 15:55:48,381 Stage-1 map = 31%, reduce = 0%, Cumulative CPU 212.52 sec 2021-01-06 15:55:50,417 Stage-1 map = 32%, reduce = 0%, Cumulative CPU 218.44 sec 2021-01-06 15:55:57,574 Stage-1 map = 33%, reduce = 0%, Cumulative CPU 229.58 sec 2021-01-06 15:56:01,657 Stage-1 map = 34%, reduce = 0%, Cumulative CPU 235.5 sec 2021-01-06 15:56:03,697 Stage-1 map = 35%, reduce = 0%, Cumulative CPU 241.52 sec 2021-01-06 15:56:07,790 Stage-1 map = 36%, reduce = 0%, Cumulative CPU 247.48 sec 2021-01-06 15:56:09,834 Stage-1 map = 37%, reduce = 0%, Cumulative CPU 253.49 sec 2021-01-06 15:56:13,926 Stage-1 map = 38%, reduce = 0%, Cumulative CPU 259.41 sec 2021-01-06 15:56:20,069 Stage-1 map = 39%, reduce = 0%, Cumulative CPU 271.13 sec 2021-01-06 15:56:23,140 Stage-1 map = 40%, reduce = 0%, Cumulative CPU 277.05 sec 2021-01-06 15:56:25,191 Stage-1 map = 41%, reduce = 0%, Cumulative CPU 282.81 sec 2021-01-06 15:56:29,272 Stage-1 map = 42%, reduce = 0%, Cumulative CPU 288.89 sec 2021-01-06 15:56:31,321 Stage-1 map = 43%, reduce = 0%, Cumulative CPU 294.76 sec 2021-01-06 15:56:34,386 Stage-1 map = 44%, reduce = 0%, Cumulative CPU 300.7 sec 2021-01-06 15:56:40,529 Stage-1 map = 45%, reduce = 0%, Cumulative CPU 312.71 sec 2021-01-06 15:56:44,608 Stage-1 map = 46%, reduce = 0%, Cumulative CPU 318.59 sec 2021-01-06 15:56:46,644 Stage-1 map = 47%, reduce = 0%, Cumulative CPU 324.63 sec 2021-01-06 15:56:50,726 Stage-1 map = 48%, reduce = 0%, Cumulative CPU 330.7 sec 2021-01-06 15:56:52,775 Stage-1 map = 49%, reduce = 0%, Cumulative CPU 336.71 sec 2021-01-06 15:56:57,890 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 347.89 sec 2021-01-06 15:57:05,042 Stage-1 map = 52%, reduce = 0%, Cumulative CPU 360.31 sec 2021-01-06 15:57:12,199 Stage-1 map = 54%, reduce = 0%, Cumulative CPU 372.25 sec 2021-01-06 15:57:18,334 Stage-1 map = 55%, reduce = 0%, Cumulative CPU 378.35 sec 2021-01-06 15:57:19,353 Stage-1 map = 56%, reduce = 0%, Cumulative CPU 384.4 sec 2021-01-06 15:57:26,511 Stage-1 map = 57%, reduce = 0%, Cumulative CPU 396.7 sec 2021-01-06 15:57:31,623 Stage-1 map = 59%, reduce = 0%, Cumulative CPU 408.79 sec 2021-01-06 15:57:37,757 Stage-1 map = 60%, reduce = 0%, Cumulative CPU 414.69 sec 2021-01-06 15:57:38,775 Stage-1 map = 61%, reduce = 0%, Cumulative CPU 420.57 sec 2021-01-06 15:57:43,890 Stage-1 map = 62%, reduce = 0%, Cumulative CPU 426.57 sec 2021-01-06 15:57:50,023 Stage-1 map = 63%, reduce = 0%, Cumulative CPU 438.61 sec 2021-01-06 15:57:51,045 Stage-1 map = 64%, reduce = 0%, Cumulative CPU 444.8 sec 2021-01-06 15:57:57,193 Stage-1 map = 65%, reduce = 0%, Cumulative CPU 450.01 sec 2021-01-06 15:57:58,213 Stage-1 map = 66%, reduce = 0%, Cumulative CPU 456.07 sec 2021-01-06 15:58:03,320 Stage-1 map = 67%, reduce = 0%, Cumulative CPU 462.04 sec 2021-01-06 15:58:04,341 Stage-1 map = 68%, reduce = 0%, Cumulative CPU 467.93 sec 2021-01-06 15:58:10,477 Stage-1 map = 69%, reduce = 0%, Cumulative CPU 479.64 sec 2021-01-06 15:58:14,561 Stage-1 map = 70%, reduce = 0%, Cumulative CPU 485.57 sec 2021-01-06 15:58:16,601 Stage-1 map = 71%, reduce = 0%, Cumulative CPU 491.55 sec 2021-01-06 15:58:20,691 Stage-1 map = 72%, reduce = 0%, Cumulative CPU 497.1 sec 2021-01-06 15:58:23,759 Stage-1 map = 73%, reduce = 0%, Cumulative CPU 503.28 sec 2021-01-06 15:58:26,822 Stage-1 map = 74%, reduce = 0%, Cumulative CPU 509.13 sec 2021-01-06 15:58:32,955 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 520.92 sec 2021-01-06 15:58:36,022 Stage-1 map = 76%, reduce = 0%, Cumulative CPU 526.84 sec 2021-01-06 15:58:39,093 Stage-1 map = 77%, reduce = 0%, Cumulative CPU 532.84 sec 2021-01-06 15:58:42,165 Stage-1 map = 78%, reduce = 0%, Cumulative CPU 538.87 sec 2021-01-06 15:58:45,234 Stage-1 map = 79%, reduce = 0%, Cumulative CPU 544.83 sec 2021-01-06 15:58:51,355 Stage-1 map = 80%, reduce = 0%, Cumulative CPU 556.59 sec 2021-01-06 15:58:56,464 Stage-1 map = 81%, reduce = 0%, Cumulative CPU 562.72 sec 2021-01-06 15:58:57,495 Stage-1 map = 82%, reduce = 0%, Cumulative CPU 568.08 sec 2021-01-06 15:59:03,628 Stage-1 map = 83%, reduce = 0%, Cumulative CPU 574.07 sec 2021-01-06 15:59:09,756 Stage-1 map = 84%, reduce = 0%, Cumulative CPU 579.96 sec 2021-01-06 15:59:11,807 Stage-1 map = 84%, reduce = 28%, Cumulative CPU 580.79 sec 2021-01-06 15:59:14,876 Stage-1 map = 85%, reduce = 28%, Cumulative CPU 586.69 sec 2021-01-06 15:59:27,168 Stage-1 map = 86%, reduce = 28%, Cumulative CPU 598.43 sec 2021-01-06 15:59:30,236 Stage-1 map = 86%, reduce = 29%, Cumulative CPU 598.54 sec 2021-01-06 15:59:33,305 Stage-1 map = 87%, reduce = 29%, Cumulative CPU 604.45 sec 2021-01-06 15:59:39,442 Stage-1 map = 88%, reduce = 29%, Cumulative CPU 610.43 sec 2021-01-06 15:59:45,578 Stage-1 map = 89%, reduce = 29%, Cumulative CPU 616.54 sec 2021-01-06 15:59:47,626 Stage-1 map = 89%, reduce = 30%, Cumulative CPU 616.59 sec 2021-01-06 15:59:51,719 Stage-1 map = 90%, reduce = 30%, Cumulative CPU 622.46 sec 2021-01-06 15:59:56,826 Stage-1 map = 91%, reduce = 30%, Cumulative CPU 628.32 sec 2021-01-06 16:00:09,096 Stage-1 map = 92%, reduce = 30%, Cumulative CPU 640.11 sec 2021-01-06 16:00:12,179 Stage-1 map = 92%, reduce = 31%, Cumulative CPU 640.19 sec 2021-01-06 16:00:15,240 Stage-1 map = 93%, reduce = 31%, Cumulative CPU 646.25 sec 2021-01-06 16:00:21,368 Stage-1 map = 94%, reduce = 31%, Cumulative CPU 652.31 sec 2021-01-06 16:00:27,502 Stage-1 map = 95%, reduce = 31%, Cumulative CPU 658.42 sec 2021-01-06 16:00:30,569 Stage-1 map = 95%, reduce = 32%, Cumulative CPU 658.47 sec 2021-01-06 16:00:34,655 Stage-1 map = 96%, reduce = 32%, Cumulative CPU 664.26 sec 2021-01-06 16:00:39,774 Stage-1 map = 97%, reduce = 32%, Cumulative CPU 670.08 sec 2021-01-06 16:00:52,043 Stage-1 map = 98%, reduce = 32%, Cumulative CPU 682.02 sec 2021-01-06 16:00:54,090 Stage-1 map = 98%, reduce = 33%, Cumulative CPU 682.07 sec 2021-01-06 16:00:58,174 Stage-1 map = 99%, reduce = 33%, Cumulative CPU 688.07 sec 2021-01-06 16:01:04,310 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 693.43 sec 2021-01-06 16:01:06,358 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 695.5 sec MapReduce Total cumulative CPU time: 11 minutes 35 seconds 500 msec Ended Job = job_1609141291605_0037 MapReduce Jobs Launched: Stage-Stage-1: Map: 117 Reduce: 1 Cumulative CPU: 695.5 sec HDFS Read: 31436910990 HDFS Write: 109 HDFS EC Read: 0 SUCCESS Total MapReduce CPU Time Spent: 11 minutes 35 seconds 500 msec OK 767830000 Time taken: 447.145 seconds, Fetched: 1 row(s) hive> > set hive.execution.engine=spark; hive> select count(*) from ods_fact_sale; Query ID = root_20210106160132_8d81e192-ceb7-46a3-bc60-70a5eeabce87 Total jobs = 1 Launching Job 1 out of 1 In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number> In order to set a constant number of reducers:set mapreduce.job.reduces=<number> Running with YARN Application = application_1609141291605_0038 Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/yarn application -kill application_1609141291605_0038 Hive on Spark Session Web UI URL: http://hp3:44667Query Hive on Spark job[0] stages: [0, 1] Spark job[0] status = RUNNING --------------------------------------------------------------------------------------STAGES ATTEMPT STATUS TOTAL COMPLETED RUNNING PENDING FAILED -------------------------------------------------------------------------------------- Stage-0 ........ 0 FINISHED 117 117 0 0 0 Stage-1 ........ 0 FINISHED 1 1 0 0 0 -------------------------------------------------------------------------------------- STAGES: 02/02 [==========================>>] 100% ELAPSED TIME: 51.30 s -------------------------------------------------------------------------------------- Spark job[0] finished successfully in 51.30 second(s) Spark Job[0] Metrics: TaskDurationTime: 316285, ExecutorCpuTime: 247267, JvmGCTime: 5415, BytesRead / RecordsRead: 31436921640 / 767830000, BytesReadEC: 0, ShuffleTotalBytesRead / ShuffleRecordsRead: 6669 / 117, ShuffleBytesWritten / ShuffleRecordsWritten: 6669 / 117 OK 767830000 Time taken: 71.384 seconds, Fetched: 1 row(s) hive>從測試記錄可以看出,執行速度從mr的8分鐘比哪位了71秒,性能大幅提升
參考
1.https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark
2.http://lxw1234.com/archives/2015/05/200.htm
總結
以上是生活随笔為你收集整理的大数据开发之Hive优化篇6-Hive on spark的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: ME722 刷机
- 下一篇: android桌面小部件开发