hive3之执行计划(Explain)、Fetch 抓取、本地模式、表的优化、Group By、笛卡尔积、行列过滤
一、執(zhí)行計(jì)劃(Explain)
1)基本語(yǔ)法
????????EXPLAIN [EXTENDED | DEPENDENCY | AUTHORIZATION] query
2)案例實(shí)操
(1)查看下面這條語(yǔ)句的執(zhí)行計(jì)劃
????????沒(méi)有生成 MR 任務(wù)的
hive (default)> explain select * from emp; Explain STAGE DEPENDENCIES:Stage-0 is a root stage STAGE PLANS:Stage: Stage-0Fetch Operatorlimit: -1Processor Tree:TableScanalias: empStatistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONESelect Operatorexpressions: empno (type: int), ename (type: string), job (type: string), mgr (type: int), hiredate (type: string), sal (type: double), comm (type: double), deptno (type: int)outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONEListSink有生成 MR 任務(wù)的
hive (default)> explain select deptno, avg(sal) avg_sal from emp group by deptno; Explain STAGE DEPENDENCIES:Stage-1 is a root stageStage-0 depends on stages: Stage-1 STAGE PLANS:Stage: Stage-1Map ReduceMap Operator Tree:TableScanalias: empStatistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONESelect Operatorexpressions: sal (type: double), deptno (type: int)outputColumnNames: sal, deptnoStatistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONEGroup By Operatoraggregations: sum(sal), count(sal)keys: deptno (type: int)mode: hashoutputColumnNames: _col0, _col1, _col2Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONEReduce Output Operatorkey expressions: _col0 (type: int)sort order: +Map-reduce partition columns: _col0 (type: int)Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONEvalue expressions: _col1 (type: double), _col2 (type: bigint)Execution mode: vectorizedReduce Operator Tree:Group By Operatoraggregations: sum(VALUE._col0), count(VALUE._col1)keys: KEY._col0 (type: int)mode: mergepartialoutputColumnNames: _col0, _col1, _col2Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONESelect Operatorexpressions: _col0 (type: int), (_col1 / _col2) (type: double)outputColumnNames: _col0, _col1Statistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONEFile Output Operatorcompressed: falseStatistics: Num rows: 1 Data size: 7020 Basic stats: COMPLETE Column stats: NONEtable:input format: org.apache.hadoop.mapred.SequenceFileInputFormatoutput format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormatserde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDeStage: Stage-0Fetch Operatorlimit: -1Processor Tree:ListSink(2)查看詳細(xì)執(zhí)行計(jì)劃
hive (default)> explain extended select * from emp; hive (default)> explain extended select deptno, avg(sal) avg_sal from emp group by deptno;二、Fetch 抓取
????????Fetch 抓取是指,Hive 中對(duì)某些情況的查詢可以不必使用 MapReduce 計(jì)算。例如:SELECT * FROM employees;在這種情況下,Hive 可以簡(jiǎn)單地讀取 employee 對(duì)應(yīng)的存儲(chǔ)目錄下的文件, 然后輸出查詢結(jié)果到控制臺(tái)。
????????在 hive-default.xml.template 文件中 hive.fetch.task.conversion 默認(rèn)是 more,老版本 hive 默認(rèn)是 minimal,該屬性修改為 more 以后,在全局查找、字段查找、limit 查找等都不走 mapreduce。
<property><name>hive.fetch.task.conversion</name><value>more</value><description>Expects one of [none, minimal, more].Some select queries can be converted to single FETCH task minimizing latency.Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incurs RS), lateral views and joins.0. none : disable hive.fetch.task.conversion1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only2. more : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns)</description> </property>1)案例實(shí)操:
(1)把 hive.fetch.task.conversion 設(shè)置成 none,然后執(zhí)行查詢語(yǔ)句,都會(huì)執(zhí)行 mapreduce 程序。
hive (default)> set hive.fetch.task.conversion=none; hive (default)> select * from emp; hive (default)> select ename from emp; hive (default)> select ename from emp limit 3;(2)把 hive.fetch.task.conversion 設(shè)置成 more,然后執(zhí)行查詢語(yǔ)句,如下查詢方式都不 會(huì)執(zhí)行 mapreduce 程序。
hive (default)> set hive.fetch.task.conversion=more; hive (default)> select * from emp; hive (default)> select ename from emp; hive (default)> select ename from emp limit 3;三、本地模式
????????大多數(shù)的 Hadoop Job 是需要 Hadoop 提供的完整的可擴(kuò)展性來(lái)處理大數(shù)據(jù)集的。不過(guò), 有時(shí) Hive 的輸入數(shù)據(jù)量是非常小的。在這種情況下,為查詢觸發(fā)執(zhí)行任務(wù)消耗的時(shí)間可能 會(huì)比實(shí)際 job 的執(zhí)行時(shí)間要多的多。對(duì)于大多數(shù)這種情況,Hive 可以通過(guò)本地模式在單臺(tái)機(jī) 器上處理所有的任務(wù)。對(duì)于小數(shù)據(jù)集,執(zhí)行時(shí)間可以明顯被縮短。
????????用戶可以通過(guò)設(shè)置 hive.exec.mode.local.auto 的值為 true,來(lái)讓 Hive 在適當(dāng)?shù)臅r(shí)候自動(dòng) 啟動(dòng)這個(gè)優(yōu)化。
set hive.exec.mode.local.auto=true; //開(kāi)啟本地 mr //設(shè)置 local mr 的最大輸入數(shù)據(jù)量,當(dāng)輸入數(shù)據(jù)量小于這個(gè)值時(shí)采用 local mr 的方式,默認(rèn) 為 134217728,即 128M set hive.exec.mode.local.auto.inputbytes.max=50000000; //設(shè)置 local mr 的最大輸入文件個(gè)數(shù),當(dāng)輸入文件個(gè)數(shù)小于這個(gè)值時(shí)采用 local mr 的方式,默 認(rèn)為 4 set hive.exec.mode.local.auto.input.files.max=10;1)案例實(shí)操:
(2)關(guān)閉本地模式(默認(rèn)是關(guān)閉的),并執(zhí)行查詢語(yǔ)句
hive (default)> select count(*) from emp group by deptno;(1)開(kāi)啟本地模式,并執(zhí)行查詢語(yǔ)句
hive (default)> set hive.exec.mode.local.auto=true; hive (default)> select count(*) from emp group by deptno;四、表的優(yōu)化
1、小表大表 Join(MapJOIN)
????????將 key 相對(duì)分散,并且數(shù)據(jù)量小的表放在 join 的左邊,可以使用 map join 讓小的維度表 先進(jìn)內(nèi)存。在 map 端完成 join。
????????實(shí)際測(cè)試發(fā)現(xiàn):新版的 hive 已經(jīng)對(duì)小表 JOIN 大表和大表 JOIN 小表進(jìn)行了優(yōu)化。小表放 在左邊和右邊已經(jīng)沒(méi)有區(qū)別。
1.1、案例實(shí)操
1)需求介紹
????????測(cè)試大表 JOIN 小表和小表 JOIN 大表的效率
2)開(kāi)啟 MapJoin 參數(shù)設(shè)置
????????(1)設(shè)置自動(dòng)選擇 Mapjoin
set hive.auto.convert.join = true; 默認(rèn)為 true????????(2)大表小表的閾值設(shè)置(默認(rèn) 25M 以下認(rèn)為是小表):
set hive.mapjoin.smalltable.filesize = 25000000;3)MapJoin 工作機(jī)制
4)建大表、小表和 JOIN 后表的語(yǔ)句
// 創(chuàng)建大表 create table bigtable(id bigint, t bigint, uid string, keyword string, url_rank int, click_num int, click_url string ) row format delimited fields terminated by '\t';// 創(chuàng)建小表 create table smalltable(id bigint, t bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';// 創(chuàng)建 join 后表的語(yǔ)句 create table jointable(id bigint, t bigint, uid string, keyword string, url_rank int, click_num int, click_url string ) row format delimited fields terminated by '\t';5)分別向大表和小表中導(dǎo)入數(shù)據(jù)
hive (default)> load data local inpath '/opt/module/data/bigtable' into table bigtable; hive (default)>load data local inpath '/opt/module/data/smalltable' into table smalltable;6)小表 JOIN 大表語(yǔ)句
insert overwrite table jointable select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url from smalltable s join bigtable b on b.id = s.id;7)大表 JOIN 小表語(yǔ)句
insert overwrite table jointable select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url from bigtable b join smalltable s on s.id = b.id;2、大表 Join 大表
1)空 KEY 過(guò)濾
????????有時(shí) join 超時(shí)是因?yàn)槟承?key 對(duì)應(yīng)的數(shù)據(jù)太多,而相同 key 對(duì)應(yīng)的數(shù)據(jù)都會(huì)發(fā)送到相同 的 reducer 上,從而導(dǎo)致內(nèi)存不夠。此時(shí)我們應(yīng)該仔細(xì)分析這些異常的 key,很多情況下, 這些 key 對(duì)應(yīng)的數(shù)據(jù)是異常數(shù)據(jù),我們需要在 SQL 語(yǔ)句中進(jìn)行過(guò)濾。例如 key 對(duì)應(yīng)的字段為 空,操作如下:
(1)配置歷史服務(wù)器
????????配置 mapred-site.xml
<property><name>mapreduce.jobhistory.address</name><value>hadoop20:10020</value> </property> <property><name>mapreduce.jobhistory.webapp.address</name><value>hadoop20:19888</value> </property>啟動(dòng)歷史服務(wù)器
sbin/mr-jobhistory-daemon.sh start historyserver查看 jobhistory
http://hadoop20:19888/jobhistory(2)創(chuàng)建原始數(shù)據(jù)空 id 表
// 創(chuàng)建空 id 表 create table nullidtable(id bigint, t bigint, uid string, keyword string, url_rank int, click_num int, click_url string ) row format delimited fields terminated by '\t';(3)分別加載原始數(shù)據(jù)和空 id 數(shù)據(jù)到對(duì)應(yīng)表中
hive (default)> load data local inpath '/opt/module/data/nullid' into> table nullidtable; Loading data to table default.nullidtable OK Time taken: 2.045 seconds hive (default)>數(shù)據(jù)例子:
\N 20111230000005 57375476989eea12893c0c3811607bcf 奇藝高清 1 1 http:2879www.123qiyi.com/ \N 20111230000005 66c5bb7774e31d0a22278249b26bc83a 凡人修仙傳 3 1 http:2879www.123booksky.org/BookDetail.aspx?BookID=1050804&Level=1 \N 20111230000007 b97920521c78de70ac38e3713f524b50 本本聯(lián)盟 1 1 http:2879www.123bblianmeng.com/ \N 20111230000008 6961d0c97fe93701fc9c0d861d096cd9 華南師范大學(xué)圖書(shū)館 1 1 http:2879lib.scnu.edu.cn/ \N 20111230000008 f2f5a21c764aebde1e8afcc2871e086f 在線代理 2 1 http:2879proxyie.cn/ \N 20111230000009 96994a0480e7e1edcaef67b20d8816b7 偉大導(dǎo)演 1 1 http:2879movie.douban.com/review/1128960/ \N 20111230000009 698956eb07815439fe5f46e9a4503997 youku 1 1 http:2879www.123youku.com/ \N 20111230000009 599cd26984f72ee68b2b6ebefccf6aed 安徽合肥365房產(chǎn)網(wǎng) 1 1 http:2879hf.house365.com/ \N 20111230000010 f577230df7b6c532837cd16ab731f874 哈薩克網(wǎng)址大全 1 1 http:2879www.123kz321.com/(4)測(cè)試不過(guò)濾空 id
hive (default)> insert overwrite table jointable select n.* from nullidtable n left join bigtable o on n.id = o.id;(5)測(cè)試過(guò)濾空 id(推薦)
hive (default)> insert overwrite table jointable select n.* from (select * from nullidtable where id is not null) n left join bigtable o on n.id = o.id;2)空 key 轉(zhuǎn)換
????????有時(shí)雖然某個(gè) key 為空對(duì)應(yīng)的數(shù)據(jù)很多,但是相應(yīng)的數(shù)據(jù)不是異常數(shù)據(jù),必須要包含在 join 的結(jié)果中,此時(shí)我們可以表 a 中 key 為空的字段賦一個(gè)隨機(jī)的值,使得數(shù)據(jù)隨機(jī)均勻地 分不到不同的 reducer 上。例如:
2.1)不隨機(jī)分布空 null 值:
(1)設(shè)置 5 個(gè) reduce 個(gè)數(shù)
????????set mapreduce.job.reduces = 5;
(2)JOIN 兩張表
insert overwrite table jointable select n.* from nullidtable n left join bigtable b on n.id = b.id;結(jié)果:如下圖所示,可以看出來(lái),出現(xiàn)了數(shù)據(jù)傾斜,某些 reducer 的資源消耗遠(yuǎn)大于其 他 reducer。
2.2)、隨機(jī)分布空 null 值
(1)設(shè)置 5 個(gè) reduce 個(gè)數(shù)?
set mapreduce.job.reduces = 5;(2)JOIN 兩張表
insert overwrite table jointable select n.* from nullidtable n full join bigtable o on nvl(n.id,rand()) = o.id;結(jié)果:如下圖所示,可以看出來(lái),消除了數(shù)據(jù)傾斜,負(fù)載均衡 reducer 的資源消耗
3)SMB(Sort Merge Bucket join)?
(1)創(chuàng)建第二張大表
create table bigtable2(id bigint,t bigint,uid string,keyword string,url_rank int,click_num int,click_url string) row format delimited fields terminated by '\t'; load data local inpath '/opt/module/data/bigtable' into table bigtable2;測(cè)試大表直接 JOIN
insert overwrite table jointable select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url from bigtable s join bigtable2 b on b.id = s.id;(2)創(chuàng)建分通表 1,桶的個(gè)數(shù)不要超過(guò)可用 CPU 的核數(shù)
create table bigtable_buck1(id bigint,t bigint,uid string,keyword string,url_rank int,click_num int,click_url string) clustered by(id) sorted by(id) into 6 buckets row format delimited fields terminated by '\t'; load data local inpath '/opt/module/data/bigtable' into table bigtable_buck1;(3)創(chuàng)建分通表 2,桶的個(gè)數(shù)不要超過(guò)可用 CPU 的核數(shù)
create table bigtable_buck2(id bigint,t bigint,uid string,keyword string,url_rank int,click_num int,click_url string) clustered by(id) sorted by(id) into 6 buckets row format delimited fields terminated by '\t'; load data local inpath '/opt/module/data/bigtable' into table bigtable_buck2;(4)設(shè)置參數(shù)
set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;(5)測(cè)試
insert overwrite table jointable select b.id, b.t, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url from bigtable_buck1 s join bigtable_buck2 b on b.id = s.id;五、Group By
默認(rèn)情況下,Map 階段同一 Key 數(shù)據(jù)分發(fā)給一個(gè) reduce,當(dāng)一個(gè) key 數(shù)據(jù)過(guò)大時(shí)就傾斜 了
并不是所有的聚合操作都需要在 Reduce 端完成,很多聚合操作都可以先在 Map 端進(jìn)行 部分聚合,最后在 Reduce 端得出最終結(jié)果。
1)開(kāi)啟 Map 端聚合參數(shù)設(shè)置
(1)是否在 Map 端進(jìn)行聚合,默認(rèn)為 True
set hive.map.aggr = true?(2)在 Map 端進(jìn)行聚合操作的條目數(shù)目
set hive.groupby.mapaggr.checkinterval = 100000(3)有數(shù)據(jù)傾斜的時(shí)候進(jìn)行負(fù)載均衡(默認(rèn)是 false)
set hive.groupby.skewindata = true????????當(dāng)選項(xiàng)設(shè)定為 true,生成的查詢計(jì)劃會(huì)有兩個(gè) MR Job。第一個(gè) MR Job 中,Map 的輸出 結(jié)果會(huì)隨機(jī)分布到 Reduce 中,每個(gè) Reduce 做部分聚合操作,并輸出結(jié)果,這樣處理的結(jié)果 是相同的 Group By Key 有可能被分發(fā)到不同的 Reduce 中,從而達(dá)到負(fù)載均衡的目的;第二 個(gè) MR Job 再根據(jù)預(yù)處理的數(shù)據(jù)結(jié)果按照 Group By Key 分布到 Reduce 中(這個(gè)過(guò)程可以保證 相同的 Group By Key 被分布到同一個(gè) Reduce 中),最后完成最終的聚合操作。
hive (default)> select deptno from emp group by deptno; Stage-Stage-1: Map: 1 Reduce: 5 Cumulative CPU: 23.68 sec HDFS Read: 19987 HDFS Write: 9 SUCCESS Total MapReduce CPU Time Spent: 23 seconds 680 msec OK deptno 10 20 30優(yōu)化以后
hive (default)> set hive.groupby.skewindata = true; hive (default)> select deptno from emp group by deptno; Stage-Stage-1: Map: 1 Reduce: 5 Cumulative CPU: 28.53 sec HDFS Read: 18209 HDFS Write: 534 SUCCESS Stage-Stage-2: Map: 1 Reduce: 5 Cumulative CPU: 38.32 sec HDFS Read: 15014 HDFS Write: 9 SUCCESS Total MapReduce CPU Time Spent: 1 minutes 6 seconds 850 msec OK deptno 10 20 30六、Count(Distinct) 去重統(tǒng)計(jì)
????????數(shù)據(jù)量小的時(shí)候無(wú)所謂,數(shù)據(jù)量大的情況下,由于 COUNT DISTINCT 操作需要用一個(gè) Reduce Task 來(lái)完成,這一個(gè) Reduce 需要處理的數(shù)據(jù)量太大,就會(huì)導(dǎo)致整個(gè) Job 很難完成, 一般 COUNT DISTINCT 使用先 GROUP BY 再 COUNT 的方式替換,但是需要注意 group by 造成 的數(shù)據(jù)傾斜問(wèn)題
1)案例實(shí)操
(1)創(chuàng)建一張大表
create table bigtable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string ) row format delimited fields terminated by '\t';(2)加載數(shù)據(jù)
hive (default)> load data local inpath '/opt/module/data/bigtable' into table bigtable;(3)設(shè)置 5 個(gè) reduce 個(gè)數(shù)
set mapreduce.job.reduces = 5;(4)執(zhí)行去重 id 查詢
hive (default)> select count(distinct id) from bigtable; Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 7.12 sec HDFS Read: 120741990 HDFS Write: 7 SUCCESS Total MapReduce CPU Time Spent: 7 seconds 120 msec OK c0 100001 Time taken: 23.607 seconds, Fetched: 1 row(s)(5)采用 GROUP by 去重 id
hive (default)> select count(id) from (select id from bigtable group by id) a; Stage-Stage-1: Map: 1 Reduce: 5 Cumulative CPU: 17.53 sec HDFS Read: 120752703 HDFS Write: 580 SUCCESS Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 4.29 sec2 HDFS Read: 9409 HDFS Write: 7 SUCCESS Total MapReduce CPU Time Spent: 21 seconds 820 msec OK _c0 100001 Time taken: 50.795 seconds, Fetched: 1 row(s)雖然會(huì)多用一個(gè) Job 來(lái)完成,但在數(shù)據(jù)量大的情況下,這個(gè)絕對(duì)是值得的。
七、笛卡爾積
????????盡量避免笛卡爾積,join 的時(shí)候不加 on 條件,或者無(wú)效的 on 條件,Hive 只能使用 1 個(gè) reducer 來(lái)完成笛卡爾積。
八、行列過(guò)濾
列處理:在 SELECT 中,只拿需要的列,如果有分區(qū),盡量使用分區(qū)過(guò)濾,少用 SELECT *。
行處理:在分區(qū)剪裁中,當(dāng)使用外關(guān)聯(lián)時(shí),如果將副表的過(guò)濾條件寫(xiě)在 Where 后面, 那么就會(huì)先全表關(guān)聯(lián),之后再過(guò)濾,比如:
1)測(cè)試先關(guān)聯(lián)兩張表,再用 where 條件過(guò)濾
hive (default)> select o.id from bigtable b join bigtable o on o.id = b.id where o.id <= 10;Time taken: 34.406 seconds, Fetched: 100 row(s)
2)通過(guò)子查詢后,再關(guān)聯(lián)表
hive (default)> select b.id from bigtable b join (select id from bigtable where id <= 10) o on b.id = o.id;Time taken: 30.058 seconds, Fetched: 100 row(s)
總結(jié)
以上是生活随笔為你收集整理的hive3之执行计划(Explain)、Fetch 抓取、本地模式、表的优化、Group By、笛卡尔积、行列过滤的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 学习英语的电影推荐!
- 下一篇: 换行符与回车符