當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

Hive中JOIN的使用入门

發(fā)布時(shí)間：2024/1/17 编程问答 46 豆豆

生活随笔收集整理的這篇文章主要介紹了 Hive中JOIN的使用入门小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

Hive中join的用法

Hive中Join的通常使用有以下幾種：

inner join 等值連接

left join?
right join?
full join

left semi join

cross join(笛卡爾積)

mulitiple(一般來(lái)說(shuō)是多個(gè)表進(jìn)行join)

數(shù)據(jù)準(zhǔn)備：

join_a.txt:1 zhangsan2 lisi3 wangwu join_b.txt:1 302 294 21

創(chuàng)建a表并導(dǎo)入數(shù)據(jù)：

create table a( id int,name string ) row format delimited fields terminated by '\t';hive>load data local inpath '/opt/data/join_a.txt' overwrite into table a;

創(chuàng)建b表并導(dǎo)入數(shù)據(jù)：

create table b( id int,age int ) row format delimited fields terminated by '\t';hive>load data local inpath '/opt/data/join_b.txt' overwrite into table b;

進(jìn)行測(cè)試

普通Join

hive>select a.id,a.name,b.age from a join b on a.id=b.id；

執(zhí)行結(jié)果：?

左外連接

hive>select a.id,a.name,b.age from a left join b on a.id=b.id；

執(zhí)行結(jié)果：?

右外連接

hive>select a.id,a.name,b.age from a right join b on a.id=b.id；

執(zhí)行結(jié)果：?
?
因?yàn)槿〉氖莂.id 所以id為NULL

全連接

hive>select a.id,a.name,b.age from a full join b on a.id=b.id；

執(zhí)行結(jié)果：?

笛卡爾(笛卡爾后面不需要加條件)

hive>select a.id,a.name,b.age from a cross join b；

執(zhí)行結(jié)果：?

Common Join和Map Join的MapReduce實(shí)現(xiàn)

地址：http://blog.csdn.net/lemonzhaotao/article/details/78209708?
由于Hive的底層就是MapReduce，因此通過(guò)MapReduce的自己編程，可以進(jìn)一步了解兩者的執(zhí)行原理，有助于Hive的學(xué)習(xí)

Common Join實(shí)現(xiàn)原理深度剖析

Common Join即傳統(tǒng)思路實(shí)現(xiàn)Join，性能較差因?yàn)樯婕暗搅藄huffle的過(guò)程?
common join/shuffle join/reduce join (都是指同一個(gè))

執(zhí)行流程

實(shí)現(xiàn)思路(最古老的實(shí)現(xiàn)思路)

有a表和b表?
以這句SQL為思路： select a.id,a.name,b.age from a join b on a.id=b.id 去實(shí)現(xiàn)?
通過(guò)兩個(gè)表的id去進(jìn)行join

1) map 讀取 a表 ==> <id, (name)><1,(zhangsan)><2,(lisi)><3,(wangwu)> 2) map 讀取 b表 ==> <id, (age)><1,(30)><2,(29)><4,(21)> 3) shuffle: hash(key) <1,(zhangsan,30)><2,(lisi,29)> <3,(wangwu)><4,(21)> 4) reduce1,(zhangsan,30) 2,(lisi,29) 3,(wangwu)4,(21)

這種執(zhí)行流程，會(huì)造成的問(wèn)題

數(shù)據(jù)傾斜?
產(chǎn)生該現(xiàn)象的根本原因：某一個(gè)key太多，從而導(dǎo)致某一個(gè)task的數(shù)據(jù)量特別大

shuffle?
一旦出現(xiàn)shuffle，性能就會(huì)下降很多(不管是在hadoop還是spark中都是一樣)?
因此在做spark開(kāi)發(fā)的過(guò)程中，有一個(gè)原則：能避免不使用shuffle就不使用產(chǎn)生shuffle的算子?
因?yàn)閟huffle最耗費(fèi)性能

MapJoin實(shí)現(xiàn)原理深度剖析

mapjoin 也叫作 boardcast join?
map join不會(huì)有reduce階段和shuffle階段

執(zhí)行流程

先啟動(dòng)Task A；Task A去啟動(dòng)一個(gè)MapReduce的local task；通過(guò)該local task把small table data的數(shù)據(jù)讀取進(jìn)來(lái)；之后會(huì)生成一個(gè)HashTable Files；之后將該文件加載到分布式緩存中來(lái)；

啟動(dòng)MapJoin Task，去讀大表的數(shù)據(jù)，每讀一個(gè)就會(huì)去和Distributed Cache中的數(shù)據(jù)去關(guān)聯(lián)一次，關(guān)聯(lián)上后進(jìn)行輸出?
整個(gè)階段，沒(méi)有reduce 和 shuffle

原理

將小表的數(shù)據(jù)加載到內(nèi)存中

沒(méi)有shuffle過(guò)程

缺點(diǎn)：內(nèi)存(如果小表過(guò)大，可能會(huì)出現(xiàn)OOM)

通過(guò)執(zhí)行計(jì)劃深度剖析Common Join和MapJoin的區(qū)別

通過(guò)日志打印的信息去對(duì)比兩種Join操作?
非常重要，需要會(huì)畫兩個(gè)join的圖、知道兩者的原理

數(shù)據(jù)準(zhǔn)備

數(shù)據(jù)準(zhǔn)備：

dept.txt 10 ACCOUNTINGNEW YORK 20 RESEARCHDALLAS 30 SALESCHICAGO 40 OPERATIONSBOSTON

創(chuàng)建dept表

hive>create table dept(deptno int, dname string, loc string) row format delimited fields terminated by '\t';

導(dǎo)入數(shù)據(jù)到dept表

hive>load data local inpath '/opt/data/dept.txt' overwrite into table dept;

測(cè)試

hive低版本使用mapjoin必須這樣寫：?
得說(shuō)明d是小表：/+MAPJOIN(d)/ 告訴hive d是小表

hive>select /*+MAPJOIN(d)*/ e.empno, e.ename, d.dname from emp e join dept d on e.deptno=d.deptno;

注意：在hive中看執(zhí)行計(jì)劃是十分重要的一個(gè)技能

網(wǎng)址：?
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain

語(yǔ)法：

EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] query

hive.auto.convert.join參數(shù)設(shè)置?
hive.auto.convert.join=true的意思為：?
hive自動(dòng)會(huì)把普通的join轉(zhuǎn)換為mapjoin，也就是哪個(gè)表大哪個(gè)表小 hive自動(dòng)會(huì)我們進(jìn)行轉(zhuǎn)換?
為了測(cè)試看到更好的效果，我們需要將這個(gè)值設(shè)置為false
給參數(shù)設(shè)置值

hive>set hive.auto.convert.join=false;

查看common join執(zhí)行計(jì)劃

hive>explain select e.empno, e.ename, d.dname from emp e join dept d on e.deptno=d.deptno;

打印的信息：?
?

打印出信息的部分說(shuō)明：?
TableScan 表示去讀表?
Filter Operator 表示過(guò)濾操作即 e.deptno=d.deptno?
value expressions 表示輸出的value字段有哪些

查看map join執(zhí)行計(jì)劃

hive>set hive.auto.convert.join=true; hive>explain select e.empno, e.ename, d.dname from emp e join dept d on e.deptno=d.deptno;

打印的執(zhí)行計(jì)劃信息：?
?

通過(guò)打印在控制臺(tái)的日志去查看map join：?

starting to launch local task?
對(duì)應(yīng) mapreduce local task

Dump ….. hashtable?
將hashtable sink出來(lái)了(輸出出來(lái))

uploaded ….?
將hashtable 上傳到分布式緩存中了這個(gè)就是小表的流程

End of local task?
結(jié)束了local task

MapredLocal task successfully?
map local task執(zhí)行成功

Launching Job 1 out of 1?
啟動(dòng)一個(gè)job

number of mappers：1 number of reducers：0?
沒(méi)有reducer，mapper數(shù)量為1

重點(diǎn)：?
對(duì)于普通join 會(huì)生成2個(gè)stage?
對(duì)于mapjoin 會(huì)生成3個(gè)stage

總結(jié)

以上是生活随笔為你收集整理的Hive中JOIN的使用入门的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：基于用户行为的兴趣标签模型
下一篇： Spark action算子案例