hive学习(5)--- Partitions分区的使用(包括动态分区)
下面這個文章很好的講解了Partitions的使用方法
http://www.aahyhaa.com/archives/316
其他參考文章:
http://p-x1984.iteye.com/blog/1156408
http://www.cnblogs.com/tangtianfly/archive/2012/03/13/2393449.html
http://www.2cto.com/kf/201210/160777.html
http://blog.csdn.net/acmilanvanbasten/article/details/17252673
但是,這里還需要補充一點,也是我學(xué)習(xí)過程中的一個誤區(qū):
對于具備分區(qū)字段的表,導(dǎo)入的數(shù)據(jù),只能導(dǎo)入到指定的分區(qū),而我曾經(jīng)以為,數(shù)據(jù)導(dǎo)入時,會自動根據(jù)字段進(jìn)行分區(qū)。這有什么區(qū)別呢?
比如,我的表按照city分區(qū),我有一份各個城市的天氣,大概數(shù)據(jù)如下:
2014-05-23|07:33:58 China shenzhen rain -28 -21 199
2014-05-23|07:33:58 China hangzhou fine -26 -19 200
2014-05-23|07:33:58 China hangzhou fine 6 14 200
然后我把這個數(shù)據(jù)加載到表中:load data inpath '/tmp/wetherdata4.txt' into table weatherpartion?partition(city='hangzhou');
我的預(yù)期是:會根據(jù)city字段創(chuàng)建2個分區(qū)目錄,一個叫hangzhou,一個叫shenzhen,并且會shenzhen的這一行記錄放到shenzhen這個分區(qū),把杭州的2行記錄放到hangzhou
但實際上,只創(chuàng)建了一個分區(qū)hangzhou,并且3條數(shù)據(jù)都加載進(jìn)了hangzhou這個分區(qū),這很明顯,數(shù)據(jù)與分區(qū)沒有一致。
此時,再仔細(xì)思考下分區(qū)的使用場景,我的理解是:
1、數(shù)據(jù)文件生成的時候,會根據(jù)某個字段生成不同的文件,比如場景的日志文件,每天會產(chǎn)生一個,同一天的日志會放到一個文件中
2、不同的數(shù)據(jù)文件,會累加到一個表中做大數(shù)據(jù)分析
根據(jù)上面的解釋,就比較好理解為什么導(dǎo)入的時候,一個文件只能導(dǎo)入到一個指定的分區(qū)(可能是多個條件指定的唯一分區(qū),比如city=hangzhou,country=china),
再看看創(chuàng)建表時候的語句,被指定分區(qū)的字段其實不是表創(chuàng)建時候的字段(比如city字段),也就是說,其實用于分區(qū)的字段,并不應(yīng)該作為數(shù)據(jù)真正的字段,只能認(rèn)為是一個輔助字段,而為了hql語法上的支持,故hive會在創(chuàng)建的時候把分區(qū)字段也加入到表的字段中,因為語法需要。
比如每個城市會把自己的天氣數(shù)據(jù)匯總給某個機構(gòu),那機構(gòu)就會對city進(jìn)行分區(qū),而每個城市匯總的天氣數(shù)據(jù)里,可以沒有city這個字段,因為一個城市,它的city值肯定是一樣的,寫與不寫都無所謂。此時,這個機構(gòu)在導(dǎo)入數(shù)據(jù)的時候指明?partition(city='XXX');即可,這樣,每個城市的天氣數(shù)據(jù)就導(dǎo)入的對應(yīng)的目錄下,當(dāng)查找指定城市的天氣時,系統(tǒng)只會訪問對應(yīng)目錄下的原始數(shù)據(jù)文件,不會訪問表中其他城市的原始數(shù)據(jù)文件,從而提高效率。
? ? create table weatherpartion
? ? (date string, weath string,?
? ? minTemperat int, maxTemperat int,
? ? pmvalue int) partitioned by (country string, city string)
? ? ROW FORMAT DELIMITED
? ? FIELDS TERMINATED BY ' '
? ? STORED AS TEXTFILE;
其實hive對分區(qū)的設(shè)計,正是符合了它的本意:不對數(shù)據(jù)文件做任何修改。如果是根據(jù)某個字段,自動分區(qū),那勢必會把一個大文件拆成多個小文件,這就違背了不修改數(shù)據(jù)文件的初衷了。
為了驗證性能,我們做一個實驗,同樣的數(shù)據(jù),一個是按pcity分區(qū),一個不分區(qū),看看同樣的hql的執(zhí)行速度:
分區(qū)情況:
hive> dfs -ls /user/hive/warehouse/weatherpartion;
Found 6 items
drwxr-xr-x ? - hadoop supergroup ? ? ? ? ?0 2014-05-24 13:33 /user/hive/warehouse/weatherpartion/pcity=beijin
drwxr-xr-x ? - hadoop supergroup ? ? ? ? ?0 2014-05-24 13:34 /user/hive/warehouse/weatherpartion/pcity=guangzhou
drwxr-xr-x ? - hadoop supergroup ? ? ? ? ?0 2014-05-24 13:34 /user/hive/warehouse/weatherpartion/pcity=hangzhou
drwxr-xr-x ? - hadoop supergroup ? ? ? ? ?0 2014-05-24 13:34 /user/hive/warehouse/weatherpartion/pcity=nanjing
drwxr-xr-x ? - hadoop supergroup ? ? ? ? ?0 2014-05-24 13:34 /user/hive/warehouse/weatherpartion/pcity=shanghai
drwxr-xr-x ? - hadoop supergroup ? ? ? ? ?0 2014-05-24 13:34 /user/hive/warehouse/weatherpartion/pcity=shenzhen
在同樣的情況下,多次執(zhí)行同一句hql,取耗時的平均值:
一:未分區(qū)的weather表,三表聯(lián)合查詢,排序
select cy.number,wh.*,pm.pmlevel
from cityinfo cy join weather wh on (cy.name=wh.city)?
join pminfo pm on (pm.pmvalue=wh.pmvalue)?
where wh.city='hangzhou' and wh.weath='fine' and wh.minTemperat in?
( -18,25,43) order by maxTemperat DESC limit 20;
執(zhí)行5次,耗時如下:
Job 0: Map: 5 ?Reduce: 2 ? Cumulative CPU: 41.52 sec ? HDFS Read: 1061735802 HDFS Write: 6918992 SUCCESS
Job 1: Map: 2 ?Reduce: 1 ? Cumulative CPU: 5.14 sec ? HDFS Read: 6923192 HDFS Write: 7232163 SUCCESS
Job 2: Map: 1 ?Reduce: 1 ? Cumulative CPU: 1.81 sec ? HDFS Read: 7232532 HDFS Write: 1126 SUCCESS
Total MapReduce CPU Time Spent: 48 seconds 470 msec
Time taken: 72.781 seconds, Fetched: 20 row(s)
Job 0: Map: 5 ?Reduce: 2 ? Cumulative CPU: 45.71 sec ? HDFS Read: 1061735802 HDFS Write: 6918992 SUCCESS
Job 1: Map: 2 ?Reduce: 1 ? Cumulative CPU: 5.25 sec ? HDFS Read: 6923192 HDFS Write: 7232163 SUCCESS
Job 2: Map: 1 ?Reduce: 1 ? Cumulative CPU: 1.82 sec ? HDFS Read: 7232532 HDFS Write: 1126 SUCCESS
Total MapReduce CPU Time Spent: 52 seconds 780 msec
Time taken: 66.584 seconds, Fetched: 20 row(s)
Job 0: Map: 5 ?Reduce: 2 ? Cumulative CPU: 43.55 sec ? HDFS Read: 1061735802 HDFS Write: 6918992 SUCCESS
Job 1: Map: 2 ?Reduce: 1 ? Cumulative CPU: 5.29 sec ? HDFS Read: 6923192 HDFS Write: 7232163 SUCCESS
Job 2: Map: 1 ?Reduce: 1 ? Cumulative CPU: 1.82 sec ? HDFS Read: 7232532 HDFS Write: 1126 SUCCESS
Total MapReduce CPU Time Spent: 50 seconds 660 msec
Time taken: 62.12 seconds, Fetched: 20 row(s)
Job 0: Map: 5 ?Reduce: 2 ? Cumulative CPU: 41.09 sec ? HDFS Read: 1061735802 HDFS Write: 6918992 SUCCESS
Job 1: Map: 2 ?Reduce: 1 ? Cumulative CPU: 5.12 sec ? HDFS Read: 6923192 HDFS Write: 7232163 SUCCESS
Job 2: Map: 1 ?Reduce: 1 ? Cumulative CPU: 1.8 sec ? HDFS Read: 7232532 HDFS Write: 1126 SUCCESS
Total MapReduce CPU Time Spent: 48 seconds 10 msec
Time taken: 58.33 seconds, Fetched: 20 row(s)
Job 0: Map: 5 ?Reduce: 2 ? Cumulative CPU: 42.68 sec ? HDFS Read: 1061735802 HDFS Write: 6918992 SUCCESS
Job 1: Map: 2 ?Reduce: 1 ? Cumulative CPU: 5.0 sec ? HDFS Read: 6923192 HDFS Write: 7232163 SUCCESS
Job 2: Map: 1 ?Reduce: 1 ? Cumulative CPU: 1.82 sec ? HDFS Read: 7232532 HDFS Write: 1126 SUCCESS
Total MapReduce CPU Time Spent: 49 seconds 500 msec
Time taken: 62.355 seconds, Fetched: 20 row(s)
二:按city分區(qū)的weather表,三表聯(lián)合查詢,排序
select cy.number,wh.*,pm.pmlevel
from cityinfo cy join weatherpartion wh on (cy.name=wh.city)?
join pminfo pm on (pm.pmvalue=wh.pmvalue)?
where wh.pcity='hangzhou' and wh.weath='fine' and wh.minTemperat in?
( -18,25,43) order by maxTemperat DESC limit 20;
執(zhí)行5次,耗時如下:
Job 0: Map: 2 ?Reduce: 1 ? Cumulative CPU: 10.68 sec ? HDFS Read: 172140323 HDFS Write: 7793860 SUCCESS
Job 1: Map: 2 ?Reduce: 1 ? Cumulative CPU: 5.35 sec ? HDFS Read: 7797836 HDFS Write: 7997910 SUCCESS
Job 2: Map: 1 ?Reduce: 1 ? Cumulative CPU: 1.82 sec ? HDFS Read: 7998279 HDFS Write: 1306 SUCCESS
Total MapReduce CPU Time Spent: 17 seconds 850 msec
Time taken: 48.127 seconds, Fetched: 20 row(s)
MapReduce Jobs Launched:?
Job 0: Map: 2 ?Reduce: 1 ? Cumulative CPU: 10.4 sec ? HDFS Read: 172140323 HDFS Write: 7793860 SUCCESS
Job 1: Map: 2 ?Reduce: 1 ? Cumulative CPU: 5.31 sec ? HDFS Read: 7797836 HDFS Write: 7997910 SUCCESS
Job 2: Map: 1 ?Reduce: 1 ? Cumulative CPU: 1.84 sec ? HDFS Read: 7998279 HDFS Write: 1306 SUCCESS
Total MapReduce CPU Time Spent: 17 seconds 550 msec
Time taken: 47.386 seconds, Fetched: 20 row(s)
MapReduce Jobs Launched:?
Job 0: Map: 2 ?Reduce: 1 ? Cumulative CPU: 10.8 sec ? HDFS Read: 172140323 HDFS Write: 7793860 SUCCESS
Job 1: Map: 2 ?Reduce: 1 ? Cumulative CPU: 5.38 sec ? HDFS Read: 7797835 HDFS Write: 7997910 SUCCESS
Job 2: Map: 1 ?Reduce: 1 ? Cumulative CPU: 1.85 sec ? HDFS Read: 7998278 HDFS Write: 1306 SUCCESS
Total MapReduce CPU Time Spent: 18 seconds 30 msec
Time taken: 47.853 seconds, Fetched: 20 row(s)
三、結(jié)論
CPU消耗的時間大幅減少,但總時間提升的并不太多,因為還涉及到一些調(diào)度、通信、洗牌、切分等的時間
四、動態(tài)分區(qū)
靜態(tài)分區(qū)要求導(dǎo)入數(shù)據(jù)的時候指定導(dǎo)入的分區(qū),如果有大量不同分區(qū)的數(shù)據(jù)需要導(dǎo)入,則需要手動執(zhí)行N次命令,相當(dāng)麻煩,所以hive提供動態(tài)分區(qū)的功能。
也就是說,hive可以根據(jù)設(shè)定的分區(qū),把數(shù)據(jù)分到對應(yīng)的分區(qū)中,它包括嚴(yán)格模式和寬松模式。
默認(rèn)情況下動態(tài)分區(qū)的功能是關(guān)閉的,需要用戶手動打開,當(dāng)打開動態(tài)分區(qū)后,默認(rèn)情況下是嚴(yán)格模式
打開動態(tài)分區(qū)的命令:set hive.exec.dynamic.partition=true;
通過如下例子說明
需求:把天氣表中的數(shù)據(jù)按照城市和天氣狀況(晴、雨)進(jìn)行2級分區(qū),例子使用嚴(yán)格模式,所以城市為靜態(tài)分區(qū),天氣情況weath為動態(tài)分區(qū),從而構(gòu)成2級分區(qū)
第一步,創(chuàng)建目標(biāo)表,指定分區(qū)字段:
? create table weather_sub?
? ? (date string, pmvalue int) partitioned by (city string, weath string) ? ? ? ? //此處需要指定2個分區(qū)字段
? ? ROW FORMAT DELIMITED?
? ? FIELDS TERMINATED BY ' '
? ? STORED AS TEXTFILE;
再執(zhí)行動態(tài)分區(qū)插入數(shù)據(jù)的語句:
insert overwrite table weather_sub?
? ? partition (city='hangzhou',weath)
? ? select w.date,w.pmvalue,w.weath ? //此處需要注意,這里只填了3個字段,但目標(biāo)表事實上有4個字段的,其中缺少的字段正是weather_sub.city
? ? from weather w
? ? where w.city='hangzhou';
//w.date,w.pmvalue分別對應(yīng)目標(biāo)表的date和pmvalue,city字段使用分區(qū)指定的hangzhou,w.weath對應(yīng)weath
檢驗結(jié)果:
hive> select * from weather_sub where city='hangzhou' and weath='fine' limit 5;
OK
2014-05-23|07:33:58 ? ? 200 ? ? hangzhou ? ? ? ?fine
2014-05-23|07:33:58 ? ? 200 ? ? hangzhou ? ? ? ?fine
2014-05-23|07:33:58 ? ? 200 ? ? hangzhou ? ? ? ?fine
2014-05-23|07:33:58 ? ? 200 ? ? hangzhou ? ? ? ?fine
2014-05-23|07:33:58 ? ? 200 ? ? hangzhou ? ? ? ?fine
Time taken: 0.101 seconds, Fetched: 5 row(s)
按分區(qū)查找,速度很快,沒有用MP程序
查看dfs情況:
hive> dfs -ls /user/hive/warehouse/weather_sub/city=hangzhou; ? ? ? ? ? ? ? ? ?
Found 2 items
drwxr-xr-x ? - hadoop supergroup ? ? ? ? ?0 2014-06-06 22:42 /user/hive/warehouse/weather_sub/city=hangzhou/weath=cloudy
drwxr-xr-x ? - hadoop supergroup ? ? ? ? ?0 2014-06-06 22:42 /user/hive/warehouse/weather_sub/city=hangzhou/weath=fine
最后附上刪除表的命令:
drop table weather_sub;
總結(jié)
以上是生活随笔為你收集整理的hive学习(5)--- Partitions分区的使用(包括动态分区)的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 多角度理解认知
- 下一篇: 办公室打印服务器方案