Hive 快速上手
Hive 快速上手
本人大數(shù)據(jù)專業(yè)學(xué)生,本文檔最早是在學(xué)校上這門課時(shí)候的筆記。后來(lái)系統(tǒng)重裝重裝hive補(bǔ)充完善了這個(gè)筆記,今天偶然翻到,看格式應(yīng)該是我當(dāng)時(shí)打算發(fā)布來(lái)著,但是后來(lái)忘記了。特此補(bǔ)發(fā)。內(nèi)容主要來(lái)自于本校老師教學(xué)時(shí)自己編寫的文檔和網(wǎng)絡(luò)資料。(注:發(fā)布時(shí)間是2018年9月初)
本文旨在快速學(xué)習(xí)或者回顧hive常用知識(shí),閱讀本文檔需要二十分鐘,完成后你將上手hive。
外部表和內(nèi)部表
內(nèi)部表(managed table)
外部表(external table)
存儲(chǔ)格式為 Sequencefile時(shí)的一個(gè)數(shù)據(jù)導(dǎo)入問(wèn)題
指定存儲(chǔ)格式為 Sequencefile 時(shí),把txt格式的數(shù)據(jù)導(dǎo)入表中,hive 會(huì)報(bào)文件格式錯(cuò),解決方案為先將txt格式傳入hive,然后利用傳入表格插入Sequencefile格式表格
load data local inpath '/home/fonttian/database/hive/students2' overwrite into table students3;# 創(chuàng)建外部表 create external table if not exists students3_orc(name string,age int,sex string,brithday date)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; # 從其他表格中插入數(shù)據(jù) insert into table students3 select * from students2; insert into table students3_orc select * from students3;分區(qū)
# 創(chuàng)建外部表,利用date字段進(jìn)行分區(qū) create external table if not exists students4(name string,age int,sex string,brithday date) partitioned by (day date) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';# 導(dǎo)入數(shù)據(jù)進(jìn)入分區(qū)外表,分區(qū)為 day="2018-3-26" load data local inpath '/home/fonttian/database/hive/students2' into table students4 partition (day="2018-3-26");# 如果查詢無(wú)效,可以使用下面的代碼create external table if not exists students5(name string,age int,sex string,brithday date) partitioned by (pt_int int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';load data local inpath '/home/fonttian/database/hive/students2' into table students5 partition (pt_int=1); load data local inpath '/home/fonttian/database/hive/students2' into table students5 partition (pt_int=2);select * from students5; select * from students5 where pt_int = 1; select * from students5 where pt_int > 0;# 創(chuàng)建外部表 create external table if not exists students3_parquet(name string,age int,sex string,brithday date)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored as parquet;insert into table students3_parquet select * from students3;# 查詢 SELECT * FROM students2 WHERE age>30 && Dept=TP;# 查看是否為分區(qū)表 show partitions # 或者使用查勘表結(jié)構(gòu)的命令 describe extended students5; desc formatted students5; # delete partition alter table students5 drop partition(pt_int=2);數(shù)據(jù)的導(dǎo)出
# 導(dǎo)出數(shù)據(jù)-insert方式 insert overwrite local directory "/home/fonttian/database/hive/learnhive" select * from students5;但是這種導(dǎo)出方式不利于直接訪問(wèn)導(dǎo)出數(shù)據(jù),分隔符的問(wèn)題,默認(rèn)使用“^A(\x01)”分隔符
利用格式化導(dǎo)出自定義我們自己的分隔符,或者流式導(dǎo)出將沒(méi)有這個(gè)問(wèn)題
insert overwrite local directory "/home/fonttian/database/hive/learnhive" row format delimited fields terminated by '\t' collection items terminated by '\n' select * from students5;# 流式導(dǎo)出,需要在shell中進(jìn)行 bin/hive -e "use class;select * from students5;" > /home/fonttian/database/hive/learnhive/students5.txt如果想要導(dǎo)出到HDFS只需要,將“l(fā)ocal”關(guān)鍵字去掉即可
DML
查詢
分組(group by/having)
每個(gè)部?門的平均工工資
每個(gè)部?門中每個(gè)崗位的最高高工工資
查詢出每個(gè)部?門的平均工工資超過(guò)2000的部?門
表連接(join)
排序
order by
全局排序
對(duì)全局?jǐn)?shù)據(jù)的一一個(gè)排序,僅僅只有一一個(gè)reduce
sort by
對(duì)每一一個(gè)reduce內(nèi)部數(shù)據(jù)進(jìn)行行行排序,對(duì)全局結(jié)果集來(lái)說(shuō)不不排序
# 如果有必要需要先進(jìn)行調(diào)優(yōu) # set hive.exec.reducers.max=<number> # set mapreduce.job.reduces=<number># 按照年齡排序,查詢student5表 select * from students5 sort by age asc;distribute by
類似于MapReduce中分區(qū),對(duì)數(shù)據(jù)進(jìn)行行行分區(qū),結(jié)合sort by進(jìn)行使用,同樣要注意的是這里我們還是需要進(jìn)行數(shù)據(jù)的格式化,這樣才可以直接讀取數(shù)據(jù)
insert overwrite local directory '/home/fonttian/database/hive/learnhive/students5_distribute_by' row format delimited fields terminated by '\t' collection items terminated by '\n' select * from students5 distribute by pt_int sort by age asc;注意事項(xiàng):
distribute by必須在sort by之前
cluster by
當(dāng)distribute by字段和sort by字段相同時(shí),就可以替代使用用。
join
- Hive只支持等值連接,外連接和左半連接。
首先需要導(dǎo)入一波數(shù)據(jù)備用
# 創(chuàng)建外部表 create external table if not exists score(name string,math int,chinese int,english int)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored as textfile;# 導(dǎo)入數(shù)據(jù) load data local inpath '/home/fonttian/database/hive/score' overwrite into table score;# 創(chuàng)建外部表 create external table if not exists job(name string,likes string)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored as textfile;# 導(dǎo)入數(shù)據(jù) load data local inpath '/home/fonttian/database/hive/job' overwrite into table job; ```sql - 可以連接兩個(gè)以上表select students2.name ,students2.age,score.math,job.likes from students2 join score on(students2.name = score.name) join job on (job.name=score.name);
- 如果連接多個(gè)表的join key 是同一個(gè),則被轉(zhuǎn)化為單個(gè)map/reduce任務(wù) - join時(shí)大表放在最后。因?yàn)槊看蝝ap/reduce任務(wù)的邏輯是這樣的:reduce會(huì)緩存join序列中最后一個(gè)表之外的所有的表額記錄,再通過(guò)最后一個(gè)表序列化到文件系統(tǒng)中。 - 如果想要限制join的輸出,就需要在where子句中寫過(guò)濾條件,或是在join子句寫。建議后者,以避免部分錯(cuò)誤發(fā)生。```sql select students5.name,score.math from score left outer join students5 on(score.name = students5.name and students5.pt_int = 1);select students5.name,score.math from students5 left outer join score on(score.name = students5.name and students5.pt_int = 1); ```sql - Left SEMI JOIN 是IN/EXISTS子查詢的一種更高效的實(shí)現(xiàn)。其限制為:join子句中的右邊表只能在ON自劇中設(shè)置過(guò)濾條件,where子句。select子句或其他過(guò)濾地方都不行```sql select job.name,job.likes from job where job.name in (select score.name from score); select job.name,job.likes from job left semi join score on (score.name = job.score);正則表達(dá)式
regexp 關(guān)鍵字
語(yǔ)法: A REGEXP B
操作類型: strings
描述: 功能與RLIKE相同
select count(*) from students5 where name not regexp '\\d{8}'; # 統(tǒng)計(jì),name開頭不是T的數(shù)據(jù)行數(shù) beelin >select count(*) from students5 where name not regexp 'T.*';regexp_extract 關(guān)鍵字
語(yǔ)法: regexp_extract(string subject, string pattern, int index)
返回值: string
說(shuō)明:將字符串subject按照pattern正則表達(dá)式的規(guī)則拆分,返回index指定的字符。
# 將字符串'IloveYou'按照'(I)(.*?)(You)'拆分,返回第一處字符,結(jié)果為I select regexp_extract('IloveYou','(I)(.*?)(You)',1) from students5 limit 1; # 將字符串'IloveYou'按照'(I)(.*?)(You)'拆分,返回第一處字符,結(jié)果為You select regexp_extract('IloveYou','I(.*?)(You)',2) from students5 limit 1; # 返回全部-結(jié)果‘IloveYou’ select regexp_extract('IloveYou','(I)(.*?)(You)',0) from students5 limit 1;regexp_replace 關(guān)鍵字
語(yǔ)法: regexp_replace(string A, string B, string C)
返回值: string
說(shuō)明:將字符串A中的符合Java正則表達(dá)式B的部分替換為C。注意,在有些情況下要使用轉(zhuǎn)義字符,類似Oracle中的regexp_replace函數(shù)。
# 返回結(jié)果:‘Ilove’ select regexp_replace("IloveYou","You","") from students5 limit 1; # 返回:‘Ilovelili’ select regexp_replace("IloveYou","You","lili") from test1 limit 1;beeline and hivesever2
# 后臺(tái)啟動(dòng) $ nohup bin/hive --service hiveserver2 & # 查看hive是否啟動(dòng) $ ps -aux| grep hiveserver2 # 關(guān)閉 $ kill -9 20670$ bin/beeline # 使用默認(rèn)賬戶連接hive beeline> !connect jdbc:hive2://localhost:10000 scott tiger # 使用配置中的賬戶密碼連接hive beeline> !connect jdbc:hive2://localhost:10000 fonttian 123456 # 退出 beeline> !quit參考內(nèi)容
總結(jié)
- 上一篇: Lightgbm with Hypero
- 下一篇: Hyperopt 入门指南