當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Hive 练习（带数据）

發布時間：2023/12/8 编程问答 27 豆豆

生活随笔收集整理的這篇文章主要介紹了 Hive 练习（带数据）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

練習使用的數據已上傳
鏈接：https://pan.baidu.com/s/1L5znszdXLUytH9qvTdO4JA
提取碼：lzyd

練習1 word count

通過Hive完成word count作業。

創建一個表，導入一篇文章。

hive> create table `article` (`sentence` string); OK Time taken: 1.019 secondshive> load data local inpath '/mnt/hgfs/vm_shared/The_Man_of_Property.txt' overwrite into table article; Loading data to table default.article Table default.article stats: [numFiles=1, numRows=0, totalSize=632207, rawDataSize=0] OK Time taken: 1.386 seconds

分詞
將文章按照空格切分成單詞：
使用split方法，通過空格切分文章

hive> select split(sentence," ") from article;["Preface"] ["“The","Forsyte","Saga”","was","the","title","originally","destined","for","that","part","of","it","which","is","called","“The","Man","of","Property”;","and","to","adopt","it","for","the","collected","chronicles","of","the","Forsyte","family","has","indulged","the","Forsytean","tenacity","that","is","in","all","of","us.","The","word","Saga","might","be","objected","to","on","the","ground","that","it","connotes","the","heroic","and","that","there","is","little","heroism","in","these","pages.","But","it","is","used","with","a","suitable","irony;","and,","after","all,","this","long","tale,","though","it","may","deal","with","folk","in","frock","coats,","furbelows,","and","a","gilt-edged","period,","is","not","devoid","of","the","essential","heat","of","conflict.","Discounting","for","the","gigantic","stature","and","blood-thirstiness","of","old","days,","as","they","have","come","down","to","us","in","fairy-tale","and","legend,","the","folk","of","the","old","Sagas","were","Forsytes,","assuredly,","in","their","possessive","instincts,","and","as","little","proof","against","the","inroads","of","beauty","and","passion","as","Swithin,","Soames,","or","even","Young","Jolyon.","And","if","heroic","figures,","in","days","that","never","were,","seem","to","startle","out","from","their","surroundings","in","fashion","unbecoming","to","a","Forsyte","of","the","Victorian","era,","we","may","be","sure","that","tribal","instinct","was","even","then","the","prime","force,","and","that","“family”","and","the","sense","of","home","and","property","counted","as","they","do","to","this","day,","for","all","the","recent","efforts","to","“talk","them","out.”"] ["So","many","people","have","written","and","claimed","that","their","families","were","the","originals","of","the","Forsytes","that","one","has","been","almost","encouraged","to","believe","in","the","typicality","of","an","imagined","species.","Manners","change","and","modes","evolve,","and","“Timothy’s","on","the","Bayswater","Road”","becomes","a","nest","of","the","unbelievable","in","all","except","essentials;","we","shall","not","look","upon","its","like","again,","nor","perhaps","on","such","a","one","as","James","or","Old","Jolyon.","And","yet","the","figures","of","Insurance","Societies","and","the","utterances","of","Judges","reassure","us","daily","that","our","earthly","paradise","is","still","a","rich","preserve,","where","the","wild","raiders,","Beauty","and","Passion,","come","stealing","in,","filching","security","from","beneath","our","noses.","As","surely","as","a","dog","will","bark","at","a","brass","band,","so","will","the","essential","Soames","in","human","nature","ever","rise","up","uneasily","against","the","dissolution","which","hovers","round","the","folds","of","ownership."] ["“Let","the","dead","Past","bury","its","dead”","would","be","a","better","saying","if","the","Past","ever","died.","The","persistence","of","the","Past","is","one","of","those","tragi-comic","blessings","which","each","new","age","denies,","coming","cocksure","on","to","the","stage","to","mouth","its","claim","to","a","perfect","novelty."] ["But","no","Age","is","so","new","as","that!","Human","Nature,","under","its","changing","pretensions","and","clothes,","is","and","ever","will","be","very","much","of","a","Forsyte,","and","might,","after","all,","be","a","much","worse","animal."] ... ["The","End"] Time taken: 0.086 seconds, Fetched: 2866 row(s)可以看到運行結果是許多個字符串數組。一共2866條數據

使用wc命令查看文件行數

wc The_Man_of_Property.txt2866 111783 632207 The_Man_of_Property.txt行數字數字節數文件名稱

同樣是2866行，可見split會將每一行句子分到一個數組中。
經過分詞后，將每個單詞變為一行，方便統計相同單詞的個數。使用explode實現。

hive> select explode(split(sentence," ")) from article;... we are not at home.” And in young Jolyon’s face he slammed the door.The End Time taken: 0.085 seconds, Fetched: 111818 row(s)

查看結果已經將每個單詞單獨放到了一行，一共111818行數據。
分割出來的單詞帶有一些標點符號不是我們想要的，所以用一個正則提取出單詞。

select regexp_extract(word,'[a-zA-Z]+',0) from (select explode(split(sentence," ")) word from article) t;... at home And in young Jolyon face he slammed the doorThe End Time taken: 0.066 seconds, Fetched: 111818 row(s)

單詞分割好了，接下來該計數了。

select word, count(*) from (select regexp_extract(str,'[a-zA-Z]+[\’]*[a-zA-Z]+',0) word from (select explode(split(sentence," ")) str from article) t1 ) t2 group by word;...... yield 4 yielded 3 yielding 2 yields 1 you 522 young 198 younger 10 youngest 3 youngling 1 your 130 yours 2 yourself 22 yourselves 1 youth 10 you’d 14 you’ll 21 you’re 23 you’ve 25 Time taken: 27.26 seconds, Fetched: 9872 row(s)

看起來結果還不錯。
其實如果不用正則過濾的話會簡單不少。

select word, count(*) AS cnt from (select explode(split(sentence,' ')) wordfrom article ) t group by word;

練習前數據準備

數據調研

trains.csv (訂單——商品） ---------------------- order_id:訂單號 product_id:商品ID add_to_cart_order：加入購物車的位置 reordered：這個訂單是否重復購買（1 表示是 0 表示否） orders.csv （數據倉庫中定位：用戶行為表） ---------------------- order_id:訂單號 user_id:用戶id eval_set:訂單的行為（歷史產生的或者訓練所需要的） order_number：用戶購買訂單的先后順序 order_dow：order day of week ,訂單在星期幾進行購買的（0-6） order_hour_of_day：訂單在哪個小時段產生的（0-23） days_since_prior_order：表示后一個訂單距離前一個訂單的相隔天數

建表，導入數據

create table trains( order_id string, product_id string, add_to_cart_order string, reordered string )row format delimited fields terminated by ',' lines terminated by '\n'; hive> load data local inpath '/mnt/hgfs/vm_shared/trains.csv' overwrite into table trains; Loading data to table default.trains Table default.trains stats: [numFiles=1, numRows=0, totalSize=24680147, rawDataSize=0] OK Time taken: 1.801 seconds hive> select * from trains limit 10; OK trains.order_id trains.product_id trains.add_to_cart_order trains.reordered order_id product_id add_to_cart_order reordered 1 49302 1 1 1 11109 2 1 1 10246 3 0 1 49683 4 0 1 43633 5 1 1 13176 6 0 1 47209 7 0 1 22035 8 1 36 39612 1 0 Time taken: 0.1 seconds, Fetched: 10 row(s)

結果中第一行是我定義的列名，第二行是數據中自帶的字段，因此第一行數據是引入的臟數據需要去除。
去除的方法有很多。

方法1. 現在已經導入數據了，可以通過HQL覆蓋當前數據。

insert overwrite table trains select * from trains where order_id !='order_id'

方法2. 可以在數據導入前直接對數據集操作刪除第一行。

[root@node1 vm_shared]# head trains.csv //這是數據集的前幾行，是帶字段名的 order_id,product_id,add_to_cart_order,reordered 1,49302,1,1 1,11109,2,1 1,10246,3,0 1,49683,4,0 1,43633,5,1 1,13176,6,0 1,47209,7,0 1,22035,8,1 36,39612,1,0

使用命令

sed '1d' trains.csv> trains_tmp.csv

結果

[root@node1 vm_shared]# head trains_tmp.csv //可見第一行已經刪掉了 1,49302,1,1 1,11109,2,1 1,10246,3,0 1,49683,4,0 1,43633,5,1 1,13176,6,0 1,47209,7,0 1,22035,8,1 36,39612,1,0 36,19660,2,1

方法3. 在建表中加入屬性skip.header.line.count'='1'，這樣在導入數據會自動跳過第一行。如：

create table xxx( ... ) row format delimited fields terminated by '\t' tblproperties ('skip.header.line.count'='1');

同樣的方法建表及導入orders.csv

練習2 每個用戶有多少個訂單

select user_id, count(*) from orders group by user_id ;... Time taken: 32.335 seconds, Fetched: 206209 row(s)

練習3 每個用戶一個訂單平均是多少商品？

注意:使用聚合函數（count、sum、avg、max、min ）的時候要結合group by 進行使用

創建表，導入數據

create table priors( order_id string, product_id string, add_to_cart_order string, reordered string) row format delimited fields terminated by ',' lines terminated by '\n' tblproperties ('skip.header.line.count'='1'); hive> load data local inpath '/mnt/hgfs/vm_shared/priors.csv' overwrite into table priors; Loading data to table default.priors Table default.priors stats: [numFiles=1, numRows=0, totalSize=577550706, rawDataSize=0] OK Time taken: 13.463 seconds

應為需求是每個用戶的訂單平均的商品數，所以我們需要用每個用戶的商品數除每個用戶的訂單數。

select order_id, count(product_id) cnt from priors group by order_id; select o.user_id, sum(p.cnt)/count(o.order_id) from orders o join (select order_id, count(product_id) cnt from priors group by order_id) p on o.order_id=p.order_id group by user_id limit 10;1 5.9 2 13.928571428571429 3 7.333333333333333 4 3.6 5 9.25 6 4.666666666666667 7 10.3 8 16.333333333333332 9 25.333333333333332 10 28.6

練習4 每個用戶在一周中的購買訂單的分布(列轉行) ？

select user_id , sum(case when order_dow='0' then 1 else 0 end) dow0 , sum(case when order_dow='1' then 1 else 0 end) dow1 , sum(case when order_dow='2' then 1 else 0 end) dow2 , sum(case when order_dow='3' then 1 else 0 end) dow3 , sum(case when order_dow='4' then 1 else 0 end) dow4 , sum(case when order_dow='5' then 1 else 0 end) dow5 , sum(case when order_dow='6' then 1 else 0 end) dow6 from orders group by user_id

練習5 用戶購買的商品數大于100的商品有哪些

訓練集trains和priors數相同的表結構，在兩個集合的全量數據中查找。

-- 通過with/as定義一個臨時數據集 with user_pro_cnt_tmp as (select * from ( -- 訂單訓練數據 select a.user_id,b.product_id from orders as a left join trains b on a.order_id=b.order_idunion all -- 訂單歷史數據 select a.user_id,b.product_id from orders as a left join priors b on a.order_id=b.order_id ) t ) select user_id , count(distinct product_id) pro_cnt from user_pro_cnt_tmp group by user_id having pro_cnt >= 100 limit 10;

總結

以上是生活随笔為你收集整理的Hive 练习（带数据）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

数据
Hive

上一篇： Sql server 2016 Alwa
下一篇： Dynamics 365 on-p