日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Hive 练习(带数据)

發布時間:2023/12/8 编程问答 25 豆豆
生活随笔 收集整理的這篇文章主要介紹了 Hive 练习(带数据) 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

練習使用的數據已上傳
鏈接:https://pan.baidu.com/s/1L5znszdXLUytH9qvTdO4JA
提取碼:lzyd

練習1 word count

通過Hive完成word count作業。

  • 創建一個表,導入一篇文章。
hive> create table `article` (`sentence` string); OK Time taken: 1.019 secondshive> load data local inpath '/mnt/hgfs/vm_shared/The_Man_of_Property.txt' overwrite into table article; Loading data to table default.article Table default.article stats: [numFiles=1, numRows=0, totalSize=632207, rawDataSize=0] OK Time taken: 1.386 seconds
  • 分詞
    將文章按照空格切分成單詞:
    使用split方法,通過空格切分文章
hive> select split(sentence," ") from article;["Preface"] ["“The","Forsyte","Saga”","was","the","title","originally","destined","for","that","part","of","it","which","is","called","“The","Man","of","Property”;","and","to","adopt","it","for","the","collected","chronicles","of","the","Forsyte","family","has","indulged","the","Forsytean","tenacity","that","is","in","all","of","us.","The","word","Saga","might","be","objected","to","on","the","ground","that","it","connotes","the","heroic","and","that","there","is","little","heroism","in","these","pages.","But","it","is","used","with","a","suitable","irony;","and,","after","all,","this","long","tale,","though","it","may","deal","with","folk","in","frock","coats,","furbelows,","and","a","gilt-edged","period,","is","not","devoid","of","the","essential","heat","of","conflict.","Discounting","for","the","gigantic","stature","and","blood-thirstiness","of","old","days,","as","they","have","come","down","to","us","in","fairy-tale","and","legend,","the","folk","of","the","old","Sagas","were","Forsytes,","assuredly,","in","their","possessive","instincts,","and","as","little","proof","against","the","inroads","of","beauty","and","passion","as","Swithin,","Soames,","or","even","Young","Jolyon.","And","if","heroic","figures,","in","days","that","never","were,","seem","to","startle","out","from","their","surroundings","in","fashion","unbecoming","to","a","Forsyte","of","the","Victorian","era,","we","may","be","sure","that","tribal","instinct","was","even","then","the","prime","force,","and","that","“family”","and","the","sense","of","home","and","property","counted","as","they","do","to","this","day,","for","all","the","recent","efforts","to","“talk","them","out.”"] ["So","many","people","have","written","and","claimed","that","their","families","were","the","originals","of","the","Forsytes","that","one","has","been","almost","encouraged","to","believe","in","the","typicality","of","an","imagined","species.","Manners","change","and","modes","evolve,","and","“Timothy’s","on","the","Bayswater","Road”","becomes","a","nest","of","the","unbelievable","in","all","except","essentials;","we","shall","not","look","upon","its","like","again,","nor","perhaps","on","such","a","one","as","James","or","Old","Jolyon.","And","yet","the","figures","of","Insurance","Societies","and","the","utterances","of","Judges","reassure","us","daily","that","our","earthly","paradise","is","still","a","rich","preserve,","where","the","wild","raiders,","Beauty","and","Passion,","come","stealing","in,","filching","security","from","beneath","our","noses.","As","surely","as","a","dog","will","bark","at","a","brass","band,","so","will","the","essential","Soames","in","human","nature","ever","rise","up","uneasily","against","the","dissolution","which","hovers","round","the","folds","of","ownership."] ["“Let","the","dead","Past","bury","its","dead”","would","be","a","better","saying","if","the","Past","ever","died.","The","persistence","of","the","Past","is","one","of","those","tragi-comic","blessings","which","each","new","age","denies,","coming","cocksure","on","to","the","stage","to","mouth","its","claim","to","a","perfect","novelty."] ["But","no","Age","is","so","new","as","that!","Human","Nature,","under","its","changing","pretensions","and","clothes,","is","and","ever","will","be","very","much","of","a","Forsyte,","and","might,","after","all,","be","a","much","worse","animal."] ... ["The","End"] Time taken: 0.086 seconds, Fetched: 2866 row(s)可以看到運行結果是許多個字符串數組。一共2866條數據
  • 使用wc命令查看文件行數
wc The_Man_of_Property.txt2866 111783 632207 The_Man_of_Property.txt行數 字數 字節數 文件名稱
  • 同樣是2866行,可見split會將每一行句子分到一個數組中。
  • 經過分詞后,將每個單詞變為一行,方便統計相同單詞的個數。使用explode實現。
hive> select explode(split(sentence," ")) from article;... we are not at home.And in young Jolyon’s face he slammed the door.The End Time taken: 0.085 seconds, Fetched: 111818 row(s)
  • 查看結果已經將每個單詞單獨放到了一行,一共111818行數據。
  • 分割出來的單詞帶有一些標點符號不是我們想要的,所以用一個正則提取出單詞。
select regexp_extract(word,'[a-zA-Z]+',0) from (select explode(split(sentence," ")) word from article) t;... at home And in young Jolyon face he slammed the doorThe End Time taken: 0.066 seconds, Fetched: 111818 row(s)
  • 單詞分割好了,接下來該計數了。
select word, count(*) from (select regexp_extract(str,'[a-zA-Z]+[\’]*[a-zA-Z]+',0) word from (select explode(split(sentence," ")) str from article) t1 ) t2 group by word;...... yield 4 yielded 3 yielding 2 yields 1 you 522 young 198 younger 10 youngest 3 youngling 1 your 130 yours 2 yourself 22 yourselves 1 youth 10 you’d 14 you’ll 21 you’re 23 you’ve 25 Time taken: 27.26 seconds, Fetched: 9872 row(s)
  • 看起來結果還不錯。
  • 其實如果不用正則過濾的話會簡單不少。
select word, count(*) AS cnt from (select explode(split(sentence,' ')) wordfrom article ) t group by word;

練習前數據準備

  • 數據調研
  • trains.csv (訂單——商品) ---------------------- order_id:訂單號 product_id:商品ID add_to_cart_order:加入購物車的位置 reordered:這個訂單是否重復購買(1 表示是 0 表示否) orders.csv (數據倉庫中定位:用戶行為表) ---------------------- order_id:訂單號 user_id:用戶id eval_set:訂單的行為(歷史產生的或者訓練所需要的) order_number:用戶購買訂單的先后順序 order_dow:order day of week ,訂單在星期幾進行購買的(0-6) order_hour_of_day:訂單在哪個小時段產生的(0-23) days_since_prior_order:表示后一個訂單距離前一個訂單的相隔天數
  • 建表,導入數據
  • create table trains( order_id string, product_id string, add_to_cart_order string, reordered string )row format delimited fields terminated by ',' lines terminated by '\n'; hive> load data local inpath '/mnt/hgfs/vm_shared/trains.csv' overwrite into table trains; Loading data to table default.trains Table default.trains stats: [numFiles=1, numRows=0, totalSize=24680147, rawDataSize=0] OK Time taken: 1.801 seconds hive> select * from trains limit 10; OK trains.order_id trains.product_id trains.add_to_cart_order trains.reordered order_id product_id add_to_cart_order reordered 1 49302 1 1 1 11109 2 1 1 10246 3 0 1 49683 4 0 1 43633 5 1 1 13176 6 0 1 47209 7 0 1 22035 8 1 36 39612 1 0 Time taken: 0.1 seconds, Fetched: 10 row(s)
    • 結果中第一行是我定義的列名,第二行是數據中自帶的字段,因此第一行數據是引入的臟數據需要去除。

    • 去除的方法有很多。

    方法1. 現在已經導入數據了,可以通過HQL覆蓋當前數據。

    insert overwrite table trains select * from trains where order_id !='order_id'

    方法2. 可以在數據導入前直接對數據集操作刪除第一行。

    [root@node1 vm_shared]# head trains.csv //這是數據集的前幾行,是帶字段名的 order_id,product_id,add_to_cart_order,reordered 1,49302,1,1 1,11109,2,1 1,10246,3,0 1,49683,4,0 1,43633,5,1 1,13176,6,0 1,47209,7,0 1,22035,8,1 36,39612,1,0
    • 使用命令
    sed '1d' trains.csv> trains_tmp.csv
    • 結果
    [root@node1 vm_shared]# head trains_tmp.csv //可見第一行已經刪掉了 1,49302,1,1 1,11109,2,1 1,10246,3,0 1,49683,4,0 1,43633,5,1 1,13176,6,0 1,47209,7,0 1,22035,8,1 36,39612,1,0 36,19660,2,1

    方法3. 在建表中加入屬性skip.header.line.count'='1',這樣在導入數據會自動跳過第一行。如:

    create table xxx( ... ) row format delimited fields terminated by '\t' tblproperties ('skip.header.line.count'='1');
    • 同樣的方法建表及導入orders.csv

    練習2 每個用戶有多少個訂單

    select user_id, count(*) from orders group by user_id ;... Time taken: 32.335 seconds, Fetched: 206209 row(s)

    練習3 每個用戶一個訂單平均是多少商品?

    注意:使用聚合函數(count、sum、avg、max、min )的時候要結合group by 進行使用

    • 創建表,導入數據
    create table priors( order_id string, product_id string, add_to_cart_order string, reordered string) row format delimited fields terminated by ',' lines terminated by '\n' tblproperties ('skip.header.line.count'='1'); hive> load data local inpath '/mnt/hgfs/vm_shared/priors.csv' overwrite into table priors; Loading data to table default.priors Table default.priors stats: [numFiles=1, numRows=0, totalSize=577550706, rawDataSize=0] OK Time taken: 13.463 seconds
    • 應為需求是每個用戶的訂單平均的商品數,所以我們需要用每個用戶的商品數除每個用戶的訂單數。
    select order_id, count(product_id) cnt from priors group by order_id; select o.user_id, sum(p.cnt)/count(o.order_id) from orders o join (select order_id, count(product_id) cnt from priors group by order_id) p on o.order_id=p.order_id group by user_id limit 10;1 5.9 2 13.928571428571429 3 7.333333333333333 4 3.6 5 9.25 6 4.666666666666667 7 10.3 8 16.333333333333332 9 25.333333333333332 10 28.6

    練習4 每個用戶在一周中的購買訂單的分布(列轉行) ?

    select user_id , sum(case when order_dow='0' then 1 else 0 end) dow0 , sum(case when order_dow='1' then 1 else 0 end) dow1 , sum(case when order_dow='2' then 1 else 0 end) dow2 , sum(case when order_dow='3' then 1 else 0 end) dow3 , sum(case when order_dow='4' then 1 else 0 end) dow4 , sum(case when order_dow='5' then 1 else 0 end) dow5 , sum(case when order_dow='6' then 1 else 0 end) dow6 from orders group by user_id

    練習5 用戶購買的商品數大于100的商品有哪些

    訓練集trains和priors數相同的表結構,在兩個集合的全量數據中查找。

    -- 通過with/as定義一個臨時數據集 with user_pro_cnt_tmp as (select * from ( -- 訂單訓練數據 select a.user_id,b.product_id from orders as a left join trains b on a.order_id=b.order_idunion all -- 訂單歷史數據 select a.user_id,b.product_id from orders as a left join priors b on a.order_id=b.order_id ) t ) select user_id , count(distinct product_id) pro_cnt from user_pro_cnt_tmp group by user_id having pro_cnt >= 100 limit 10;

    總結

    以上是生活随笔為你收集整理的Hive 练习(带数据)的全部內容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。