日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

淘宝双11大数据分析(Hive 分析篇-上)

發布時間:2024/1/18 编程问答 31 豆豆
生活随笔 收集整理的這篇文章主要介紹了 淘宝双11大数据分析(Hive 分析篇-上) 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

  • 前言
  • 當前環境
  • 用聚合函數 count() 計算出表內有多少條數據
  • 查出 user_id 不重復的數據的條數
  • 查詢不重復的數據有多少條(為了排除客戶刷單的情況)
  • 關鍵字條件查詢分析
    • 查詢雙11那天有多少人購買了商品
    • 求當天購買的此品牌商品的數量
  • 根據用戶行為分析
    • 查詢一件商品在某天的購買比例或瀏覽比例
    • 查詢雙11那天,男女買家購買商品的比例
    • 給定購買商品的數量范圍,查詢某一天在該網站的購買該數量商品的用戶id
  • 用戶實時查詢分析

前言

閱讀此篇文章,需要先閱讀前篇:

  • 淘寶雙11大數據分析(環境篇)
  • 淘寶雙11大數據分析(數據準備篇)
  • 當前環境

    使用 jps 查看當前啟動的服務:(其中 Master 和 Worker是 Spark 的服務,不本篇無關)

    [root@centos2020 dataset]# jps 11408 Master 12707 RunJar 7876 NameNode 8183 ResourceManager 7930 DataNode 8477 NodeManager 11550 Worker 12990 Jps

    另外,本篇需要先啟動 mysql ,hive服務(啟動步驟請參考:淘寶雙11大數據分析(數據準備篇))。

    用聚合函數 count() 計算出表內有多少條數據

    hive> select count(*) from user_log;

    執行過程和結果(文章篇幅原因,后邊的運行只展示結果,不展示過程):

    WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = root_20200226001346_be1efc88-10bf-4030-9509-62de994fb0d2 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number> In order to set a constant number of reducers:set mapreduce.job.reduces=<number> Starting Job = job_1582623436135_0001, Tracking URL = http://centos2020:8088/proxy/application_1582623436135_0001/ Kill Command = /usr/hadoop/hadoop-2.7.7/bin/hadoop job -kill job_1582623436135_0001 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2020-02-26 00:14:53,555 Stage-1 map = 0%, reduce = 0% 2020-02-26 00:15:16,085 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.0 sec 2020-02-26 00:15:39,527 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.35 sec MapReduce Total cumulative CPU time: 6 seconds 350 msec Ended Job = job_1582623436135_0001 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.35 sec HDFS Read: 482200 HDFS Write: 105 SUCCESS Total MapReduce CPU Time Spent: 6 seconds 350 msec OK 10000 Time taken: 115.187 seconds, Fetched: 1 row(s)

    結果是 10000 ,結果正確。因為在上一篇中,給該表中存的數據就是 10000 條。

    查出 user_id 不重復的數據的條數

    hive> select count(distinct user_id) from user_log;

    結果:

    MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 5.41 sec HDFS Read: 482582 HDFS Write: 103 SUCCESS Total MapReduce CPU Time Spent: 5 seconds 410 msec OK 358 Time taken: 45.244 seconds, Fetched: 1 row(s)

    查詢不重復的數據有多少條(為了排除客戶刷單的情況)

    hive> select count(*) from (select user_id,item_id,cat_id,merchant_id,brand_id,month,day,action from user_log group by user_id,item_id,cat_id,merchant_id,brand_id,month,day,action having count(*)=1)a;

    結果:

    MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.93 sec HDFS Read: 485070 HDFS Write: 116 SUCCESS Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 4.33 sec HDFS Read: 5058 HDFS Write: 104 SUCCESS Total MapReduce CPU Time Spent: 11 seconds 260 msec OK 4754 Time taken: 89.299 seconds, Fetched: 1 row(s)

    注意以上的查詢語句中,要加上別名 a 否則會報錯:

    hive> select count(*) from (select user_id,item_id,cat_id,merchant_id,brand_id,month,day,action from user_log group by user_id,item_id,cat_id,merchant_id,brand_id,month,day,action having count(*)=1); NoViableAltException(256@[])at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.atomjoinSource(HiveParser_FromClauseParser.java:2265)at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.joinSource(HiveParser_FromClauseParser.java:2475)at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:1690)at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromClause(HiveParser_FromClauseParser.java:1312)at org.apache.hadoop.hive.ql.parse.HiveParser.fromClause(HiveParser.java:42074)at org.apache.hadoop.hive.ql.parse.HiveParser.atomSelectStatement(HiveParser.java:36735)at org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:36987)at org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:36633)at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:35822)at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:35710)at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:2284)at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1333)at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:208)at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:77)at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:70)at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:468)at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821)at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)at java.lang.reflect.Method.invoke(Method.java:498)at org.apache.hadoop.util.RunJar.run(RunJar.java:226)at org.apache.hadoop.util.RunJar.main(RunJar.java:141) FAILED: ParseException line 1:22 cannot recognize input near '(' 'select' 'user_id' in joinSource

    關鍵字條件查詢分析

    查詢雙11那天有多少人購買了商品

    hive> select count(distinct user_id) from user_log where action='2';

    結果:

    MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 5.76 sec HDFS Read: 483409 HDFS Write: 103 SUCCESS Total MapReduce CPU Time Spent: 5 seconds 760 msec OK 358 Time taken: 49.434 seconds, Fetched: 1 row(s)

    求當天購買的此品牌商品的數量

    hive> select count(*) from user_log where action='2' and brand_id=2661;

    結果:

    MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.46 sec HDFS Read: 483258 HDFS Write: 101 SUCCESS Total MapReduce CPU Time Spent: 6 seconds 460 msec OK 3 Time taken: 52.679 seconds, Fetched: 1 row(s)

    根據用戶行為分析

    查詢一件商品在某天的購買比例或瀏覽比例

    先查出有多少用戶在這天購買了商品:

    即關鍵字查詢中那條語句:

    select count(distinct user_id) from user_log where action='2';

    其查詢結果是 358。

    再查出有多少用戶點擊了該商品:

    即查詢 user_id 不重復的數據條數:

    select count(distinct user_id) from user_log;

    其結果是 358。

    也就是說比例是 358/358 = 1。

    查詢雙11那天,男女買家購買商品的比例

    先查出男性買家買商品的數量:

    select count(*) from user_log where gender=1;

    結果是:3299

    再查出女性買家買商品的數量:

    select count(*) from user_log where gender=0;

    結果是:3361

    因此男女比例是:3299 / 3361 大約等于 0.98

    給定購買商品的數量范圍,查詢某一天在該網站的購買該數量商品的用戶id

    查詢某一天在該網站購買商品超過5次的用戶id:

    select user_id from user_log where action='2' group by user_id having count(action='2')>5;

    結果(篇幅原因,展示部分結果):

    MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.77 sec HDFS Read: 483294 HDFS Write: 1304 SUCCESS Total MapReduce CPU Time Spent: 6 seconds 770 msec OK 1321 6058 16464 ... 422917 Time taken: 53.104 seconds, Fetched: 65 row(s)

    用戶實時查詢分析

    查詢不同品牌的購買次數

    這次創建新的數據表進行存儲:

    create table scan(brand_id INT,scan INT) COMMENT 'This is the search of bigdatataobao' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;

    導入數據:

    insert overwrite table scan select brand_id,count(action) from user_log where action='2' group by brand_id;

    查看結果:

    select * from scan;

    結果(篇幅原因,展示部分結果):

    OK NULL 8 60 3 69 1 82 11 99 3 104 1 125 1 127 1 133 1 ... 8396 3 8410 1 8461 1 Time taken: 0.852 seconds, Fetched: 544 row(s)

    總結

    以上是生活随笔為你收集整理的淘宝双11大数据分析(Hive 分析篇-上)的全部內容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。