UserBehavior 阿里巴巴淘宝用户行为数据字段分析
[root@gree139 exam]# hdfs dfs -mkdir -p /data/userbehavior
[root@gree139 exam]# hdfs dfs -put ./UserBehavior.csv /data/userbehavior
[root@gree139 exam]# hdfs dfs -ls /data/userbehavior
1 請在 HDFS 中創建目錄/data/userbehavior,并將 UserBehavior.csv 文件傳到該目
錄。(5 分)
[root@gree139 exam]# hdfs dfs -mkdir -p /data/userbehavior
[root@gree139 exam]# hdfs dfs -put ./UserBehavior.csv /data/userbehavior
2 通過 HDFS 命令查詢出文檔有多少行數據。(5 分)
[root@gree139 exam]# hdfs dfs -cat /data/userbehavior/UserBehavior.csv | wc -l
561294
Client連接hive要啟動hiveserver2
[root@gree139 hive110]# nohup ./bin/hive --service hiveserver2 &
[root@gree139 hive110]# nohup ./bin/hive --service metastore &
hive> create database exam;
| use exam;create external table if not exists? userbehavior(user_id int,item_id int,category_id int,behavior_type string,time bigint) row format delimited fields terminated by ','stored as textfile location '/data/userbehavior' |
3) 請在 HBase 中創建命名空間 exam,并在命名空間 exam 創建 userbehavior 表,包
含一個列簇 info(5 分)
| hbase(main):004:0> disable 'exam:userbehavior' 0 row(s) in 2.3720 seconds hbase(main):005:0> drop 'exam:userbehavior' 0 row(s) in 1.2500 seconds hbase(main):007:0> create_namespace 'exam' hbase(main):007:0> create 'exam:userbehavior','info' hbase(main):009:0> count 'exam:userbehavior' |
4) 請在 Hive 中創建外部表 userbehavior_hbase,并映射到 HBase 中(5 分),并將數
據加載到 HBase 中(5 分)
| create external table if not exists? userbehavior_hbase(user_id int,item_id int,category_id int,behavior_type string,time bigint) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'with serdeproperties("hbase.columns.mapping"=":key,info:item_id,info:category_id,info:behavior_type,info:time")tblproperties ("hbase.table.name"="exam:userbehavior"); insert into userbehavior_hbase select * from exam.userbehavior; hbase(main):015:0> scan 'exam:userbehavior' hbase(main):013:0> get 'exam:userbehavior','108982','info' hbase(main):013:0> get 'exam:userbehavior','108982','info' |
?
5) 請在 exam 數據庫中創建內部分區表 userbehavior_partitioned(按照日期進行分區),
并通過查詢 userbehavior 表將時間戳格式化為”年-月-日 時:分:秒”格式,將數據插
入至 userbehavior_partitioned 表中,例如下圖:(15 分)
| drop table userbehavior_partitioned;create table userbehavior_partitioned(user_id int,item_id int,category_id int,behavior_type string,time string) partitioned by (dt string) stored as orc;set hive.exec.dynamic.partition=true;set hive.exec.dynamic.partition.mode=nonstrict;insert into userbehavior_partitioned partition (dt)select user_id,item_id,category_id,behavior_type,from_unixtime(time,'YYYY-MM-dd HH:mm:ss') as time,from_unixtime(time,'YYYY-MM-dd') as dtfrom userbehavior;show partitions userbehavior_partitioned;select * from userbehavior_partitioned; |
3.用戶行為分析(20 分)
請使用 Spark,加載 HDFS 文件系統 UserBehavior.csv 文件,并分別使用 RDD 完成以下
分析。
| scala> val fileRdd = sc.textFile("/data/userbehavior/") scala> val userbehaviorRdd =? fileRdd.map(x=>x.split(",")).filter(x=>x.length==5) |
?
1 統計 uv 值(一共有多少用戶訪問淘寶)(10 分)
| scala> userbehaviorRdd.map(x=>x(0)).distinct().count res3: Long = 5458 scala> userbehaviorRdd.groupBy(x=>x(0)).count res5: Long = 5458 |
2 分別統計瀏覽行為為點擊,收藏,加入購物車,購買的總數量(10 分)
| scala> userbehaviorRdd.map(x=>(x(3),1)).reduceByKey(_+_).collect.foreach(println) (cart,30888) (buy,11508) (pv,503881) (fav,15017) scala> userbehaviorRdd.map(x=>(x(3),1)).groupByKey().map(x=>(x._1,x._2.toList.size)).collect.foreach(println) (cart,30888) (buy,11508) (pv,503881) (fav,15017) |
4.找出有價值的用戶(30 分)
1 使用 SparkSQL 統計用戶最近購買時間。以 2017-12-03 為當前日期,計算時間范圍
為一個月,計算用戶最近購買時間,時間的區間為 0-30 天,將其分為 5 檔,0-6 天,7-12
天,13-18 天,19-24 天,25-30 天分別對應評分 4 到 0(15 分)
| Hive-->> select t.user_id ,(case when t.diff between 0 and 6 then 4when t.diff between 7 and 12 then 3when t.diff between 13 and 18 then 2when t.diff between 19 and 24 then 1when t.diff between 25 and 30 then 0else null end) levelfrom(select user_id, datediff('2017-12-03', max(dt)) diff , max(dt) maxnumfrom exam.userbehavior_partitioned group by user_id) t; Sparksql-->>> scala> spark.sql("select t.user_id,(case when t.diff between 0 and 6 then 4 when t.diff between 7 and 12 then 3 when t.diff between 13 and 18 then 2? when t.diff between 19 and 24 then 1 when t.diff between 25 and 30 then 0 else null end ) level from(select user_id, datediff('2017-12-03', max(dt)) diff , max(dt) maxnum from exam.userbehavior_partitioned group by user_id) t").show() scala> spark.sql(""" ???? | select t.user_id , ?? ??|??????? ( ???? |???????? case when t.diff between 0 and 6 then 4 ???? |??????????? when t.diff between 7 and 12 then 3 ???? |??????????? when t.diff between 13 and 18 then 2 ???? |??????????? when t.diff between 19 and 24 then 1 ???? |??????????? when t.diff between 25 and 30 then 0 ???? |??????????? else null end ???? |???????? ) level ???? | from ???? | (select user_id, datediff('2017-12-03', max(dt)) diff , max(dt) maxnum ???? | from exam.userbehavior_partitioned group by user_id) t ???? | """).show() |
?
2 使用 SparkSQL 統計用戶的消費頻率。以 2017-12-03 為當前日期,計算時間范圍為
一個月,計算用戶的消費次數,用戶中消費次數從低到高為 1-161 次,將其分為 5
檔,1-32,33-64,65-96,97-128,129-161 分別對應評分 0 到 4(15 分)
| Hive-->>> select t.user_id , Sparksql--->> scala> spark.sql(""" ???? | select t.user_id , ???? |??????? ( ???? |???????? case when t.num between 129 and 161 then 4 ???? |??????????? when t.num between 97 and 128 then 3 ???? |??????????? when t.num between 65 and 96 then 2 ???? |??????????? when t.num between 33 and 64 then 1 ???? |??????????? when t.num between 1 and 32 then 0 ???? |??????????? else null end ???? |???????? ) level ???? | from ???? |????? ( ???? | select user_id, count(user_id) num ???? | from exam.userbehavior_partitioned where behavior_type="buy" ???? | and dt between '2017-11-03' and '2017-12-03' ???? | group by user_id) t ???? | """).show() 查看 購買次數等級 select t2.user_id,t2.level from |
總結
以上是生活随笔為你收集整理的UserBehavior 阿里巴巴淘宝用户行为数据字段分析的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 线性判别分析LDA的思想
- 下一篇: python卸载包很慢_Python卸载