网站流量分析项目day04
1.流量分析
a.基礎(chǔ)指標(biāo)多維度統(tǒng)計(jì)分析
統(tǒng)計(jì) PageView 瀏覽次數(shù)(pv)
select count(*) from ods_weblog_detail where datestr ="20181101" and valid = "true"; 排除靜態(tài)資源
View Code
統(tǒng)計(jì)Unique Visitor 獨(dú)立訪客(UV)
select count(distinct remote_addr) as uvs from ods_weblog_detail where datestr ="20181101";
View Code
統(tǒng)計(jì)訪問(wèn)次數(shù)(VV)
select count(distinct session) as vvs from ods_click_stream_visit where datestr ="20181101";
View Code
ip
select count(distinct remote_addr) as ips from ods_weblog_detail where datestr ="20181101";
View Code
結(jié)果表
create table dw_webflow_basic_info(month string,day string, pv bigint,uv bigint ,ip bigint, vv bigint) partitioned by(datestr string); insert into table dw_webflow_basic_info partition(datestr="20181101") select '201811','01',a.*,b.* from (select count(*) as pv,count(distinct remote_addr) as uv,count(distinct remote_addr) as ips from ods_weblog_detail where datestr ='20181101') a join (select count(distinct session) as vvs from ods_click_stream_visit where datestr ="20181101") b;
View Code
多維度分析--按照時(shí)間
方式一:直接在ods_weblog_detail單表上進(jìn)行查詢 --計(jì)算該處理批次(一天)中的各小時(shí)pvs drop table dw_pvs_everyhour_oneday; create table dw_pvs_everyhour_oneday(month string,day string,hour string,pvs bigint) partitioned by(datestr string); insert into table dw_pvs_everyhour_oneday partition(datestr='20130918') select a.month as month,a.day as day,a.hour as hour,count(*) as pvs from ods_weblog_detail a where a.datestr='20130918' group by a.month,a.day,a.hour; --計(jì)算每天的pvs drop table dw_pvs_everyday; create table dw_pvs_everyday(pvs bigint,month string,day string); insert into table dw_pvs_everyday select count(*) as pvs,a.month as month,a.day as day from ods_weblog_detail a group by a.month,a.day; 方式二:與時(shí)間維表關(guān)聯(lián)查詢 --維度:日 drop table dw_pvs_everyday; create table dw_pvs_everyday(pvs bigint,month string,day string); insert into table dw_pvs_everyday select count(*) as pvs,a.month as month,a.day as day from (select distinct month, day from t_dim_time) a join ods_weblog_detail b on a.month=b.month and a.day=b.day group by a.month,a.day; --維度:月 drop table dw_pvs_everymonth; create table dw_pvs_everymonth (pvs bigint,month string); insert into table dw_pvs_everymonth select count(*) as pvs,a.month from (select distinct month from t_dim_time) a join ods_weblog_detail b on a.month=b.month group by a.month; --另外,也可以直接利用之前的計(jì)算結(jié)果。比如從之前算好的小時(shí)結(jié)果中統(tǒng)計(jì)每一天的 Insert into table dw_pvs_everyday Select sum(pvs) as pvs,month,day from dw_pvs_everyhour_oneday group by month,day having day='18';
View Code
按照referer、時(shí)間維度
--統(tǒng)計(jì)每小時(shí)各來(lái)訪url產(chǎn)生的pv量 drop table dw_pvs_referer_everyhour; create table dw_pvs_referer_everyhour(referer_url string,referer_host string,month string,day string,hour string,pv_referer_cnt bigint) partitioned by(datestr string); insert into table dw_pvs_referer_everyhour partition(datestr='20181101') select http_referer,ref_host,month,day,hour,count(*) as pv_referer_cnt from dw_weblog_detail group by http_referer,ref_host,month,day,hour having ref_host is not null order by hour asc,day asc,month asc,pv_referer_cnt desc; --統(tǒng)計(jì)每小時(shí)各來(lái)訪host的產(chǎn)生的pv數(shù)并排序 drop table dw_pvs_refererhost_everyhour; create table dw_pvs_refererhost_everyhour(ref_host string,month string,day string,hour string,ref_host_cnts bigint) partitioned by(datestr string); insert into table dw_pvs_refererhost_everyhour partition(datestr='20181101') select ref_host,month,day,hour,count(*) as ref_host_cnts from ods_weblog_detail group by ref_host,month,day,hour having ref_host is not null order by hour asc,day asc,month asc,ref_host_cnts desc;
View Code
b. 復(fù)合指標(biāo)分析
人均瀏覽網(wǎng)頁(yè)數(shù)(平均訪問(wèn)深度)
drop table dw_avgpv_user_everyday; create table dw_avgpv_user_everyday( day string, avgpv string); insert into table dw_avgpv_user_everyday select '20130918',sum(b.pvs)/count(b.remote_addr) from (select remote_addr,count(1) as pvs from ods_weblog_detail where datestr='20130918' group by remote_addr) b; 今日所有來(lái)訪者平均請(qǐng)求瀏覽的頁(yè)面數(shù)。該指標(biāo)可以說(shuō)明網(wǎng)站對(duì)用戶的粘性。 計(jì)算方式:總頁(yè)面請(qǐng)求數(shù)pv/獨(dú)立訪客數(shù)uv remote_addr表示不同的用戶。可以先統(tǒng)計(jì)出不同remote_addr的pv量然后累加(sum)所有pv作為總的頁(yè)面請(qǐng)求數(shù),再count所有remote_addr作為總的去重總?cè)藬?shù)。
View Code
平均訪問(wèn)平度
select '20181101',vv/uv from dw_webflow_basic_info; --注意vv的計(jì)算采用的是點(diǎn)擊流模型表數(shù)據(jù) 已經(jīng)去除無(wú)效數(shù)據(jù) select count(session)/ count(distinct remote_addr) from ods_click_stream_visit where datestr ="20181101"; --符合邏輯 平均每個(gè)獨(dú)立訪客一天內(nèi)訪問(wèn)網(wǎng)站的次數(shù)(產(chǎn)生的session個(gè)數(shù))。 計(jì)算方式:訪問(wèn)次數(shù)vv/獨(dú)立訪客數(shù)uv
View Code
c.分組Top的問(wèn)題
統(tǒng)計(jì)每小時(shí)各來(lái)訪host的產(chǎn)生的pvs數(shù)最多的前三個(gè)
--表:dw_weblog_detail --分組的字段:時(shí)間 --度量值:count select month,day,hour,ref_host,count(1) pvs from dw_weblog_detail group by month,day,hour,ref_host; select hour,ref_host,pvs,rank from (select concat(month,day,hour) hour ,ref_host,pvs, row_number() over(partition by concat(month,day,hour) order by pvs desc ) rank from (select month,day,hour,ref_host,count(1) pvs from dw_weblog_detail group by month,day,hour, ref_host) t) t1 where t1.rank<=3;
View Code
2.受訪分析
a. 各個(gè)頁(yè)面的pv(uv,vv等)
統(tǒng)計(jì)各個(gè)頁(yè)面的pv
表:dw_weblog_detail 分組字段:request 度量值:count select request,count(1) request_count from dw_weblog_detail where valid='true' group by request having request is not null order by request_count desc limit 20;
View Code
b. 熱門(mén)網(wǎng)頁(yè)統(tǒng)計(jì)
統(tǒng)計(jì)每日最熱門(mén)的頁(yè)面top10
表:dw_weblog_detail 分組:request 度量值:count select '20130928',request,count(1) request_count from dw_weblog_detail where valid='true' group by request order by request_count desc limit 10;
View Code
3.訪客分析
a. 獨(dú)立訪客
按照時(shí)間維度(比如小時(shí))來(lái)統(tǒng)計(jì)獨(dú)立訪客及其產(chǎn)生的pv --獨(dú)立訪客分析
表:dw_weblog_detail 分組:hour 度量值:count select hour,remote_addr,count(1) pvs from dw_weblog_detail group by hour,remote_addr;
View Code
b. 每日新訪客
將每天的新訪客統(tǒng)計(jì)出來(lái)。
只要遇到新舊等二元問(wèn)題,創(chuàng)建歷史表和新的表,兩個(gè)表進(jìn)行join操作,最好是左外或者右外,我們用新訪客左外的話,如果右表數(shù)據(jù)是null的話就證明是新的訪客。 創(chuàng)建新表和歷史表 --歷日去重訪客累積表 drop table dw_user_dsct_history; create table dw_user_dsct_history( day string, ip string ) partitioned by(datestr string); --每日新訪客表 drop table dw_user_new_d; create table dw_user_new_d ( day string, ip string ) partitioned by(datestr string); --查詢當(dāng)天新的數(shù)據(jù) select remote_addr from dw_weblog_detail group by remote_addr --和歷史數(shù)據(jù)join select count(t1.remote_addr) from (select remote_addr from dw_weblog_detail where datestr="20181101" group by remote_addr) t1 left join dw_user_dsct_history t2 on t1.remote_addr=t2.ip where t2.ip is null; --將新的數(shù)據(jù)插入的新表中 insert into table dw_user_new_d partition(datestr="20181101") select t1.day, t1.remote_addr from (select concat(month,day) day,remote_addr from dw_weblog_detail where datestr="20181101" group by concat(month,day), remote_addr) t1 left join dw_user_dsct_history t2 on t1.remote_addr=t2.ip; --將新訪客放置到歷史訪客表中 insert into table dw_user_dsct_history partition(datestr="20181101") select day,ip from dw_user_new_d where datestr="20181101";
View Code
c. 地域分析
IP一般包含的信息:國(guó)家、區(qū)域(省/州)、城市、街道、經(jīng)緯度、ISP提供商等信息。因?yàn)镮P數(shù)據(jù)庫(kù)隨著時(shí)間經(jīng)常變化(不過(guò)一段時(shí)間內(nèi)變化很小),所以需要有人經(jīng)常維護(hù)和更新。這個(gè)數(shù)據(jù)也不可能完全準(zhǔn)確、也不可能覆蓋全。
目前,國(guó)內(nèi)用的比較有名的是“純真IP數(shù)據(jù)庫(kù)”,國(guó)外常用的是 maxmind、ip2location。IP數(shù)據(jù)庫(kù)是否收費(fèi):收費(fèi)、免費(fèi)都有。一般有人維護(hù)的數(shù)據(jù)往往都是收費(fèi)的,準(zhǔn)確率和覆蓋率會(huì)稍微高一些。
查詢形式:
? 本地: 將IP數(shù)據(jù)庫(kù)下載到本地使用,查詢效率高、性能好。常用在統(tǒng)計(jì)分析方面。具體形式又分為:
內(nèi)存查詢:將全部數(shù)據(jù)直接加載到內(nèi)存中,便于高性能查詢。或者二進(jìn)制的數(shù)據(jù)文件本身就是經(jīng)過(guò)優(yōu)化的索引文件,可以直接對(duì)文件做查詢。
數(shù)據(jù)庫(kù)查詢:將數(shù)據(jù)導(dǎo)入到數(shù)據(jù)庫(kù),再用數(shù)據(jù)庫(kù)查詢。效率沒(méi)有內(nèi)存查詢快。
遠(yuǎn)程(web service或ajax),調(diào)用遠(yuǎn)程第三方服務(wù)。查詢效率自然比較低,一般用在網(wǎng)頁(yè)應(yīng)用中。查詢的本質(zhì):輸入一個(gè)IP,找到其所在的IP段,一般都是采用二分搜索實(shí)現(xiàn)的。
4. 訪客visit分析
a. 回頭/單次訪客分析
表:ods_click_stream_visit 度量值:count 分組:remote_addr select t1.day,t1.remote_addr,t1.count from (select '20181101' as day,remote_addr,count(session) count from ods_click_stream_visit group by remote_addr) t1 where t1.count>1;
View Code
b. 人均訪問(wèn)頻次
需求:統(tǒng)計(jì)出每天所有用戶訪問(wèn)網(wǎng)站的平均次數(shù)(visit) 表:ods_click_stream_visit 度量值:count 分組:day select count(session)/count(distinct remote_addr) from ods_click_stream_visit where datestr='20181101';
View Code
5. 關(guān)鍵路徑轉(zhuǎn)換率
--規(guī)律:如果需要當(dāng)前行和上一行進(jìn)行計(jì)算
--我們就join自己表,根據(jù)需要找規(guī)律
首先創(chuàng)建總表
0) 規(guī)劃一條用戶行為軌跡線
Step1、 /item
Step2、 /category
Step3、 /index
Step4、 /order
1) 計(jì)算在這條軌跡線當(dāng)中, 每一步pv量是多少? 最終形成一張表
create table dw_oute_numbs as
select 'step1' as step,count(distinct remote_addr) as numbs from ods_click_pageviews where datestr='20181103' and request like '/item%'
union all
select 'step2' as step,count(distinct remote_addr) as numbs from ods_click_pageviews where datestr='20181103' and request like '/category%'
union all
select 'step3' as step,count(distinct remote_addr) as numbs from ods_click_pageviews where datestr='20181103' and request like '/order%'
union all
select 'step4' as step,count(distinct remote_addr) as numbs from ods_click_pageviews where datestr='20181103' and request like '/index%';
2) 求 每一步 和 第一步的轉(zhuǎn)化率
select
t2.pvs/t1.pvs
from dw_oute_numbs t1 join dw_oute_numbs t2 where t1.step="step1";
3) 求 每一步 和 上一步的轉(zhuǎn)化率
select
(t1.pvs /t2.pvs) *100
from dw_oute_numbs t1 join dw_oute_numbs t2 where cast(substring(t1.step,5,1) as int) -1 = cast(substring(t2.step,5,1) as int);
4) 合并在一起即可
select abs.step,abs.numbs,abs.rate as abs_ratio,rel.rate as leakage_rate
from
(select tmp.rnstep as step,tmp.rnnumbs as numbs,tmp.rnnumbs/tmp.rrnumbs as rate
from
(select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs from dw_oute_numbs rn
inner join
dw_oute_numbs rr) tmp
where tmp.rrstep='step1') abs
left outer join
(select tmp.rrstep as step,tmp.rrnumbs/tmp.rnnumbs as rate
from
(select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs from dw_oute_numbs rn
inner join
dw_oute_numbs rr) tmp
where cast(substr(tmp.rnstep,5,1) as int)=cast(substr(tmp.rrstep,5,1) as int)-1
) rel
on abs.step=rel.step;
View Code
6. 模塊開(kāi)發(fā)_數(shù)據(jù)導(dǎo)出
a. 從hive表到RDBMS表直接導(dǎo)出
效率較高,相當(dāng)于直接在Hive表與RDBMS表的進(jìn)行數(shù)據(jù)更新,但無(wú)法做精細(xì)的控制。
b.從hive到HDFS再到RDBMS表的導(dǎo)出
需要先將數(shù)據(jù)從Hive表導(dǎo)出到HDFS,再?gòu)腍DFS將數(shù)據(jù)導(dǎo)入到RDBMS。雖然比直接導(dǎo)出多了一步操作,但是可以實(shí)現(xiàn)對(duì)數(shù)據(jù)的更精準(zhǔn)的操作,特別是在從Hive表導(dǎo)出到HDFS時(shí),可以進(jìn)一步對(duì)數(shù)據(jù)進(jìn)行字段篩選、字段加工、數(shù)據(jù)過(guò)濾操作,從而使得HDFS上的數(shù)據(jù)更“接近”或等于將來(lái)實(shí)際要導(dǎo)入RDBMS表的數(shù)據(jù),提高導(dǎo)出速度。
c. 全量導(dǎo)出數(shù)據(jù)到mysql
hive-->HDFS
導(dǎo)出dw_pvs_referer_everyhour表數(shù)據(jù)到HDFS
insert overwrite directory '/weblog/export/dw_pvs_referer_everyhour' row format delimited fields terminated by ',' STORED AS textfile select referer_url,hour,pv_referer_cnt from dw_pvs_referer_everyhour where datestr = "20181101";
View Code
d. 增量導(dǎo)出數(shù)據(jù)到mysql
應(yīng)用場(chǎng)景: 將hive表中的增量記錄同步到目標(biāo)表中
使用技術(shù): 使用sqoop export 中--update-mode 的allowinsert模式進(jìn)行增量數(shù)據(jù)導(dǎo)入目標(biāo)表中。該模式用于將Hive中有但目標(biāo)表中無(wú)的記錄同步到目標(biāo)表中,但同時(shí)也會(huì)同步不一致的記錄。
實(shí)現(xiàn)邏輯: 以dw_webflow_basic_info基礎(chǔ)信息指標(biāo)表為例進(jìn)行增量導(dǎo)出操作
實(shí)現(xiàn)步驟:
1) mysql手動(dòng)創(chuàng)建目標(biāo)表
create table dw_webflow_basic_info(
monthstr varchar(20),
daystr varchar(10),
pv bigint,
uv bigint,
ip bigint,
vv bigint)
2) 先執(zhí)行全量導(dǎo)入, 把當(dāng)前的hive中20181101分區(qū)數(shù)據(jù)進(jìn)行導(dǎo)出
bin/sqoop export
--connect jdbc:mysql://node01:3306/weblog
--username root --password 123456
--table dw_webflow_basic_info
--fields-terminated-by '01'
--export-dir /user/hive/warehouse/itheima_weblog.db/dw_webflow_basic_info/datestr=20181101/
3) 為了方便演示, 手動(dòng)生成往hive中添加20181103的數(shù)據(jù)
insert into table dw_webflow_basic_info partition(datestr="20191006") values("201910","06",14250,1341,1341,96);
4) sqoop進(jìn)行增量導(dǎo)出
bin/sqoop export
--connect jdbc:mysql://node01:3306/weblog
--username root
--password 123456
--table dw_webflow_basic_info
--fields-terminated-by '01'
--update-key monthstr,daystr
--update-mode allowinsert
--export-dir /user/hive/warehouse/itheima_weblog.db/dw_webflow_basic_info/datestr=20181103/
View Code
e. 定時(shí)增量導(dǎo)出數(shù)據(jù)
應(yīng)用場(chǎng)景:將Hive表中的增量記錄自動(dòng)定時(shí)同步到目標(biāo)表中
使用技術(shù):使用sqoop expo rt 中--update-mode 的allowinsert模式進(jìn)行增量數(shù)據(jù)導(dǎo)入目標(biāo)表中。該模式用于將Hive中有但目標(biāo)表中無(wú)的記錄同步到目標(biāo)表中,但同時(shí)也會(huì)同步不一致的記錄
實(shí)現(xiàn)邏輯:以dw_webflow_basic_info基礎(chǔ)信息指標(biāo)表為例進(jìn)行增量導(dǎo)出操作
#!/bin/bash
export SQOOP_HOME=/export/servers/sqoop
if [ $# -eq 1 ]
then
execute_date=`date --date="${1}" +%Y%m%d`
else
execute_date=`date -d'-1 day' +%Y%m%d`
fi
echo "execute_date:"${execute_date}
table_name="dw_webflow_basic_info"
hdfs_dir=/user/hive/warehouse/itheima.db/dw_webflow_basic_info/datestr=${execute_date}
mysql_db_pwd=hadoop
mysql_db_name=root
echo 'sqoop start'
$SQOOP_HOME/bin/sqoop export
--connect "jdbc:mysql://node-1:3306/weblog"
--username $mysql_db_name
--password $mysql_db_pwd
--table $table_name
--fields-terminated-by '01'
--update-key monthstr,daystr
--update-mode allowinsert
--export-dir $hdfs_dir
echo 'sqoop end'
View Code
7. 模塊開(kāi)發(fā)_工作流調(diào)度
數(shù)據(jù)預(yù)處理模塊按照數(shù)據(jù)處理過(guò)程和業(yè)務(wù)需求,可以分成3個(gè)步驟執(zhí)行:數(shù)據(jù)預(yù)處理清洗、點(diǎn)擊流模型之pageviews、點(diǎn)擊流模型之visit。并且3個(gè)步驟之間存在著明顯的依賴關(guān)系,使用azkaban定時(shí)周期性執(zhí)行將會(huì)非常方便.
對(duì)之前的預(yù)處理MapReduce進(jìn)行打jar包(共三個(gè))
編寫(xiě) azkaban調(diào)度job設(shè)置dependence依賴
a. 數(shù)據(jù)預(yù)處理調(diào)度
#weblog_preprocess.job type=command command=/export/servers/hadoop-2.6.0-cdh5.14.0/bin/hadoop jar preprocess.jar /weblog/log /weblog/out # weblog_click_pageviews.job type=command dependencies=weblog_preprocess command=/export/servers/hadoop-2.6.0-cdh5.14.0/bin/hadoop jar weblog_click_pageviews.jar /weblog/out /weblog/pageviews # weblog_click_visit.job type=command dependencies=weblog_click_pageviews command=/export/servers/hadoop-2.6.0-cdh5.14.0/bin/hadoop jar weblog_click_visit.jar /weblog/pageviews /weblog/sisit
View Code
b. 數(shù)據(jù)庫(kù)定時(shí)入庫(kù)
#!/bin/bash
export HIVE_HOME=/export/servers/hive
if [ $# -eq 1 ]
then
datestr=`date --date="${1}" +%Y%m%d`
else
datestr=`date -d'-1 day' +%Y%m%d`
fi
HQL="load data inpath '/preprocess/' into table itheima.ods_weblog_origin partition(datestr='${datestr}')"
echo "開(kāi)始執(zhí)行l(wèi)oad......"
$HIVE_HOME/bin/hive -e "$HQL"
echo "執(zhí)行完畢......"
# load-weblog.job
type=command
command=sh load-weblog.sh
View Code
c. 數(shù)據(jù)統(tǒng)計(jì)計(jì)算定時(shí)
#!/bin/bash HQL=" drop table dw_user_dstc_ip_h; create table dw_user_dstc_ip_h( remote_addr string, pvs bigint, hour string); insert into table dw_user_dstc_ip_h select remote_addr,count(1) as pvs,concat(month,day,hour) as hour from ods_weblog_detail Where datestr='20181101' group by concat(month,day,hour),remote_addr; " echo $HQL /export/servers/hive/bin/hive -e "$HQL"
View Code
總結(jié)
以上是生活随笔為你收集整理的网站流量分析项目day04的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: pal是什么币 什么是pal币
- 下一篇: 1099~1999 元,华硕推出 B76