日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 综合教程 >内容正文

综合教程

网站流量分析项目day04

發(fā)布時間:2023/12/15 综合教程 47 生活家
生活随笔 收集整理的這篇文章主要介紹了 网站流量分析项目day04 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

1.流量分析

a.基礎指標多維度統(tǒng)計分析

統(tǒng)計 PageView 瀏覽次數(shù)(pv)

select count(*) from ods_weblog_detail where datestr ="20181101" and valid = "true"; 排除靜態(tài)資源

View Code

統(tǒng)計Unique Visitor 獨立訪客(UV)

select count(distinct remote_addr) as uvs from ods_weblog_detail where datestr ="20181101";

View Code

統(tǒng)計訪問次數(shù)(VV)

select count(distinct session) as vvs from ods_click_stream_visit where datestr ="20181101";

View Code

ip

select count(distinct remote_addr) as ips from ods_weblog_detail where datestr ="20181101";

View Code

結果表

create table dw_webflow_basic_info(month string,day string,
pv bigint,uv bigint ,ip bigint, vv bigint) partitioned by(datestr string);

insert into table dw_webflow_basic_info partition(datestr="20181101")
select '201811','01',a.*,b.* from
(select count(*) as pv,count(distinct remote_addr) as uv,count(distinct remote_addr) as ips 
from ods_weblog_detail
where datestr ='20181101') a join 
(select count(distinct session) as vvs from ods_click_stream_visit where datestr ="20181101") b;

View Code

多維度分析--按照時間

方式一:直接在ods_weblog_detail單表上進行查詢
--計算該處理批次(一天)中的各小時pvs
drop table dw_pvs_everyhour_oneday;
create table dw_pvs_everyhour_oneday(month string,day string,hour string,pvs bigint) partitioned by(datestr string);

insert into table dw_pvs_everyhour_oneday partition(datestr='20130918')
select a.month as month,a.day as day,a.hour as hour,count(*) as pvs from ods_weblog_detail a
where  a.datestr='20130918' group by a.month,a.day,a.hour;

--計算每天的pvs
drop table dw_pvs_everyday;
create table dw_pvs_everyday(pvs bigint,month string,day string);

insert into table dw_pvs_everyday
select count(*) as pvs,a.month as month,a.day as day from ods_weblog_detail a
group by a.month,a.day;

方式二:與時間維表關聯(lián)查詢
--維度:日
drop table dw_pvs_everyday;
create table dw_pvs_everyday(pvs bigint,month string,day string);

insert into table dw_pvs_everyday
select count(*) as pvs,a.month as month,a.day as day from (select distinct month, day from t_dim_time) a
join ods_weblog_detail b 
on a.month=b.month and a.day=b.day
group by a.month,a.day;

--維度:月
drop table dw_pvs_everymonth;
create table dw_pvs_everymonth (pvs bigint,month string);

insert into table dw_pvs_everymonth
select count(*) as pvs,a.month from (select distinct month from t_dim_time)  a
join ods_weblog_detail b on a.month=b.month group by a.month;

--另外,也可以直接利用之前的計算結果。比如從之前算好的小時結果中統(tǒng)計每一天的
Insert into table dw_pvs_everyday
Select sum(pvs) as pvs,month,day from dw_pvs_everyhour_oneday group by month,day having day='18';

View Code

按照referer、時間維度

--統(tǒng)計每小時各來訪url產(chǎn)生的pv量
drop table dw_pvs_referer_everyhour;
create table dw_pvs_referer_everyhour(referer_url string,referer_host string,month string,day string,hour string,pv_referer_cnt bigint) partitioned by(datestr string);

insert into table dw_pvs_referer_everyhour partition(datestr='20181101')
select http_referer,ref_host,month,day,hour,count(*) as pv_referer_cnt
from dw_weblog_detail 
group by http_referer,ref_host,month,day,hour 
having ref_host is not null
order by hour asc,day asc,month asc,pv_referer_cnt desc;

--統(tǒng)計每小時各來訪host的產(chǎn)生的pv數(shù)并排序
drop table dw_pvs_refererhost_everyhour;
create table dw_pvs_refererhost_everyhour(ref_host string,month string,day string,hour string,ref_host_cnts bigint) partitioned by(datestr string);

insert into table dw_pvs_refererhost_everyhour partition(datestr='20181101')
select ref_host,month,day,hour,count(*) as ref_host_cnts
from ods_weblog_detail 
group by ref_host,month,day,hour 
having ref_host is not null
order by hour asc,day asc,month asc,ref_host_cnts desc;

View Code

b. 復合指標分析

人均瀏覽網(wǎng)頁數(shù)(平均訪問深度)

drop table dw_avgpv_user_everyday;
create table dw_avgpv_user_everyday(
day string,
avgpv string);

insert into table dw_avgpv_user_everyday
select '20130918',sum(b.pvs)/count(b.remote_addr) from
(select remote_addr,count(1) as pvs from ods_weblog_detail where datestr='20130918' group by remote_addr) b;
今日所有來訪者平均請求瀏覽的頁面數(shù)。該指標可以說明網(wǎng)站對用戶的粘性。
計算方式:總頁面請求數(shù)pv/獨立訪客數(shù)uv
remote_addr表示不同的用戶。可以先統(tǒng)計出不同remote_addr的pv量然后累加(sum)所有pv作為總的頁面請求數(shù),再count所有remote_addr作為總的去重總人數(shù)。

View Code

平均訪問平度

select '20181101',vv/uv from dw_webflow_basic_info; --注意vv的計算采用的是點擊流模型表數(shù)據(jù) 已經(jīng)去除無效數(shù)據(jù) 

select count(session)/ count(distinct remote_addr) from ods_click_stream_visit where datestr ="20181101"; --符合邏輯

平均每個獨立訪客一天內(nèi)訪問網(wǎng)站的次數(shù)(產(chǎn)生的session個數(shù))。
計算方式:訪問次數(shù)vv/獨立訪客數(shù)uv

View Code

c.分組Top的問題

統(tǒng)計每小時各來訪host的產(chǎn)生的pvs數(shù)最多的前三個

--表:dw_weblog_detail
--分組的字段:時間
--度量值:count
select
month,day,hour,ref_host,count(1) pvs
from dw_weblog_detail
group by month,day,hour,ref_host;

select  
hour,ref_host,pvs,rank
from
(select 
concat(month,day,hour) hour ,ref_host,pvs,
row_number() over(partition by  concat(month,day,hour) order by pvs desc  ) rank
from 
(select 
month,day,hour,ref_host,count(1) pvs
from  dw_weblog_detail
group by  month,day,hour, ref_host) t) t1  where t1.rank<=3;

View Code

2.受訪分析

a. 各個頁面的pv(uv,vv等)

統(tǒng)計各個頁面的pv

表:dw_weblog_detail
分組字段:request
度量值:count
select
request,count(1) request_count
from dw_weblog_detail
where valid='true'
group by request having request is not null order by request_count desc limit 20;

View Code

b. 熱門網(wǎng)頁統(tǒng)計

統(tǒng)計每日最熱門的頁面top10

表:dw_weblog_detail
分組:request
度量值:count
select
'20130928',request,count(1) request_count
from dw_weblog_detail where valid='true'
group by request order by request_count desc limit 10;

View Code

3.訪客分析

a. 獨立訪客

按照時間維度(比如小時)來統(tǒng)計獨立訪客及其產(chǎn)生的pv --獨立訪客分析

表:dw_weblog_detail
分組:hour
度量值:count
select
hour,remote_addr,count(1) pvs
from dw_weblog_detail
group by hour,remote_addr;

View Code

b. 每日新訪客

將每天的新訪客統(tǒng)計出來。

只要遇到新舊等二元問題,創(chuàng)建歷史表和新的表,兩個表進行join操作,最好是左外或者右外,我們用新訪客左外的話,如果右表數(shù)據(jù)是null的話就證明是新的訪客。
創(chuàng)建新表和歷史表
--歷日去重訪客累積表
drop table dw_user_dsct_history;
create table dw_user_dsct_history(
day string,
ip string
) 
partitioned by(datestr string);

--每日新訪客表
drop table dw_user_new_d;
create table dw_user_new_d (
day string,
ip string
) 
partitioned by(datestr string);

--查詢當天新的數(shù)據(jù)
select  
remote_addr
from  dw_weblog_detail
group by remote_addr
--和歷史數(shù)據(jù)join
select
count(t1.remote_addr)
from
(select  
remote_addr
from  dw_weblog_detail
where  datestr="20181101"
group by remote_addr) t1 left join  dw_user_dsct_history  t2 on  t1.remote_addr=t2.ip where t2.ip is null;

--將新的數(shù)據(jù)插入的新表中
insert  into   table dw_user_new_d partition(datestr="20181101")
select 
t1.day,
t1.remote_addr
from
(select  
concat(month,day) day,remote_addr
from  dw_weblog_detail  where  datestr="20181101"
group by  concat(month,day), remote_addr) t1  left join   dw_user_dsct_history  t2 on  t1.remote_addr=t2.ip;
--將新訪客放置到歷史訪客表中        
insert into table  dw_user_dsct_history  partition(datestr="20181101")
select  day,ip  from  dw_user_new_d where  datestr="20181101";

View Code

c. 地域分析

  IP一般包含的信息:國家、區(qū)域(省/州)、城市、街道、經(jīng)緯度、ISP提供商等信息。因為IP數(shù)據(jù)庫隨著時間經(jīng)常變化(不過一段時間內(nèi)變化很小),所以需要有人經(jīng)常維護和更新。這個數(shù)據(jù)也不可能完全準確、也不可能覆蓋全。

  目前,國內(nèi)用的比較有名的是“純真IP數(shù)據(jù)庫”,國外常用的是 maxmind、ip2location。IP數(shù)據(jù)庫是否收費:收費、免費都有。一般有人維護的數(shù)據(jù)往往都是收費的,準確率和覆蓋率會稍微高一些。

查詢形式:

?  本地: 將IP數(shù)據(jù)庫下載到本地使用,查詢效率高、性能好。常用在統(tǒng)計分析方面。具體形式又分為:

內(nèi)存查詢:將全部數(shù)據(jù)直接加載到內(nèi)存中,便于高性能查詢。或者二進制的數(shù)據(jù)文件本身就是經(jīng)過優(yōu)化的索引文件,可以直接對文件做查詢。
數(shù)據(jù)庫查詢:將數(shù)據(jù)導入到數(shù)據(jù)庫,再用數(shù)據(jù)庫查詢。效率沒有內(nèi)存查詢快。

  遠程(web service或ajax),調(diào)用遠程第三方服務。查詢效率自然比較低,一般用在網(wǎng)頁應用中。查詢的本質(zhì):輸入一個IP,找到其所在的IP段,一般都是采用二分搜索實現(xiàn)的。

4. 訪客visit分析

a. 回頭/單次訪客分析

表:ods_click_stream_visit
度量值:count
分組:remote_addr
select
t1.day,t1.remote_addr,t1.count
from
(select
'20181101' as day,remote_addr,count(session) count
from ods_click_stream_visit 
group by remote_addr) t1
where t1.count>1;

View Code

b. 人均訪問頻次

需求:統(tǒng)計出每天所有用戶訪問網(wǎng)站的平均次數(shù)(visit)
表:ods_click_stream_visit
度量值:count
分組:day
select
count(session)/count(distinct remote_addr)
from ods_click_stream_visit
where datestr='20181101';

View Code

5. 關鍵路徑轉(zhuǎn)換率

--規(guī)律:如果需要當前行和上一行進行計算
--我們就join自己表,根據(jù)需要找規(guī)律
首先創(chuàng)建總表
0) 規(guī)劃一條用戶行為軌跡線
        Step1、  /item

        Step2、  /category

        Step3、  /index

        Step4、  /order

1) 計算在這條軌跡線當中, 每一步pv量是多少? 最終形成一張表
create table dw_oute_numbs as 
select 'step1' as step,count(distinct remote_addr)  as numbs from ods_click_pageviews where datestr='20181103' and request like '/item%'
union all
select 'step2' as step,count(distinct remote_addr)  as numbs from ods_click_pageviews where datestr='20181103' and request like '/category%'
union all
select 'step3' as step,count(distinct remote_addr)  as numbs from ods_click_pageviews where datestr='20181103' and request like '/order%'
union all
select 'step4' as step,count(distinct remote_addr)  as numbs from ods_click_pageviews where datestr='20181103' and request like '/index%';
2) 求  每一步 和 第一步的轉(zhuǎn)化率
select
t2.pvs/t1.pvs
from dw_oute_numbs t1 join dw_oute_numbs t2 where t1.step="step1";
3) 求 每一步 和 上一步的轉(zhuǎn)化率
select
(t1.pvs /t2.pvs) *100
from    dw_oute_numbs t1  join dw_oute_numbs t2 where  cast(substring(t1.step,5,1) as int) -1  =  cast(substring(t2.step,5,1) as int);
4) 合并在一起即可    
select abs.step,abs.numbs,abs.rate as abs_ratio,rel.rate as leakage_rate
from
(select tmp.rnstep as step,tmp.rnnumbs as numbs,tmp.rnnumbs/tmp.rrnumbs as rate
from
(select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  from dw_oute_numbs rn
inner join 
dw_oute_numbs rr) tmp
where tmp.rrstep='step1') abs
left outer join
(select tmp.rrstep as step,tmp.rrnumbs/tmp.rnnumbs as rate
from
(select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs  from dw_oute_numbs rn
inner join 
dw_oute_numbs rr) tmp
where cast(substr(tmp.rnstep,5,1) as int)=cast(substr(tmp.rrstep,5,1) as int)-1
) rel
on abs.step=rel.step;

View Code

6. 模塊開發(fā)_數(shù)據(jù)導出

a. 從hive表到RDBMS表直接導出

效率較高,相當于直接在Hive表與RDBMS表的進行數(shù)據(jù)更新,但無法做精細的控制。

b.從hive到HDFS再到RDBMS表的導出

需要先將數(shù)據(jù)從Hive表導出到HDFS,再從HDFS將數(shù)據(jù)導入到RDBMS。雖然比直接導出多了一步操作,但是可以實現(xiàn)對數(shù)據(jù)的更精準的操作,特別是在從Hive表導出到HDFS時,可以進一步對數(shù)據(jù)進行字段篩選、字段加工、數(shù)據(jù)過濾操作,從而使得HDFS上的數(shù)據(jù)更“接近”或等于將來實際要導入RDBMS表的數(shù)據(jù),提高導出速度。

c. 全量導出數(shù)據(jù)到mysql

hive-->HDFS

導出dw_pvs_referer_everyhour表數(shù)據(jù)到HDFS

insert overwrite directory '/weblog/export/dw_pvs_referer_everyhour' 
row format delimited fields terminated by ',' STORED AS textfile 
select 
referer_url,hour,pv_referer_cnt 
from dw_pvs_referer_everyhour 
where datestr = "20181101";

View Code

d. 增量導出數(shù)據(jù)到mysql

應用場景: 將hive表中的增量記錄同步到目標表中

使用技術: 使用sqoop export 中--update-mode 的allowinsert模式進行增量數(shù)據(jù)導入目標表中。該模式用于將Hive中有但目標表中無的記錄同步到目標表中,但同時也會同步不一致的記錄。
實現(xiàn)邏輯: 以dw_webflow_basic_info基礎信息指標表為例進行增量導出操作

實現(xiàn)步驟:

1) mysql手動創(chuàng)建目標表

create table dw_webflow_basic_info(
monthstr varchar(20),
daystr varchar(10),
pv bigint,
uv bigint,
ip bigint,
vv bigint)
2) 先執(zhí)行全量導入, 把當前的hive中20181101分區(qū)數(shù)據(jù)進行導出


bin/sqoop export 
--connect jdbc:mysql://node01:3306/weblog 
--username root --password 123456 
--table dw_webflow_basic_info 
--fields-terminated-by '01' 
--export-dir /user/hive/warehouse/itheima_weblog.db/dw_webflow_basic_info/datestr=20181101/
3) 為了方便演示, 手動生成往hive中添加20181103的數(shù)據(jù)


insert into table dw_webflow_basic_info partition(datestr="20191006") values("201910","06",14250,1341,1341,96);
4) sqoop進行增量導出


bin/sqoop export 
--connect jdbc:mysql://node01:3306/weblog 
--username root 
--password 123456 
--table dw_webflow_basic_info 
--fields-terminated-by '01' 
--update-key monthstr,daystr 
--update-mode allowinsert 
--export-dir /user/hive/warehouse/itheima_weblog.db/dw_webflow_basic_info/datestr=20181103/

View Code

e. 定時增量導出數(shù)據(jù)

應用場景:將Hive表中的增量記錄自動定時同步到目標表中

使用技術:使用sqoop expo rt 中--update-mode 的allowinsert模式進行增量數(shù)據(jù)導入目標表中。該模式用于將Hive中有但目標表中無的記錄同步到目標表中,但同時也會同步不一致的記錄
實現(xiàn)邏輯:以dw_webflow_basic_info基礎信息指標表為例進行增量導出操作

#!/bin/bash
export SQOOP_HOME=/export/servers/sqoop
if [ $# -eq 1 ]
then
    execute_date=`date --date="${1}" +%Y%m%d`
else
    execute_date=`date -d'-1 day' +%Y%m%d`
fi

echo "execute_date:"${execute_date}

table_name="dw_webflow_basic_info"
hdfs_dir=/user/hive/warehouse/itheima.db/dw_webflow_basic_info/datestr=${execute_date}
mysql_db_pwd=hadoop
mysql_db_name=root

echo 'sqoop start'
$SQOOP_HOME/bin/sqoop export 
--connect "jdbc:mysql://node-1:3306/weblog" 
--username $mysql_db_name 
--password $mysql_db_pwd 
--table $table_name 
--fields-terminated-by '01' 
--update-key monthstr,daystr 
--update-mode allowinsert 
--export-dir $hdfs_dir

echo 'sqoop end'

View Code

7. 模塊開發(fā)_工作流調(diào)度

  數(shù)據(jù)預處理模塊按照數(shù)據(jù)處理過程和業(yè)務需求,可以分成3個步驟執(zhí)行:數(shù)據(jù)預處理清洗、點擊流模型之pageviews、點擊流模型之visit。并且3個步驟之間存在著明顯的依賴關系,使用azkaban定時周期性執(zhí)行將會非常方便.

對之前的預處理MapReduce進行打jar包(共三個)

編寫 azkaban調(diào)度job設置dependence依賴

a. 數(shù)據(jù)預處理調(diào)度

#weblog_preprocess.job
type=command
command=/export/servers/hadoop-2.6.0-cdh5.14.0/bin/hadoop  jar preprocess.jar  /weblog/log    /weblog/out

# weblog_click_pageviews.job
type=command
dependencies=weblog_preprocess
command=/export/servers/hadoop-2.6.0-cdh5.14.0/bin/hadoop  jar weblog_click_pageviews.jar  /weblog/out  /weblog/pageviews

# weblog_click_visit.job
type=command
dependencies=weblog_click_pageviews
command=/export/servers/hadoop-2.6.0-cdh5.14.0/bin/hadoop  jar weblog_click_visit.jar  /weblog/pageviews  /weblog/sisit

View Code

b. 數(shù)據(jù)庫定時入庫

#!/bin/bash

export HIVE_HOME=/export/servers/hive

if [ $# -eq 1 ]
then
    datestr=`date --date="${1}" +%Y%m%d`
else
    datestr=`date -d'-1 day' +%Y%m%d`
fi

HQL="load data inpath '/preprocess/' into table itheima.ods_weblog_origin partition(datestr='${datestr}')"

echo "開始執(zhí)行l(wèi)oad......"
$HIVE_HOME/bin/hive -e "$HQL"
echo "執(zhí)行完畢......"

# load-weblog.job
type=command
command=sh load-weblog.sh

View Code

c. 數(shù)據(jù)統(tǒng)計計算定時

#!/bin/bash
HQL="
drop table dw_user_dstc_ip_h;
create table dw_user_dstc_ip_h(
remote_addr string,
pvs      bigint,
hour     string);

insert into table dw_user_dstc_ip_h 
select remote_addr,count(1) as pvs,concat(month,day,hour) as hour 
from ods_weblog_detail
Where datestr='20181101'
group by concat(month,day,hour),remote_addr;
"
echo $HQL
/export/servers/hive/bin/hive -e "$HQL"

View Code

總結

以上是生活随笔為你收集整理的网站流量分析项目day04的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。