生活随笔
收集整理的這篇文章主要介紹了
如何写一个包含多个事件四则运算的留存SQL ——impala hive
小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
在實(shí)現(xiàn)一個(gè)留存業(yè)務(wù)需求時(shí),碰到了一個(gè)難題,我需要提供展示一個(gè)按照如下圖格式的數(shù)據(jù),
day 1 ~ day n的第一行是留存用戶數(shù)量,第二行是一個(gè)由多個(gè)事件組合執(zhí)行四則算術(shù)運(yùn)算得到的復(fù)合數(shù)值,這里碰到的難點(diǎn)主要是第二行的計(jì)算,如果只想查看第二行的解決方法可以點(diǎn)擊這里
由于數(shù)據(jù)傳輸速率受限,我不能使用先查詢出所有數(shù)據(jù)然后在代碼里處理數(shù)據(jù)的方法,因此我需要在sql查詢中盡量完成所有聚合計(jì)算以減少查詢返回的行數(shù)
留存模型采用的是經(jīng)典模型(Classic retention)留存用戶的數(shù)量都是在各天day n獨(dú)立計(jì)算的
這里day 1~day n第一行計(jì)算新用戶留存數(shù)量,第二行的小數(shù)計(jì)算留存的新用戶中某個(gè)混合事件的表現(xiàn),混合事件可以是由某一事件計(jì)算得到的值或者由多個(gè)事件進(jìn)行四則運(yùn)算得到的組合事件的值,第二行的值計(jì)算是這篇文章要講的重點(diǎn)
例如求某個(gè)活動(dòng)事件有兩個(gè)入口entrance_a和entrance_b,結(jié)束通關(guān)標(biāo)識事件為event_over
假設(shè)事件名稱event_name為"activety_1",用戶表為t_user,事件表為t_event
t_user表的數(shù)據(jù)是這樣的
t_event表的數(shù)據(jù)是這樣的
1
計(jì)算第一行, 也就是計(jì)算經(jīng)典留存模型day 1~day n的留存用戶量,可以用t_user和t_event的join和case when語句實(shí)現(xiàn):
with temp_user
as (
select distinct `uid
`, to_date
(`firstday
`) as `firstday
`
from `t_user
`
where firstday
>= "2022-02-01" and firstday
< "2022-02-04"
),
temp_event
as (
select distinct`uid
`,to_date
(`event_date
`) as `event_date
`
from `t_event
`
where event_name
= "activity_1"
and event_date
>= "2022-02-01"
and event_date
< "2022-02-07"
)
select
`firstday
`,
Count(distinct a
.uid
) as `new_user
`,
Count(distinct case when event_date
= date_add
(firstday
, 1) then a
.uid
end) as `day 1`,
Count(distinct case when event_date
= date_add
(firstday
, 2) then a
.uid
end) as `day 2`,
Count(distinct case when event_date
= date_add
(firstday
, 3) then a
.uid
end) as `day 3`
from temp_user a
left join temp_event b
on a
.uid
= b
.uid
group by firstday
用以上sql語句查詢得到的結(jié)果:
格式跟開頭的圖中的表格保持了一致,數(shù)值稍微驗(yàn)證一下可知沒有問題,求第一行的留存用戶數(shù)相對較簡單
2
計(jì)算第二行的值,我想計(jì)算事件activity_1在day 1 ~ day n的通關(guān)表現(xiàn),具體來說就是要計(jì)算出day 1~day n各留存用戶的(entrance_a + entrance_b)/event_over值
現(xiàn)在假設(shè)事件表t_event_1的內(nèi)容是這樣的
用戶表t_user同之前的不變
要計(jì)算出day 1~day n各留存用戶的(entrance_a + entrance_b)/event_over的表現(xiàn),結(jié)果展示格式類似下面這張圖
除了要按day 1 ~ day n展示數(shù)據(jù)外還涉及到屬性entrance_a, entrance_b等的聚合計(jì)算,使用分析函數(shù)(Analytics Function)并不能降低得到的行數(shù),這里我采用了先把要統(tǒng)計(jì)的數(shù)據(jù)(entrance_a, entrance_b, event_over)先分別計(jì)算出來按firstday和event_date分組成行,然后再利用case when和求和語句把算出來的結(jié)果聚合到對應(yīng)的day n,寫出來的sql如下
with temp_user
as (
select distinct `uid
`, to_date
(`firstday
`) as `firstday
`
from `t_user
`
where firstday
>= "2022-02-01" and firstday
< "2022-02-04"
),
temp_event_1
as (
select a
.uid
,to_date
(a
.firstday
) `firstday
`,to_date
(b
.event_date
) `event_date
`,count(case when b
.entrance
= "a" then 1 end) as entrance_a
,count(case when b
.entrance
= "b" then 1 end) as entrance_b
,count(case when b
.event_status
= "event_over" then 1 end) as event_over
from t_user a
left join t_event_1 b
on a
.uid
= b
.uid
and to_date
(b
.event_date
) between date_add
(a
.firstday
, 1) and date_add
(a
.firstday
, 3)and (b
.entrance
in ("a", "b") OR b
.event_status
= "event_over") and b
.event_date
>= "2022-02-01" and b
.event_date
< "2022-02-07"
group by firstday
, event_date
, a
.uid
)
select evt
.firstday
,count(distinct evt
.uid
) `new
user`,(sum(case when date_add
(evt
.firstday
, 1) = evt
.event_date
then evt
.entrance_a
else 0 end ) + sum(case when date_add
(evt
.firstday
, 1) = evt
.event_date
then evt
.entrance_b
else 0 end )) / sum(case when date_add
(evt
.firstday
, 1) = evt
.event_date
then evt
.event_over
else 0 end ) `day 1`,(sum(case when date_add
(evt
.firstday
, 2) = evt
.event_date
then evt
.entrance_a
else 0 end ) + sum(case when date_add
(evt
.firstday
, 2) = evt
.event_date
then evt
.entrance_b
else 0 end )) / sum(case when date_add
(evt
.firstday
, 2) = evt
.event_date
then evt
.event_over
else 0 end ) `day 2`,(sum(case when date_add
(evt
.firstday
, 3) = evt
.event_date
then evt
.entrance_a
else 0 end ) + sum(case when date_add
(evt
.firstday
, 3) = evt
.event_date
then evt
.entrance_b
else 0 end )) / sum(case when date_add
(evt
.firstday
, 3) = evt
.event_date
then evt
.event_over
else 0 end ) `day 3`
from temp_event_1 evt
group by firstday
查詢的結(jié)果:
這里出現(xiàn)的null如果你想在sql中把它們變成0也可以使用zeroifnull()函數(shù)將每一列計(jì)算結(jié)果包住
驗(yàn)證正確性
根據(jù)用戶表t_user和事件表t_event_1對比查詢結(jié)果的正確性:
用戶表
先看firstday在02-01的用戶,只有0和1兩個(gè),對應(yīng)的day 1就是他們在02-02的(entrance_a+entrance_b)/event_over表現(xiàn)值:
再檢查02-02的用戶2和3在day 1, day 2, day3的表現(xiàn)數(shù)值,也就是他們分別再02-03, 02-04和02-05的表現(xiàn)數(shù)值:
根據(jù)驗(yàn)證結(jié)果查詢是沒有問題的
day 1 ~ day n第二行的解決sql相對來說比較復(fù)雜,但是我目前沒有想到更好的sql,如果有更好的方法查出結(jié)果,歡迎評論告訴我,感謝~
以上的模擬數(shù)據(jù),都可以在這個(gè)鏈接(https://demo.gethue.com/hue/home)找到,賬號密碼登錄的時(shí)候都是demo,找到Hive下的數(shù)據(jù)庫選擇default子庫
Editor選擇Hive就可以查詢了
你也可以選擇自己創(chuàng)建新表和插入數(shù)據(jù)模擬,以下是我的建表和插入數(shù)據(jù)的sql:
CREATE TABLE `t_user
`(`uid
` int, `firstday
` timestamp)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED
AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://namenode:8020/user/hive/warehouse/t_user';INSERT INTO t_user
(uid
, firstday
)
VALUES(0, '2022-02-01 00:01:00'),(1, '2022-02-01 00:04:30'),(2, '2022-02-02 10:00:00'),(3, '2022-02-02 14:30:00')
;
CREATE TABLE `t_event
`(`uid
` int,`event_name
` STRING
,`event_date
` TIMESTAMP)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED
AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://namenode:8020/user/hive/warehouse/t_event';INSERT INTO t_event
(uid
, event_name
, event_date
)
VALUES(1, "activity_1", '2022-02-01 00:01:00'),(1, "activity_1", '2022-02-01 00:04:30'),(1, "activity_1", '2022-02-02 10:00:00'),(1, "activity_1", '2022-02-02 14:30:00'),(2, "activity_1", '2022-02-02 00:04:30'),(2, "activity_1", '2022-02-03 10:00:00'),(2, "activity_1", '2022-02-04 14:30:00'),(3, "activity_1", '2022-02-05 09:00:00'),(3, "activity_1", '2022-02-05 12:30:00'),(0, "activity_1", '2022-02-06 14:30:00')
;
CREATE TABLE `t_event_1
`(`uid
` int,`entrance
` string
,`event_status
` STRING
,`event_name
` STRING
,`event_date
` TIMESTAMP)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED
AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://namenode:8020/user/hive/warehouse/t_event_1';INSERT INTO `t_event_1
` (uid
, entrance
, event_status
, event_name
, event_date
)
VALUES(1, "a", "event_over", "activity_1", '2022-02-01 00:01:00'),(1, "b", "event_over", "activity_1", '2022-02-01 00:04:30'),(1, "a", "event_failed", "activity_1", '2022-02-02 10:00:00'),(1, "b", "event_failed", "activity_1", '2022-02-02 14:30:00'),(1, "a", "event_over", "activity_1", '2022-02-06 14:30:00'),(2, "a", "event_over", "activity_1", '2022-02-02 14:30:00'), (2, "a", "event_failed", "activity_1", '2022-02-02 00:04:30'),(2, "b", "event_over", "activity_1", '2022-02-03 10:00:00'),(2, "a", "event_over", "activity_1", '2022-02-04 14:30:00'),(2, "b", "event_failed", "activity_1", '2022-02-05 16:30:00'),(3, "a", "event_over", "activity_1", '2022-02-05 09:00:00'),(3, "b", "event_over", "activity_1", '2022-02-02 00:04:30'),(3, "b", "event_over", "activity_1", '2022-02-05 12:30:00'),(0, "a", "event_over", "activity_1", '2022-02-06 14:30:00'),(0, "a", "event_over", "activity_1", '2022-02-02 14:30:00'), (0, "b", "event_over", "activity_1", '2022-02-02 00:04:30')
;
注意你登錄的session只會保持一段時(shí)間,大概十幾分鐘或更多?,一般你可以通過re-create session來重新打開并繼續(xù)執(zhí)行sql查詢,如果re-create沒有效果就只能重新登錄了
總結(jié)
以上是生活随笔為你收集整理的如何写一个包含多个事件四则运算的留存SQL ——impala hive的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。