日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

HIVE中窗口函数

發(fā)布時間:2023/12/14 编程问答 24 豆豆
生活随笔 收集整理的這篇文章主要介紹了 HIVE中窗口函数 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

什么是窗口函數(shù)

窗口函數(shù)是用于分析用的一類函數(shù),要理解窗口函數(shù)要先從聚合函數(shù)說起。 大家都知道聚合函數(shù)是將某列中多行的值合并為一行,比如sum、count等。 而窗口函數(shù)則可以在本行內(nèi)做運算,得到多行的結(jié)果,即每一行對應(yīng)一行的值。 通用的窗口函數(shù)可以用下面的語法來概括:

Function() Over (Partition By Column1,Column2,Order By Column3)

窗口函數(shù)又分為以下三類: 聚合型窗口函數(shù) 分析型窗口函數(shù) * 取值型窗口函數(shù)

接下來我們將通過幾個實際的例子來介紹下窗口函數(shù)。

準(zhǔn)備數(shù)據(jù)

首先我們準(zhǔn)備如下數(shù)據(jù):

CREATE TABLE user_match_temp ( user_name string, opponent string, result int, create_time timestamp);INSERT INTO TABLE user_match_temp values ('vpspringcloud','vpspringboot',1,'2019-07-18 23:19:00'), ('vpspringboot','vpspringcloud',0,'2019-07-18 23:19:00'), ('vpspringcloud','vpspringdata',0,'2019-07-18 23:20:00'), ('vpspringdata','vpspringcloud',1,'2019-07-18 23:20:00'), ('vpspringcloud','vpspringroo',1,'2019-07-19 22:19:00'), ('vpspringroo','vpspringcloud',0,'2019-07-19 22:19:00'), ('vpspringdata','vpspringboot',0,'2019-07-19 23:19:00'), ('vpspringboot','vpspringdata',1,'2019-07-19 23:19:00');

數(shù)據(jù)包含4列,分別為 user_name,opponent,result,create_time。 我們將基于這些數(shù)據(jù)來介紹下窗口函數(shù)的一些使用場景。

聚合型窗口函數(shù):

聚合型即SUM(), MIN(),MAX(),AVG(),COUNT()這些常見的聚合函數(shù)。 聚合函數(shù)配合窗口函數(shù)使用可以使計算更加靈活,例如以下場景: * 至今累計分?jǐn)?shù)

?

hive> SELECT *, SUM(result) OVER (PARTITION BY user_name ORDER BY create_time) AS result_sums hive> FROM user_match_temp;+----------------+----------------+---------+------------------------+--------------+--+ | user_name | opponent | result | create_time | result_sums | +----------------+----------------+---------+------------------------+--------------+--+ | vpspringdata | vpspringcloud | 1 | 2019-07-18 23:20:00.0 | 1 | | vpspringdata | vpspringboot | 0 | 2019-07-19 23:19:00.0 | 1 | | vpspringdata | vpspringcloud | 1 | 2019-07-21 23:20:00.0 | 2 | | vpspringdata | vpspringboot | 0 | 2019-07-23 23:19:00.0 | 2 | | vpspringcloud | vpspringboot | 1 | 2019-07-18 23:19:00.0 | 1 | | vpspringcloud | vpspringdata | 0 | 2019-07-18 23:20:00.0 | 1 | | vpspringcloud | vpspringroo | 1 | 2019-07-19 22:19:00.0 | 2 | | vpspringcloud | vpspringboot | 1 | 2019-07-20 23:19:00.0 | 3 | | vpspringcloud | vpspringdata | 0 | 2019-07-21 23:20:00.0 | 3 | | vpspringcloud | vpspringroo | 1 | 2019-07-22 22:19:00.0 | 4 | | vpspringroo | vpspringcloud | 0 | 2019-07-19 22:19:00.0 | 0 | | vpspringroo | vpspringcloud | 0 | 2019-07-22 22:19:00.0 | 0 | | vpspringboot | vpspringcloud | 0 | 2019-07-18 23:19:00.0 | 0 | | vpspringboot | vpspringdata | 1 | 2019-07-19 23:19:00.0 | 1 | | vpspringboot | vpspringcloud | 0 | 2019-07-20 23:19:00.0 | 1 | | vpspringboot | vpspringdata | 1 | 2019-07-23 23:19:00.0 | 2 | +----------------+----------------+---------+------------------------+--------------+--+
  • 之前3場平均勝場
hive> SELECT *,avg(result) over (partition by user_name order by create_time rows between 3 preceding and current row) as recently_wins hive> From user_match_temp; +----------------+----------------+---------+------------------------+---------------------+--+ | user_name | opponent | result | create_time | recently_wins | +----------------+----------------+---------+------------------------+---------------------+--+ | vpspringdata | vpspringcloud | 1 | 2019-07-18 23:20:00.0 | 1.0 | | vpspringdata | vpspringboot | 0 | 2019-07-19 23:19:00.0 | 0.5 | | vpspringdata | vpspringcloud | 1 | 2019-07-21 23:20:00.0 | 0.6666666666666666 | | vpspringdata | vpspringboot | 0 | 2019-07-23 23:19:00.0 | 0.5 | | vpspringcloud | vpspringboot | 1 | 2019-07-18 23:19:00.0 | 1.0 | | vpspringcloud | vpspringdata | 0 | 2019-07-18 23:20:00.0 | 0.5 | | vpspringcloud | vpspringroo | 1 | 2019-07-19 22:19:00.0 | 0.6666666666666666 | | vpspringcloud | vpspringboot | 1 | 2019-07-20 23:19:00.0 | 0.75 | | vpspringcloud | vpspringdata | 0 | 2019-07-21 23:20:00.0 | 0.5 | | vpspringcloud | vpspringroo | 1 | 2019-07-22 22:19:00.0 | 0.75 | | vpspringroo | vpspringcloud | 0 | 2019-07-19 22:19:00.0 | 0.0 | | vpspringroo | vpspringcloud | 0 | 2019-07-22 22:19:00.0 | 0.0 | | vpspringboot | vpspringcloud | 0 | 2019-07-18 23:19:00.0 | 0.0 | | vpspringboot | vpspringdata | 1 | 2019-07-19 23:19:00.0 | 0.5 | | vpspringboot | vpspringcloud | 0 | 2019-07-20 23:19:00.0 | 0.3333333333333333 | | vpspringboot | vpspringdata | 1 | 2019-07-23 23:19:00.0 | 0.5 | +----------------+----------------+---------+------------------------+---------------------+--+

?

? 我們通過rows between 即可定義窗口的范圍,這里我們定義了窗口的范圍為之前3行到該行。

  • 累計遇到的對手?jǐn)?shù)量 需要注意的是count(distinct xxx)在窗口函數(shù)里是不允許使用的,不過我們也可以用size(collect_set() over(partition by order by))來替代實現(xiàn)我們的需求
hive> SELECT *,size(collect_set(opponent) over (partition by user_name order by create_time)) as recently_wins hive> From user_match_temp;+----------------+----------------+---------+------------------------+------------------+--+ | user_name | opponent | result | create_time | opponent_counts | +----------------+----------------+---------+------------------------+------------------+--+ | vpspringdata | vpspringcloud | 1 | 2019-07-18 23:20:00.0 | 1 | | vpspringdata | vpspringboot | 0 | 2019-07-19 23:19:00.0 | 2 | | vpspringdata | vpspringcloud | 1 | 2019-07-21 23:20:00.0 | 2 | | vpspringdata | vpspringboot | 0 | 2019-07-23 23:19:00.0 | 2 | | vpspringcloud | vpspringboot | 1 | 2019-07-18 23:19:00.0 | 1 | | vpspringcloud | vpspringdata | 0 | 2019-07-18 23:20:00.0 | 2 | | vpspringcloud | vpspringroo | 1 | 2019-07-19 22:19:00.0 | 3 | | vpspringcloud | vpspringboot | 1 | 2019-07-20 23:19:00.0 | 3 | | vpspringcloud | vpspringdata | 0 | 2019-07-21 23:20:00.0 | 3 | | vpspringcloud | vpspringroo | 1 | 2019-07-22 22:19:00.0 | 3 | | vpspringroo | vpspringcloud | 0 | 2019-07-19 22:19:00.0 | 1 | | vpspringroo | vpspringcloud | 0 | 2019-07-22 22:19:00.0 | 1 | | vpspringboot | vpspringcloud | 0 | 2019-07-18 23:19:00.0 | 1 | | vpspringboot | vpspringdata | 1 | 2019-07-19 23:19:00.0 | 2 | | vpspringboot | vpspringcloud | 0 | 2019-07-20 23:19:00.0 | 2 | | vpspringboot | vpspringdata | 1 | 2019-07-23 23:19:00.0 | 2 | +----------------+----------------+---------+------------------------+------------------+--+

collect_set()也是一個聚合函數(shù),作用是將多行聚合進一行的某個set內(nèi),再用size()統(tǒng)計集合內(nèi)的元素個數(shù),即可實現(xiàn)我們的需求。

分析型窗口函數(shù)

分析型即RANk(),ROW_NUMBER(),DENSE_RANK()等常見的排序用的窗口函數(shù),不過他們也是有區(qū)別的。

hive> SELECT *, hive> rank() over (order by create_time) as user_rank, hive> row_number() over (order by create_time) as user_row_number, hive> dense_rank() over (order by create_time) as user_dense_rank hive> FROM user_match_temp;+----------------+----------------+---------+------------------------+------------+------------------+------------------+--+ | user_name | opponent | result | create_time | user_rank | user_row_number | user_dense_rank | +----------------+----------------+---------+------------------------+------------+------------------+------------------+--+ | vpspringcloud | vpspringboot | 1 | 2019-07-18 23:19:00.0 | 1 | 1 | 1 | | vpspringboot | vpspringcloud | 0 | 2019-07-18 23:19:00.0 | 1 | 2 | 1 | | vpspringcloud | vpspringdata | 0 | 2019-07-18 23:20:00.0 | 3 | 3 | 2 | | vpspringdata | vpspringcloud | 1 | 2019-07-18 23:20:00.0 | 3 | 4 | 2 | | vpspringroo | vpspringcloud | 0 | 2019-07-19 22:19:00.0 | 5 | 5 | 3 | | vpspringcloud | vpspringroo | 1 | 2019-07-19 22:19:00.0 | 5 | 6 | 3 | | vpspringdata | vpspringboot | 0 | 2019-07-19 23:19:00.0 | 7 | 7 | 4 | | vpspringboot | vpspringdata | 1 | 2019-07-19 23:19:00.0 | 7 | 8 | 4 | | vpspringcloud | vpspringboot | 1 | 2019-07-20 23:19:00.0 | 9 | 9 | 5 | | vpspringboot | vpspringcloud | 0 | 2019-07-20 23:19:00.0 | 9 | 10 | 5 | | vpspringcloud | vpspringdata | 0 | 2019-07-21 23:20:00.0 | 11 | 11 | 6 | | vpspringdata | vpspringcloud | 1 | 2019-07-21 23:20:00.0 | 11 | 12 | 6 | | vpspringcloud | vpspringroo | 1 | 2019-07-22 22:19:00.0 | 13 | 13 | 7 | | vpspringroo | vpspringcloud | 0 | 2019-07-22 22:19:00.0 | 13 | 14 | 7 | | vpspringdata | vpspringboot | 0 | 2019-07-23 23:19:00.0 | 15 | 15 | 8 | | vpspringboot | vpspringdata | 1 | 2019-07-23 23:19:00.0 | 15 | 16 | 8 | +----------------+----------------+---------+------------------------+------------+------------------+---------------

如上所示:?row_number函數(shù):生成連續(xù)的序號(相同元素序號相同);?rank函數(shù):如兩元素排序相同則序號相同,并且會跳過下一個序號;dense_rank函數(shù):如兩元素排序相同則序號相同,不會跳過下一個序號;

除了這三個排序用的函數(shù),還有?CUME_DIST函數(shù) :小于等于當(dāng)前值的行在所有行中的占比?PERCENT_RANK() :小于當(dāng)前值的行在所有行中的占比 * NTILE() :如果把數(shù)據(jù)按行數(shù)分為n份,那么該行所屬的份數(shù)是第幾份 這三種窗口函數(shù) 效果如下:

hive2> SELECT *, hive2> CUME_DIST() over (order by create_time) as user_CUME_DIST, hive2> PERCENT_RANK() over (order by create_time) as user_PERCENT_RANK, hive2> NTILE(3) over (order by create_time) as user_NTILE hive2> FROM user_match_temp;+----------------+----------------+---------+------------------------+-----------------+----------------------+-------------+--+ | user_name | opponent | result | create_time | user_CUME_DIST | user_PERCENT_RANK | user_NTILE | +----------------+----------------+---------+------------------------+-----------------+----------------------+-------------+--+ | vpspringcloud | vpspringboot | 1 | 2019-07-18 23:19:00.0 | 0.125 | 0.0 | 1 | | vpspringboot | vpspringcloud | 0 | 2019-07-18 23:19:00.0 | 0.125 | 0.0 | 1 | | vpspringcloud | vpspringdata | 0 | 2019-07-18 23:20:00.0 | 0.25 | 0.13333333333333333 | 1 | | vpspringdata | vpspringcloud | 1 | 2019-07-18 23:20:00.0 | 0.25 | 0.13333333333333333 | 1 | | vpspringcloud | vpspringroo | 1 | 2019-07-19 22:19:00.0 | 0.375 | 0.26666666666666666 | 1 | | vpspringroo | vpspringcloud | 0 | 2019-07-19 22:19:00.0 | 0.375 | 0.26666666666666666 | 1 | | vpspringdata | vpspringboot | 0 | 2019-07-19 23:19:00.0 | 0.5 | 0.4 | 2 | | vpspringboot | vpspringdata | 1 | 2019-07-19 23:19:00.0 | 0.5 | 0.4 | 2 | | vpspringcloud | vpspringboot | 1 | 2019-07-20 23:19:00.0 | 0.625 | 0.5333333333333333 | 2 | | vpspringboot | vpspringcloud | 0 | 2019-07-20 23:19:00.0 | 0.625 | 0.5333333333333333 | 2 | | vpspringcloud | vpspringdata | 0 | 2019-07-21 23:20:00.0 | 0.75 | 0.6666666666666666 | 2 | | vpspringdata | vpspringcloud | 1 | 2019-07-21 23:20:00.0 | 0.75 | 0.6666666666666666 | 3 | | vpspringcloud | vpspringroo | 1 | 2019-07-22 22:19:00.0 | 0.875 | 0.8 | 3 | | vpspringroo | vpspringcloud | 0 | 2019-07-22 22:19:00.0 | 0.875 | 0.8 | 3 | | vpspringdata | vpspringboot | 0 | 2019-07-23 23:19:00.0 | 1.0 | 0.9333333333333333 | 3 | | vpspringboot | vpspringdata | 1 | 2019-07-23 23:19:00.0 | 1.0 | 0.9333333333333333 | 3 | +----------------+----------------+---------+------------------------+-----------------

?

取值型窗口函數(shù)

這幾個函數(shù)可以通過字面意思記得,LAG是遲滯的意思,也就是對某一列進行往后錯行;LEAD是LAG的反義詞,也就是對某一列往前錯行;FIRST_VALUE是對該列到目前為止的首個值,而LAST_VALUE是到目前行為止的最后一個值。

LAG()和LEAD() 可以帶3個參數(shù),第一個是返回的值,第二個是前置或者后置的行數(shù),第三個是默認(rèn)值。

下一個對手,上一個對手,最近3局的第一個對手及最后一個對手,如下:

hive> SELECT *, hive> lag(opponent,1) hive> over (partition by user_name order by create_time) as lag_opponent, hive> lead(opponent,1) over hive> (partition by user_name order by create_time) as lead_opponent, hive> first_value(opponent) over (partition by user_name order by create_time rows hive> between 3 preceding and 3 following) as first_opponent, hive> last_value(opponent) over (partition by user_name order by create_time rows hive> between 3 preceding and 3 following) as last_opponent hive> From user_match_temp; +----------------+----------------+---------+------------------------+----------------+----------------+-----------------+----------------+--+ | user_name | opponent | result | create_time | lag_opponent | lead_opponent | first_opponent | last_opponent | +----------------+----------------+---------+------------------------+----------------+----------------+-----------------+----------------+--+ | vpspringdata | vpspringcloud | 1 | 2019-07-18 23:20:00.0 | NULL | vpspringboot | vpspringcloud | vpspringboot | | vpspringdata | vpspringboot | 0 | 2019-07-19 23:19:00.0 | vpspringcloud | vpspringcloud | vpspringcloud | vpspringboot | | vpspringdata | vpspringcloud | 1 | 2019-07-21 23:20:00.0 | vpspringboot | vpspringboot | vpspringcloud | vpspringboot | | vpspringdata | vpspringboot | 0 | 2019-07-23 23:19:00.0 | vpspringcloud | NULL | vpspringcloud | vpspringboot | | vpspringcloud | vpspringboot | 1 | 2019-07-18 23:19:00.0 | NULL | vpspringdata | vpspringboot | vpspringboot | | vpspringcloud | vpspringdata | 0 | 2019-07-18 23:20:00.0 | vpspringboot | vpspringroo | vpspringboot | vpspringdata | | vpspringcloud | vpspringroo | 1 | 2019-07-19 22:19:00.0 | vpspringdata | vpspringboot | vpspringboot | vpspringroo | | vpspringcloud | vpspringboot | 1 | 2019-07-20 23:19:00.0 | vpspringroo | vpspringdata | vpspringboot | vpspringroo | | vpspringcloud | vpspringdata | 0 | 2019-07-21 23:20:00.0 | vpspringboot | vpspringroo | vpspringdata | vpspringroo | | vpspringcloud | vpspringroo | 1 | 2019-07-22 22:19:00.0 | vpspringdata | NULL | vpspringroo | vpspringroo | | vpspringroo | vpspringcloud | 0 | 2019-07-19 22:19:00.0 | NULL | vpspringcloud | vpspringcloud | vpspringcloud | | vpspringroo | vpspringcloud | 0 | 2019-07-22 22:19:00.0 | vpspringcloud | NULL | vpspringcloud | vpspringcloud | | vpspringboot | vpspringcloud | 0 | 2019-07-18 23:19:00.0 | NULL | vpspringdata | vpspringcloud | vpspringdata | | vpspringboot | vpspringdata | 1 | 2019-07-19 23:19:00.0 | vpspringcloud | vpspringcloud | vpspringcloud | vpspringdata | | vpspringboot | vpspringcloud | 0 | 2019-07-20 23:19:00.0 | vpspringdata | vpspringdata | vpspringcloud | vpspringdata | | vpspringboot | vpspringdata | 1 | 2019-07-23 23:19:00.0 | vpspringcloud | NULL | vpspringcloud | vpspringdata | +----------------+----------------+---------+------------------------+----------------+----------------+-

?

總結(jié)

以上是生活随笔為你收集整理的HIVE中窗口函数的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。