當前位置：首頁 > 运维知识 > 数据库 >内容正文

数据库

hive与mysql对比之max、group by、日志分析

發布時間：2023/12/10 数据库 42 豆豆

生活随笔收集整理的這篇文章主要介紹了 hive与mysql对比之max、group by、日志分析小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

前期準備

mysql模型:test_max_date(id int,name varchar(255)，num int,date date)

hive模型：?create table test_date_max(id int,name string,rq Date);

insert into table test_date_max values
(1,"1","2020-12-25"),
(2,"1","2020-12-28"),
(3,"2","2020-12-25"),
(4,"2","2020-12-20")
;

需求

查詢每個人最新狀態

計算邏輯

每個人有多條數據，日期越大，狀態越新

計算過程

mysql:

SELECT id,name,date,max(date) from test_max_date group by name ORDER BY id

hive:

select name,max(rq) from test_date_max group by name;

錯誤信息說明：在之前的帖子中說過hive groupby的問題。

這里hive中有id,name,日期。id是主鍵不重復，name是可以重復的，按照name分組，對rq使用max函數，其實是對name去重，返回name每個重復值組中的最大日期

就好比一個公司分了幾個部門，部門是確定的，如果是求每個部門年齡最大的，那就是在公司全員信息表中對部門分組，對age求最大。

hive中select 字段和group by 字段必須一一匹配。

如果需要查詢完整信息，一下有兩種方式（附上sql、結果數據、查詢時間）

方式一：
select?
? ? a.*?
from?
? ? test_date_max a
? ? join?
? ? (select name,max(rq) as rq from ?test_date_max group by name) b
? ? on a.rq = b.rq and a.name = b.name

a.id?? ?a.name?? ?a.rq
2?? ?1?? ?2020-12-28
3?? ?2?? ?2020-12-25
Time taken: 118.387 seconds, Fetched: 2 row(s)

? ??
方式二：
select
? ? *
from(
? ? select
? ? ? ? *,
? ? ? ? row_number()over(partition by name order by rq desc) rank
? ? from
? ? ? ? test_date_max
? ? )tmp
where rank=1

tmp.id?? ?tmp.name?? ?tmp.rq?? ?tmp.rank
2?? ?1?? ?2020-12-28?? ?1
3?? ?2?? ?2020-12-25?? ?1
Time taken: 68.587 seconds, Fetched: 2 row(s)

計算日志分析——方式一3個job，方式二1個job

方式一：

hive (test)> select > a.* > from > test_date_max a> join > (select name,max(rq) as rq from test_date_max group by name) b> on a.rq = b.rq and a.name = b.name> ; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = admin_20210204130801_0f13ad17-7887-4a32-984d-088b5453617e Total jobs = 2 Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number> In order to set a constant number of reducers:set mapreduce.job.reduces=<number> Starting Job = job_1611888254670_2374, Tracking URL = http://hdp6.tydic.xian:8088/proxy/application_1611888254670_2374/ Kill Command = /usr/hdp/2.6.5.0-292/hadoop/bin/hadoop job -kill job_1611888254670_2374 Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1 2021-02-04 13:08:27,084 Stage-2 map = 0%, reduce = 0% 2021-02-04 13:08:44,179 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 3.78 sec 2021-02-04 13:08:59,776 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 7.68 sec MapReduce Total cumulative CPU time: 7 seconds 680 msec Ended Job = job_1611888254670_2374 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/apache-hive-2.1.1-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/hdp/2.6.5.0-292/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] 2021-02-04 13:09:08 Starting to launch local task to process map join; maximum memory = 954728448 2021-02-04 13:09:09 Dump the side-table for tag: 0 with group count: 4 into file: file:/tmp/admin/a935af81-8bbe-4c2c-b2f5-d3bdaa816d9e/hive_2021-02-04_13-08-01_805_2036786040923555355-1/-local-10005/HashTable-Stage-3/MapJoin-mapfile20--.hashtable 2021-02-04 13:09:09 Uploaded 1 File to: file:/tmp/admin/a935af81-8bbe-4c2c-b2f5-d3bdaa816d9e/hive_2021-02-04_13-08-01_805_2036786040923555355-1/-local-10005/HashTable-Stage-3/MapJoin-mapfile20--.hashtable (356 bytes) 2021-02-04 13:09:09 End of local task; Time Taken: 1.24 sec. Execution completed successfully MapredLocal task succeeded Launching Job 2 out of 2 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1611888254670_2377, Tracking URL = http://hdp6.tydic.xian:8088/proxy/application_1611888254670_2377/ Kill Command = /usr/hdp/2.6.5.0-292/hadoop/bin/hadoop job -kill job_1611888254670_2377 Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0 2021-02-04 13:09:33,954 Stage-3 map = 0%, reduce = 0% 2021-02-04 13:09:59,112 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 3.37 sec MapReduce Total cumulative CPU time: 3 seconds 370 msec Ended Job = job_1611888254670_2377 MapReduce Jobs Launched: Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 7.68 sec HDFS Read: 7794 HDFS Write: 140 SUCCESS Stage-Stage-3: Map: 1 Cumulative CPU: 3.37 sec HDFS Read: 5282 HDFS Write: 141 SUCCESS Total MapReduce CPU Time Spent: 11 seconds 50 msec OK a.id a.name a.rq 2 1 2020-12-28 3 2 2020-12-25 Time taken: 118.387 seconds, Fetched: 2 row(s)

方式二：

hive (test)> select> *> from(> select> *,> row_number()over(partition by name order by rq desc) rank> from> test_date_max> )tmp> where rank=1> ; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = admin_20210204130834_f1469766-42c9-48cb-9194-2cb506a5ff6a Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number> In order to set a constant number of reducers:set mapreduce.job.reduces=<number> Starting Job = job_1611888254670_2376, Tracking URL = http://hdp6.tydic.xian:8088/proxy/application_1611888254670_2376/ Kill Command = /usr/hdp/2.6.5.0-292/hadoop/bin/hadoop job -kill job_1611888254670_2376 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2021-02-04 13:09:07,610 Stage-1 map = 0%, reduce = 0% 2021-02-04 13:09:25,459 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.71 sec 2021-02-04 13:09:42,161 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.97 sec MapReduce Total cumulative CPU time: 7 seconds 970 msec Ended Job = job_1611888254670_2376 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 7.97 sec HDFS Read: 10327 HDFS Write: 145 SUCCESS Total MapReduce CPU Time Spent: 7 seconds 970 msec OK tmp.id tmp.name tmp.rq tmp.rank 2 1 2020-12-28 1 3 2 2020-12-25 1 Time taken: 68.587 seconds, Fetched: 2 row(s)

job解析

2021-02-04 16:38:57,881 Stage-1 map = 0%, reduce = 0% 2021-02-04 16:39:13,646 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.81 sec 2021-02-04 16:39:20,976 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 8.78 sec

hive默認引擎是mapreduce，將sql轉換成mapreduce任務，mapreduce任務分為三個階段，map，shuffle，reduce，map階段是讀取文件，shuffle是歸并排序，并將shuffle過程中的數據溢寫的本地，reduce是讀寫shuffle過程中的文件二次計算將結果寫到磁盤，從上面日志可以看出，map階段不涉及計算，沒有cpu耗時，shuffle有歸并排序，有cpu計算，有cpu耗時，只不過是做簡單計算，reduce階段有讀取、合并，有cpu計算，cpu耗時

總結

以上是生活随笔為你收集整理的hive与mysql对比之max、group by、日志分析的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：一个excel文档里复制黏贴另外表单跟着
下一篇： mysql自增_面试官：为什么 MySQ