當(dāng)前位置：首頁 >

3.2-3.3 Hive中常见的数据压缩

發(fā)布時(shí)間：2025/7/25 65 豆豆

生活随笔收集整理的這篇文章主要介紹了 3.2-3.3 Hive中常见的数据压缩小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

一、數(shù)據(jù)壓縮

1、

數(shù)據(jù)壓縮數(shù)據(jù)量小*本地磁盤，IO*減少網(wǎng)絡(luò)IOHadoop作業(yè)通常是IO綁定的; 壓縮減少了跨網(wǎng)絡(luò)傳輸?shù)臄?shù)據(jù)的大小; 通過簡單地啟用壓縮，可以提高總體作業(yè)性能; 要壓縮的數(shù)據(jù)必須支持可分割性；

2、什么時(shí)候壓縮？

1、Use Compressed Map Input · Mapreduce jobs read input from HDFS · Compress if input data is large. This will reduce disk read cost. · Compress with splittable algorithms like Bzip2 · Or use compression with splittable file structures such as Sequence Files, RC Files etc.2、Compress Intermediate Data ·Map output is written to disk(spill)and transferred accross the network ·Always use compression toreduce both disk write,and network transfer load ·Beneficial in performace point of view even if input and output is uncompressed ·Use faster codecs such as Snappy,LZO3、Compress Reducer Output .Mapreduce output used for both archiving or chaining mapreduce jobs ·Use compression to reduce disk space for archiving ·Compression is also beneficial for chaining jobsespecially with limited disk throughput resource. ·Use compression methods with higher compress ratio to save more disk space

3、Supported Codecs in Hadoop

Zlib→org.apache.hadoop.io.compress.DefaultCodec Gzip →org.apache.hadoop.io.compress.Gzipcodec Bzip2→org.apache.hadoop.io.compress.BZip2Codec Lzo→com.hadoop.compression.1zo.LzoCodec Lz4→org.apache.hadoop.io.compress.Lz4Codec Snappy→org.apache.hadoop.io.compress.Snappycodec

4、Compression in MapReduce

##### Compressed Input Usage：File format is auto recognized with extension.Codec must be defined in core-site.xml.##### Compress Intermediate Data (Map Output)：mapreduce.map.output.compress=True; mapreduce.map.output.compress.codec=CodecName;##### Compress Job Output (Reducer Output)：mapreduce.output.fileoutputformat.compress=True; mapreduce.output.fileoutputformat.compress.codec=CodecName;

5、Compression in Hive

##### Compressed Input Usage： Can be defined in table definition STORED AS INPUTFORMAT \"com.hadoop.mapred.DeprecatedLzoText Input Format\"##### Compress Intermediate Data (Map Output)： SET hive. exec. compress. intermediate=True; SET mapred. map. output. compression. codec=CodecName; SET mapred. map. output. compression. type=BLOCK/RECORD; Use faster codecs such as Snappy, Lzo, LZ4 Useful for chained mapreduce jobs with lots of intermediate data such as joins.##### Compress Job Output (Reducer Output)： SET hive.exec.compress.output=True; SET mapred.output.compression.codec=CodecName; SET mapred.output.compression.type=BLOCK/RECORD;

二、snappy

1、簡介

在hadoop集群中snappy是一種比較好的壓縮工具，相對(duì)gzip壓縮速度和解壓速度有很大的優(yōu)勢(shì)，而且相對(duì)節(jié)省cpu資源，但壓縮率不及gzip。它們各有各的用途。Snappy是用C++開發(fā)的壓縮和解壓縮開發(fā)包，旨在提供高速壓縮速度和合理的壓縮率。Snappy比zlib更快，但文件相對(duì)要大20%到100%。在64位模式的Core i7處理器上，可達(dá)每秒250~500兆的壓縮速度。Snappy的前身是Zippy。雖然只是一個(gè)數(shù)據(jù)壓縮庫，它卻被Google用于許多內(nèi)部項(xiàng)目程，其中就包括BigTable，MapReduce和RPC。 Google宣稱它在這個(gè)庫本身及其算法做了數(shù)據(jù)處理速度上的優(yōu)化，作為代價(jià)，并沒有考慮輸出大小以及和其他類似工具的兼容性問題。 Snappy特地為64位x86處理器做了優(yōu)化，在單個(gè)Intel Core i7處理器內(nèi)核上能夠達(dá)到至少每秒250MB的壓縮速率和每秒500MB的解壓速率。如果允許損失一些壓縮率的話，那么可以達(dá)到更高的壓縮速度，雖然生成的壓縮文件可能會(huì)比其他庫的要大上20%至100%，但是，相比其他的壓縮庫，Snappy卻能夠在特定的壓縮率下?lián)碛畜@人的壓縮速度，“壓縮普通文本文件的速度是其他庫的1.5-1.7倍， HTML能達(dá)到2-4倍，但是對(duì)于JPEG、PNG以及其他的已壓縮的數(shù)據(jù)，壓縮速度不會(huì)有明顯改善”。

2、使得Snappy類庫對(duì)Hadoop可用

此處使用的是編譯好的庫文件；

#這里是編譯好的庫文件，在壓縮包里，先解壓縮 [root@hadoop-senior softwares]# mkdir 2.5.0-native-snappy[root@hadoop-senior softwares]# tar zxf 2.5.0-native-snappy.tar.gz -C 2.5.0-native-snappy[root@hadoop-senior softwares]# cd 2.5.0-native-snappy[root@hadoop-senior 2.5.0-native-snappy]# ls libhadoop.a libhadoop.so libhadooputils.a libhdfs.so libsnappy.a libsnappy.so libsnappy.so.1.2.0 libhadooppipes.a libhadoop.so.1.0.0 libhdfs.a libhdfs.so.0.0.0 libsnappy.la libsnappy.so.1#替換hadoop的安裝 [root@hadoop-senior lib]# pwd /opt/modules/hadoop-2.5.0/lib[root@hadoop-senior lib]# mv native/ 250-native[root@hadoop-senior lib]# mkdir native[root@hadoop-senior lib]# ls 250-native native native-bak[root@hadoop-senior lib]# cp /opt/softwares/2.5.0-native-snappy/* ./native/[root@hadoop-senior lib]# ls native libhadoop.a libhadoop.so libhadooputils.a libhdfs.so libsnappy.a libsnappy.so libsnappy.so.1.2.0 libhadooppipes.a libhadoop.so.1.0.0 libhdfs.a libhdfs.so.0.0.0 libsnappy.la libsnappy.so.1#檢查 [root@hadoop-senior hadoop-2.5.0]# bin/hadoop checknative 19/04/25 09:59:51 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native 19/04/25 09:59:51 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library Native library checking: hadoop: true /opt/modules/hadoop-2.5.0/lib/native/libhadoop.so zlib: true /lib64/libz.so.1 snappy: true /opt/modules/hadoop-2.5.0/lib/native/libsnappy.so.1 #snappy已經(jīng)為true lz4: true revision:99 bzip2: true /lib64/libbz2.so.1

3、mapreduce壓縮測(cè)試

#創(chuàng)建測(cè)試文件 [root@hadoop-senior hadoop-2.5.0]# bin/hdfs dfs -mkdir -p /user/root/mapreduce/wordcount/input[root@hadoop-senior hadoop-2.5.0]# touch /opt/datas/wc.input[root@hadoop-senior hadoop-2.5.0]# vim !$ hadoop hdfs hadoop hive hadoop mapreduce hadoop hue[root@hadoop-senior hadoop-2.5.0]# bin/hdfs dfs -put /opt/datas/wc.input /user/root/mapreduce/wordcount/input put: `/user/root/mapreduce/wordcount/input/wc.input': File exists[root@hadoop-senior hadoop-2.5.0]# bin/hdfs dfs -ls -R /user/root/mapreduce/wordcount/input -rw-r--r-- 1 root supergroup 12 2019-04-08 15:03 /user/root/mapreduce/wordcount/input/wc.input#先不壓縮運(yùn)行MapReduce [root@hadoop-senior hadoop-2.5.0]# bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar wordcount /user/root/mapreduce/wordcount/input /user/root/mapreduce/wordcount/output#壓縮運(yùn)行MapReduce [root@hadoop-senior hadoop-2.5.0]# bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar wordcount -Dmapreduce.map.output.compress=true -Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec /user/root/mapreduce/wordcount/input /user/root/mapreduce/wordcount/output2#-Dmapreduce.map.output.compress=true ：map輸出的值要使用壓縮；-D是參數(shù)#-Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec ：使用snappy壓縮；-D是參數(shù) #由于數(shù)據(jù)量太小，基本上看不出差別

三、hive配置壓縮

hive (default)> set mapreduce.map.output.compress=true; hive (default)> set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;

測(cè)試：

在hive中運(yùn)行一個(gè)select會(huì)執(zhí)行MapReduce：

hive (default)> select count(*) from emp;

在web頁面的具體job中可以看到此作業(yè)使用的配置：

轉(zhuǎn)載于:https://www.cnblogs.com/weiyiming007/p/10768896.html

總結(jié)

以上是生活随笔為你收集整理的3.2-3.3 Hive中常见的数据压缩的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： chrome浏览器新建标签页面跳
下一篇： uva1025城市里的间谍

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

3.2-3.3 Hive中常见的数据压缩

總結(jié)