hive处理日志,自定义inputformat
生活随笔
收集整理的這篇文章主要介紹了
hive处理日志,自定义inputformat
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
開放環境,hadoop-0.20.2,hive-0.6
1.日志分隔符
Xml代碼 ?2010-05-31?10:50:17|||61.132.4.82|||http://www.360buy.com/product/201185.html?? 2010-05-31 10:50:17|||61.132.4.82|||http://www.360buy.com/product/201185.html
分隔符是“ ||| ”,這是為了盡可能防止日志正文出現與分隔符相同的字符而導致數據混淆。
hive 的內部分隔符是“ \001 ”,所以我們需要做一下轉換
2.編寫自定義InputFormat
Java代碼 ?package?com.jd.cloud.clickstore;?? ?? import?java.io.IOException;?? ?? import?org.apache.hadoop.io.LongWritable;?? import?org.apache.hadoop.io.Text;?? import?org.apache.hadoop.mapred.FileSplit;?? import?org.apache.hadoop.mapred.InputSplit;?? import?org.apache.hadoop.mapred.JobConf;?? import?org.apache.hadoop.mapred.JobConfigurable;?? import?org.apache.hadoop.mapred.RecordReader;?? import?org.apache.hadoop.mapred.Reporter;?? import?org.apache.hadoop.mapred.TextInputFormat;?? ?? /**? ?*?自定義hadoop的?org.apache.hadoop.mapred.InputFormat? ?*?? ?*?@author?winston? ?*?? ?*/?? public?class?ClickstreamInputFormat?extends?TextInputFormat?implements?? ????????JobConfigurable?{?? ?? ????public?RecordReader<LongWritable,?Text>?getRecordReader(?? ????????????InputSplit?genericSplit,?JobConf?job,?Reporter?reporter)?? ????????????throws?IOException?{?? ?? ????????reporter.setStatus(genericSplit.toString());?? ????????return?new?ClickstreamRecordReader(job,?(FileSplit)?genericSplit);?? ????}?? }?? package com.jd.cloud.clickstore;import java.io.IOException;import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapred.InputSplit;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.JobConfigurable;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;/*** 自定義hadoop的 org.apache.hadoop.mapred.InputFormat* * @author winston* */
public class ClickstreamInputFormat extends TextInputFormat implementsJobConfigurable {public RecordReader<LongWritable, Text> getRecordReader(InputSplit genericSplit, JobConf job, Reporter reporter)throws IOException {reporter.setStatus(genericSplit.toString());return new ClickstreamRecordReader(job, (FileSplit) genericSplit);}
}
3.自定義ClickstreamRecordReader實現RecordReader接口,并重寫next方法
? Java代碼 ?/**?Read?a?line.?*/?? ??public?synchronized?boolean?next(LongWritable?key,?Text?value)?? ????throws?IOException?{?? ?? ????while?(pos?<?end)?{?? ??????key.set(pos);?? ?? ??????int?newSize?=?in.readLine(value,?maxLineLength,?? ????????????????????????????????Math.max((int)Math.min(Integer.MAX_VALUE,?end-pos),?? ?????????????????????????????????????????maxLineLength));?? ???????? ??????//start?? ??????String?strReplace?=?value.toString().toLowerCase().replaceAll("\\|\\|\\|"?,?"\001"?);?? ??????Text?txtReplace?=?new?Text();?? ??????txtReplace.set(strReplace?);?? ??????value.set(txtReplace.getBytes(),?0,?txtReplace.getLength());?? ??????//end?? ???????? ???????? ??????if?(newSize?==?0)?{?? ????????return?false;?? ??????}?? ??????pos?+=?newSize;?? ??????if?(newSize?<?maxLineLength)?{?? ????????return?true;?? ??????}?? ?? ??????//?line?too?long.?try?again?? ??????LOG.info("Skipped?line?of?size?"?+?newSize?+?"?at?pos?"?+?(pos?-?newSize));?? ????}?? ?? ????return?false;?? ??}?? /** Read a line. */public synchronized boolean next(LongWritable key, Text value)throws IOException {while (pos < end) {key.set(pos);int newSize = in.readLine(value, maxLineLength,Math.max((int)Math.min(Integer.MAX_VALUE, end-pos),maxLineLength));//startString strReplace = value.toString().toLowerCase().replaceAll("\\|\\|\\|" , "\001" );Text txtReplace = new Text();txtReplace.set(strReplace );value.set(txtReplace.getBytes(), 0, txtReplace.getLength());//endif (newSize == 0) {return false;}pos += newSize;if (newSize < maxLineLength) {return true;}// line too long. try againLOG.info("Skipped line of size " + newSize + " at pos " + (pos - newSize));}return false;}
我們可以直接使用LineRecordReader,修改next方法
3.啟動hive,添加我們自己剛剛添加的類
4.創建數據庫
Java代碼 ?create?table?clickstream_table(time?string,?ip?string,?url?string)?stored?as?INPUTFORMAT?'com.jd.cloud.clickstore.ClickstreamInputFormat'?OUTPUTFORMAT?'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'?LOCATION?'/data/clickstream_20110216.txt';?? create table clickstream_table(time string, ip string, url string) stored as INPUTFORMAT 'com.jd.cloud.clickstore.ClickstreamInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION '/data/clickstream_20110216.txt';
5.導入數據
Java代碼 ?LOAD?DATA?LOCAL?INPATH?'/data/clickstream_20110216.txt'?OVERWRITE?INTO?TABLE?clickstream_table;?? LOAD DATA LOCAL INPATH '/data/clickstream_20110216.txt' OVERWRITE INTO TABLE clickstream_table;
6.查詢剛剛到入的數據
select * from clickstream_table;
參考http://wiki.apache.org/hadoop/Hive/SerDe
1.日志分隔符
Xml代碼 ?
分隔符是“ ||| ”,這是為了盡可能防止日志正文出現與分隔符相同的字符而導致數據混淆。
hive 的內部分隔符是“ \001 ”,所以我們需要做一下轉換
2.編寫自定義InputFormat
Java代碼 ?
3.自定義ClickstreamRecordReader實現RecordReader接口,并重寫next方法
? Java代碼 ?
我們可以直接使用LineRecordReader,修改next方法
3.啟動hive,添加我們自己剛剛添加的類
4.創建數據庫
Java代碼 ?
5.導入數據
Java代碼 ?
6.查詢剛剛到入的數據
select * from clickstream_table;
參考http://wiki.apache.org/hadoop/Hive/SerDe
轉載于:https://www.cnblogs.com/java20130722/p/3206914.html
總結
以上是生活随笔為你收集整理的hive处理日志,自定义inputformat的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: maya 中使用节点连接来求余数:
- 下一篇: ECMA学习小结(3)——constru