日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

hive处理日志,自定义inputformat

發布時間:2023/12/2 编程问答 26 豆豆
生活随笔 收集整理的這篇文章主要介紹了 hive处理日志,自定义inputformat 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
開放環境,hadoop-0.20.2,hive-0.6

1.日志分隔符
Xml代碼 ?
  • 2010-05-31?10:50:17|||61.132.4.82|||http://www.360buy.com/product/201185.html??
  • 2010-05-31 10:50:17|||61.132.4.82|||http://www.360buy.com/product/201185.html
    分隔符是“ ||| ”,這是為了盡可能防止日志正文出現與分隔符相同的字符而導致數據混淆。
    hive 的內部分隔符是“ \001 ”,所以我們需要做一下轉換

    2.編寫自定義InputFormat
    Java代碼 ?
  • package?com.jd.cloud.clickstore;??
  • ??
  • import?java.io.IOException;??
  • ??
  • import?org.apache.hadoop.io.LongWritable;??
  • import?org.apache.hadoop.io.Text;??
  • import?org.apache.hadoop.mapred.FileSplit;??
  • import?org.apache.hadoop.mapred.InputSplit;??
  • import?org.apache.hadoop.mapred.JobConf;??
  • import?org.apache.hadoop.mapred.JobConfigurable;??
  • import?org.apache.hadoop.mapred.RecordReader;??
  • import?org.apache.hadoop.mapred.Reporter;??
  • import?org.apache.hadoop.mapred.TextInputFormat;??
  • ??
  • /**?
  • ?*?自定義hadoop的?org.apache.hadoop.mapred.InputFormat?
  • ?*??
  • ?*?@author?winston?
  • ?*??
  • ?*/??
  • public?class?ClickstreamInputFormat?extends?TextInputFormat?implements??
  • ????????JobConfigurable?{??
  • ??
  • ????public?RecordReader<LongWritable,?Text>?getRecordReader(??
  • ????????????InputSplit?genericSplit,?JobConf?job,?Reporter?reporter)??
  • ????????????throws?IOException?{??
  • ??
  • ????????reporter.setStatus(genericSplit.toString());??
  • ????????return?new?ClickstreamRecordReader(job,?(FileSplit)?genericSplit);??
  • ????}??
  • }??
  • package com.jd.cloud.clickstore;import java.io.IOException;import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.FileSplit; import org.apache.hadoop.mapred.InputSplit; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.JobConfigurable; import org.apache.hadoop.mapred.RecordReader; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.TextInputFormat;/*** 自定義hadoop的 org.apache.hadoop.mapred.InputFormat* * @author winston* */ public class ClickstreamInputFormat extends TextInputFormat implementsJobConfigurable {public RecordReader<LongWritable, Text> getRecordReader(InputSplit genericSplit, JobConf job, Reporter reporter)throws IOException {reporter.setStatus(genericSplit.toString());return new ClickstreamRecordReader(job, (FileSplit) genericSplit);} }
    3.自定義ClickstreamRecordReader實現RecordReader接口,并重寫next方法
    ? Java代碼 ?
  • /**?Read?a?line.?*/??
  • ??public?synchronized?boolean?next(LongWritable?key,?Text?value)??
  • ????throws?IOException?{??
  • ??
  • ????while?(pos?<?end)?{??
  • ??????key.set(pos);??
  • ??
  • ??????int?newSize?=?in.readLine(value,?maxLineLength,??
  • ????????????????????????????????Math.max((int)Math.min(Integer.MAX_VALUE,?end-pos),??
  • ?????????????????????????????????????????maxLineLength));??
  • ????????
  • ??????//start??
  • ??????String?strReplace?=?value.toString().toLowerCase().replaceAll("\\|\\|\\|"?,?"\001"?);??
  • ??????Text?txtReplace?=?new?Text();??
  • ??????txtReplace.set(strReplace?);??
  • ??????value.set(txtReplace.getBytes(),?0,?txtReplace.getLength());??
  • ??????//end??
  • ????????
  • ????????
  • ??????if?(newSize?==?0)?{??
  • ????????return?false;??
  • ??????}??
  • ??????pos?+=?newSize;??
  • ??????if?(newSize?<?maxLineLength)?{??
  • ????????return?true;??
  • ??????}??
  • ??
  • ??????//?line?too?long.?try?again??
  • ??????LOG.info("Skipped?line?of?size?"?+?newSize?+?"?at?pos?"?+?(pos?-?newSize));??
  • ????}??
  • ??
  • ????return?false;??
  • ??}??
  • /** Read a line. */public synchronized boolean next(LongWritable key, Text value)throws IOException {while (pos < end) {key.set(pos);int newSize = in.readLine(value, maxLineLength,Math.max((int)Math.min(Integer.MAX_VALUE, end-pos),maxLineLength));//startString strReplace = value.toString().toLowerCase().replaceAll("\\|\\|\\|" , "\001" );Text txtReplace = new Text();txtReplace.set(strReplace );value.set(txtReplace.getBytes(), 0, txtReplace.getLength());//endif (newSize == 0) {return false;}pos += newSize;if (newSize < maxLineLength) {return true;}// line too long. try againLOG.info("Skipped line of size " + newSize + " at pos " + (pos - newSize));}return false;}
    我們可以直接使用LineRecordReader,修改next方法

    3.啟動hive,添加我們自己剛剛添加的類


    4.創建數據庫
    Java代碼 ?
  • create?table?clickstream_table(time?string,?ip?string,?url?string)?stored?as?INPUTFORMAT?'com.jd.cloud.clickstore.ClickstreamInputFormat'?OUTPUTFORMAT?'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'?LOCATION?'/data/clickstream_20110216.txt';??
  • create table clickstream_table(time string, ip string, url string) stored as INPUTFORMAT 'com.jd.cloud.clickstore.ClickstreamInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION '/data/clickstream_20110216.txt';

    5.導入數據
    Java代碼 ?
  • LOAD?DATA?LOCAL?INPATH?'/data/clickstream_20110216.txt'?OVERWRITE?INTO?TABLE?clickstream_table;??
  • LOAD DATA LOCAL INPATH '/data/clickstream_20110216.txt' OVERWRITE INTO TABLE clickstream_table;

    6.查詢剛剛到入的數據
    select * from clickstream_table;



    參考http://wiki.apache.org/hadoop/Hive/SerDe

    轉載于:https://www.cnblogs.com/java20130722/p/3206914.html

    總結

    以上是生活随笔為你收集整理的hive处理日志,自定义inputformat的全部內容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。