MapReduce-TextInputFormat 切片机制
生活随笔
收集整理的這篇文章主要介紹了
MapReduce-TextInputFormat 切片机制
小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
MapReduce 默認(rèn)使用?TextInputFormat 進(jìn)行切片,其機(jī)制如下
(1)簡(jiǎn)單地按照文件的內(nèi)容長(zhǎng)度進(jìn)行切片 (2)切片大小,默認(rèn)等于Block大小,可單獨(dú)設(shè)置 (3)切片時(shí)不考慮數(shù)據(jù)集整體,而是逐個(gè)針對(duì)每一個(gè)文件單獨(dú)切片例如: (1)輸入數(shù)據(jù)有兩個(gè)文件: filel.txt 320M file2.txt 10M (2)經(jīng)過(guò) FilelnputFormat(TextInputFormat為其實(shí)現(xiàn)類)的切片機(jī)制運(yùn)算后,形成的切片信息如下: filel.txt.splitl--0~128 filel.txt.split2--128~256 filel.txt.split3--256~320 file2.txt.splitl--0~10M?
測(cè)試讀取數(shù)據(jù)的方式
輸入數(shù)據(jù)(中間為空格,末尾為換行符)
map 階段的 k-v
可以看出 k 為偏移量,v 為一行的值,即?TextInputFormat 按行讀取
?
以?WordCount 為例進(jìn)行測(cè)試,測(cè)試切片數(shù)
測(cè)試數(shù)據(jù),三個(gè)相同的文件
測(cè)試代碼
package com.mapreduce.wordcount;import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.log4j.BasicConfigurator;import java.io.IOException; import java.util.StringTokenizer;public class WordCount {static {try {// 設(shè)置 HADOOP_HOME 環(huán)境變量System.setProperty("hadoop.home.dir", "D:/DevelopTools/hadoop-2.9.2/");// 日志初始化 BasicConfigurator.configure();// 加載庫(kù)文件System.load("D:/DevelopTools/hadoop-2.9.2/bin/hadoop.dll");} catch (UnsatisfiedLinkError e) {System.err.println("Native code library failed to load.\n" + e);System.exit(1);}}public static void main(String[] args) throws Exception {args = new String[]{"D:\\tmp\\input2", "D:\\tmp\\456"};Configuration conf = new Configuration();Job job = Job.getInstance(conf, "word count");job.setJarByClass(WordCount.class);job.setMapperClass(TokenizerMapper.class);job.setCombinerClass(IntSumReducer.class);job.setReducerClass(IntSumReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);// 設(shè)置 InputFormat,默認(rèn)為 TextInputFormat.class,這里顯式設(shè)置下,后面設(shè)置切片大小job.setInputFormatClass(TextInputFormat.class);TextInputFormat.setMinInputSplitSize(job, 1);TextInputFormat.setMaxInputSplitSize(job, 1024 * 1024 * 128);FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));System.exit(job.waitForCompletion(true) ? 0 : 1);}public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {private final static IntWritable one = new IntWritable(1);private Text word = new Text();@Overridepublic void map(Object key, Text value, Context context) throws IOException, InterruptedException {// 查看 k-vSystem.out.println(key + "\t" + value);StringTokenizer itr = new StringTokenizer(value.toString());while (itr.hasMoreTokens()) {word.set(itr.nextToken());context.write(word, one);}}}public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {private IntWritable result = new IntWritable();@Overridepublic void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {int sum = 0;for (IntWritable val : values) {sum += val.get();}result.set(sum);context.write(key, result);}} }?
轉(zhuǎn)載于:https://www.cnblogs.com/jhxxb/p/10790786.html
總結(jié)
以上是生活随笔為你收集整理的MapReduce-TextInputFormat 切片机制的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: BootstrapTable-加载数据
- 下一篇: Codeforces Round #55