當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Hadoop倒排索引原理解析

發(fā)布時(shí)間：2024/8/1 编程问答 30 豆豆

生活随笔收集整理的這篇文章主要介紹了 Hadoop倒排索引原理解析小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

倒排索引源于實(shí)際應(yīng)用中需要根據(jù)屬性的值來查找記錄，這種索引表中的每一項(xiàng)都包括一個(gè)屬性值和具有該屬性值的各記錄的地址，由于不是由記錄來確定屬性值，而是由屬性值來確定記錄的位置，因而稱為倒排索引，我查閱了一些資料去學(xué)習(xí)了解它，下面是我自己對(duì)倒排索引的理解

假設(shè)我現(xiàn)在有兩個(gè)文件 123.txt 和 456.txt，內(nèi)容分別是
123.txt

Hello world nice to meet you happy good Would you like to play basketball with me

456.txt

Hello hadoop basketball good good better best nice sun moon star sky

Map 階段前
在 map 階段前得到的是
1.txt

0 Hello world 13 nice to meet you 31 happy good 43 Would you like to play basketball with me

2.txt

0 Hello hadoop 14 basketball good 31 good better best nice 54 sun moon star sky

前面的數(shù)字是行偏移量，作用并不大，主要是根據(jù)后面的 value 進(jìn)行拆分，上面兩個(gè)內(nèi)容也就是 map 階段分別從從 1.txt 和 2.txt 得到的輸入

Map 階段（重寫 map 方法）
我們將單詞及其來自的文件作為 key，單詞的數(shù)量作為 value（其實(shí)值就是1），形式自己定，我的如下
map 階段結(jié)束得到的 1.txt

Hello->123.txt 1 world->123.txt 1 nice->123.txt 1 to->123.txt 1 meet->123.txt 1 you->123.txt 1 happy->123.txt 1 good->123.txt 1 Would->123.txt 1 you->123.txt 1 like->123.txt 1 to->123.txt 1 play->123.txt 1 basketball->123.txt 1 with->123.txt 1 me->123.txt 1

map 階段結(jié)束得到的 2.txt

Hello->456.txt 1 hadoop->456.txt 1 basketball->456.txt 1 good->456.txt 1 good->456.txt 1 better->456.txt 1 best->456.txt 1 nice->456.txt 1 sun->456.txt 1 moon->456.txt 1 star->456.txt 1 sky->456.txt 1

這樣設(shè)計(jì)我們就可以使用 MapReduce 框架自帶的 map 端排序，將同一單詞的 value 組成列表
如下
1.txt

Hello->123.txt {1} basketball->123.txt {1} good->123.txt {1} happy->123.txt {1} like->123.txt {1} me->123.txt {1} meet->123.txt {1} Would->123.txt {1} nice->123.txt {1} play->123.txt {1} to->123.txt {1,1} with->123.txt {1} world->123.txt {1} you->123.txt {1,1}

2.txt

Hello->456.txt {1} basketball->456.txt {1} best->456.txt {1} better->456.txt {1} good->456.txt {1,1} hadoop->456.txt {1} moon->456.txt {1} nice->456.txt {1} sky->456.txt {1} star->456.txt {1} sun->456.txt {1}

上面的內(nèi)容也就是 combine 階段分別從 1.txt 和 2.txt 得到的輸入
Combine 階段
combine 階段一般來說是跟 reduce 一樣的，但這里我們需要自定義 combine 方法，這一階段我們將 key 的 value 值累加，然后把單詞設(shè)置為 key，就可以使用 MapReduce 框架默認(rèn)的 Shuffle 過程，將相同單詞發(fā)送給同一個(gè) Reducer 來處理，文件及該單詞在這一文件出現(xiàn)的次數(shù)設(shè)為 value
1.txt 經(jīng)過 combine 階段的輸出如下

Hello 123.txt->1 basketball 123.txt->1 good 123.txt->1 happy 123.txt->1 like 123.txt->1 me 123.txt->1 meet 123.txt->1 Would 123.txt->1 nice 123.txt->1 play 123.txt->1 to 123.txt->2 with 123.txt->1 world 123.txt->1 you 123.txt->2

2.txt 經(jīng)過 combine 階段的輸出如下

Hello 456.txt->1 basketball 456.txt->1 best 456.txt->1 better 456.txt->1 good 456.txt->2 hadoop 456.txt->1 moon 456.txt->1 nice 456.txt->1 sky 456.txt->1 star 456.txt->1 sun 456.txt->1

Shuffle 階段
shuffle 階段輸出（假設(shè)只有一個(gè)分區(qū)）

Hello {456.txt->1,123.txt->1} Would {123.txt->1} basketball {123.txt->1,456.txt->1} best {456.txt->1} better {456.txt->1} good {456.txt->2,123.txt->1} hadoop {456.txt->1} happy {123.txt->1} like {123.txt->1} me {123.txt->1} meet {123.txt->1} moon {456.txt->1} nice {456.txt->1,123.txt->1} play {123.txt->1} sky {456.txt->1} star {456.txt->1} sun {456.txt->1} to {123.txt->2} with {123.txt->1} world {123.txt->1} you {123.txt->2}

這一階段的輸出也就是 Reduce 階段的輸入
Reduce 階段（重寫 reduce 方法）
Reduce 階段就容易了，輸出如下

Hello 456.txt->1,123.txt->1 Would 123.txt->1 basketball 123.txt->1,456.txt->1 best 456.txt->1 better 456.txt->1 good 456.txt->2,123.txt->1 hadoop 456.txt->1 happy 123.txt->1 like 123.txt->1 me 123.txt->1 meet 123.txt->1 moon 456.txt->1 nice 456.txt->1,123.txt->1 play 123.txt->1 sky 456.txt->1 star 456.txt->1 sun 456.txt->1 to 123.txt->2 with 123.txt->1 world 123.txt->1 you 123.txt->2

原理搞懂之后，編寫代碼就容易多了，我們主要是對(duì) map 和 reduce 方法重寫，combiner 類需看情況是否需要，不合并則無需指定該類，分區(qū)類也根據(jù)自己需要編寫，總的來說，MapReduce 數(shù)據(jù)格式的轉(zhuǎn)換如下
Map: (Key1, Value1) → list(Key2,Value2)
Combine: (Key2, list(Value2)) → list(Key3, Value3)
Reduce: (Key3, list(Value3)) → list(Key4, Value4)

下面附上我的源碼，我是設(shè)置為2個(gè)分區(qū)，單詞 A-M 包括小寫在分區(qū)1，N-Z 包括小寫在分區(qū)2
MyMap 類：

package hadoopSort;import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.lib.input.FileSplit; import java.io.IOException; import java.util.StringTokenizer;public class MyMap extends Mapper<Object,Text,Text,Text> {private static final IntWritable one = new IntWritable(1);private Text word = new Text();//單詞value初始都為1private Text value = new Text("1");@Overrideprotected void map(Object k1,Text v1,Mapper<Object,Text,Text,Text>.Context context) throws IOException,InterruptedException{//獲得文件的路徑FileSplit inputSplit = (FileSplit) context.getInputSplit();Path path = inputSplit.getPath();String fileName = path.getName();//將單詞根據(jù)空格，\t等字符截取StringTokenizer itr = new StringTokenizer(v1.toString());while (itr.hasMoreTokens()){//將單詞和文件名組成Key值word.set(itr.nextToken()+"->"+fileName);context.write(word,value);}} }

MyCombiner 類：

package hadoopSort;import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; import java.io.IOException; import java.util.Iterator;public class MyCombiner extends Reducer<Text,Text,Text,Text> {private Text text = new Text();@Overridepublic void reduce(Text key,Iterable<Text> values,Reducer<Text,Text,Text,Text>.Context context)throws IOException,InterruptedException {int sum = 0;//統(tǒng)計(jì)數(shù)量for (Text v:values){//統(tǒng)計(jì)單詞在該文件出現(xiàn)的總次數(shù)sum += Integer.parseInt(v.toString());}//將Key以‘->’為分隔符，則第一個(gè)為單詞，第二個(gè)為單詞所在的文件String[] line = key.toString().split("->");//單詞設(shè)置為Key值key.set(line[0]);//文件名及該單詞在該文件出現(xiàn)次數(shù)設(shè)置為valuetext.set(line[1]+"->"+sum);context.write(key,text);} }

MyPartitioner 類（分區(qū)類）：

package hadoopSort;import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Partitioner; import java.awt.*;public class MyPartitioner extends Partitioner<Text,Text> {@Overridepublic int getPartition(Text key, Text value, int numPart) {char firstLetter = key.toString().charAt(0);//A-M包括小寫在分區(qū)1，N-Z包括小寫在分區(qū)2，這里不考慮不合格的字符，假設(shè)都是符合要求的單詞if (firstLetter>='a'&&firstLetter<='m'||firstLetter>='A'&&firstLetter<='M'){return 0;}else {return 1;}} }

MyReduce 類：

package hadoopSort;import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; import java.io.IOException; import java.util.Iterator;public class MyReduce extends Reducer<Text,Text,Text,Text> {private Text result = new Text();@Overrideprotected void reduce(Text k2, Iterable<Text> v2, Reducer<Text,Text,Text,Text>.Context context)throws IOException,InterruptedException{String line = new String();for(Text c:v2){//將value列表里的內(nèi)容連接起來line += c.toString()+",";//System.out.println(c.toString()+",");}//去掉最后一個(gè)逗號(hào)line = line.substring(0,line.length()-1);result.set(line);context.write(k2,result);} }

Main 類：

package hadoopSort;import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class Main {public static void main(String[] args)throws Exception{Configuration conf = new Configuration();conf.set("mapreduce.app-submission.cross-platform","true"); //跨平臺(tái)提交//輸入路徑和輸出路徑，輸出路徑必須不存在String[] filePath = new String[]{"/user/hadoop/input", "/user/hadoop/output"};if (filePath.length < 2) {System.err.println("Usage: wordcount <in> [<in>...] <out>");System.exit(2);}//指定運(yùn)行對(duì)象和job名稱Job job = Job.getInstance(conf,"word sort");//提交到集群需要指定jar包的位置，不然會(huì)報(bào)錯(cuò)ClassNotFoundExceptionjob.setJar("out\\artifacts\\myMapReduce_jar\\myMapReduce.jar");job.setJarByClass(Main.class);//指定map類job.setMapperClass(MyMap.class);//指定map輸出的key和value的格式j(luò)ob.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(Text.class);//指定combiner類job.setCombinerClass(MyCombiner.class);//指定reduce類job.setReducerClass(MyReduce.class);//指定reduce輸出的key和value的格式j(luò)ob.setOutputKeyClass(Text.class);job.setOutputValueClass(Text.class);//設(shè)置分區(qū)類job.setPartitionerClass(MyPartitioner.class);//指定分區(qū)數(shù)job.setNumReduceTasks(2);//若輸出路徑存在則刪除，就不需要每次都手動(dòng)刪除了Path outputPath = new Path(filePath[1]);outputPath.getFileSystem(conf).delete(outputPath, true);//設(shè)置輸入路徑FileInputFormat.addInputPath(job,new Path(filePath[0]));//設(shè)置輸出路徑FileOutputFormat.setOutputPath(job,new Path(filePath[1]));//等待任務(wù)完成System.exit(job.waitForCompletion(true)?0:1);} }

有不對(duì)的地方歡迎指正！

總結(jié)

以上是生活随笔為你收集整理的Hadoop倒排索引原理解析的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。