當(dāng)前位置：首頁(yè) >

Hadoop_23_MapReduce倒排索引实现

發(fā)布時(shí)間：2025/6/16 70 豆豆

生活随笔收集整理的這篇文章主要介紹了 Hadoop_23_MapReduce倒排索引实现小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

1.1.倒排索引?

　　　　根據(jù)屬性的值來(lái)查找記錄。這種索引表中的每一項(xiàng)都包括一個(gè)屬性值和具有該屬性值的各記錄的地址。由于不是由記錄來(lái)確

定屬性值，而是由屬性值來(lái)確定記錄的位置，因而稱為倒排索引(invertedindex)

　　　　例如：單詞——文檔矩陣（將屬性值放在前面作為索引）

1.2.MapReduce實(shí)現(xiàn)倒排索引

需求：對(duì)大量的文本（文檔、網(wǎng)頁(yè)），需要建立搜索索引

代碼實(shí)現(xiàn)：

package cn.bigdata.hdfs.mr; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.FileSplit; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;/*** 使用MapRedeuce程序建立倒排索引文件* 文件列表如下：* a.txt b.txt c.txt* hello tom hello jerry hello jerry* hello jerry hello jerry hello tom* hello tom tom jerry*/public class InverIndexStepOne {static class InverIndexStepOneMapper extends Mapper<LongWritable, Text, Text, IntWritable>{Text k = new Text();IntWritable v = new IntWritable(1);@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {String line = value.toString();//將得到的每行文本數(shù)據(jù)根據(jù)空格" "進(jìn)行切分String [] words = line.split(" ");//根據(jù)切片信息獲取文件名FileSplit inputSplit = (FileSplit)context.getInputSplit();String fileName = inputSplit.getPath().getName();for(String word : words){k.set(word + "--" + fileName);context.write(k, v);}}}static class InverIndexStepOneReducer extends Reducer<Text, IntWritable, Text, IntWritable>{@Overrideprotected void reduce(Text key, Iterable<IntWritable> values ,Context context) throws IOException, InterruptedException {int count = 0;for(IntWritable value : values){count += value.get();}context.write(key, new IntWritable(count));}}public static void main(String[] args) throws Exception {Configuration conf = new Configuration();Job job = Job.getInstance(conf);job.setJarByClass(InverIndexStepOne.class);job.setMapperClass(InverIndexStepOneMapper.class);job.setReducerClass(InverIndexStepOneReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);//輸入文件路徑FileInputFormat.setInputPaths(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));job.waitForCompletion(true);} }

?運(yùn)行結(jié)果輸出文件：E:\inverseOut\part-r-00000

hello--a.txt 3 hello--b.txt 2 hello--c.txt 2 jerry--a.txt 1 jerry--b.txt 3 jerry--c.txt 1 tom--a.txt 2 tom--b.txt 1 tom--c.txt 1

?在原來(lái)的基礎(chǔ)上進(jìn)行二次合并，格式如上圖單詞矩陣，代碼實(shí)現(xiàn)如下：

package cn.bigdata.hdfs.mr; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; /*** 對(duì)第一次的輸出結(jié)果進(jìn)行合并，使得一個(gè)value對(duì)應(yīng)的多個(gè)文檔記錄組成一條完整記錄* @author Administrator**/public class IndexStepTwo {static class IndexStepTwoMapper extends Mapper<LongWritable, Text, Text, Text>{@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {String line = value.toString();String[] files = line.split("--");context.write(new Text(files[0]), new Text(files[1]));}}static class IndexStepTwoReducer extends Reducer<Text, Text, Text, Text>{@Overrideprotected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {//定義Buffer緩沖數(shù)組StringBuffer sb = new StringBuffer();for (Text text : values) {sb.append(text.toString().replace("\t", "-->") + "\t");}context.write(key, new Text(sb.toString()));}}public static void main(String[] args) throws Exception{if (args.length < 1 || args == null) {args = new String[]{"E:/inverseOut/part-r-00000", "D:/inverseOut2"};}Configuration config = new Configuration();Job job = Job.getInstance(config);job.setMapperClass(IndexStepTwoMapper.class);job.setReducerClass(IndexStepTwoReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(Text.class);FileInputFormat.setInputPaths(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));System.exit(job.waitForCompletion(true) ? 1:0);} }

?運(yùn)行結(jié)果：

hello c.txt-->2 b.txt-->2 a.txt-->3 jerry c.txt-->1 b.txt-->3 a.txt-->1 tom c.txt-->1 b.txt-->1 a.txt-->2

總結(jié)：

　　對(duì)大量的文檔建立索引無(wú)非就兩個(gè)過(guò)程，一個(gè)是分詞，另一個(gè)是統(tǒng)計(jì)分詞在每個(gè)文檔中出現(xiàn)的次數(shù)，根據(jù)分詞在每個(gè)文檔

中出現(xiàn)的次數(shù)建立索引文件，下次搜索詞的時(shí)候直接查詢索引文件，從而返回文檔的摘要等信息；

轉(zhuǎn)載于:https://www.cnblogs.com/yaboya/p/9252313.html

總結(jié)

以上是生活随笔為你收集整理的Hadoop_23_MapReduce倒排索引实现的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： V8 —— 你需要知道的垃圾回收机制
下一篇：多云战略未来五大趋势分析，必看！

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

Hadoop_23_MapReduce倒排索引实现

總結(jié)