日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Hadoop_23_MapReduce倒排索引实现

發布時間:2025/6/16 编程问答 59 豆豆
生活随笔 收集整理的這篇文章主要介紹了 Hadoop_23_MapReduce倒排索引实现 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

1.1.倒排索引?

    根據屬性的值來查找記錄。這種索引表中的每一項都包括一個屬性值和具有該屬性值的各記錄的地址。由于不是由記錄來確

定屬性值,而是由屬性值來確定記錄的位置,因而稱為倒排索引(invertedindex)

    例如:單詞——文檔矩陣(將屬性值放在前面作為索引)

    

1.2.MapReduce實現倒排索引

需求:對大量的文本(文檔、網頁),需要建立搜索索引

代碼實現:

package cn.bigdata.hdfs.mr; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.FileSplit; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;/*** 使用MapRedeuce程序建立倒排索引文件* 文件列表如下:* a.txt b.txt c.txt* hello tom hello jerry hello jerry* hello jerry hello jerry hello tom* hello tom tom jerry*/public class InverIndexStepOne {static class InverIndexStepOneMapper extends Mapper<LongWritable, Text, Text, IntWritable>{Text k = new Text();IntWritable v = new IntWritable(1);@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {String line = value.toString();//將得到的每行文本數據根據空格" "進行切分String [] words = line.split(" ");//根據切片信息獲取文件名FileSplit inputSplit = (FileSplit)context.getInputSplit();String fileName = inputSplit.getPath().getName();for(String word : words){k.set(word + "--" + fileName);context.write(k, v);}}}static class InverIndexStepOneReducer extends Reducer<Text, IntWritable, Text, IntWritable>{@Overrideprotected void reduce(Text key, Iterable<IntWritable> values ,Context context) throws IOException, InterruptedException {int count = 0;for(IntWritable value : values){count += value.get();}context.write(key, new IntWritable(count));}}public static void main(String[] args) throws Exception {Configuration conf = new Configuration();Job job = Job.getInstance(conf);job.setJarByClass(InverIndexStepOne.class);job.setMapperClass(InverIndexStepOneMapper.class);job.setReducerClass(InverIndexStepOneReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);//輸入文件路徑FileInputFormat.setInputPaths(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));job.waitForCompletion(true);} }

?運行結果輸出文件:E:\inverseOut\part-r-00000

hello--a.txt 3 hello--b.txt 2 hello--c.txt 2 jerry--a.txt 1 jerry--b.txt 3 jerry--c.txt 1 tom--a.txt 2 tom--b.txt 1 tom--c.txt 1

?在原來的基礎上進行二次合并,格式如上圖單詞矩陣,代碼實現如下:

package cn.bigdata.hdfs.mr; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; /*** 對第一次的輸出結果進行合并,使得一個value對應的多個文檔記錄組成一條完整記錄* @author Administrator**/public class IndexStepTwo {static class IndexStepTwoMapper extends Mapper<LongWritable, Text, Text, Text>{@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {String line = value.toString();String[] files = line.split("--");context.write(new Text(files[0]), new Text(files[1]));}}static class IndexStepTwoReducer extends Reducer<Text, Text, Text, Text>{@Overrideprotected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {//定義Buffer緩沖數組StringBuffer sb = new StringBuffer();for (Text text : values) {sb.append(text.toString().replace("\t", "-->") + "\t");}context.write(key, new Text(sb.toString()));}}public static void main(String[] args) throws Exception{if (args.length < 1 || args == null) {args = new String[]{"E:/inverseOut/part-r-00000", "D:/inverseOut2"};}Configuration config = new Configuration();Job job = Job.getInstance(config);job.setMapperClass(IndexStepTwoMapper.class);job.setReducerClass(IndexStepTwoReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(Text.class);FileInputFormat.setInputPaths(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));System.exit(job.waitForCompletion(true) ? 1:0);} }

?運行結果:

hello c.txt-->2 b.txt-->2 a.txt-->3 jerry c.txt-->1 b.txt-->3 a.txt-->1 tom c.txt-->1 b.txt-->1 a.txt-->2

總結:

  對大量的文檔建立索引無非就兩個過程,一個是分詞,另一個是統計分詞在每個文檔中出現的次數,根據分詞在每個文檔

中出現的次數建立索引文件,下次搜索詞的時候直接查詢索引文件,從而返回文檔的摘要等信息;

?

轉載于:https://www.cnblogs.com/yaboya/p/9252313.html

總結

以上是生活随笔為你收集整理的Hadoop_23_MapReduce倒排索引实现的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。