當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Hadoop 倒排索引

發(fā)布時間：2023/11/30 编程问答 32 豆豆

生活随笔收集整理的這篇文章主要介紹了 Hadoop 倒排索引小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

　　倒排索引是文檔檢索系統(tǒng)中最常用的數(shù)據(jù)結(jié)構(gòu)，被廣泛地應(yīng)用于全文搜索引擎。它主要是用來存儲某個單詞（或詞組）在一個文檔或一組文檔中存儲位置的映射，即提供了一種根據(jù)內(nèi)容來查找文檔的方式。由于不是根據(jù)文檔來確定文檔所包含的內(nèi)容，而是進行相反的操作，因而稱為倒排索引（Inverted Index）。

一、實例描述

　　倒排索引簡單地就是，根據(jù)單詞，返回它在哪個文件中出現(xiàn)過，而且頻率是多少的結(jié)果。這就像百度里的搜索，你輸入一個關(guān)鍵字，那么百度引擎就迅速的在它的服務(wù)器里找到有該關(guān)鍵字的文件，并根據(jù)頻率和其他的一些策略（如頁面點擊投票率）等來給你返回結(jié)果。這個過程中，倒排索引就起到很關(guān)鍵的作用。

　　樣例輸入：

　　樣例輸出：

二、設(shè)計思路

　　倒排索引涉及幾個過程：Map過程，Combine過程，Reduce過程。

　　Map過程：?

　　當你把需要處理的文檔上傳到hdfs時，首先默認的TextInputFormat類對輸入的文件進行處理，得到文件中每一行的偏移量和這一行內(nèi)容的鍵值對<偏移量，內(nèi)容>做為map的輸入。在改寫map函數(shù)的時候，我們就需要考慮，怎么設(shè)計key和value的值來適合MapReduce框架，從而得到正確的結(jié)果。由于我們要得到單詞,所屬的文檔URL,詞頻，而<key,value>只有兩個值，那么就必須得合并其中得兩個信息了。這里我們設(shè)計key=單詞＋URL，value=詞頻。即map得輸出為<單詞＋URL，詞頻>，之所以將單詞＋URL做為key，時利用MapReduce框架自帶得Map端進行排序。

　　Combine過程：

　　Combine過程將key值相同得value值累加，得到一個單詞在文檔上得詞頻。但是為了把相同得key交給同一個reduce處理，我們需要設(shè)計為key=單詞，value＝URL+詞頻。

　　Reduce過程：

　　Reduce過程其實就是一個合并的過程了，只需將相同的key值的value值合并成倒排索引需要的格式即可。

三、程序代碼

　　程序代碼如下：

1 import java.io.IOException; 2 import java.util.StringTokenizer; 3 4 import org.apache.hadoop.conf.Configuration; 5 import org.apache.hadoop.fs.Path; 6 import org.apache.hadoop.io.LongWritable; 7 import org.apache.hadoop.io.Text; 8 import org.apache.hadoop.mapreduce.Job; 9 import org.apache.hadoop.mapreduce.Mapper; 10 import org.apache.hadoop.mapreduce.Reducer; 11 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 12 import org.apache.hadoop.mapreduce.lib.input.FileSplit; 13 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 14 import org.apache.hadoop.util.GenericOptionsParser; 15 16 17 public class InvertedIndex { 18 19 public static class Map extends Mapper<LongWritable, Text, Text, Text>{ 20 private static Text word = new Text(); 21 private static Text one = new Text(); 22 23 @Override 24 protected void map(LongWritable key, Text value,Mapper<LongWritable, Text, Text, Text>.Context context) 25 throws IOException, InterruptedException { 26 // super.map(key, value, context); 27 String fileName = ((FileSplit)context.getInputSplit()).getPath().getName(); 28 StringTokenizer st = new StringTokenizer(value.toString()); 29 while (st.hasMoreTokens()) { 30 word.set(st.nextToken()+"\t"+fileName); 31 context.write(word, one); 32 } 33 } 34 } 35 36 public static class Combine extends Reducer<Text, Text, Text, Text>{ 37 private static Text word = new Text(); 38 private static Text index = new Text(); 39 40 @Override 41 protected void reduce(Text key, Iterable<Text> values,Reducer<Text, Text, Text, Text>.Context context) 42 throws IOException, InterruptedException { 43 // super.reduce(arg0, arg1, arg2); 44 String[] splits = key.toString().split("\t"); 45 if (splits.length != 2) { 46 return ; 47 } 48 long count = 0; 49 for(Text v:values){ 50 count++; 51 } 52 word.set(splits[0]); 53 index.set(splits[1]+":"+count); 54 context.write(word, index); 55 } 56 } 57 58 public static class Reduce extends Reducer<Text, Text, Text, Text>{ 59 private static StringBuilder sub = new StringBuilder(256); 60 private static Text index = new Text(); 61 62 @Override 63 protected void reduce(Text word, Iterable<Text> values,Reducer<Text, Text, Text, Text>.Context context) 64 throws IOException, InterruptedException { 65 // super.reduce(arg0, arg1, arg2); 66 for(Text v:values){ 67 sub.append(v.toString()).append(";"); 68 } 69 index.set(sub.toString()); 70 context.write(word, index); 71 sub.delete(0, sub.length()); 72 } 73 } 74 75 public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { 76 Configuration conf = new Configuration(); 77 String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs(); 78 if(otherArgs.length!=2){ 79 System.out.println("Usage:wordcount <in> <out>"); 80 System.exit(2); 81 } 82 Job job = new Job(conf,"Invert Index "); 83 job.setJarByClass(InvertedIndex.class); 84 85 job.setMapperClass(Map.class); 86 job.setCombinerClass(Combine.class); 87 job.setReducerClass(Reduce.class); 88 89 job.setMapOutputKeyClass(Text.class); 90 job.setMapOutputValueClass(Text.class); 91 job.setOutputKeyClass(Text.class); 92 job.setOutputValueClass(Text.class); 93 94 FileInputFormat.addInputPath(job,new Path(args[0])); 95 FileOutputFormat.setOutputPath(job, new Path(args[1])); 96 System.exit(job.waitForCompletion(true)?0:1); 97 } 98 99 }

轉(zhuǎn)載于:https://www.cnblogs.com/xiaoyh/p/9361356.html

創(chuàng)作挑戰(zhàn)賽新人創(chuàng)作獎勵來咯，堅持創(chuàng)作打卡瓜分現(xiàn)金大獎

總結(jié)

以上是生活随笔為你收集整理的Hadoop 倒排索引的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Maven Web项目解决跨域问题
下一篇：幸运三角形南阳acm491（dfs）