當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

lucene的建立索引，搜索，中文分词

發布時間：2024/1/18 编程问答 36 豆豆

生活随笔收集整理的這篇文章主要介紹了 lucene的建立索引，搜索，中文分词小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Lucene是apache軟件基金會4 jakarta項目組的一個子項目，是一個開放源代碼的全文檢索引擎工具包。
現在最新的lucene已經更新到6.0版本了。但是這個最新版，需要適配jdk1.80以上的版本。
所以我這里用的5.5.1的版本的，它對jdk沒有這么高的要求，比較適應開發。下面分三步簡單的將lucene的建立索引，搜索，中文分詞，介紹給大家。
用到的包：

一，建立索引：
1，定義三個字符串數組，分別對應：id,content,city。
2，寫個getWriter（），在里面建立分析器analyzer。
3，通過IndexWriter將目錄，將分析器綁定在一起。
4，將字符串數組，寫入到文檔，再通過有分析器流寫出。
通過以上步驟，lucene就會幫我將索引建立好了。
代碼：

import java.nio.file.FileSystems; import java.nio.file.Path; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field.Store; import org.apache.lucene.document.StringField; import org.apache.lucene.document.TextField; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.junit.Test; public class IndexFile { protected String[] ids={"1", "2"}; protected String[] content={"Amsterdam has lost of add cancals", "i love add this girl"}; protected String[] city={"Amsterdam", "Venice"}; private Directory dir; /** * 初始添加文檔 * @throws Exception */ @Test public void init() throws Exception { String filePath = "F:/1/index";Path path = FileSystems.getDefault().getPath(filePath);//得到系統文件的路徑dir = FSDirectory.open(path);//通過系統打開目錄IndexWriter writer=getWriter(); //得到寫出流for(int i=0; i < ids.length; i++) { Document doc=new Document(); //將內容寫入文本中doc.add(new StringField("id", ids[i], Store.YES)); doc.add(new TextField("content", content[i], Store.YES)); doc.add(new StringField("city", city[i], Store.YES)); writer.addDocument(doc); //將文本寫出} System.out.println("init ok?"); writer.close(); } /** * 獲得IndexWriter對象 * @return * @throws Exception */ public IndexWriter getWriter() throws Exception { Analyzer analyzer=new StandardAnalyzer(); IndexWriterConfig iwc=new IndexWriterConfig( analyzer); //加入分析器，建立索引return new IndexWriter(dir, iwc); } }

產生的結果：

二，搜索：
通過上面已經將文件建好了索引，現在我們要做的就是通過這個索引，去搜索對應的內容。
1，找對應的文件夾，打開文件將文件讀入
2，將文件加入到搜索器中，指定對應的索引，這里是content中的單詞add作為搜索條件。注意設置最大輸出結果，search(query, 10)就是這個方法中的數字。
代碼：

import java.nio.file.FileSystems; import java.nio.file.Path; import org.apache.lucene.document.Document; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.Term; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.TermQuery; import org.apache.lucene.search.TopDocs; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.junit.Test; public class IndexSearch {/*** 查詢* * @throws Exception*/@Testpublic void testSearch() throws Exception {String filePath = "F:/1/index";Path path = FileSystems.getDefault().getPath(filePath);Directory dir = FSDirectory.open(path);IndexReader reader = DirectoryReader.open(dir); //注意這里用的是讀入IndexSearcher searcher = new IndexSearcher(reader);//將目標文件加入搜索器Term term = new Term("content", "add");//content==》key add===>valueTermQuery query = new TermQuery(term);TopDocs topdocs = searcher.search(query, 10);//這個數字輸出的最多記錄數ScoreDoc[] scoreDocs = topdocs.scoreDocs;System.out.println("查詢結果總數---" + topdocs.totalHits + "---最大的評分--" + topdocs.getMaxScore());for (int i = 0; i < scoreDocs.length; i++) {int doc = scoreDocs[i].doc;Document document = searcher.doc(doc);System.out.println("content====" + document.get("content"));System.out.println("id--" + scoreDocs[i].doc + "---scors--" + scoreDocs[i].score + "---index--"+ scoreDocs[i].shardIndex);}reader.close();} }

結果展示：

三，中文分詞：
lucene自帶分詞，考慮到中英文分詞，這里我們選用的是lucene-analyzers-smartcn-5.5.1.jar包，它能較好的支持中英文分詞。
1，定義一段話，含中英文。
2，寫入自定義，停用詞，所謂停用詞，就是lucene在分詞時，會自動過濾這些詞，不會對其進行分割，直接將其忽略。
3，將這些自定義停用詞加入到系統。
4，將文本加入分詞器，就會將詞分好了。
代碼：

import java.util.Iterator; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.apache.lucene.analysis.util.CharArraySet; import org.apache.lucene.util.Version; public class SmartChineseAnalyzerTest { public static void main(String[] args) { try { // 要處理的文本 String text = "lucene分析器使用分詞器和過濾器構成一個“管道”，文本在流經這個管道后成為可以進入索引的最小單位，因此，一個標準的分析器有兩個部分組成，一個是分詞器tokenizer,它用于將文本按照規則切分為一個個可以進入索引的最小單位。另外一個是TokenFilter，它主要作用是對切出來的詞進行進一步的處理（如去掉敏感詞、英文大小寫轉換、單復數處理）等。lucene中的Tokenstram方法首先創建一個tokenizer對象處理Reader對象中的流式文本，然后利用TokenFilter對輸出流進行過濾處理"; // 自定義停用詞 String[] self_stop_words = { "的", "了", "呢", "，", "0", "：", ",", "是", "流" }; CharArraySet cas = new CharArraySet( 0, true); for (int i = 0; i < self_stop_words.length; i++) { cas.add(self_stop_words[i]); } // 加入系統默認停用詞 Iterator<Object> itor = SmartChineseAnalyzer.getDefaultStopSet().iterator(); while (itor.hasNext()) { cas.add(itor.next()); } // 中英文混合分詞器(其他幾個分詞器對中文的分析都不行) SmartChineseAnalyzer sca = new SmartChineseAnalyzer( cas); TokenStream ts = sca.tokenStream("field", text); CharTermAttribute ch = ts.addAttribute(CharTermAttribute.class); ts.reset(); while (ts.incrementToken()) { System.out.println(ch.toString()); } ts.end(); ts.close(); } catch (Exception ex) { ex.printStackTrace(); } } }

部分結果展示：

總結

以上是生活随笔為你收集整理的lucene的建立索引，搜索，中文分词的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：跨网段攻击
下一篇： Elasticsearch 跨网络、跨集