當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Lucene全文检索

發(fā)布時間：2024/7/19 编程问答 34 豆豆

生活随笔收集整理的這篇文章主要介紹了 Lucene全文检索小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

目錄結(jié)構(gòu):?
1.全文檢索?
2.Lucene入門?
3.Lucene進(jìn)階

全文檢索

一, 生活中的搜索:
1.Windows系統(tǒng)中的有搜索功能：打開“我的電腦”，按“F3”就可以使用查找的功能，查找指定的文件或文件夾。搜索的范圍是整個電腦中的文件資源。

2.Eclipse中的幫助子系統(tǒng)：點擊Help?Help Contents，可以查找出相關(guān)的幫助信息。搜索的范圍是Eclipse的所有幫助文件。
搜索引擎，如Baidu或Google等，可以查詢到互聯(lián)網(wǎng)中的網(wǎng)頁、PDF、DOC、PPT、圖片、音樂、視頻等。
3.Mac中的Spotlight搜索
4.數(shù)據(jù)庫中檢索檢查某一個關(guān)鍵字的例子。
select * from topic where content like ‘%java%’
文本檢索,會使索引失效

存在問題:
1.搜索速度慢
2.搜索效果不好.
3.沒有相關(guān)度排序

二, 什么是全文檢索？

全文檢索是指計算機(jī)索引程序通過掃描文章中的每一個詞，對每一個詞建立一個索引，指明該詞在文章中出現(xiàn)的次數(shù)和位置，當(dāng)用戶查詢時，檢索程序就根據(jù)事先建立的索引進(jìn)行查找，并將查找的結(jié)果反饋給用戶的檢索方式。這個過程類似于通過字典中的檢索字表查字的過程。

在說全文檢索之前我們先來了解一下數(shù)據(jù)分類

結(jié)構(gòu)化數(shù)據(jù):指具有固定格式或有限長度的數(shù)據(jù)，如數(shù)據(jù)庫，元數(shù)據(jù)等;
半結(jié)構(gòu)化數(shù)據(jù):半結(jié)構(gòu)化數(shù)據(jù)
非結(jié)構(gòu)化數(shù)據(jù):指不定長或無固定格式的數(shù)據(jù)，如郵件，word文檔等;
非結(jié)構(gòu)化數(shù)據(jù)又一種叫法叫全文數(shù)據(jù)。從全文數(shù)據(jù)中進(jìn)行檢索就叫全文檢索。
特點:只關(guān)注文本不考慮語義

三, 為什么使用 ?
搜索速度:將數(shù)據(jù)源中的數(shù)據(jù)都通過全文索引

匹配效果:過詞元(term)進(jìn)行匹配，通過語言分析接口的實現(xiàn)，可以實現(xiàn)對中文等非英語的支持。

相關(guān)度:有匹配度算法，將匹配程度（相似度）比較高的結(jié)果排在前面。

適用場景:關(guān)系數(shù)據(jù)庫中進(jìn)行模糊查詢時，數(shù)據(jù)庫自帶的索引將不起作用，此時需要通過全文檢索來提高速度；比如：
網(wǎng)站系統(tǒng)中針對內(nèi)容的模糊查詢；
select * from article where content like ‘%上海平安%’
ERP系統(tǒng)中產(chǎn)品等數(shù)據(jù)的模糊查詢，BBS、BLOG中的文章搜索等；
各種搜索引擎運行依賴于全文檢索；
只對指定領(lǐng)域的網(wǎng)站進(jìn)行索引與搜索（即垂直搜索，如“818工作搜索”、“有道購物搜索”）
要在word、pdf等各種各樣的數(shù)據(jù)格式中檢索內(nèi)容；
其它場合：比如搜狐拼音輸入法、Google輸入法等。

四, 工作原理

1.如何查詢?nèi)臄?shù)據(jù)?

順序掃描法(Serial Scanning)：所謂順序掃描，比如要找內(nèi)容包含某一個字符串的文件，就是一個文檔一個文檔的看，對于每一個文檔，從頭看到尾，如果此文檔包含此字符串，則此文檔為我們要找的文件，接著看下一個文件，直到掃描完所有的文件。比如Window自帶的搜索。
如何提升全文檢索的速度?

對非結(jié)構(gòu)化數(shù)據(jù)順序掃描很慢，對結(jié)構(gòu)化數(shù)據(jù)的搜索卻相對較快（由于結(jié)構(gòu)化數(shù)據(jù)有一定的結(jié)構(gòu)可以采取一定的搜索算法加快速度），那么把我們的非結(jié)構(gòu)化數(shù)據(jù)想辦法弄得有一定結(jié)構(gòu)不就行了嗎？關(guān)系數(shù)據(jù)庫中存儲的都是結(jié)構(gòu)化數(shù)據(jù)，因此很檢索都比較快。
從非結(jié)構(gòu)化數(shù)據(jù)中提取出的然后重新組織的信息，我們稱之索引。
字典及圖書目錄的原理。

2.全文檢索的過程

索引創(chuàng)建:將現(xiàn)實世界中所有的結(jié)構(gòu)化和非結(jié)構(gòu)化數(shù)據(jù)提取信息，創(chuàng)建索引的過程。
搜索索引:就是得到用戶的查詢請求，搜索創(chuàng)建的索引，然后返回結(jié)果的過程。

3.案例分析

索引文件中應(yīng)該存放什么？
索引文件中只需要存放單詞及文檔編號即可
要查出即包含is，又包括 shanghai及pingan的文檔，先獲得包含is的文檔列表，再獲得包含shanghai及pingan的文檔列表，最合做一個集合并運算，就得出文檔1及文檔3。

文檔0
What is your name?
文檔1
My name is shanghai pingan!
文檔2
What is that?
文檔3
It is shanghai pingan, ShangHai Pingan

首先將我們非結(jié)構(gòu)化數(shù)據(jù)存儲到文檔區(qū)

文檔編號內(nèi)容

0	What is your name?
1	My name is shanghai pingan!
2	What is that?
3	It is shanghai pingan, ShangHai Pingan

如何建立索引？
第一步：分詞組件（Tokenizer）對文檔進(jìn)行處理,此過程稱為Tokenize。
1. 將文檔分成一個一個單獨的單詞。(用空格分開)
2. 去除標(biāo)點符號。
3. 去除停詞(Stop word)。大量出現(xiàn)的助詞,比如is,it等。中文：的，了，呢
經(jīng)過分詞(Tokenizer)后得到的結(jié)果稱為詞元(Token)。詞元(Token)如下：
shanghai,ShangHai,pingan,My,name,What,your,pingan

第二步：將得到的詞元(Token)傳給語言處理組件(Linguistic Processor)，對于英語，處理大致如下：
1. 變?yōu)樾?Lowercase)。
2. 將單詞縮減為詞根形式，如“cars”到“car”等。這種操作稱為：stemming。
3. 將單詞轉(zhuǎn)變?yōu)樵~根形式，如“drove”到“drive”等。這種操作稱為：lemmatization。
語言處理組件(linguistic processor)的結(jié)果稱為詞(Term)。結(jié)果如下：
shanghai,pingan,my,name,what,your

第三步：把得到的詞Term傳給索引組件(Indexer)處理,處理過程如下：
1、把得到的詞創(chuàng)建一個字典表

詞term文檔Document

what	0
name	0
My	1
name	1
shanghai	1
pingan	1
what	2
that	2
shanghai	3
pingan	3
shanghai	3
pingan	3

2、對字典按字母順序進(jìn)行排序

詞term文檔Document

shanghai	1
shanghai	3
shanghai	3
pingan	1
pingan	3
pingan	3
my	1
name	0
name	1
what	0
what	2
your	0

3、合并相同的詞(Term)成為文檔倒排(Posting List)鏈表。

詞term出現(xiàn)次數(shù)文檔Frequency文檔Frequency

shanghai	3	1	1	3	2
pingan	3	1	1	3	2
my	1	1	1	~	~
name	2	0	1	1	1
what	2	0	1	2	1
your	1	0	1	~	~

最終會存儲兩部分一個文檔區(qū)和一個索引區(qū)

詞元文檔編號

what	0,2
your	0
name	0,1
my	1
shanghai	1,3,3
pingan	1,3,3
that	2

搜索處理的大致流程：
1、接收用戶輸入的搜索詞及關(guān)鍵字并作簡單處理；
2、對查詢語句進(jìn)行詞法分析，語法分析，及語言處理；
3、查詢到包含輸出詞的文檔列表，并進(jìn)行相關(guān)邏輯運算；
4、根據(jù)文檔的相關(guān)性進(jìn)行排序，把相關(guān)性最高的文檔返回出來。

4.文檔相關(guān)性

計算詞的權(quán)重:
1、找出詞(Term)對文檔的重要性的過程稱為計算詞的權(quán)重(Term weight)的過程。主要有兩個因素：
A、Term Frequency (tf)：即此Term在此文檔中出現(xiàn)了多少次。tf 越大說明越重要。
B、 Document Frequency (df)：即有多少文檔包含該Term。df 越大說明越不重要。

2、判斷Term之間的關(guān)系從而得到文檔相關(guān)性的過程，也即向量空間模型的算法(VSM)。
實現(xiàn)方式：把文檔看作一系列詞(Term)，每一個詞(Term)都有一個權(quán)重(Term weight)，不同的詞(Term)根據(jù)自己在文檔中的權(quán)重來影響文檔相關(guān)性的打分計算

5.全文檢索應(yīng)用架構(gòu)

6.全文檢索的流程對應(yīng)的Lucene 實現(xiàn)的包結(jié)構(gòu)

Lucene 的analysis 模塊主要負(fù)責(zé)詞法分析及語言處理而形成Term。
Lucene的index模塊主要負(fù)責(zé)索引的創(chuàng)建，里面有IndexWriter。
Lucene的store模塊主要負(fù)責(zé)索引的讀寫。
Lucene 的QueryParser主要負(fù)責(zé)語法分析。
Lucene的search模塊主要負(fù)責(zé)對索引的搜索。

Lucene入門

Lucene是什么？

Lucene是一個用Java寫的高性能、可伸縮的全文檢索引擎工具包，它可以方便的嵌入到各種應(yīng)用中實現(xiàn)針對應(yīng)用的全文索引/檢索功能。Lucene的目標(biāo)是為各種中小型應(yīng)用程序加入全文檢索功能。

開發(fā)步驟

建立索引文件

1,創(chuàng)建一個測試類LuceneTest
2,導(dǎo)入jar包
lucene-core-4.10.4.jar 核心包
lucene-analyzers-common-4.10.4.jar 分詞器包
3,創(chuàng)建索引寫入器IndexWriter 傳入對應(yīng)的參數(shù):索引需要存放的位置,索引寫入器配置對象(配置版本,分詞器)
4.內(nèi)容寫入之后,寫入到二進(jìn)制文件中不方便查看,使用工具(lukeall-4.10.0.jar)查看索引庫

public class LuceneTest {String content1 = "hello world";String content2 = "hello java world";String content3 = "hello lucene world";String indexPath = "hello";Analyzer analyzer = new StandardAnalyzer();//分詞器@Testpublic void testCreateIndex() throws Exception {//1.創(chuàng)建索引寫入器Directory d = FSDirectory.open(new File(indexPath));//索引需要存放的位置//創(chuàng)建索引寫入器配置對象IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_4_10_4, analyzer);IndexWriter writer = new IndexWriter(d, conf);//2.寫入文檔信息//添加文檔定義字段的存儲規(guī)則FieldType type = new FieldType();type.setIndexed(true);//是否要索引type.setStored(true);//是否需要存儲Document document1 = new Document();//數(shù)據(jù)庫中的一條數(shù)據(jù)//new Field("字段名","字段內(nèi)容","字段的配置屬性")document1.add(new Field("title", "doc1", type));//該條記錄中的字段 title:doc1document1.add(new Field("content", content1, type));//content: hello worldwriter.addDocument(document1);Document document2 = new Document();document2.add(new Field("title", "doc2", type));document2.add(new Field("content", content2, type));writer.addDocument(document2);Document document3 = new Document();document3.add(new Field("title", "doc3", type));document3.add(new Field("content", content3, type));writer.addDocument(document3);//需要把添加的記錄保存writer.commit();writer.close();} }

運行測試類會在該項目目錄下生成一個hello文件夾

打開_0.xfs文件,這時我們看不出一個所以然

使用工具(lukeall-4.10.0.jar)查看索引庫
只需在終端通過命令行 java -jar lukeall-4.10.0.jar 即可

需要在Path路徑上找到hello索引庫的絕對路徑

點擊OK即可看到索引庫

查詢索引庫

0.導(dǎo)入jar包lucene-queryparser-4.10.4.jar(將字符串變成Query對象)
1.創(chuàng)建測試方法searchIndex()
2.創(chuàng)建索引查詢對象IndexSearcher
3.根據(jù)查詢的文本內(nèi)容解析成Query查詢對象(導(dǎo)入jar包lucene-queryparser-4.10.4.jar)設(shè)置查詢字段,分詞器
4.根據(jù)查詢器查詢到文檔編號
5.通過文檔編號查詢對應(yīng)的文檔內(nèi)容

//索引查詢過程 @Testpublic void searchIndex() throws Exception {//1.創(chuàng)建索引寫入器Directory d = FSDirectory.open(new File(indexPath));//創(chuàng)建分詞器Analyzer analyzer = new StandardAnalyzer();//打開索引目錄IndexReader r = DirectoryReader.open(d);//創(chuàng)建索引查詢對象IndexSearcher searcher = new IndexSearcher(r);QueryParser parser = new QueryParser("content", analyzer);Query query = parser.parse("hello");//查詢hello//search(查詢對象,符合條件的前n條記錄)TopDocs search = searcher.search(query, 10000);//n:前幾個結(jié)果System.out.println("符合條件的記錄有多少個:" + search.totalHits);ScoreDoc[] scoreDocs = search.scoreDocs;for (int i = 0; i < scoreDocs.length; i++) {System.out.println("*******************************");System.out.println("分?jǐn)?shù):" + scoreDocs[i].score);//相關(guān)度的排序int docId = scoreDocs[i].doc;//文檔編號Document document = searcher.doc(docId);System.out.println("文檔編號 docId--->" + docId);System.out.println("標(biāo)題內(nèi)容 title:--->" + document.get("content"));} }

打印結(jié)果:

常用API

Directory:索引目錄用于存放lucene索引文件
Directory是一個對索引目錄的一個抽象，索引目錄可以存放在普通的文件中，也可以位于數(shù)據(jù)庫，或其它的遠(yuǎn)程服務(wù)中；一般情況下均使用文件來索引目錄，這時一個Directory就相當(dāng)于一個文件夾。
SimpleFSDirectory：直接使用java.io.RandomAccessFile類來操作索引文件，在普通的Lucene應(yīng)用中，可以直接使用SimpleFSDirectory。

SimpleFSDirectory類：直接使用java.io.RandomAccessFile類來操作索引文件，在普通的Lucene應(yīng)用中，這是最簡單的用法。
構(gòu)造函數(shù)：
SimpleFSDirectory(File path) ：直接根據(jù)一個文件夾地址來創(chuàng)建索引目錄；
MMapDirectory(File path) ：讓OS把整個索引文件映射到虛擬地址空間，這樣Lucene就會覺得索引在內(nèi)存中。

Document:當(dāng)往索引中加入內(nèi)容的時候，每一條信息用一個子Document來表示,Document的意思表示文檔，也可以理解成記錄，與關(guān)系數(shù)據(jù)表中的一行數(shù)據(jù)記錄類似；
在Document創(chuàng)建完以后，直接調(diào)用其提供的字段操作方法來操作其中的字段對象。
Document提供的方法主要包括：
字段添加：add(Field field)
字段刪除：removeField、removeFields
獲取字段或值:get、getBinaryValue、getField、getFields等

**Field:**Field代表Document中的一行數(shù)據(jù)，相當(dāng)于一條Lucene記錄中的一列。
Lucene提供了一個接口Fieldable，其它的API大多針對這個接口編程，因此Lucene中的列對象實際上是由Fieldable來定義，實現(xiàn)該接口的除了Field類，還包括NumericField等。在實際開發(fā)中，主要使用的是Field類。
Field類提供的常用構(gòu)造方法：
1、Field(String name, String value, Field.Store store, Field.Index index) -通過字段名稱，字段值，存儲類型及索引方式來創(chuàng)建一個字段；
2、Field(String name, byte[] value, Field.Store store) -通過字段名稱、字段值(字節(jié)碼)及字段存儲方式創(chuàng)建字段對象；
3、Field(String name, Reader reader) -根據(jù)字段名稱及Reader對象創(chuàng)建字段對象；
4、其它構(gòu)造方法，詳情查看API。
new Field(“title”, “中國太平”, Store.NO, Index.ANALYZED);
new Field(“content”, “比較好的保險公司”, Store.YES, Index.ANALYZED);

**FieldType:**Lucene中，在創(chuàng)建Field的時候，可以指定Field的store及index屬性；
store屬性：表示字段值是否存儲，True表示要存儲，而False則表示不存儲；
type.setStored(true);//是否需要存儲在文檔區(qū)中
indexed屬性：表示字段的是否需要建立索引，即是否支持搜索。tokenized屬性：表示字段是否需要根據(jù)Analyzer規(guī)則進(jìn)行分詞

創(chuàng)建FieldTest測試類(復(fù)制上面的類修改類名)
定義字段的存儲規(guī)則

FieldType type2 = new FieldType();type2.setIndexed(true);//該字段是否要索引type2.setStored(true);//是否需要存儲在文檔區(qū)中type2.setTokenized(false);//字段是否分詞type2.setTokenized(false);//字段是否分詞

設(shè)置所有的字段的配置屬性為type2

document1.add(new Field("content", content1, type2)); document2.add(new Field("content", content2, type2)); document3.add(new Field("content", content3, type2));public class FieldTest {String content1 = "hello world";String content2 = "hello java world";String content3 = "hello lucene world";String indexPath = "fieldType";Analyzer analyzer = new StandardAnalyzer();//分詞器//創(chuàng)建索引@Testpublic void testCreateIndex() throws Exception {//1.創(chuàng)建索引寫入器Directory d = FSDirectory.open(new File(indexPath));//索引需要存放的位置//創(chuàng)建索引寫入器配置對象IndexWriterConfig confg = new IndexWriterConfig(Version.LUCENE_4_10_4, analyzer);confg.setOpenMode(IndexWriterConfig.OpenMode.CREATE);//索引每次重新創(chuàng)建IndexWriter writer = new IndexWriter(d, confg);//2.寫入文檔信息//添加文檔定義字段的存儲規(guī)則FieldType type = new FieldType();type.setIndexed(true);//該字段是否要索引type.setStored(true);//是否需要存儲type.setTokenized(true);FieldType type2 = new FieldType();type2.setIndexed(true);//該字段是否要索引type2.setStored(true);//是否需要存儲type2.setTokenized(false);//字段是否分詞Document document1 = new Document();//數(shù)據(jù)庫中的一條數(shù)據(jù)//new Field("字段名","字段內(nèi)容","字段的配置屬性")document1.add(new Field("title", "doc1", type));//該條記錄中的字段 title:doc1document1.add(new Field("content", content1, type2));//content: hello worldwriter.addDocument(document1);Document document2 = new Document();document2.add(new Field("title", "doc2", type));document2.add(new Field("content", content2, type2));writer.addDocument(document2);Document document3 = new Document();document3.add(new Field("title", "doc3", type));document3.add(new Field("content", content3, type2));writer.addDocument(document3);//需要把添加的記錄保存writer.commit();writer.close();} }

運行測試類?

查看索引庫?

當(dāng)我們搜索用戶名或者地名希望是完整的詞元,不希望被分割,此時就可以設(shè)置該字段的tokenize屬性為false,設(shè)置不進(jìn)行分詞
在索引庫中:
1.標(biāo)題和內(nèi)容都通過分詞器進(jìn)行索引了.
2.標(biāo)題是完整儲存在文檔區(qū)中,內(nèi)容值截取前30個字符存儲在存儲區(qū)
3.文章ID只是存儲在文檔區(qū)但是沒有進(jìn)行分詞
4.時間,作者,閱讀量,評論數(shù),來源是沒索引也沒存儲的

Analyzer(詞法分析器)

創(chuàng)建一個測試類AnalyzerTest?
封裝一個測試各個分詞器的方法analyzerMethod(Analyzer analyzer, String content);

public class AnalyzerTest {String en = "good morning boy";String ch = "你好恭喜發(fā)財東方明珠三生三世十里桃花";@Testpublic void analyzerMethod(Analyzer analyzer, String content) throws Exception {TokenStream tokenStream = analyzer.tokenStream("content", content);tokenStream.reset();while (tokenStream.incrementToken()) {System.out.println(tokenStream);}}//英文分詞器SimpleAnalyzer測試@Testpublic void testSimpleAnalyzer() throws Exception {analyzerMethod(new SimpleAnalyzer(), en);} }

英文分詞:?
SimpleAnalyzer:最簡單的詞法分析器，按英文單詞建立索引，以空格為分隔符；

//英文分詞器SimpleAnalyzer測試@Testpublic void testSimpleAnalyzer() throws Exception {analyzerMethod(new SimpleAnalyzer(), en);}

StandardAnalyzer:按英文單詞及中文字符來進(jìn)行分析。

//英文分詞器StandardAnalyzer測試@Testpublic void testStandardAnalyzer() throws Exception {analyzerMethod(new StandardAnalyzer(), en);}

對于英文StandardAnalyzer也是采取空格進(jìn)行分詞?
下面對中文進(jìn)行分詞測試(對于中文他是單字分詞)

//英文分詞器StandardAnalyzer測試@Testpublic void testStandardAnalyzer() throws Exception {analyzerMethod(new StandardAnalyzer(), ch);}

PerFieldAnalyzerWrapper:

public void testPerFieldAnalyzerWrapper() throws Exception {Map<String, Analyzer> analyzerMap = new HashMap<>();analyzerMap.put("en", new SimpleAnalyzer());//使用SimpleAnalyzer分詞器analyzerMap.put("ch", new StandardAnalyzer());//使用StandardAnalyzer//設(shè)置默認(rèn)分詞器PerFieldAnalyzerWrapper wrapper = new PerFieldAnalyzerWrapper(new SimpleAnalyzer(), analyzerMap);//會根據(jù)傳入的字段名在PerFieldAnalyzerWrapper找到這個字段對應(yīng)的分詞器//如果PerFieldAnalyzerWrapper沒有該字段對應(yīng)的分詞器就會應(yīng)用默認(rèn)的的分詞器//tokenStream("content", xxxxxxxxx);根據(jù)xxxxxx來判斷選擇的分詞器TokenStream tokenStream = wrapper.tokenStream("content", ch);tokenStream.reset();while (tokenStream.incrementToken()) {System.out.println(tokenStream);}}

中文分詞:?
StandardAnalyzer:單字分詞，把每一個字當(dāng)成一個詞

//中文分詞器StandardAnalyzer測試 @Test public void testStandardAnalyzer() throws Exception {analyzerMethod(new StandardAnalyzer(), ch);}

CJKAnalyzer:二分法分詞，把相臨的兩個字當(dāng)成一個詞，比如我們是中國人；我們，們是，是中，中國，國人等

//中文分詞器CJKAnalyzer測試 @Test public void testCJKAnalyzer() throws Exception { analyzerMethod(new CJKAnalyzer(), ch); }

SmartChineseAnalyzer:字典分詞，也叫詞庫分詞；把中文的詞全部放置到一個詞庫中，按某種算法來維護(hù)詞庫內(nèi)容；如果匹配到就切分出來成為詞語。通常詞庫分詞被認(rèn)為是最理想的中文分詞算法。如：“我們是中國人”，效果為：“我們”、“中國人”。（可以使用SmartChineseAnalyzer，“極易分詞” MMAnalyzer ，或者是“庖丁分詞”分詞器、IKAnalyzer。推薦使用IKAnalyzer ）

//中文分詞器SmartChineseAnalyzer測試 //需要導(dǎo)入jar包lucene-analyzers-smartcn-4.10.4.jar @Test public void testSmartChineseAnalyzer() throws Exception {analyzerMethod(new SmartChineseAnalyzer(), ch);} }

IKAnalyzer:第三方的?
1.導(dǎo)入jar包 IKAnalyzer2012FF_u1.jar(這個包在中央倉庫是沒有的)支持停詞和自定義拓展詞?
2.添加停詞詞典stopword.dic?
3.添加拓展詞典ext.dic

//中文分詞器IKAnalyzer測試 //需要導(dǎo)入jar包IKAnalyzer2012FF_u1.jar@Testpublic void testIKAnalyzer() throws Exception {analyzerMethod(new IKAnalyzer(), ch);}

如果想去掉”的”,”了”,”嗎”…..的語氣詞我們可以加入配置文件
IKAnalyzer.cfg.xml和stopword.dic

在stopword.dic文件里添加我們不需要的分詞即可,這樣拆分詞元就不會把這些停詞作為分詞了

我們?nèi)绻爰尤胍恍┪覀冏约盒枰脑~元則需要在配置文件IKAnalyzer.cfg.xml中配置一個額外分詞文件拓展詞典ext.dic
在拓展詞典ext.dic中設(shè)置我們自定義的詞元

索引庫的更新

public class CRUDTest {String content1 = "hello world";String content2 = "hello java world";String content3 = "hello lucene world";String indexPath = "luncecrud";Analyzer analyzer = new StandardAnalyzer();//分詞器//創(chuàng)建索引@Testpublic void testCreateIndex() throws Exception {//1.創(chuàng)建索引寫入器Directory d = FSDirectory.open(new File(indexPath));//索引需要存放的位置//創(chuàng)建索引寫入器配置對象IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_4_10_4, analyzer);IndexWriter writer = new IndexWriter(d, conf);//2.寫入文檔信息//添加文檔定義字段的存儲規(guī)則FieldType type = new FieldType();type.setIndexed(true);//是否要索引type.setStored(true);//是否需要存儲Document document1 = new Document();//數(shù)據(jù)庫中的一條數(shù)據(jù)//new Field("字段名","字段內(nèi)容","字段的配置屬性")document1.add(new Field("title", "doc1", type));//該條記錄中的字段 title:doc1document1.add(new Field("content", content1, type));//content: hello worldwriter.addDocument(document1);Document document2 = new Document();document2.add(new Field("title", "doc2", type));document2.add(new Field("content", content2, type));writer.addDocument(document2);Document document3 = new Document();document3.add(new Field("title", "doc3", type));document3.add(new Field("content", content3, type));writer.addDocument(document3);//需要把添加的記錄保存writer.commit();writer.close();testSearch();}@Testpublic void testUpdate() throws Exception {//創(chuàng)建索引寫入器Directory d = FSDirectory.open(new File(indexPath));IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_10_4, analyzer);IndexWriter writer = new IndexWriter(d, config);//更新對象Term term = new Term("title", "doc2");//更新的條件Document updateDoc = new Document();//更新之后的文檔對象FieldType type = new FieldType();type.setIndexed(true);type.setStored(true);updateDoc.add(new Field("title", "doc2", type));updateDoc.add(new Field("content", "hello黃河之水天上來吧我要更新內(nèi)容啦", type));writer.updateDocument(term, updateDoc);//提交更新內(nèi)容釋放資源writer.commit();writer.close();testSearch();}//索引查詢過程@Testpublic void testSearch() throws Exception {//1.創(chuàng)建索引寫入器Directory d = FSDirectory.open(new File(indexPath));//打開索引目錄IndexReader r = DirectoryReader.open(d);IndexSearcher searcher = new IndexSearcher(r);QueryParser parser = new QueryParser("content", analyzer);Query query = parser.parse("hello");//查詢hello//search(查詢對象,符合條件的前n條記錄)TopDocs search = searcher.search(query, 10000);//n:前幾個結(jié)果System.out.println("符合條件的記錄有多少個:" + search.totalHits);ScoreDoc[] scoreDocs = search.scoreDocs;Document doc = null;for (int i = 0; i < scoreDocs.length; i++) {System.out.println("*******************************");System.out.println("分?jǐn)?shù):" + scoreDocs[i].score);//相關(guān)度的排序int docId = scoreDocs[i].doc;//文檔編號Document document = searcher.doc(docId);System.out.println("文檔編號 docId--->" + docId);System.out.println("標(biāo)題內(nèi)容 title:--->" + document.get("title"));System.out.println("正文內(nèi)容 content:--->" + document.get("content"));}} }

先創(chuàng)建一個創(chuàng)建索引的方法testCreateIndex()和索引查詢的方法testSearch()然后創(chuàng)建一個索引更新的方法testUpdate();
先執(zhí)行testCreateIndex()

在執(zhí)行testUpdate();

把文檔標(biāo)題為doc2 的內(nèi)容更新為新的內(nèi)容,同時文檔編號發(fā)生變化,文檔編號為1的被刪除,增加類文檔編號3.說明更新的操作是先刪除后添加

刪除索引庫

@Testpublic void testDelete()throws Exception{//創(chuàng)建索引寫入器Directory d = FSDirectory.open(new File(indexPath));IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_10_4, analyzer);IndexWriter writer = new IndexWriter(d, config);//刪除記錄/*** 方式一Term term=new Term("title","doc2");writer.deleteDocuments(term);*///方式二QueryParser parser = new QueryParser("title", analyzer);Query query = parser.parse("doc3");writer.deleteDocuments(query);//將刪除操作提交writer.commit();writer.close();testSearch();}

Lucene進(jìn)階

查詢所有

//索引查詢過程1 public void search1(String content) throws Exception {//1.創(chuàng)建索引寫入器Directory d = FSDirectory.open(new File(indexPath));//創(chuàng)建分詞器Analyzer analyzer = new StandardAnalyzer();//打開索引目錄IndexReader r = DirectoryReader.open(d);IndexSearcher searcher = new IndexSearcher(r);QueryParser parser = new QueryParser("content", analyzer);Query query = parser.parse(content);//查詢hello//search(查詢對象,符合條件的前n條記錄)TopDocs search = searcher.search(query, 10000);//n:前幾個結(jié)果System.out.println("符合條件的記錄有多少個:" + search.totalHits);ScoreDoc[] scoreDocs = search.scoreDocs;for (int i = 0; i < scoreDocs.length; i++) {System.out.println("*******************************");System.out.println("分?jǐn)?shù):" + scoreDocs[i].score);//相關(guān)度的排序int docId = scoreDocs[i].doc;//文檔編號Document document = searcher.doc(docId);System.out.println("文檔編號 docId--->" + docId);System.out.println("標(biāo)題內(nèi)容 title:--->" + document.get("title"));System.out.println("正文內(nèi)容 content:--->" + document.get("content"));}}//索引查詢過程2public void search2(Query query) throws Exception { //1.創(chuàng)建索引寫入器Directory d = FSDirectory.open(new File(indexPath)); //創(chuàng)建分詞器Analyzer analyzer = new StandardAnalyzer();//打開索引目錄IndexReader r = DirectoryReader.open(d);IndexSearcher searcher = new IndexSearcher(r);QueryParser parser = new QueryParser("content", analyzer);//search(查詢對象,符合條件的前n條記錄)TopDocs search = searcher.search(query, 10000);//n:前幾個結(jié)果System.out.println("符合條件的記錄有多少個:" + search.totalHits);ScoreDoc[] scoreDocs = search.scoreDocs;for (int i = 0; i < scoreDocs.length; i++) {System.out.println("*******************************");System.out.println("分?jǐn)?shù):" + scoreDocs[i].score);//相關(guān)度的排序int docId = scoreDocs[i].doc;//文檔編號Document document = searcher.doc(docId);System.out.println("文檔編號 docId--->" + docId);System.out.println("標(biāo)題內(nèi)容 title:--->" + document.get("title"));System.out.println("正文內(nèi)容 content:--->" + document.get("content"));}}@Test public void test1() throws Exception {search1("*:*");//查詢所有,匹配所有字段search2(new MatchAllDocsQuery());}

單詞搜索

/*** 單詞搜索** @throws Exception*/@Testpublic void test2() throws Exception {//search("title:doc1"); --->public void search(String content)search(new TermQuery(new Term("title", "doc1")));//--->search(Query query)}

段落查詢

/*** 段落查詢* @throws Exception*/@Testpublic void test3() throws Exception {// search("content:\"hello world\"");PhraseQuery query =new PhraseQuery();query.add(new Term("content","hello"));query.add(new Term("content","world"));search(query);}

通配符檢索

/*** 通配符檢索* @throws Exception*/ @Test public void test4() throws Exception { //查詢所有 //方式1search("l*ne"); //方式2search("luenc?"); //方式3WildcardQuery query = new WildcardQuery(new Term("content","l*ne"));search(query); }

search(“l(fā)**ne”);中的 *表示多個字符
search(“l(fā)uenc?”);中的?表示一個字符

單詞模糊查詢

Lucene支持單詞容錯content:lucenx ~1 表示支持單詞容錯一個字母,content:lucenx~N N最大值為2

@Test public void test5() throws Exception{ search("content:lxcenX~2"); FuzzyQuery query = new FuzzyQuery(new Term("content","lucenx"),1); search(query); }

相似查詢在關(guān)鍵字后面使用 ~ （波浪線)符號，后面可以跟一個表示相似度的數(shù)字，比如~0.85 , ~ 0.3 , ~1，值在0-1之間，1表示非常相似度最高，默認(rèn)為0.5。

@Test public void test6() throws Exception{ search("lqcenX~1"); FuzzyQuery query = new FuzzyQuery(new Term("content","lqcenX")); search(query); }

段落查詢 (臨近查詢)

content:\”hello world\”~1 表示這個段落中間可以插入1個單詞
content:\”hello world\”~N 表示這個段落中間可以插入N個單詞

/*** 段落查詢 (臨近查詢)* @throws Exception*/@Testpublic void test7() throws Exception{//~1 表示這個段落中間可以插入一個單詞//content:\"hello world\"~N 表示這個段落中間可以插入N個單詞//search("content:\"hello world\"~1");PhraseQuery query = new PhraseQuery();query.add(new Term("content","hello"));query.add(new Term("content","world"));query.setSlop(1);//設(shè)置中間有一個停詞search(query);}

范圍檢索

/*** 范圍檢索*/ @Test public void test8() throws Exception { // {:左開區(qū)間 // }:右開區(qū)間 // [:左閉區(qū)間 // ]:右閉區(qū)間 //search("inputtime:{20101010 TO 20101012}"); //TermRangeQuery(查詢字段,左邊的值,右邊的值,是否左閉區(qū)間,是否右閉區(qū)間);TermRangeQuery query = new TermRangeQuery("inputtime", new BytesRef("20101010"), new BytesRef("20101012"), false, false);search(query); }

組合查詢

AND和&&:目標(biāo)–>查詢出標(biāo)題中包括One及內(nèi)容中包括java的文檔；
下面兩種情況均可：
title:one && content:java
title:one AND content:java

/*** 組合查詢AND和&&* @throws Exception*/@Testpublic void test9() throws Exception {//search("content:hello AND inputtime:{20101010 TO 20101012}");search("content:hello && inputtime:{20101010 TO 20101012}");/*BooleanQuery query = new BooleanQuery();query.add(new TermQuery(new Term("content","hello")), BooleanClause.Occur.MUST);query.add(new TermRangeQuery("inputtime",new BytesRef("20101010"),new BytesRef("20101012"),false,false), BooleanClause.Occur.MUST);search(query);*/}

OR和||:查詢出標(biāo)題中包括One但內(nèi)容中不包括java的文檔；
默認(rèn)情況下分詞組合即為邏輯或(OR)方式。
下面三種情況均可：
title:one || content:java
title:one OR content:java
title:one content:java

/*** 組合查詢OR和||* @throws Exception*/ @Test public void test10() throws Exception { //search("content:lucene OR inputtime:{20101010 TO 20101012}"); //search("content:lucene || inputtime:{20101010 TO 20101012}");BooleanQuery query = new BooleanQuery();query.add(new TermQuery(new Term("content","lucene")), BooleanClause.Occur.SHOULD);query.add(new TermRangeQuery("inputtime",new BytesRef("20101010"),new BytesRef("20101012"),false,false), BooleanClause.Occur.SHOULD);search(query); }

Not或!:查詢出標(biāo)題中包括One但內(nèi)容中不包括java的文檔；
下面兩種情況均可：
title:one ! content:java
title:one NOT content:java

/*** 組合查詢OR和||* @throws Exception*/ @Test public void test10() throws Exception {//search("content:lucene OR inputtime:{20101010 TO 20101012}");//search("content:lucene || inputtime:{20101010 TO 20101012}");BooleanQuery query = new BooleanQuery();query.add(new TermQuery(new Term("content","lucene")), BooleanClause.Occur.SHOULD);query.add(new TermRangeQuery("inputtime",new BytesRef("20101010"),new BytesRef("20101012"),false,false), BooleanClause.Occur.SHOULD);search(query); }

必須包括(+)及排除(-):目標(biāo)—>查詢出標(biāo)題中包括One但內(nèi)容中不包括java的文檔；
+title:one -content:title

增加權(quán)重

Luence允許我們在組合查詢中，指定某一個詞的相關(guān)性權(quán)重值，從而可以讓得到相關(guān)性高的結(jié)果;
要提升一個詞的相關(guān)性權(quán)重，則可以在關(guān)鍵詞的后面添加^n來實現(xiàn)。
比如查詢jakarta apache，如果要把jakarta 的相關(guān)性提高，則可以改為jakarta^4 apache
相關(guān)性權(quán)重也可以用于詞組查詢，比如”jakarta apache”^4 “Apache Lucene” 將把與jakarta apache詞組最相關(guān)的優(yōu)先排列出來；
相關(guān)性權(quán)重值默認(rèn)為1，一般要提升權(quán)重時均設(shè)置為大于1的整數(shù)；該值也可以為0-1的小數(shù)，但不能為負(fù)數(shù)。

/*** 增加權(quán)重* @throws Exception*/ @Test public void test12() throws Exception {//search("content:lucene^10 java");BooleanQuery query = new BooleanQuery();TermQuery termQuery = new TermQuery(new Term("content", "lucene"));termQuery.setBoost(10);//該查詢對象添加權(quán)重query.add(termQuery, BooleanClause.Occur.SHOULD);query.add(new TermQuery(new Term("content","java")), BooleanClause.Occur.SHOULD);search(query); }

特殊字符

由于| & ! + - ( ) 等符號在查詢表達(dá)式中被用做關(guān)鍵字，因此要查詢這些字符必須使用\來進(jìn)行轉(zhuǎn)義處理。
當(dāng)前Lucene查詢中的特殊字符：+ - && || ! ( ) { } [ ] ^ ” ~ * ? : \
比如，要查詢包括(1+1):2 的文檔，需要使用到如下表達(dá)式:
(1+1):2

分組
使用括號()對查詢表示式分組Grouping
Lucene查詢語法中支持通過()來對查詢表達(dá)式進(jìn)行分組，從而組合出各種復(fù)雜的查詢。
1、查詢出標(biāo)題中包括one或two，但內(nèi)容中不包括java的文檔；
Query query=parser.parse(“title:(one OR two) NOT content:java”);

高亮實現(xiàn)

１、高亮的概述：從搜索結(jié)果中截取一部分摘要，并把符合條件的記錄添加高亮顯示；
高亮需要使用jar包lucene-highlighter-4.10.4.jar
２、高亮涉及的功能包括兩部分：Ａ、截取摘要，Ｂ、高亮顯示

Formatter formatter = new SimpleHTMLFormatter("<font color=\"red\">","</font>"); Scorer scorer = new QueryScorer(query); Highlighter hl = new Highlighter(formatter,scorer); hl.setMaxDocCharsToAnalyze(20); String str=hl.getBestFragment(new StandardAnalyzer(), "content",doc.get("content"));

原文出處：

CoderZS,?Lucene&全文檢索,?https://blog.csdn.net/zhang18024666607/article/details/78216635

轉(zhuǎn)載于:https://www.cnblogs.com/ryelqy/p/10104044.html

總結(jié)

以上是生活随笔為你收集整理的Lucene全文检索的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： SpringCloud源码分析（一）--
下一篇： A1051. DNA序列