生活随笔
收集整理的這篇文章主要介紹了
Lucene学习之——停用词
小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
??????? 做中文文本聚類,研究中科院的imdict-chinese-analyzer分詞器時(shí),我自己加載的停用詞表一直都跑不出正確的結(jié)果,于是,就追蹤lucene是怎么加載自己的停用詞表的。在源代碼的WordListLoader.java類中,發(fā)現(xiàn)了這樣的代碼:
public?static?HashSet?getWordSet(Reader?reader)?throws?IOException?{ ????HashSet?result?=?new?HashSet(); ????BufferedReader?br?=?null; ????try?{ ??????if?(reader?instanceof?BufferedReader)?{ ????????br?=?(BufferedReader)?reader; ??????}?else?{ ????????br?=?new?BufferedReader(reader); ??????} ??????String?word?=?null; ??????while?((word?=?br.readLine())?!=?null)?{ ????????result.add(word.trim()); ??????} ????} ????finally?{ ??????if?(br?!=?null) ????????br.close(); ????} ????return?result; ??}? 問題就變得清晰了,于是我在這行代碼中加入了輸出語句,把從文件中加載進(jìn)去的停用詞打印出來。這樣,就找到了問題的根源了:停用詞表的編碼格式問題(我設(shè)置的編碼格式為:Unicode)
?????? 把研究的結(jié)果總結(jié)如下:
????? 1、Lucene支持的停用詞表文件有utf-8.
???? 2、停用詞表的格式很簡(jiǎn)單:每詞一行.
?????3、Luence中支持停用詞的方式有5種(見org.apache.lucene.analysis.StopAnalyzer.java類的5種構(gòu)造方法):StopAnalyzer.java類默認(rèn)停用詞,以String[]傳入,以Set傳入,以File傳入,以Reader傳入
???4、? 關(guān)于停用詞處理的參考代碼如下:
?
package?com.xh.TextClustering; ??import?java.io.File; ?import?java.io.IOException; ?import?java.io.StringReader; ??import?org.apache.lucene.analysis.Analyzer; ?import?org.apache.lucene.analysis.Token; ?import?org.apache.lucene.analysis.TokenStream; ?import?org.apache.lucene.index.CorruptIndexException; ?import?org.apache.lucene.store.LockObtainFailedException; ?public?class?StopAnalyzerTestChinese ?{ ???????static??String??source="我?是?中國(guó)人。"; ??????public?static?void?main(String?args[]) ??????{ ??????????Indexer(); ??????} ?????private?static?void?Indexer()? ?????{ ?????????????? ?????????try?{ ?????????????Analyzer?analyzer=new?StopAnalyzer(new?File("chinese_stopword.dic")); ??????????????????TokenStream?stream=analyzer.tokenStream("content",?new?StringReader(source)); ?????????????while(true) ?????????????{ ?????????????????Token?item=stream.next(); ?????????????????if(null==item)break; ?????????????????System.out.println("{"+item.termText()+"}"); ?????????????} ???????????}?catch?(CorruptIndexException?e)?{ ?????????????e.printStackTrace(); ?????????}?catch?(LockObtainFailedException?e)?{ ?????????????e.printStackTrace(); ?????????}?catch?(IOException?e)?{ ?????????????e.printStackTrace(); ?????????} ?????} ?}? 注:參考中文停用詞表在附件中。
總結(jié)
以上是生活随笔為你收集整理的Lucene学习之——停用词的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。