當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【Nutch2.3基础教程】集成Nutch/Hadoop/Hbase/Solr构建搜索引擎：安装及运行【集群环境】

發布時間：2024/1/23 编程问答 32 豆豆

生活随笔收集整理的這篇文章主要介紹了【Nutch2.3基础教程】集成Nutch/Hadoop/Hbase/Solr构建搜索引擎：安装及运行【集群环境】小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

1、下載相關軟件，并解壓

版本號如下：

（1）apache-nutch-2.3

（2） hadoop-1.2.1

（3）hbase-0.92.1

（4）solr-4.9.0

并解壓至/opt/jediael。

若要下載最新的開發版本nutch，可以進行以下操作

svn co https://svn.apache.org/repos/asf/nutch/branches/2.x

2、安裝hadoop1.2.1集群環境

見http://blog.csdn.net/jediael_lu/article/details/38926477

3、安裝hbase0.92.1集群環境

見http://blog.csdn.net/jediael_lu/article/details/43086641

4、Nutch的配置

（1）vi /usr/search/apache-nutch-2.3/conf/nutch-site.xml?

<property> <name>storage.data.store.class</name> <value>org.apache.gora.hbase.store.HBaseStore</value> <description>Default class for storing data</description> </property> <pre name="code" class="html"><property> <name>http.agent.name</name> <value>My Nutch Spider</value> </property>
（2）vi /usr/search/apache-nutch-2.3/ivy/ivy.xml?

默認情況下，此語句被注釋掉，將其注釋符號去掉，使其生效。

<dependency org="org.apache.gora" name="gora-hbase" rev="0.5" conf="*->default" />

gora-hbase 0.5對應hbase0.94.12

根據需要修改hadoop的版本：

<dependency org="org.apache.hadoop" name="hadoop-core" rev="1.2.1" conf="*->default”> <dependency org="org.apache.hadoop" name="hadoop-test" rev="1.2.1" conf="test->default”>
（3）vi /usr/search/apache-nutch-2.2.1/conf/gora.properties?

添加以下語句：

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

以上三個步驟指定了使用HBase來進行存儲。

（4）根據需要修改網頁過濾器

?vi /usr/search/apache-nutch-2.3/conf/regex-urlfilter.txt?

將

# accept anything else
+.

修改為

# accept anything else
+^http://([a-z0-9]*\.)*nutch.apache.org/

(9)增加索引內容

默認情況下，schema.xml文件中的core及index-basic中的field才會被索引，為索引更多的field，可以通過以下方式添加。

修改nutch-default.xml，新增以下紅色內容【或者只增加index-more】

? <name>plugin.includes</name>

?<description>Regular expression naming plugin directory names to

? include. Any plugin not matching this expression is excluded.

? In any case you need at least include the nutch-extensionpoints plugin. By

? default Nutch includes crawling just HTML and plain text via HTTP,

? and basic indexing and search plugins. In order to use HTTPS please enable

? protocol-httpclient, but be aware of possible intermittent problems with the

? underlying commons-httpclient library.

? </description>

</property>

或者可以在nutch-site.xml中添加plugin.includes屬性，并將上述內容復制過去。注意，在nutch-site.xml中的屬性會代替nutch-default.xml中的屬性，因此必須將原有的屬性也復制過去。

（5）構建runtime

?cd /usr/search/apache-nutch-2.3/

ant runtime

（6）驗證Nutch安裝完成

# cd /usr/search/apache-nutch-2.3/runtime/local/bin/
# ./nutch?
Usage: nutch COMMAND
where COMMAND is one of:
?inject ? ? ? ? inject new urls into the database
?hostinject ? ? creates or updates an existing host table from a text file
?generate ? ? ? generate new batches to fetch from crawl db
?fetch ? ? ? ? ?fetch URLs marked during generate
?parse ? ? ? ? ?parse URLs marked during fetch
?updatedb ? ? ? update web table after parsing
?updatehostdb ? update host table after parsing
?readdb ? ? ? ? read/dump records from page database
?readhostdb ? ? display entries from the hostDB
?elasticindex ? run the elasticsearch indexer
?solrindex ? ? ?run the solr indexer on parsed batches
?solrdedup ? ? ?remove duplicates from solr
?parsechecker ? check the parser for a given url
?indexchecker ? check the indexing filters for a given url
?plugin ? ? ? ? load a plugin and run one of its classes main()
?nutchserver ? ?run a (local) Nutch server on a user defined port
?junit ? ? ? ? ?runs the given JUnit test
?or
?CLASSNAME ? ? ?run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

（7）創建seed.txt

?cd /usr/search/apache-nutch-2.3/runtime/deploy/bin/

vi seed.txt

http://nutch.apache.org/

hadoop fs -copyFromLocal seed.txt /

將seed.txt放到HDFS的根目錄下。

(8)在運行過程中，會出現以下異常：

java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat

原因未明。為使抓取能正常繼續，先將crawl文件中的以下語句注釋掉

#echo "SOLR dedup -> $SOLRURL"#__bin_nutch solrdedup $commonOptions $SOLRURL
以后找原因。

export CLASSPATH=$CLASSPATH:.....無效。

但使用local模式運行不會有以上的錯誤。

5、Solr的配置

（1）覆蓋solr的schema.xml文件。（對于solr4，應該使用schema-solr4.xml)

cp /usr/search/apache-nutch-2.3/conf/schema.xml /usr/search/solr-4.9.0/example/solr/collection1/conf/

（2）若使用solr3.6，則至此已經完成配置，但使用4.9，需要修改以下配置：【新版本已經不需要此步驟】

修改上述復制過來的schema.xml文件

刪除：<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />?

增加：<field name="_version_" type="long" indexed="true" stored="true"/>

或者使用tomcat來運行solr，見http://blog.csdn.net/jediael_lu/article/details/37908885。

6、啟動抓取任務

（1）啟動hadoop

#start-all.sh

（2）啟動HBase
# ./start-hbase.sh?

（3）啟動Solr

[# cd /usr/search/solr-4.9.0/example/
# java -jar start.jar?

（4）啟動Nutch，開始抓取任務

將seed.txt復制到hdfs的根目錄下。

# cd /usr/search/apache-nutch-2.3/runtime/deploy
#?bin/crawl /seed.txt TestCrawl http://localhost:8583/solr 2

大功告成，任務開始執行。

7、安裝過程中可能出現的異常

異常一：No active index writer.

修改nutch-default.xml，在plugin.includes中增加indexer-solr。

異常二：ClassNotFoundException: org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat

在SolrDeleteDuplicates中的Job job = new Job(getConf(), "solrdedup");

后添加以下代碼：

job.setJarByClass(SolrDeleteDuplicates.class);?

關于上述過程的一些分析請見：

集成Nutch/Hbase/Solr構建搜索引擎之二：內容分析

http://blog.csdn.net/jediael_lu/article/details/37738569

使用crontab來設置Nutch的例行任務時，出現以下錯誤 JAVA_HOME is not set。
以及
Can't find Hadoop executable. Add HADOOP_HOME/bin to the path or run in local mode.
于是創建了一個腳本，用于執行抓取工作： $ vi /opt/jediael/apache-nutch-2.3/runtime/deploy/bin/myCrawl.sh #!/bin/bash export JAVA_HOME=/usr/java/jdk1.7.0_51 export PATH=$PATH:/opt/jediael/hadoop-1.2.1/bin/ /opt/jediael/apache-nutch-2.3/runtime/deploy/bin/crawl /seed.txt `date +%h%d%H` http://master:8983/solr/ 2 然后再配置例行任務 0 0,9,12,15,19,21 * * * bash /opt/jediael/apache-nutch-2.3/runtime/deploy/bin/myCrawl.sh >> ~/nutch.log

總結

以上是生活随笔為你收集整理的【Nutch2.3基础教程】集成Nutch/Hadoop/Hbase/Solr构建搜索引擎：安装及运行【集群环境】的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：搭建hbase-0.94.26集群环境
下一篇： Nutch关于robot.txt的处理