【Nutch2.3基础教程】集成Nutch/Hadoop/Hbase/Solr构建搜索引擎:安装及运行【集群环境】
1、下載相關軟件,并解壓
版本號如下:
(1)apache-nutch-2.3
(2) hadoop-1.2.1
(3)hbase-0.92.1
(4)solr-4.9.0
并解壓至/opt/jediael。
若要下載最新的開發版本nutch,可以進行以下操作
svn co https://svn.apache.org/repos/asf/nutch/branches/2.x2、安裝hadoop1.2.1集群環境
見http://blog.csdn.net/jediael_lu/article/details/38926477
3、安裝hbase0.92.1集群環境
見http://blog.csdn.net/jediael_lu/article/details/43086641
4、Nutch的配置
(1)vi /usr/search/apache-nutch-2.3/conf/nutch-site.xml?
<property> <name>storage.data.store.class</name> <value>org.apache.gora.hbase.store.HBaseStore</value> <description>Default class for storing data</description> </property> <pre name="code" class="html"><property> <name>http.agent.name</name> <value>My Nutch Spider</value> </property>(2)vi /usr/search/apache-nutch-2.3/ivy/ivy.xml?
默認情況下,此語句被注釋掉,將其注釋符號去掉,使其生效。
<dependency org="org.apache.gora" name="gora-hbase" rev="0.5" conf="*->default" />gora-hbase 0.5對應hbase0.94.12
根據需要修改hadoop的版本:
<dependency org="org.apache.hadoop" name="hadoop-core" rev="1.2.1" conf="*->default”> <dependency org="org.apache.hadoop" name="hadoop-test" rev="1.2.1" conf="test->default”>(3)vi /usr/search/apache-nutch-2.2.1/conf/gora.properties?
添加以下語句:
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
以上三個步驟指定了使用HBase來進行存儲。
(4)根據需要修改網頁過濾器
?vi /usr/search/apache-nutch-2.3/conf/regex-urlfilter.txt?
?vi /usr/search/apache-nutch-2.3/conf/regex-urlfilter.txt?
將
# accept anything else
+.
修改為
# accept anything else
+^http://([a-z0-9]*\.)*nutch.apache.org/
(9)增加索引內容
默認情況下,schema.xml文件中的core及index-basic中的field才會被索引,為索引更多的field,可以通過以下方式添加。
修改nutch-default.xml,新增以下紅色內容【或者只增加index-more】
<property>
? <name>plugin.includes</name>
?<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|index-anchor|index-more|languageidentifier|subcollection|feed|creativecommons|tld</value>?
?<description>Regular expression naming plugin directory names to
? include. Any plugin not matching this expression is excluded.
? In any case you need at least include the nutch-extensionpoints plugin. By
? default Nutch includes crawling just HTML and plain text via HTTP,
? and basic indexing and search plugins. In order to use HTTPS please enable
? protocol-httpclient, but be aware of possible intermittent problems with the
? underlying commons-httpclient library.
? </description>
</property>
或者可以在nutch-site.xml中添加plugin.includes屬性,并將上述內容復制過去。注意,在nutch-site.xml中的屬性會代替nutch-default.xml中的屬性,因此必須將原有的屬性也復制過去。
(5)構建runtime
?cd /usr/search/apache-nutch-2.3/
ant runtime
(6)驗證Nutch安裝完成
# cd /usr/search/apache-nutch-2.3/runtime/local/bin/
# ./nutch?
Usage: nutch COMMAND
where COMMAND is one of:
?inject ? ? ? ? inject new urls into the database
?hostinject ? ? creates or updates an existing host table from a text file
?generate ? ? ? generate new batches to fetch from crawl db
?fetch ? ? ? ? ?fetch URLs marked during generate
?parse ? ? ? ? ?parse URLs marked during fetch
?updatedb ? ? ? update web table after parsing
?updatehostdb ? update host table after parsing
?readdb ? ? ? ? read/dump records from page database
?readhostdb ? ? display entries from the hostDB
?elasticindex ? run the elasticsearch indexer
?solrindex ? ? ?run the solr indexer on parsed batches
?solrdedup ? ? ?remove duplicates from solr
?parsechecker ? check the parser for a given url
?indexchecker ? check the indexing filters for a given url
?plugin ? ? ? ? load a plugin and run one of its classes main()
?nutchserver ? ?run a (local) Nutch server on a user defined port
?junit ? ? ? ? ?runs the given JUnit test
?or
?CLASSNAME ? ? ?run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
?cd /usr/search/apache-nutch-2.3/runtime/deploy/bin/
vi seed.txt
http://nutch.apache.org/
hadoop fs -copyFromLocal seed.txt /
將seed.txt放到HDFS的根目錄下。
(8)在運行過程中,會出現以下異常:
java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat
原因未明。為使抓取能正常繼續,先將crawl文件中的以下語句注釋掉
以后找原因。
export CLASSPATH=$CLASSPATH:.....無效。
但使用local模式運行不會有以上的錯誤。
5、Solr的配置
(1)覆蓋solr的schema.xml文件。(對于solr4,應該使用schema-solr4.xml)
cp /usr/search/apache-nutch-2.3/conf/schema.xml /usr/search/solr-4.9.0/example/solr/collection1/conf/
(2)若使用solr3.6,則至此已經完成配置,但使用4.9,需要修改以下配置:【新版本已經不需要此步驟】
修改上述復制過來的schema.xml文件
刪除:<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />?
增加:<field name="_version_" type="long" indexed="true" stored="true"/>
或者使用tomcat來運行solr,見http://blog.csdn.net/jediael_lu/article/details/37908885。
6、啟動抓取任務
(1)啟動hadoop
#start-all.sh
(2)啟動HBase
# ./start-hbase.sh?
(3)啟動Solr
[# cd /usr/search/solr-4.9.0/example/
# java -jar start.jar?
(4)啟動Nutch,開始抓取任務
將seed.txt復制到hdfs的根目錄下。
# cd /usr/search/apache-nutch-2.3/runtime/deploy
#?bin/crawl /seed.txt TestCrawl http://localhost:8583/solr 2
大功告成,任務開始執行。
7、安裝過程中可能出現的異常
異常一:No active index writer.
修改nutch-default.xml,在plugin.includes中增加indexer-solr。
異常二:ClassNotFoundException: org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat
在SolrDeleteDuplicates中的Job job = new Job(getConf(), "solrdedup");
后添加以下代碼:
job.setJarByClass(SolrDeleteDuplicates.class);?
關于上述過程的一些分析請見:
集成Nutch/Hbase/Solr構建搜索引擎之二:內容分析
http://blog.csdn.net/jediael_lu/article/details/37738569
以及
Can't find Hadoop executable. Add HADOOP_HOME/bin to the path or run in local mode.
于是創建了一個腳本,用于執行抓取工作: $ vi /opt/jediael/apache-nutch-2.3/runtime/deploy/bin/myCrawl.sh #!/bin/bash export JAVA_HOME=/usr/java/jdk1.7.0_51 export PATH=$PATH:/opt/jediael/hadoop-1.2.1/bin/ /opt/jediael/apache-nutch-2.3/runtime/deploy/bin/crawl /seed.txt `date +%h%d%H` http://master:8983/solr/ 2 然后再配置例行任務 0 0,9,12,15,19,21 * * * bash /opt/jediael/apache-nutch-2.3/runtime/deploy/bin/myCrawl.sh >> ~/nutch.log
總結
以上是生活随笔為你收集整理的【Nutch2.3基础教程】集成Nutch/Hadoop/Hbase/Solr构建搜索引擎:安装及运行【集群环境】的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 搭建hbase-0.94.26集群环境
- 下一篇: Nutch关于robot.txt的处理