當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【Nutch2.2.1基础教程之2.1】集成Nutch/Hbase/Solr构建搜索引擎之一：安装及运行【单机环境】...

發布時間：2025/4/14 编程问答 34 豆豆

生活随笔收集整理的這篇文章主要介紹了【Nutch2.2.1基础教程之2.1】集成Nutch/Hbase/Solr构建搜索引擎之一：安装及运行【单机环境】... 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

1、下載相關軟件，并解壓

版本號如下：

（1）apache-nutch-2.2.1

（2） hbase-0.90.4?

（3）solr-4.9.0

并解壓至/usr/search

2、Nutch的配置

（1）vi /usr/search/apache-nutch-2.2.1/conf/nutch-site.xml?

<property> <name>storage.data.store.class</name> <value>org.apache.gora.hbase.store.HBaseStore</value> <description>Default class for storing data</description> </property>
（2）vi /usr/search/apache-nutch-2.2.1/ivy/ivy.xml?

默認情況下，此語句被注釋掉，將其注釋符號去掉，使其生效。

<dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />
（3）vi /usr/search/apache-nutch-2.2.1/conf/gora.properties?

添加以下語句：

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

以上三個步驟指定了使用HBase來進行存儲。

以下步驟才是構建基本Nutch的必要步驟。

（4）構建runtime

?cd /usr/search/apache-nutch-2.2.1/

ant runtime

（5）驗證Nutch安裝完成

[root@jediael44 apache-nutch-2.2.1]# cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/
[root@jediael44 bin]# ./nutch?
Usage: nutch COMMAND
where COMMAND is one of:
?inject ? ? ? ? inject new urls into the database
?hostinject ? ? creates or updates an existing host table from a text file
?generate ? ? ? generate new batches to fetch from crawl db
?fetch ? ? ? ? ?fetch URLs marked during generate
?parse ? ? ? ? ?parse URLs marked during fetch
?updatedb ? ? ? update web table after parsing
?updatehostdb ? update host table after parsing
?readdb ? ? ? ? read/dump records from page database
?readhostdb ? ? display entries from the hostDB
?elasticindex ? run the elasticsearch indexer
?solrindex ? ? ?run the solr indexer on parsed batches
?solrdedup ? ? ?remove duplicates from solr
?parsechecker ? check the parser for a given url
?indexchecker ? check the indexing filters for a given url
?plugin ? ? ? ? load a plugin and run one of its classes main()
?nutchserver ? ?run a (local) Nutch server on a user defined port
?junit ? ? ? ? ?runs the given JUnit test
?or
?CLASSNAME ? ? ?run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

（6）vi /usr/search/apache-nutch-2.2.1/runtime/local/conf/nutch-site.xml 添加搜索任務

<property> <name>http.agent.name</name> <value>My Nutch Spider</value> </property>
（7）創建seed.txt

?cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/

vi seed.txt

http://nutch.apache.org/

（8）修改網頁過濾器??vi /usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt?

?vi /usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt?

將

# accept anything else
+.

修改為

# accept anything else
+^http://([a-z0-9]*\.)*nutch.apache.org/

(9)增加索引內容默認情況下，schema.xml文件中的core及index-basic中的field才會被索引，為索引更多的field，可以通過以下方式添加。修改nutch-default.xml，新增以下紅色內容 <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|index-anchor|index-more|languageidentifier|subcollection|feed|creativecommons|tld</value>? <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. </description> </property> 或者可以在nutch-site.xml中添加plugin.includes屬性，并將上述內容復制過去。注意，在nutch-site.xml中的屬性會代替nutch-default.xml中的屬性，因此必須將原有的屬性也復制過去。

3、Hbase的配置

（1）vi /usr/search/hbase-0.90.4/conf/hbase-site.xml?

<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hbase.rootdir</name> <value><Your path></value> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value><Your path></value> </property> </configuration>
注：此步驟可不做。若不做，則使用hbase-default.xml（/usr/search/hbase-0.90.4/src/main/resources/hbase-default.xml）中的默認值。

默認值為：

<property><name>hbase.rootdir</name><value>file:///tmp/hbase-${user.name}/hbase</value><description>The directory shared by region servers and intowhich HBase persists. The URL should be 'fully-qualified'to include the filesystem scheme. For example, to specify theHDFS directory '/hbase' where the HDFS instance's namenode isrunning at namenode.example.org on port 9000, set this value to:hdfs://namenode.example.org:9000/hbase. By default HBase writesinto /tmp. Change this configuration else all data will be loston machine restart.</description></property>即默認情況下會放在/tmp目錄，若機器重啟，有可能數據丟失。

但是建議還是把這些屬性做好配置，尤其是第二個關于zoopkeeper的，否則會導致各種問題。以下將目錄配置在本地文件系統中。

<configuration> <property> <name>hbase.rootdir</name> <value>file:///home/jediael/hbaserootdir</value> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value>file:///home/jediael/hbasezookeeperdataDir</value> </property></configuration>

注意，若無前綴file://，則默認是hdfs://

但在0.90.4版本，默認還是本地文件系統。

4、Solr的配置

（1）覆蓋solr的schema.xml文件。（對于solr4，應該使用schema-solr4.xml)

cp /usr/search/apache-nutch-2.2.1/conf/schema.xml /usr/search/solr-4.9.0/example/solr/collection1/conf/

（2）若使用solr3.6，則至此已經完成配置，但使用4.9，需要修改以下配置：

修改上述復制過來的schema.xml文件

刪除：<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />?

增加：<field name="_version_" type="long" indexed="true" stored="true"/>

5、啟動抓取任務

（1）啟動HBase

[root@jediael44 bin]# cd /usr/search/hbase-0.90.4/bin/
[root@jediael44 bin]# ./start-hbase.sh?

（2）啟動Solr

[root@jediael44 bin]# cd /usr/search/solr-4.9.0/example/
[root@jediael44 example]# java -jar start.jar?

（3）啟動Nutch，開始抓取任務

[root@jediael44 example]# cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/
[root@jediael44 bin]#?./crawl seed.txt TestCrawl http://localhost:8983/solr 2

大功告成，任務開始執行。

關于上述過程的一些分析請見：

集成Nutch/Hbase/Solr構建搜索引擎之二：內容分析

http://blog.csdn.net/jediael_lu/article/details/37738569

使用crontab來設置Nutch的例行任務時，出現以下錯誤 JAVA_HOME is not set。于是創建了一個腳本，用于執行抓取工作： #!/bin/bash export JAVA_HOME=/usr/java/jdk1.7.0_51 /opt/jediael/apache-nutch-2.2.1/runtime/local/bin/crawl /opt/jediael/apache-nutch-2.2.1/runtime/local/urls/ mainhttp://localhost:8080/solr/ 2 >> ~jediael/nutch.log 然后再配置例行任務 30 0,6,8,10,12,14,16,18,20,22 * * * bash /opt/jediael/apache-nutch-2.2.1/runtime/local/bin/myCrawl.sh

轉載于:https://www.cnblogs.com/eaglegeek/p/4557894.html

總結

以上是生活随笔為你收集整理的【Nutch2.2.1基础教程之2.1】集成Nutch/Hbase/Solr构建搜索引擎之一：安装及运行【单机环境】...的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Android下/data/data/p
下一篇： jquery 表单重置通用方法