日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【Nutch2.2.1基础教程之2.1】集成Nutch/Hbase/Solr构建搜索引擎之一:安装及运行【单机环境】...

發布時間:2025/4/14 编程问答 34 豆豆
生活随笔 收集整理的這篇文章主要介紹了 【Nutch2.2.1基础教程之2.1】集成Nutch/Hbase/Solr构建搜索引擎之一:安装及运行【单机环境】... 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.


1、下載相關軟件,并解壓

版本號如下:

(1)apache-nutch-2.2.1

(2) hbase-0.90.4?

(3)solr-4.9.0

并解壓至/usr/search


2、Nutch的配置

(1)vi /usr/search/apache-nutch-2.2.1/conf/nutch-site.xml?

<property> <name>storage.data.store.class</name> <value>org.apache.gora.hbase.store.HBaseStore</value> <description>Default class for storing data</description> </property>
(2)vi /usr/search/apache-nutch-2.2.1/ivy/ivy.xml?

默認情況下,此語句被注釋掉,將其注釋符號去掉,使其生效。

<dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />
(3)vi /usr/search/apache-nutch-2.2.1/conf/gora.properties?

添加以下語句:

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

以上三個步驟指定了使用HBase來進行存儲。

以下步驟才是構建基本Nutch的必要步驟。

(4)構建runtime

?cd /usr/search/apache-nutch-2.2.1/

ant runtime

(5)驗證Nutch安裝完成

[root@jediael44 apache-nutch-2.2.1]# cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/
[root@jediael44 bin]# ./nutch?
Usage: nutch COMMAND
where COMMAND is one of:
?inject ? ? ? ? inject new urls into the database
?hostinject ? ? creates or updates an existing host table from a text file
?generate ? ? ? generate new batches to fetch from crawl db
?fetch ? ? ? ? ?fetch URLs marked during generate
?parse ? ? ? ? ?parse URLs marked during fetch
?updatedb ? ? ? update web table after parsing
?updatehostdb ? update host table after parsing
?readdb ? ? ? ? read/dump records from page database
?readhostdb ? ? display entries from the hostDB
?elasticindex ? run the elasticsearch indexer
?solrindex ? ? ?run the solr indexer on parsed batches
?solrdedup ? ? ?remove duplicates from solr
?parsechecker ? check the parser for a given url
?indexchecker ? check the indexing filters for a given url
?plugin ? ? ? ? load a plugin and run one of its classes main()
?nutchserver ? ?run a (local) Nutch server on a user defined port
?junit ? ? ? ? ?runs the given JUnit test
?or
?CLASSNAME ? ? ?run the class named CLASSNAME
Most commands print help when invoked w/o parameters.


(6)vi /usr/search/apache-nutch-2.2.1/runtime/local/conf/nutch-site.xml 添加搜索任務

<property> <name>http.agent.name</name> <value>My Nutch Spider</value> </property>
(7)創建seed.txt

?cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/

vi seed.txt

http://nutch.apache.org/


(8)修改網頁過濾器??vi /usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt?

?vi /usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt?

# accept anything else
+.

修改為

# accept anything else
+^http://([a-z0-9]*\.)*nutch.apache.org/


(9)增加索引內容 默認情況下,schema.xml文件中的core及index-basic中的field才會被索引,為索引更多的field,可以通過以下方式添加。 修改nutch-default.xml,新增以下紅色內容 <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|index-anchor|index-more|languageidentifier|subcollection|feed|creativecommons|tld</value>? <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. </description> </property> 或者可以在nutch-site.xml中添加plugin.includes屬性,并將上述內容復制過去。注意,在nutch-site.xml中的屬性會代替nutch-default.xml中的屬性,因此必須將原有的屬性也復制過去。


3、Hbase的配置

(1)vi /usr/search/hbase-0.90.4/conf/hbase-site.xml?

<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hbase.rootdir</name> <value><Your path></value> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value><Your path></value> </property> </configuration>
注:此步驟可不做。若不做,則使用hbase-default.xml(/usr/search/hbase-0.90.4/src/main/resources/hbase-default.xml)中的默認值。

默認值為:

<property><name>hbase.rootdir</name><value>file:///tmp/hbase-${user.name}/hbase</value><description>The directory shared by region servers and intowhich HBase persists. The URL should be 'fully-qualified'to include the filesystem scheme. For example, to specify theHDFS directory '/hbase' where the HDFS instance's namenode isrunning at namenode.example.org on port 9000, set this value to:hdfs://namenode.example.org:9000/hbase. By default HBase writesinto /tmp. Change this configuration else all data will be loston machine restart.</description></property>即默認情況下會放在/tmp目錄,若機器重啟,有可能數據丟失。

但是建議還是把這些屬性做好配置,尤其是第二個關于zoopkeeper的,否則會導致各種問題。以下將目錄配置在本地文件系統中。

<configuration> <property> <name>hbase.rootdir</name> <value>file:///home/jediael/hbaserootdir</value> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value>file:///home/jediael/hbasezookeeperdataDir</value> </property></configuration>

注意,若無前綴file://,則默認是hdfs://

但在0.90.4版本,默認還是本地文件系統。



4、Solr的配置

(1)覆蓋solr的schema.xml文件。(對于solr4,應該使用schema-solr4.xml)

cp /usr/search/apache-nutch-2.2.1/conf/schema.xml /usr/search/solr-4.9.0/example/solr/collection1/conf/

(2)若使用solr3.6,則至此已經完成配置,但使用4.9,需要修改以下配置:

修改上述復制過來的schema.xml文件

刪除:<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />?

增加:<field name="_version_" type="long" indexed="true" stored="true"/>


5、啟動抓取任務

(1)啟動HBase

[root@jediael44 bin]# cd /usr/search/hbase-0.90.4/bin/
[root@jediael44 bin]# ./start-hbase.sh?

(2)啟動Solr

[root@jediael44 bin]# cd /usr/search/solr-4.9.0/example/
[root@jediael44 example]# java -jar start.jar?

(3)啟動Nutch,開始抓取任務

[root@jediael44 example]# cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/
[root@jediael44 bin]#?./crawl seed.txt TestCrawl http://localhost:8983/solr 2

大功告成,任務開始執行。


關于上述過程的一些分析請見:

集成Nutch/Hbase/Solr構建搜索引擎之二:內容分析

http://blog.csdn.net/jediael_lu/article/details/37738569


使用crontab來設置Nutch的例行任務時,出現以下錯誤 JAVA_HOME is not set。 于是創建了一個腳本,用于執行抓取工作: #!/bin/bash export JAVA_HOME=/usr/java/jdk1.7.0_51 /opt/jediael/apache-nutch-2.2.1/runtime/local/bin/crawl /opt/jediael/apache-nutch-2.2.1/runtime/local/urls/ mainhttp://localhost:8080/solr/ 2 >> ~jediael/nutch.log 然后再配置例行任務 30 0,6,8,10,12,14,16,18,20,22 * * * bash /opt/jediael/apache-nutch-2.2.1/runtime/local/bin/myCrawl.sh



轉載于:https://www.cnblogs.com/eaglegeek/p/4557894.html

總結

以上是生活随笔為你收集整理的【Nutch2.2.1基础教程之2.1】集成Nutch/Hbase/Solr构建搜索引擎之一:安装及运行【单机环境】...的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。