日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 >

【Nutch2.2.1基础教程之2.1】集成Nutch/Hbase/Solr构建搜索引擎之一:安装及运行【单机环境】...

發(fā)布時(shí)間:2025/4/14 47 豆豆
生活随笔 收集整理的這篇文章主要介紹了 【Nutch2.2.1基础教程之2.1】集成Nutch/Hbase/Solr构建搜索引擎之一:安装及运行【单机环境】... 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.


1、下載相關(guān)軟件,并解壓

版本號如下:

(1)apache-nutch-2.2.1

(2) hbase-0.90.4?

(3)solr-4.9.0

并解壓至/usr/search


2、Nutch的配置

(1)vi /usr/search/apache-nutch-2.2.1/conf/nutch-site.xml?

<property> <name>storage.data.store.class</name> <value>org.apache.gora.hbase.store.HBaseStore</value> <description>Default class for storing data</description> </property>
(2)vi /usr/search/apache-nutch-2.2.1/ivy/ivy.xml?

默認(rèn)情況下,此語句被注釋掉,將其注釋符號去掉,使其生效。

<dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />
(3)vi /usr/search/apache-nutch-2.2.1/conf/gora.properties?

添加以下語句:

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

以上三個(gè)步驟指定了使用HBase來進(jìn)行存儲。

以下步驟才是構(gòu)建基本Nutch的必要步驟。

(4)構(gòu)建runtime

?cd /usr/search/apache-nutch-2.2.1/

ant runtime

(5)驗(yàn)證Nutch安裝完成

[root@jediael44 apache-nutch-2.2.1]# cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/
[root@jediael44 bin]# ./nutch?
Usage: nutch COMMAND
where COMMAND is one of:
?inject ? ? ? ? inject new urls into the database
?hostinject ? ? creates or updates an existing host table from a text file
?generate ? ? ? generate new batches to fetch from crawl db
?fetch ? ? ? ? ?fetch URLs marked during generate
?parse ? ? ? ? ?parse URLs marked during fetch
?updatedb ? ? ? update web table after parsing
?updatehostdb ? update host table after parsing
?readdb ? ? ? ? read/dump records from page database
?readhostdb ? ? display entries from the hostDB
?elasticindex ? run the elasticsearch indexer
?solrindex ? ? ?run the solr indexer on parsed batches
?solrdedup ? ? ?remove duplicates from solr
?parsechecker ? check the parser for a given url
?indexchecker ? check the indexing filters for a given url
?plugin ? ? ? ? load a plugin and run one of its classes main()
?nutchserver ? ?run a (local) Nutch server on a user defined port
?junit ? ? ? ? ?runs the given JUnit test
?or
?CLASSNAME ? ? ?run the class named CLASSNAME
Most commands print help when invoked w/o parameters.


(6)vi /usr/search/apache-nutch-2.2.1/runtime/local/conf/nutch-site.xml 添加搜索任務(wù)

<property> <name>http.agent.name</name> <value>My Nutch Spider</value> </property>
(7)創(chuàng)建seed.txt

?cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/

vi seed.txt

http://nutch.apache.org/


(8)修改網(wǎng)頁過濾器??vi /usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt?

?vi /usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt?

# accept anything else
+.

修改為

# accept anything else
+^http://([a-z0-9]*\.)*nutch.apache.org/


(9)增加索引內(nèi)容 默認(rèn)情況下,schema.xml文件中的core及index-basic中的field才會被索引,為索引更多的field,可以通過以下方式添加。 修改nutch-default.xml,新增以下紅色內(nèi)容 <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|index-anchor|index-more|languageidentifier|subcollection|feed|creativecommons|tld</value>? <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. </description> </property> 或者可以在nutch-site.xml中添加plugin.includes屬性,并將上述內(nèi)容復(fù)制過去。注意,在nutch-site.xml中的屬性會代替nutch-default.xml中的屬性,因此必須將原有的屬性也復(fù)制過去。


3、Hbase的配置

(1)vi /usr/search/hbase-0.90.4/conf/hbase-site.xml?

<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hbase.rootdir</name> <value><Your path></value> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value><Your path></value> </property> </configuration>
注:此步驟可不做。若不做,則使用hbase-default.xml(/usr/search/hbase-0.90.4/src/main/resources/hbase-default.xml)中的默認(rèn)值。

默認(rèn)值為:

<property><name>hbase.rootdir</name><value>file:///tmp/hbase-${user.name}/hbase</value><description>The directory shared by region servers and intowhich HBase persists. The URL should be 'fully-qualified'to include the filesystem scheme. For example, to specify theHDFS directory '/hbase' where the HDFS instance's namenode isrunning at namenode.example.org on port 9000, set this value to:hdfs://namenode.example.org:9000/hbase. By default HBase writesinto /tmp. Change this configuration else all data will be loston machine restart.</description></property>即默認(rèn)情況下會放在/tmp目錄,若機(jī)器重啟,有可能數(shù)據(jù)丟失。

但是建議還是把這些屬性做好配置,尤其是第二個(gè)關(guān)于zoopkeeper的,否則會導(dǎo)致各種問題。以下將目錄配置在本地文件系統(tǒng)中。

<configuration> <property> <name>hbase.rootdir</name> <value>file:///home/jediael/hbaserootdir</value> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value>file:///home/jediael/hbasezookeeperdataDir</value> </property></configuration>

注意,若無前綴file://,則默認(rèn)是hdfs://

但在0.90.4版本,默認(rèn)還是本地文件系統(tǒng)。



4、Solr的配置

(1)覆蓋solr的schema.xml文件。(對于solr4,應(yīng)該使用schema-solr4.xml)

cp /usr/search/apache-nutch-2.2.1/conf/schema.xml /usr/search/solr-4.9.0/example/solr/collection1/conf/

(2)若使用solr3.6,則至此已經(jīng)完成配置,但使用4.9,需要修改以下配置:

修改上述復(fù)制過來的schema.xml文件

刪除:<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />?

增加:<field name="_version_" type="long" indexed="true" stored="true"/>


5、啟動(dòng)抓取任務(wù)

(1)啟動(dòng)HBase

[root@jediael44 bin]# cd /usr/search/hbase-0.90.4/bin/
[root@jediael44 bin]# ./start-hbase.sh?

(2)啟動(dòng)Solr

[root@jediael44 bin]# cd /usr/search/solr-4.9.0/example/
[root@jediael44 example]# java -jar start.jar?

(3)啟動(dòng)Nutch,開始抓取任務(wù)

[root@jediael44 example]# cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/
[root@jediael44 bin]#?./crawl seed.txt TestCrawl http://localhost:8983/solr 2

大功告成,任務(wù)開始執(zhí)行。


關(guān)于上述過程的一些分析請見:

集成Nutch/Hbase/Solr構(gòu)建搜索引擎之二:內(nèi)容分析

http://blog.csdn.net/jediael_lu/article/details/37738569


使用crontab來設(shè)置Nutch的例行任務(wù)時(shí),出現(xiàn)以下錯(cuò)誤 JAVA_HOME is not set。 于是創(chuàng)建了一個(gè)腳本,用于執(zhí)行抓取工作: #!/bin/bash export JAVA_HOME=/usr/java/jdk1.7.0_51 /opt/jediael/apache-nutch-2.2.1/runtime/local/bin/crawl /opt/jediael/apache-nutch-2.2.1/runtime/local/urls/ mainhttp://localhost:8080/solr/ 2 >> ~jediael/nutch.log 然后再配置例行任務(wù) 30 0,6,8,10,12,14,16,18,20,22 * * * bash /opt/jediael/apache-nutch-2.2.1/runtime/local/bin/myCrawl.sh



轉(zhuǎn)載于:https://www.cnblogs.com/eaglegeek/p/4557894.html

總結(jié)

以上是生活随笔為你收集整理的【Nutch2.2.1基础教程之2.1】集成Nutch/Hbase/Solr构建搜索引擎之一:安装及运行【单机环境】...的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。