當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Nutch爬虫引擎使用分析

發布時間：2025/4/16 编程问答 20 豆豆

生活随笔收集整理的這篇文章主要介紹了 Nutch爬虫引擎使用分析小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Nutch2.X主要執行流程：

1）InjectorJob: 從文件中得到一批種子網頁，把它們放到抓取數據庫中去

2）GeneratorJob:從抓取數據庫中產生要抓取的頁面放到抓取隊列中去

3）FetcherJob:?? 對抓取隊列中的網頁進行抓取,在reducer中使用了生產/消費者模型

4）ParserJob:??? 對抓取完成的網頁進行解析，產生一些新的鏈接與網頁內容的解析結果

5）DbUpdaterJob:把新產生的鏈接更新到抓取數據庫中去

6）SolrIndexerJob:對解析后的內容進行索引建立

源碼解讀并編寫后，發現nutch2.2未執行DbUpdaterJob，解析出的鏈接沒有在webpage數據庫中，只進行了第一層爬蟲。重新部署nutch2.0試驗，分布執行命令，重點是將nutch-default.xml直接復制到nutch-site.xml，其中conf/regex-urlfilter.txt可修改網頁過濾器。

編譯后進入antime/local目錄

#bin/nutch ?inject urls?//urls為種子地址目錄

#bin/nutch generate-topN 5

# bin/nutchfetch –all

# bin/nutchparse –all

# bin/nutchupdatedb

看mysql中nutch數據庫的webpage表還是沒有鏈接更新進去，查看WebDB中的網頁數目和鏈接數目：

#bin/nutchreaddb crawl-tinysite/db –stats? //只有2個url

實在無解，著手nutch-site.xml配置上檢查，并分析hadoop.log文件中的日志。在nutch-site.xml中增加如下設置后就不再出現提示http.robots.agents的錯誤。

? <name>http.robots.agents</name>

? <value>nutch2.0,*</value>

? <description>The agent strings we'll look for inrobots.txt files,

? comma-separated, in decreasingorder of precedence. You should

? put the value of http.agent.nameas the first agent name, and keep the

? default * at the end of the list.E.g.: BlurflDev,Blurfl,*

? </description>

</property>

Hadoop.log中還提示mapred.FileOutputCommitter- Output path is null in cleanup的警告，暫未找到解決辦法，不過官網上解答說不影響。

目前的情況是：只進行第一層爬蟲的ParserJob，未進行DbUpdaterJob，自然無法將爬蟲到得鏈接更新到數據庫，也就無法開始第二層爬蟲。綜合1.6、2.0、2.1、2.2四個版本的編譯后執行情況，以及nutch-site.xml和mysql字符集的設置調整，初步篩選出可能存在的問題點在于解析出的數據未能更新到數據庫。

1）1.x版僅支持hadoop存儲，執行正常；

2）2.x版支持mysql存儲，部署nutch時為方便開發直接集成mysql，未試驗hbase是否正常；

3）從2.x系列版本執行來看，與版本無關系，與執行參數配置無關，可能與mysql集成有關系；

4）分步執行2.2每個步驟job，以及從hadoop.log看出，停留在ParserJob上，出現字符報錯；

5）因此初步定位到mysql集成上nutch自動生成的數據庫和表與mysql字符不兼容導致。

在配置MySQL時，由于編碼問題，采用手動創建nutch在mysql的數據庫和webpage表，參考網上的說明（http://www.cnblogs.com/AloneSword/p/3798126.html），對編譯nutch2.2版本前配置mysql，執行如下步驟：

1）#vi/etc/mysql/my.cnf

在[mysqld]下添加：

[mysqld]

innodb_file_format=barracuda

innodb_file_per_table=true

innodb_large_prefix=true

character-set-server=utf8mb4

collation-server=utf8mb4_unicode_ci

max_allowed_packet=500M

重啟mysql，查看mysql是否啟動：

#netstat-tap | grep mysql?? //本環境配置的mysql端口是5306

2）授予訪問權限：

#mysql –uroot –p

mysql>GRANTALL PRIVILEGES ON *.* TO root@"%" IDENTIFIED BY? "password";

3）手動創建數據庫nutch和數據表webpage

#mysql –uroot –p

mysql>CREATE DATABASE nutch DEFAULTCHARACTER SET utf8mb4 DEFAULT COLLATE utf8mb4_unicode_ci;

msyql>use nutch;

msyql>

CREATE TABLE`webpage` (

`id` varchar(767)NOT NULL,

`headers`blob,

`text`mediumtext DEFAULT NULL,

`status`int(11) DEFAULT NULL,

`markers`blob,

`parseStatus`blob,

`modifiedTime`bigint(20) DEFAULT NULL,

`score`float DEFAULT NULL,

`typ`varchar(32) CHARACTER SET latin1 DEFAULT NULL,

`baseUrl`varchar(767) DEFAULT NULL,

`content`longblob,

`title`varchar(2048) DEFAULT NULL,

`reprUrl`varchar(767) DEFAULT NULL,

`fetchInterval`int(11) DEFAULT NULL,

`prevFetchTime`bigint(20) DEFAULT NULL,

`inlinks`mediumblob,

`prevSignature`blob,

`outlinks`mediumblob,

`fetchTime`bigint(20) DEFAULT NULL,

`retriesSinceFetch`int(11) DEFAULT NULL,

`protocolStatus`blob,

`signature`blob,

`metadata`blob,

PRIMARY KEY(`id`)

)ENGINE=InnoDB

ROW_FORMAT=COMPRESSED

DEFAULTCHARSET=utf8mb4;

表中的字段根據nutch的conf文件“gora-sql-mapping”進行設置。如通過自動方式生成數據庫和表：配置好“gora-sql-mapping”、“gora.properties”及其它文件后，首次通過運行”bin/nutch inject urls”即可自動生成數據庫和表，不過自動生成可能會遇到問題，通過查看hadoop.log文件發現很多問題與MySQL支持的數據類型、數據長度有關，只需要根據日志提示做修改、調試（可借助navicat工具像SQL Server方便操作數據庫），然后再重復自動生成過程，直到成功為止。

下面進入nutch目錄下配置后重新編譯：

1）# viivy/ivy.xml? //取消下面行的注釋，啟用mysql

<dependencyorg=”mysql” name=”mysql-connector-java” rev=”5.1.18″ conf=”*->default”/>

2）viconf/gora.properties

注釋掉默認的數據庫，增加mysql數據庫信息，如下：

###############################
# MySQLproperties??????????? #
###############################
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://localhost:5306/nutch?createDatabaseIfNotExist=true
gora.sqlstore.jdbc.user=root
gora.sqlstore.jdbc.password=123

3）viconf/gora-sql-mapping.xml

將primarykey的length從512改成767，注意有兩個地方：

<primarykeycolumn=”id” length=”767″/>

4）viconf/nutch-site.xml

可以從nutch-default.xml中復制出來，不過基本配置一些關鍵就可以，增加如下：

<name>http.agent.name</name>

<value>Nutch2.2</value>

</property>

? <name>http.robots.agents</name>

? <value>nutch2.2,*</value>

? <description>The agent strings we'lllook for in robots.txt files,

? comma-separated, in decreasing order ofprecedence. You should

? put the value of http.agent.name as the firstagent name, and keep the

? default * at the end of the list. E.g.:BlurflDev,Blurfl,*

? </description>

</property>

<name>http.accept.language</name>

<description>Valueof the “Accept-Language” request header field.

This allowsselecting non-English language as default one to retrieve.

It is auseful setting for search engines build for certain national group.

</description>

</property>

<name>parser.character.encoding.default</name>

<description>Thecharacter encoding to fall back to when no other information

isavailable</description>

</property>

<name>storage.data.store.class</name>

<value>org.apache.gora.sql.store.SqlStore</value>

<description>TheGora DataStore class for storing and retrieving data.

Currentlythe following stores are available: ….

</description>

</property>

確保nutch-site文件保存為utf-8格式。

5）編譯

#apt-getinstall ant? //如無ant 則先安裝

#ant runtime?? //進入nutch2.2目錄編譯

編譯后進入runtime/local目錄進行爬蟲，具體步驟：

1）爬蟲

#cdruntime/local

#mkdir -purls

#echo'http://nutch.apache.org/' > urls/seed.txt

#bin/nutchcrawl urls -depth 3 -topN 5

2）錯誤處理

錯誤一：執行到GeneratorJob出現錯誤，查看hadoop.log提示是

java.lang.NullPointerException

?? atorg.apache.avro.util.Utf8.<init>(Utf8.java:37)

?? atorg.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)

查看GeneratorReducer第100行代碼：

batchId =newUtf8(conf.get(GeneratorJob.BATCH_ID));

可以看到是獲取GeneratorJob.BATCH_ID時傳空值。

解決辦法一：

修改GeneratorJob中的public Map<String,Object> run(Map<String,Object> args) 方法，添加如下代碼：

1.? //?generate?batchId??

2.? ???int?randomSeed?=?Math.abs(new?Random().nextInt());??

3.? ???String?batchId?=?(curTime?/?1000)?+?"-"?+?randomSeed;??

4.? ???getConf().set(BATCH_ID,?batchId);??

解決辦法二：

在nutch-site.xml中添加generate.batch.id配置項，value不為空即可，如下面：

??? <name>generate.batch.id</name>

??? <value>*</value>

</property>

采取辦法二先解決，后期觀察是否存在問題再采用方法一。

錯誤二：執行到GeneratorJob出現錯誤，查看hadoop.log提示是Unknown column 'batchId' in 'field list'。

解決辦法：在webpage上增加batchId字段，如下：

`batchId` varchar(767)DEFAULT NULL,

mysql>alerttable add batchId varchar(767) default NULL;

mysql>showcolumns from webpage;//查看字段batchId

3）查看結果

#mysql -u root-p

msyql>usenutch;

mysql>SELECTcount( *) FROM nutch.webpage;

mysql>select count(*) from webpage;

+----------+

| count(*) |

+----------+

|????? 495 |

+----------+

共495條記錄，成功。對于加入solar索引后續根據爬蟲數據量再研究。經驗上，還是多借鑒網上的步驟，自己摸索要走很多彎路啊。

在爬蟲試驗中發現部分網站無法爬蟲出網頁內的鏈接，如163門戶、新浪門戶、騰訊門戶、天涯論壇等，繼續觀察hadoop.log日志，沒有任何錯誤，又陷入死結。懷疑是設置了反爬蟲策略？

nutch只能抓取到的是簡單頁面的內容，即不包括該頁面加載后又執行的js請求、ajax請求、內嵌iframe等頁面。

單獨爬蟲種子地址http://www.163.com 分析：

mysql>select id,title,status from nutch.webpage;

+------------------------------+--------+--------+

| id?????????????????????????? | title? | status |

+------------------------------+--------+--------+

|com.163.www:http/??????????? | 網易?? |?????2 |

|com.netease.cache.img1:http/ | NULL??|????? 3 |

+------------------------------+--------+--------+

可以看到status中2是正確的，3是網頁不存在，那為什么網易只爬蟲出一個鏈接呢？通過看爬蟲到網易的content發現charset=gb2312，最重要是body正文都沒有內容，可見是設置了反爬蟲策略。

單獨爬蟲種子地址http://www.sina.com.cn 分析：

從webpage表中看也是抓取正確，且有body網頁內容，但為什么沒有爬蟲出更多鏈接呢？

單獨爬蟲種子地址http://wwwtianya.cn 分析：和sina網一樣正確抓取，網頁內容也正確，為什么沒有爬蟲出更多鏈接呢？騰訊網和搜狐網類似。

嘗試模擬Chrome瀏覽器繞過反爬蟲限制：

??? <name>http.agent.name</name>

??? <value>Mozilla/5.0 (Windows NT 6.1;WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117Safari</value>

</property>

??? <name>http.agent.version</name>

??? <value>537.36</value>

</property>

發現沒有效果。

需要進一步定位為什么部分網站無法爬蟲出鏈接？

在conf/nutch-site.xml中配置了http.robots.agents項，去掉觀察。

? <name>http.robots.agents</name>

? <value>nutch2.2,*</value>

? <description>The agent strings we'lllook for in robots.txt files,

? comma-separated, in decreasing order of precedence.You should

? put the value of http.agent.name as the firstagent name, and keep the

? default * at the end of the list. E.g.:BlurflDev,Blurfl,*

? </description>

</property>

仍發現沒有效果。

百般無奈下，只好多放幾個網址測試，突然發現二級域名都可以爬蟲。如http://sports.sina.com.cn，而http://www.sina.com.cn就不可以。具體原因不清楚，想來要么是門戶網站設置了反爬蟲策略，要么是nutch本身機制存在問題，網上說需要二次開發才能實現，那就留后續源碼中再處理。

試驗中，種子地址分別加了如下：

http://focus.tianya.cn/

http://sports.sina.com.cn/

http://sports.163.com/

http://sports.sohu.com/

爬蟲設置depth=10和topN=200，耗近1個小時共爬蟲出 30365條。直接中斷執行了，還不知道要執行多久，機器主要配置如下：

# cat/proc/cpuinfo

八核Intel(R)Xeon(R) CPU?????????? E5410? @ 2.33GHz

#free -m

???????????? total?????? used?????? free????shared??? buffers???? cached

Mem:???????? 16043?????? 6768?????? 9275????????? 0??????? 358?????? 4907

-/+buffers/cache:?????? 1502????? 14540

Swap:??????????? 0????????? 0????????? 0

Nutch適合大型爬蟲用，最好是用hadoop直接存儲并建立solar索引來檢索爬蟲結果。

總結

以上是生活随笔為你收集整理的Nutch爬虫引擎使用分析的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： (转载)Nutch 2.0 之抓取流程
下一篇：日志分析平台ELK部署初学