【Nutch基础教程之七】Nutch的2种运行模式:local及deploy
在對(duì)nutch源代碼運(yùn)行ant runtime后,會(huì)創(chuàng)建一個(gè)runtime的目錄,在runtime目錄下有deploy和local 2個(gè)目錄。
[jediael@jediael runtime]$ ls
deploy ?local
這2個(gè)目錄分別代表nutch的2種運(yùn)行方式:部署模式及本地模式。
1、nutch.sh中關(guān)于2種運(yùn)行方式的執(zhí)行
2、在deploy目錄下執(zhí)行命令即為deploy模式,local目錄下執(zhí)行命令即為local模式。
以下以inject為例,示范2種運(yùn)行模式。
一、本地模式
1、基本用法:
$ bin/nutch inject Usage: InjectorJob <url_dir> [-crawlId <id>]用法一:未指定id liaoliuqingdeMacBook-Air:local liaoliuqing$ bin/nutch inject urls InjectorJob: starting at 2014-12-20 22:32:01 InjectorJob: Injecting urlDir: urls InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class. InjectorJob: total number of urls rejected by filters: 0 InjectorJob: total number of urls injected after normalization and filtering: 1Injector: finished at 2014-12-20 22:32:15, elapsed: 00:00:14
用法二:指定id
$ bin/nutch inject urls -crawlId 2 InjectorJob: starting at 2014-12-20 22:34:01 InjectorJob: Injecting urlDir: urls InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class. InjectorJob: total number of urls rejected by filters: 0 InjectorJob: total number of urls injected after normalization and filtering: 1Injector: finished at 2014-12-20 22:34:15, elapsed: 00:00:142、數(shù)據(jù)庫(kù)中的數(shù)據(jù)變化
上述命令將在hbase數(shù)據(jù)庫(kù)中新建一個(gè)表,表名為${id}_webpage,若未指定id,則表名為webpage.
然后將urls目錄中的文件內(nèi)容寫入表中,作為爬蟲(chóng)種子。
hbase(main):003:0> scan 'webpage' ROW COLUMN+CELL com.163.www:http/ column=f:fi, timestamp=1419085934952, value=\x00'\x8D\x00 com.163.www:http/ column=f:ts, timestamp=1419085934952, value=\x00\x00\x01Jh\x1C\xBC7 com.163.www:http/ column=mk:_injmrk_, timestamp=1419085934952, value=y com.163.www:http/ column=mk:dist, timestamp=1419085934952, value=0 com.163.www:http/ column=mtdt:_csh_, timestamp=1419085934952, value=?\x80\x00\x00 com.163.www:http/ column=s:s, timestamp=1419085934952, value=?\x80\x00\x00 1 row(s) in 0.6140 seconds當(dāng)再次執(zhí)行inject命令時(shí),會(huì)增加新的url進(jìn)入表中。
3、其它運(yùn)行腳本
where COMMAND is one of:inject inject new urls into the databasehostinject creates or updates an existing host table from a text filegenerate generate new batches to fetch from crawl dbfetch fetch URLs marked during generateparse parse URLs marked during fetchupdatedb update web table after parsingupdatehostdb update host table after parsingreaddb read/dump records from page databasereadhostdb display entries from the hostDBelasticindex run the elasticsearch indexersolrindex run the solr indexer on parsed batchessolrdedup remove duplicates from solrparsechecker check the parser for a given urlindexchecker check the indexing filters for a given urlplugin load a plugin and run one of its classes main()nutchserver run a (local) Nutch server on a user defined portjunit runs the given JUnit testorCLASSNAME run the class named CLASSNAME Most commands print help when invoked w/o parameters.可以逐步運(yùn)行一個(gè)完整抓取流程中的各個(gè)步驟,形成一個(gè)整體的流程。
當(dāng)使用crawl命令進(jìn)行抓取任務(wù)時(shí),其基本流程步驟如下:
(1)InjectorJob
開(kāi)始第一個(gè)迭代
(2)GeneratorJob
(3)FetcherJob
(4)ParserJob
(5)DbUpdaterJob
(6)SolrIndexerJob
開(kāi)始第二個(gè)迭代
(2)GeneratorJob
(3)FetcherJob
(4)ParserJob
(5)DbUpdaterJob
(6)SolrIndexerJob
開(kāi)始第三個(gè)迭代
具體每個(gè)步驟的執(zhí)行,請(qǐng)見(jiàn)http://blog.csdn.net/jediael_lu/article/details/38591067
4、nutch封裝了一個(gè)crawl腳本,將各個(gè)關(guān)鍵步驟進(jìn)行了封裝,從而無(wú)需逐步運(yùn)行抓取流程。
[jediael@jediael local]$ bin/crawl Missing seedDir : crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>如: [root@jediael44 bin]# ./crawl seed.txt TestCrawl http://localhost:8983/solr 2
二、部署模式
1、使用hadoop命令運(yùn)行
注意:必須先啟動(dòng)hadoop及hbase。
[jediael@jediael deploy]$ hadoop jar apache-nutch-2.2.1.job org.apache.nutch.crawl.InjectorJob file:///opt/jediael/apache-nutch-2.2.1/runtime/deploy/urls/ 14/12/20 23:26:50 INFO crawl.InjectorJob: InjectorJob: starting at 2014-12-20 23:26:50 14/12/20 23:26:50 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir: file:/opt/jediael/apache-nutch-2.2.1/runtime/deploy/urls 14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:zookeeper.version=3.3.2-1031432, built on 11/05/2010 05:32 GMT 14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:host.name=jediael 14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:java.version=1.7.0_51 14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation 14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:java.home=/usr/java/jdk1.7.0_51/jre 14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:java.class.path=/opt/jediael/hadoop-1.2.1/libexec/../conf:/usr/java/jdk1.7.0_51/lib/tools.jar:/opt/jediael/hadoop-1.2.1/libexec/..:/opt/jediael/hadoop-1.2.1/libexec/../hadoop-core-1.2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/asm-3.2.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/aspectjrt-1.6.11.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/aspectjtools-1.6.11.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-beanutils-1.7.0.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-beanutils-core-1.8.0.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-cli-1.2.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-codec-1.4.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-collections-3.2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-configuration-1.6.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-daemon-1.0.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-digester-1.8.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-el-1.0.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-httpclient-3.0.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-io-2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-lang-2.4.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-logging-1.1.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-logging-api-1.0.4.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-math-2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-net-3.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/core-3.1.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/hadoop-capacity-scheduler-1.2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/hadoop-fairscheduler-1.2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/hadoop-thriftfs-1.2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/hsqldb-1.8.0.10.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jackson-core-asl-1.8.8.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jackson-mapper-asl-1.8.8.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jasper-compiler-5.5.12.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jasper-runtime-5.5.12.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jdeb-0.8.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jersey-core-1.8.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jersey-json-1.8.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jersey-server-1.8.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jets3t-0.6.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jetty-6.1.26.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jetty-util-6.1.26.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jsch-0.1.42.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/junit-4.5.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/kfs-0.2.2.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/log4j-1.2.15.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/mockito-all-1.8.5.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/oro-2.0.8.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/servlet-api-2.5-20081211.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/slf4j-api-1.4.3.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/slf4j-log4j12-1.4.3.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/xmlenc-0.52.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jsp-2.1/jsp-2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jsp-2.1/jsp-api-2.1.jar 14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:java.library.path=/opt/jediael/hadoop-1.2.1/libexec/../lib/native/Linux-amd64-64 14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp 14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:java.compiler=<NA> 14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:os.name=Linux 14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:os.arch=amd64 14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:os.version=2.6.32-431.17.1.el6.x86_64 14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:user.name=jediael 14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:user.home=/home/jediael 14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:user.dir=/opt/jediael/apache-nutch-2.2.1/runtime/deploy 14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=180000 watcher=hconnection 14/12/20 23:26:52 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181 14/12/20 23:26:52 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session 14/12/20 23:26:52 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x14a5c24c9cf0657, negotiated timeout = 40000 14/12/20 23:26:52 INFO crawl.InjectorJob: InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class. 14/12/20 23:26:55 INFO input.FileInputFormat: Total input paths to process : 1 14/12/20 23:26:55 INFO util.NativeCodeLoader: Loaded the native-hadoop library 14/12/20 23:26:55 WARN snappy.LoadSnappy: Snappy native library not loaded 14/12/20 23:26:56 INFO mapred.JobClient: Running job: job_201412202325_0002 14/12/20 23:26:57 INFO mapred.JobClient: map 0% reduce 0% 14/12/20 23:27:15 INFO mapred.JobClient: map 100% reduce 0% 14/12/20 23:27:17 INFO mapred.JobClient: Job complete: job_201412202325_0002 14/12/20 23:27:18 INFO mapred.JobClient: Counters: 20 14/12/20 23:27:18 INFO mapred.JobClient: Job Counters 14/12/20 23:27:18 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=14058 14/12/20 23:27:18 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/12/20 23:27:18 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/12/20 23:27:18 INFO mapred.JobClient: Rack-local map tasks=1 14/12/20 23:27:18 INFO mapred.JobClient: Launched map tasks=1 14/12/20 23:27:18 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 14/12/20 23:27:18 INFO mapred.JobClient: File Output Format Counters 14/12/20 23:27:18 INFO mapred.JobClient: Bytes Written=0 14/12/20 23:27:18 INFO mapred.JobClient: injector 14/12/20 23:27:18 INFO mapred.JobClient: urls_injected=3 14/12/20 23:27:18 INFO mapred.JobClient: FileSystemCounters 14/12/20 23:27:18 INFO mapred.JobClient: FILE_BYTES_READ=149 14/12/20 23:27:18 INFO mapred.JobClient: HDFS_BYTES_READ=130 14/12/20 23:27:18 INFO mapred.JobClient: FILE_BYTES_WRITTEN=78488 14/12/20 23:27:18 INFO mapred.JobClient: File Input Format Counters 14/12/20 23:27:18 INFO mapred.JobClient: Bytes Read=149 14/12/20 23:27:18 INFO mapred.JobClient: Map-Reduce Framework 14/12/20 23:27:18 INFO mapred.JobClient: Map input records=6 14/12/20 23:27:18 INFO mapred.JobClient: Physical memory (bytes) snapshot=106311680 14/12/20 23:27:18 INFO mapred.JobClient: Spilled Records=0 14/12/20 23:27:18 INFO mapred.JobClient: CPU time spent (ms)=2420 14/12/20 23:27:18 INFO mapred.JobClient: Total committed heap usage (bytes)=29753344 14/12/20 23:27:18 INFO mapred.JobClient: Virtual memory (bytes) snapshot=736796672 14/12/20 23:27:18 INFO mapred.JobClient: Map output records=3 14/12/20 23:27:18 INFO mapred.JobClient: SPLIT_RAW_BYTES=130 14/12/20 23:27:18 INFO crawl.InjectorJob: InjectorJob: total number of urls rejected by filters: 0 14/12/20 23:27:18 INFO crawl.InjectorJob: InjectorJob: total number of urls injected after normalization and filtering: 3 14/12/20 23:27:18 INFO crawl.InjectorJob: Injector: finished at 2014-12-20 23:27:18, elapsed: 00:00:27
三、附帶使用eclipse運(yùn)行nutch的方式
此方法本質(zhì)上是與部署模式一致的。
使用eclipse運(yùn)行InjectorJob
eclipse輸出內(nèi)容:
InjectorJob: starting at 2014-12-20 23:13:24 InjectorJob: Injecting urlDir: /Users/liaoliuqing/99_Project/2.x/urls InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class. InjectorJob: total number of urls rejected by filters: 0 InjectorJob: total number of urls injected after normalization and filtering: 1Injector: finished at 2014-12-20 23:13:27, elapsed: 00:00:02
總結(jié)
以上是生活随笔為你收集整理的【Nutch基础教程之七】Nutch的2种运行模式:local及deploy的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 在Eclipse中运行hadoop程序
- 下一篇: Nutch+Hadoop集群搭建