當前位置：首頁 >

Nutch开发(四)

發布時間：2024/9/19 52 豆豆

生活随笔收集整理的這篇文章主要介紹了 Nutch开发(四) 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Nutch開發(四)

文章目錄

- Nutch開發(四)
- - - 開發環境
  - 1.Nutch插件設計介紹
  - 2.解讀插件目錄結構
  - 3. build.xml
  - 4. ivy.xml
  - 5. plugin.xml
  - 6. 解讀parse-html插件
  - - HtmlParser
    - - setConf(Configuration conf)
      - parse(InputSource input)
      - getParse(Content content)
  - 7.解讀parse-metatags插件
  - - MetaTagsParser
    - - filter方法
      - addIndexedMetatags方法
      - metadata plugin的配置

開發環境

Linux，Ubuntu20.04LST
IDEA
Nutch1.18
Solr8.11

轉載請聲明出處！！！By 鴨梨的藥丸哥

1.Nutch插件設計介紹

Nutch高度可擴展，使用的插件系統是基于Eclipse2.x的插件系統。

Nutch暴露了幾個擴展點，每個擴展點都是一個接口，通過實現接口來進行插件擴展的開發。Nutch提供以下擴展點，我們只需要實現對應的接口即可開發我們的Nutch插件

IndexWriter – Writes crawled data to a specific indexing backends (Solr, ElasticSearch, a CVS file, etc.).
IndexingFilter – Permits one to add metadata to the indexed fields. All plugins found which implement this extension point are run sequentially on the parse (from javadoc).
Parser – Parser implementations read through fetched documents in order to extract data to be indexed. This is what you need to implement if you want Nutch to be able to parse a new type of content, or extract more data from currently parseable content.
HtmlParseFilter – Permits one to add additional metadata to HTML parses (from javadoc).
Protocol – Protocol implementations allow Nutch to use different protocols (ftp, http, etc.) to fetch documents.
URLFilter – URLFilter implementations limit the URLs that Nutch attempts to fetch. The RegexURLFilter distributed with Nutch provides a great deal of control over what URLs Nutch crawls, however if you have very complicated rules about what URLs you want to crawl, you can write your own implementation.
URLNormalizer – Interface used to convert URLs to normal form and optionally perform substitutions.
ScoringFilter – A contract defining behavior of scoring plugins. A scoring filter will manipulate scoring variables in CrawlDatum and in resulting search indexes. Filters can be chained in a specific order, to provide multi-stage scoring adjustments.
SegmentMergeFilter – Interface used to filter segments during segment merge. It allows filtering on more sophisticated criteria than just URLs. In particular it allows filtering based on metadata collected while parsing page.

2.解讀插件目錄結構

Nutch插件的目錄都相似，這里介紹一下parse-html的目錄就行了

/src #源碼目錄 build.xml #ant怎樣編譯這個plugin配置文件(編譯出jar包放哪啊等配置信息) ivy.xml #plugin的ivy配置信息(依賴管理，跟maven的pom.xml一樣的東東) plugin.xml #nutch描述這個plugin的信息(如，這個插件實現了哪些擴展點，插件的擴展點實現類名字等)

3. build.xml

build.xml告知ant如何編譯這個插件的

4. ivy.xml

跟maven的pom.xml一樣的東西。一些外部依賴可以在這里聲明導入

<ivy-module version="1.0"><info organisation="org.apache.nutch" module="${ant.project.name}"><license name="Apache 2.0"/><ivyauthor name="Apache Nutch Team" url="https://nutch.apache.org/"/><description>Apache Nutch</description></info><configurations><include file="../../../ivy/ivy-configurations.xml"/></configurations><publications><artifact conf="master"/></publications><dependencies><dependency org="org.ccil.cowan.tagsoup" name="tagsoup" rev="1.2.1"/></dependencies></ivy-module>

5. plugin.xml

<pluginid="parse-html"name="Html Parse Plug-in"version="1.0.0"provider-name="nutch.org"><runtime><library name="parse-html.jar"><export name="*"/></library><library name="tagsoup-1.2.1.jar"/></runtime><requires><import plugin="nutch-extensionpoints"/><import plugin="lib-nekohtml"/></requires><extension id="org.apache.nutch.parse.html"name="HtmlParse"point="org.apache.nutch.parse.Parser"><implementation id="org.apache.nutch.parse.html.HtmlParser"class="org.apache.nutch.parse.html.HtmlParser"><parameter name="contentType" value="text/html|application/xhtml+xml"/><parameter name="pathSuffix" value=""/></implementation></extension></plugin>

6. 解讀parse-html插件

HtmlParser

HtmlParser實現了Parser擴展點

public class HtmlParser implements Parser

Parser接口方法：

public ParseResult getParse(Content c) //解析數據的
public void setConf(Configuration configuration) //用于nutch-setting中的配置
public Configuration getConf()

setConf(Configuration conf)

從nutch-setting.xml讀取信息，因為nutch會在調用插件通過setConf(Configuration conf)往插件傳遞配置信息。

@Override public void setConf(Configuration conf) {this.conf = conf;//創建HtmlParseFilters，里面有一個數組HtmlParseFilters裝實現類的插件//HtmlParseFilters使用數組HtmlParseFilter[] htmlParseFilters裝插件this.htmlParseFilters = new HtmlParseFilters(getConf());//獲取解析實現類名字，空就默認使用nekohtmlthis.parserImpl = getConf().get("parser.html.impl", "neko");//編碼方式this.defaultCharEncoding = getConf().get("parser.character.encoding.default", "windows-1252");//一個dom工具this.utils = new DOMContentUtils(conf);//cache策略this.cachingPolicy = getConf().get("parser.caching.forbidden.policy",Nutch.CACHING_FORBIDDEN_CONTENT); }

查看nutch-default.xml，里面的parser.html.impl參數，確實有parser.html.impl，如果nutch-default.xml沒有定義時還是會用NekoHTML去解析HTML頁面。

從前面的build.xml引入了lib-nekohtml插件，這個就是NekoHTML
而ivy.xml引入了tagsoup的ivy依賴，這個就是TagSoup，兩者都能解析html頁面

<property><name>parser.html.impl</name><value>neko</value><description>HTML Parser implementation. Currently the following keywordsare recognized: "neko" uses NekoHTML, "tagsoup" uses TagSoup.</description> </property>

parse(InputSource input)

再看看parse這個方法，

private DocumentFragment parse(InputSource input) throws Exception {//如果設置了tagsoup就用tagsoup來解析htmlif ("tagsoup".equalsIgnoreCase(parserImpl))return parseTagSoup(input);elsereturn parseNeko(input); }

getParse(Content content)

注意：在ParseResult filteredParse = this.htmlParseFilters.filter(content,parseResult, metaTags, root);會運行繼承HtmlParseFilter擴展點的插件，所以我們需要解析html中的格外的標簽中的數據時，可以通過實現HtmlParseFilter擴展點來自定義一些html中的標簽數據發解析。

public ParseResult getParse(Content content) {//HTML meta標簽HTMLMetaTags metaTags = new HTMLMetaTags();//拿到urlURL base;try {base = new URL(content.getBaseUrl());} catch (MalformedURLException e) {return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());}//文本信息String text = "";//標題String title = "";//解析出的外部連接Outlink[] outlinks = new Outlink[0];//元數據Metadata metadata = new Metadata();//解析出的dom樹// parse the contentDocumentFragment root;try {//拿到content封裝成流byte[] contentInOctets = content.getContent();InputSource input = new InputSource(new ByteArrayInputStream(contentInOctets));//編碼方式的解析EncodingDetector detector = new EncodingDetector(conf);detector.autoDetectClues(content, true);detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");String encoding = detector.guessEncoding(content, defaultCharEncoding);metadata.set(Metadata.ORIGINAL_CHAR_ENCODING, encoding);metadata.set(Metadata.CHAR_ENCODING_FOR_CONVERSION, encoding);input.setEncoding(encoding);if (LOG.isTraceEnabled()) {LOG.trace("Parsing...");}root = parse(input);} catch (IOException e) {return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());} catch (DOMException e) {return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());} catch (SAXException e) {return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());} catch (Exception e) {LOG.error("Error: ", e);return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());}//解析出meta標簽// get meta directivesHTMLMetaProcessor.getMetaTags(metaTags, root, base);//把標簽數據裝到metadata里面// populate Nutch metadata with HTML meta directivesmetadata.addAll(metaTags.getGeneralTags());if (LOG.isTraceEnabled()) {LOG.trace("Meta tags for " + base + ": " + metaTags.toString());}// check meta directivesif (!metaTags.getNoIndex()) { // okay to indexStringBuffer sb = new StringBuffer();if (LOG.isTraceEnabled()) {LOG.trace("Getting text...");}//解析文本信息,就是提取標簽中的文本utils.getText(sb, root); // extract texttext = sb.toString();sb.setLength(0);if (LOG.isTraceEnabled()) {LOG.trace("Getting title...");}//提取title標簽中的文本utils.getTitle(sb, root); // extract titletitle = sb.toString().trim();}if (!metaTags.getNoFollow()) { // okay to follow linksArrayList<Outlink> l = new ArrayList<Outlink>(); // extract outlinksURL baseTag = base;String baseTagHref = utils.getBase(root);if (baseTagHref != null) {try {baseTag = new URL(base, baseTagHref);} catch (MalformedURLException e) {baseTag = base;}}if (LOG.isTraceEnabled()) {LOG.trace("Getting links...");}//解析外部連接utils.getOutlinks(baseTag, l, root);outlinks = l.toArray(new Outlink[l.size()]);if (LOG.isTraceEnabled()) {LOG.trace("found " + outlinks.length + " outlinks in "+ content.getUrl());}}//創建parseStatusParseStatus status = new ParseStatus(ParseStatus.SUCCESS);if (metaTags.getRefresh()) {status.setMinorCode(ParseStatus.SUCCESS_REDIRECT);status.setArgs(new String[] { metaTags.getRefreshHref().toString(),Integer.toString(metaTags.getRefreshTime()) });}//封裝解析數據ParseData parseData = new ParseData(status, title, outlinks,content.getMetadata(), metadata);//解析結果ParseResult parseResult = ParseResult.createParseResult(content.getUrl(),new ParseImpl(text, parseData));//運行HtmlParseFilter解析過濾器,如parse-metatags等,具體可通過配置添加// run filters on parseParseResult filteredParse = this.htmlParseFilters.filter(content,parseResult, metaTags, root);if (metaTags.getNoCache()) { // not okay to cachefor (Map.Entry<org.apache.hadoop.io.Text, Parse> entry : filteredParse)entry.getValue().getData().getParseMeta().set(Nutch.CACHING_FORBIDDEN_KEY, cachingPolicy);}return filteredParse;}

7.解讀parse-metatags插件

MetaTagsParser

MetaTagsParser實現了HtmlParseFilter擴展點

public class MetaTagsParser implements HtmlParseFilter

filter方法

public ParseResult filter(Content content, ParseResult parseResult,HTMLMetaTags metaTags, DocumentFragment doc) {//拿到解析數據Parse parse = parseResult.get(content.getUrl());//拿到解析的元數據Metadata metadata = parse.getData().getParseMeta();/** NUTCH-1559: do not extract meta values from ParseData's metadata to avoid* duplicate metatag values*///meta標簽的元數據（k,v）Metadata generalMetaTags = metaTags.getGeneralTags();for (String tagName : generalMetaTags.names()) {//根據配置進行添加到解析結果里面addIndexedMetatags(metadata, tagName, generalMetaTags.getValues(tagName));}Properties httpequiv = metaTags.getHttpEquivTags();for (Enumeration<?> tagNames = httpequiv.propertyNames(); tagNames.hasMoreElements();) {String name = (String) tagNames.nextElement();String value = httpequiv.getProperty(name);//這里也是添加到解析結果里面addIndexedMetatags(metadata, name, value);}return parseResult;}

addIndexedMetatags方法

觀察一下這個方法，你就知道使用metadata plugin時，在使用index-metadata時，為什么配置要進行index的字段名要加上metatag.這個前綴了。

private void addIndexedMetatags(Metadata metadata, String metatag,String value) {String lcMetatag = metatag.toLowerCase(Locale.ROOT);if (metatagset.contains("*") || metatagset.contains(lcMetatag)) {if (LOG.isDebugEnabled()) {LOG.debug("Found meta tag: {}\t{}", lcMetatag, value);}metadata.add("metatag." + lcMetatag, value);}}

metadata plugin的配置

在看看配置并和addIndexedMetatags對比一下，這就可以看出為什么插件的index.parse.md要加上metatag.前綴

<property> <name>metatags.names</name> <value>description,keywords</value> <description> Names of the metatags to extract, separated by ','.Use '*' to extract all metatags. Prefixes the names with 'metatag.'in the parse-metadata. For instance to index description and keywords,you need to activate the plugin index-metadata and set the value of theparameter 'index.parse.md' to 'metatag.description,metatag.keywords'. </description> </property><property><name>index.parse.md</name><value>metatag.description,metatag.keywords</value><description>Comma-separated list of keys to be taken from the parse metadata to generate fields.Can be used e.g. for 'description' or 'keywords' provided that these values are generatedby a parser (see parse-metatags plugin)</description> </property>

總結

以上是生活随笔為你收集整理的Nutch开发(四)的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

Nutch

上一篇： mkl_def.dll文件加载失败
下一篇： nutch开发(六)