Nutch开发(四)
Nutch開發(四)
文章目錄
- Nutch開發(四)
- 開發環境
- 1.Nutch插件設計介紹
- 2.解讀插件目錄結構
- 3. build.xml
- 4. ivy.xml
- 5. plugin.xml
- 6. 解讀parse-html插件
- HtmlParser
- setConf(Configuration conf)
- parse(InputSource input)
- getParse(Content content)
- 7.解讀parse-metatags插件
- MetaTagsParser
- filter方法
- addIndexedMetatags方法
- metadata plugin的配置
開發環境
- Linux,Ubuntu20.04LST
- IDEA
- Nutch1.18
- Solr8.11
轉載請聲明出處!!!By 鴨梨的藥丸哥
1.Nutch插件設計介紹
Nutch高度可擴展,使用的插件系統是基于Eclipse2.x的插件系統。
Nutch暴露了幾個擴展點,每個擴展點都是一個接口,通過實現接口來進行插件擴展的開發。Nutch提供以下擴展點,我們只需要實現對應的接口即可開發我們的Nutch插件
- IndexWriter – Writes crawled data to a specific indexing backends (Solr, ElasticSearch, a CVS file, etc.).
- IndexingFilter – Permits one to add metadata to the indexed fields. All plugins found which implement this extension point are run sequentially on the parse (from javadoc).
- Parser – Parser implementations read through fetched documents in order to extract data to be indexed. This is what you need to implement if you want Nutch to be able to parse a new type of content, or extract more data from currently parseable content.
- HtmlParseFilter – Permits one to add additional metadata to HTML parses (from javadoc).
- Protocol – Protocol implementations allow Nutch to use different protocols (ftp, http, etc.) to fetch documents.
- URLFilter – URLFilter implementations limit the URLs that Nutch attempts to fetch. The RegexURLFilter distributed with Nutch provides a great deal of control over what URLs Nutch crawls, however if you have very complicated rules about what URLs you want to crawl, you can write your own implementation.
- URLNormalizer – Interface used to convert URLs to normal form and optionally perform substitutions.
- ScoringFilter – A contract defining behavior of scoring plugins. A scoring filter will manipulate scoring variables in CrawlDatum and in resulting search indexes. Filters can be chained in a specific order, to provide multi-stage scoring adjustments.
- SegmentMergeFilter – Interface used to filter segments during segment merge. It allows filtering on more sophisticated criteria than just URLs. In particular it allows filtering based on metadata collected while parsing page.
2.解讀插件目錄結構
Nutch插件的目錄都相似,這里介紹一下parse-html的目錄就行了
/src #源碼目錄 build.xml #ant怎樣編譯這個plugin配置文件(編譯出jar包放哪啊等配置信息) ivy.xml #plugin的ivy配置信息(依賴管理,跟maven的pom.xml一樣的東東) plugin.xml #nutch描述這個plugin的信息(如,這個插件實現了哪些擴展點,插件的擴展點實現類名字等)3. build.xml
build.xml告知ant如何編譯這個插件的
<project name="parse-html" default="jar-core"><import file="../build-plugin.xml"/><!-- Build compilation dependencies --><target name="deps-jar"><!--build時依賴于另一個插件--><ant target="jar" inheritall="false" dir="../lib-nekohtml"/></target><!-- Add compilation dependencies to classpath --><path id="plugin.deps"><fileset dir="${nutch.root}/build"><include name="**/lib-nekohtml/*.jar" /></fileset></path><!-- Deploy Unit test dependencies --><target name="deps-test"><!--test時用到的依賴插件--><ant target="deploy" inheritall="false" dir="../lib-nekohtml"/><ant target="deploy" inheritall="false" dir="../nutch-extensionpoints"/></target></project>4. ivy.xml
跟maven的pom.xml一樣的東西。一些外部依賴可以在這里聲明導入
<ivy-module version="1.0"><info organisation="org.apache.nutch" module="${ant.project.name}"><license name="Apache 2.0"/><ivyauthor name="Apache Nutch Team" url="https://nutch.apache.org/"/><description>Apache Nutch</description></info><configurations><include file="../../../ivy/ivy-configurations.xml"/></configurations><publications><!--get the artifact from our module name--><artifact conf="master"/></publications><!--在這里添加外部依賴--><dependencies><dependency org="org.ccil.cowan.tagsoup" name="tagsoup" rev="1.2.1"/></dependencies></ivy-module>5. plugin.xml
<!--插件的描述信息--> <pluginid="parse-html"name="Html Parse Plug-in"version="1.0.0"provider-name="nutch.org"><runtime><library name="parse-html.jar"><export name="*"/></library><library name="tagsoup-1.2.1.jar"/></runtime><!--插件導入--><requires><import plugin="nutch-extensionpoints"/><import plugin="lib-nekohtml"/></requires><!--擴展點的描述--><extension id="org.apache.nutch.parse.html"name="HtmlParse"point="org.apache.nutch.parse.Parser"><!--id唯一標識,class對應的實現類--><implementation id="org.apache.nutch.parse.html.HtmlParser"class="org.apache.nutch.parse.html.HtmlParser"><!--參數--><parameter name="contentType" value="text/html|application/xhtml+xml"/><parameter name="pathSuffix" value=""/></implementation></extension></plugin>6. 解讀parse-html插件
HtmlParser
HtmlParser實現了Parser擴展點
public class HtmlParser implements ParserParser接口方法:
- public ParseResult getParse(Content c) //解析數據的
- public void setConf(Configuration configuration) //用于nutch-setting中的配置
- public Configuration getConf()
setConf(Configuration conf)
從nutch-setting.xml讀取信息,因為nutch會在調用插件通過setConf(Configuration conf)往插件傳遞配置信息。
@Override public void setConf(Configuration conf) {this.conf = conf;//創建HtmlParseFilters,里面有一個數組HtmlParseFilters裝實現類的插件//HtmlParseFilters使用數組HtmlParseFilter[] htmlParseFilters裝插件this.htmlParseFilters = new HtmlParseFilters(getConf());//獲取解析實現類名字,空就默認使用nekohtmlthis.parserImpl = getConf().get("parser.html.impl", "neko");//編碼方式this.defaultCharEncoding = getConf().get("parser.character.encoding.default", "windows-1252");//一個dom工具this.utils = new DOMContentUtils(conf);//cache策略this.cachingPolicy = getConf().get("parser.caching.forbidden.policy",Nutch.CACHING_FORBIDDEN_CONTENT); }查看nutch-default.xml,里面的parser.html.impl參數,確實有parser.html.impl,如果nutch-default.xml沒有定義時還是會用NekoHTML去解析HTML頁面。
- 從前面的build.xml引入了lib-nekohtml插件,這個就是NekoHTML
- 而ivy.xml引入了tagsoup的ivy依賴,這個就是TagSoup,兩者都能解析html頁面
parse(InputSource input)
再看看parse這個方法,
private DocumentFragment parse(InputSource input) throws Exception {//如果設置了tagsoup就用tagsoup來解析htmlif ("tagsoup".equalsIgnoreCase(parserImpl))return parseTagSoup(input);elsereturn parseNeko(input); }getParse(Content content)
注意:在ParseResult filteredParse = this.htmlParseFilters.filter(content,parseResult, metaTags, root);會運行繼承HtmlParseFilter擴展點的插件,所以我們需要解析html中的格外的標簽中的數據時,可以通過實現HtmlParseFilter擴展點來自定義一些html中的標簽數據發解析。
public ParseResult getParse(Content content) {//HTML meta標簽HTMLMetaTags metaTags = new HTMLMetaTags();//拿到urlURL base;try {base = new URL(content.getBaseUrl());} catch (MalformedURLException e) {return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());}//文本信息String text = "";//標題String title = "";//解析出的外部連接Outlink[] outlinks = new Outlink[0];//元數據Metadata metadata = new Metadata();//解析出的dom樹// parse the contentDocumentFragment root;try {//拿到content封裝成流byte[] contentInOctets = content.getContent();InputSource input = new InputSource(new ByteArrayInputStream(contentInOctets));//編碼方式的解析EncodingDetector detector = new EncodingDetector(conf);detector.autoDetectClues(content, true);detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");String encoding = detector.guessEncoding(content, defaultCharEncoding);metadata.set(Metadata.ORIGINAL_CHAR_ENCODING, encoding);metadata.set(Metadata.CHAR_ENCODING_FOR_CONVERSION, encoding);input.setEncoding(encoding);if (LOG.isTraceEnabled()) {LOG.trace("Parsing...");}root = parse(input);} catch (IOException e) {return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());} catch (DOMException e) {return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());} catch (SAXException e) {return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());} catch (Exception e) {LOG.error("Error: ", e);return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());}//解析出meta標簽// get meta directivesHTMLMetaProcessor.getMetaTags(metaTags, root, base);//把標簽數據裝到metadata里面// populate Nutch metadata with HTML meta directivesmetadata.addAll(metaTags.getGeneralTags());if (LOG.isTraceEnabled()) {LOG.trace("Meta tags for " + base + ": " + metaTags.toString());}// check meta directivesif (!metaTags.getNoIndex()) { // okay to indexStringBuffer sb = new StringBuffer();if (LOG.isTraceEnabled()) {LOG.trace("Getting text...");}//解析文本信息,就是提取標簽中的文本utils.getText(sb, root); // extract texttext = sb.toString();sb.setLength(0);if (LOG.isTraceEnabled()) {LOG.trace("Getting title...");}//提取title標簽中的文本utils.getTitle(sb, root); // extract titletitle = sb.toString().trim();}if (!metaTags.getNoFollow()) { // okay to follow linksArrayList<Outlink> l = new ArrayList<Outlink>(); // extract outlinksURL baseTag = base;String baseTagHref = utils.getBase(root);if (baseTagHref != null) {try {baseTag = new URL(base, baseTagHref);} catch (MalformedURLException e) {baseTag = base;}}if (LOG.isTraceEnabled()) {LOG.trace("Getting links...");}//解析外部連接utils.getOutlinks(baseTag, l, root);outlinks = l.toArray(new Outlink[l.size()]);if (LOG.isTraceEnabled()) {LOG.trace("found " + outlinks.length + " outlinks in "+ content.getUrl());}}//創建parseStatusParseStatus status = new ParseStatus(ParseStatus.SUCCESS);if (metaTags.getRefresh()) {status.setMinorCode(ParseStatus.SUCCESS_REDIRECT);status.setArgs(new String[] { metaTags.getRefreshHref().toString(),Integer.toString(metaTags.getRefreshTime()) });}//封裝解析數據ParseData parseData = new ParseData(status, title, outlinks,content.getMetadata(), metadata);//解析結果ParseResult parseResult = ParseResult.createParseResult(content.getUrl(),new ParseImpl(text, parseData));//運行HtmlParseFilter解析過濾器,如parse-metatags等,具體可通過配置添加// run filters on parseParseResult filteredParse = this.htmlParseFilters.filter(content,parseResult, metaTags, root);if (metaTags.getNoCache()) { // not okay to cachefor (Map.Entry<org.apache.hadoop.io.Text, Parse> entry : filteredParse)entry.getValue().getData().getParseMeta().set(Nutch.CACHING_FORBIDDEN_KEY, cachingPolicy);}return filteredParse;}7.解讀parse-metatags插件
MetaTagsParser
MetaTagsParser實現了HtmlParseFilter擴展點
public class MetaTagsParser implements HtmlParseFilterfilter方法
public ParseResult filter(Content content, ParseResult parseResult,HTMLMetaTags metaTags, DocumentFragment doc) {//拿到解析數據Parse parse = parseResult.get(content.getUrl());//拿到解析的元數據Metadata metadata = parse.getData().getParseMeta();/** NUTCH-1559: do not extract meta values from ParseData's metadata to avoid* duplicate metatag values*///meta標簽的元數據(k,v)Metadata generalMetaTags = metaTags.getGeneralTags();for (String tagName : generalMetaTags.names()) {//根據配置進行添加到解析結果里面addIndexedMetatags(metadata, tagName, generalMetaTags.getValues(tagName));}Properties httpequiv = metaTags.getHttpEquivTags();for (Enumeration<?> tagNames = httpequiv.propertyNames(); tagNames.hasMoreElements();) {String name = (String) tagNames.nextElement();String value = httpequiv.getProperty(name);//這里也是添加到解析結果里面addIndexedMetatags(metadata, name, value);}return parseResult;}addIndexedMetatags方法
觀察一下這個方法,你就知道使用metadata plugin時,在使用index-metadata時,為什么配置要進行index的字段名要加上metatag.這個前綴了。
private void addIndexedMetatags(Metadata metadata, String metatag,String value) {String lcMetatag = metatag.toLowerCase(Locale.ROOT);if (metatagset.contains("*") || metatagset.contains(lcMetatag)) {if (LOG.isDebugEnabled()) {LOG.debug("Found meta tag: {}\t{}", lcMetatag, value);}metadata.add("metatag." + lcMetatag, value);}}metadata plugin的配置
在看看配置并和addIndexedMetatags對比一下,這就可以看出為什么插件的index.parse.md要加上metatag.前綴
<property> <name>metatags.names</name> <value>description,keywords</value> <description> Names of the metatags to extract, separated by ','.Use '*' to extract all metatags. Prefixes the names with 'metatag.'in the parse-metadata. For instance to index description and keywords,you need to activate the plugin index-metadata and set the value of theparameter 'index.parse.md' to 'metatag.description,metatag.keywords'. </description> </property><property><name>index.parse.md</name><!--addIndexedMetatags方法解析出來的metadata有前綴metatag.--><value>metatag.description,metatag.keywords</value><description>Comma-separated list of keys to be taken from the parse metadata to generate fields.Can be used e.g. for 'description' or 'keywords' provided that these values are generatedby a parser (see parse-metatags plugin)</description> </property>總結
以上是生活随笔為你收集整理的Nutch开发(四)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: mkl_def.dll文件加载失败
- 下一篇: nutch开发(六)