當前位置：首頁 > 前端技术 > HTML >内容正文

HTML

java解析html之HTMLparser初次尝试

發布時間：2024/3/12 HTML 28 豆豆

生活随笔收集整理的這篇文章主要介紹了 java解析html之HTMLparser初次尝试小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

為了爬取一個網頁的數據，嘗試了一下Htmlparser來做小爬蟲。

下面是一個小案例，用來爬取論壇的帖子內容。

1. HtmlParser 簡介

htmlparser是一個純的java寫的html解析的庫，主要用于改造或提取html。用來分析抓取到的網頁信息是個不錯的選擇，遺憾的是參考文檔太少。
項目主頁： http://htmlparser.sourceforge.net/
API文檔： http://htmlparser.sourceforge.net/javadoc/index.html

2. 建立Maven工程

添加相關依賴

pom.xml

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><modelVersion>4.0.0</modelVersion><groupId>com.fancy</groupId><artifactId>htmlParser</artifactId><version>0.0.1-SNAPSHOT</version><dependencies><dependency><groupId>org.htmlparser</groupId><artifactId>htmlparser</artifactId><version>2.1</version></dependency><dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>4.12</version><scope>test</scope></dependency></dependencies> </project>

2.1 創建一個解析器

用parser來抓取并分析一個網頁。

parser并不會處理網頁中的異步請求，在抓取頁面后會把真個頁面解析成DOM樹，并以各種形式的節點/TAG存儲，然后我們就可以用各種過濾器來帥選自己想要的節點。

htmlparser的已包含節點如下

org.htmlparser
Interface Node

All Superinterfaces:

Cloneable

All Known Subinterfaces:

Remark,Tag,Text

All Known Implementing Classes:

AbstractNode,AppletTag,BaseHrefTag, BodyTag, Bullet, BulletList, CompositeTag, DefinitionList, DefinitionListBullet, Div, DoctypeTag, FormTag, FrameSetTag, FrameTag, HeadingTag, HeadTag, Html, ImageTag, InputTag, JspTag, LabelTag, LinkTag, MetaTag, ObjectTag, OptionTag, ParagraphTag, ProcessingInstructionTag, RemarkNode, ScriptTag, SelectTag, Span, StyleTag, TableColumn, TableHeader, TableRow, TableTag, TagNode, TextareaTag, TextNode, TitleTag

網頁被解析后獲得的都是這些節點以及他們之間的父子包含關系。

每一個節點都包含如下方法（很多節點還會自己實現更多的方法，例如linktag有些方法用于獲取link標簽的url，檢查這個url的協議類型...)

Method Summary

?void	accept(NodeVisitor?visitor) ??????????Apply the visitor to this node.
?Object	clone() ??????????Allow cloning of nodes.
?void	collectInto(NodeList?list,NodeFilter?filter) ??????????Collect this node and its child nodes into a list, provided the node satisfies the filtering criteria.
?void	doSemanticAction() ??????????Perform the meaning of this tag.
?NodeList	getChildren() ??????????Get the children of this node.
?int	getEndPosition() ??????????Gets the ending position of the node.
?Node	getFirstChild() ??????????Get the first child of this node.
?Node	getLastChild() ??????????Get the last child of this node.
?Node	getNextSibling() ??????????Get the next sibling to this node.
?Page	getPage() ??????????Get the page this node came from.
?Node	getParent() ??????????Get the parent of this node.
?Node	getPreviousSibling() ??????????Get the previous sibling to this node.
?int	getStartPosition() ??????????Gets the starting position of the node.
?String	getText() ??????????Returns the text of the node.
?void	setChildren(NodeList?children) ??????????Set the children of this node.
?void	setEndPosition(int?position) ??????????Sets the ending position of the node.
?void	setPage(Page?page) ??????????Set the page this node came from.
?void	setParent(Node?node) ??????????Sets the parent of this node.
?void	setStartPosition(int?position) ??????????Sets the starting position of the node.
?void	setText(String?text) ??????????Sets the string contents of the node.
?String	toHtml() ??????????Return the HTML for this node.
?String	toHtml(boolean?verbatim) ??????????Return the HTML for this node.
?String	toPlainTextString() ??????????A string representation of the node.
?String	toString() ??????????Return the string representation of the node.

節點過濾器，這些過濾器可以按照即誒但類型。節點之間父子關系，也可以自定義過濾器。多個過濾器之間可以組合成符合過濾器用于多條件過濾，

比如AndFilter，NotFilter，OrFilter，XorFilter

Class Summary

AndFilter	Accepts nodes matching all of its predicate filters (AND operation).
CssSelectorNodeFilter	A NodeFilter that accepts nodes based on whether they match a CSS2 selector.
HasAttributeFilter	This class accepts all tags that have a certain attribute, and optionally, with a certain value.
HasChildFilter	This class accepts all tags that have a child acceptable to the filter.
HasParentFilter	This class accepts all tags that have a parent acceptable to another filter.
HasSiblingFilter	This class accepts all tags that have a sibling acceptable to another filter.
IsEqualFilter	This class accepts only one specific node.
LinkRegexFilter	This class accepts tags of class LinkTag that contain a link matching a given regex pattern.
LinkStringFilter	This class accepts tags of class LinkTag that contain a link matching a given pattern string.
NodeClassFilter	This class accepts all tags of a given class.
NotFilter	Accepts all nodes not acceptable to it's predicate filter.
OrFilter	Accepts nodes matching any of its predicates filters (OR operation).
RegexFilter	This filter accepts all string nodes matching a regular expression.
StringFilter	This class accepts all string nodes containing the given string.
TagNameFilter	This class accepts all tags matching the tag name.

抓取http://www.v2ex.com網站中的一篇帖子

首先要創建獲取網頁內容，分析網頁元素結構制作過濾器；

可以看到回復div的id都是r_加六位數字，推薦使用正則表達式匹配，主題的樣式是corder-bottom:0px（一定要缺人過濾器的結果，免得引入多余節點）。

創建一個方法，獲得主題和回復節點集合

/*** * 獲取html中的主題和所有回復節點* * @param url* @param ENCODE* @return*/protected NodeList getNodelist(String url, String ENCODE) {try {NodeList nodeList = null;Parser parser = new Parser(url);parser.setEncoding(ENCODE);//定義一個Filter，過濾主題divNodeFilter filter = new NodeFilter() {@Overridepublic boolean accept(Node node) {if(node.getText().contains("style=\"border-bottom: 0px;\"")) {return true;} else {return false;}}};//定義一個Filter，過濾所有回復divNodeFilter replyfilter = new NodeFilter() {@Overridepublic boolean accept(Node node) {String containsString = "id=\"r_";if(node.getText().contains(containsString)) {return true;} else {return false;}}};//組合filterOrFilter allFilter = new OrFilter(filter, replyfilter);nodeList = parser.extractAllNodesThatMatch(allFilter);return nodeList;} catch (ParserException e) {e.printStackTrace();return null;}}
好了有了這些節點接下來就是解析了。

這個例子代碼只寫了一部分元素的獲取，剩下的活也是體力活慢慢分析節點關系，用過濾器或者dom樹找目標節點。

下面的代碼是將解析到的節點數據封裝到bean

public Forum parse2Thread(String url,String ENCODE) {List<Reply> replylist = new ArrayList<Reply>(); //回復列表Topic topic = new Topic(); //主題NodeFilter divFilter = new NodeClassFilter(Div.class);//div過濾器NodeFilter headingFilter = new NodeClassFilter(HeadingTag.class);//heading過濾器NodeFilter tagFilter = new NodeClassFilter(TagNode.class);//heading過濾器NodeList nodeList = this.getNodelist(url, ENCODE);//解析node到帖子實體for (int i = 0; i < nodeList.size(); i++) {Node node = nodeList.elementAt(i);if(node.getText().contains("style=\"border-bottom: 0px;\"")) {//如果node是主題NodeList list = node.getChildren();//node的子節點//header divNode headerNode = list.extractAllNodesThatMatch(new NodeClassFilter(Div.class)).elementAt(0);//帖子主題Node h1Node = headerNode.getChildren().extractAllNodesThatMatch(headingFilter).elementAt(0);topic.setTopicName(h1Node.toPlainTextString());//發帖人信息NodeList headerChrildrens = headerNode.getChildren();topic.setAnn_name(headerChrildrens.elementAt(15).toPlainTextString());topic.setTopicDescribe(headerChrildrens.elementAt(16).toPlainTextString());//發帖人頭像鏈接Node frNode = headerChrildrens.extractAllNodesThatMatch(divFilter).elementAt(0);ImageTag imgNode = (ImageTag) frNode.getFirstChild().getFirstChild();topic.setAnn_img(imgNode.getImageURL());//cell divNode cellNode = list.extractAllNodesThatMatch(divFilter).elementAt(1);Node topic_content = cellNode.getChildren().extractAllNodesThatMatch(divFilter).elementAt(0);Node markdown_body = topic_content.getChildren().extractAllNodesThatMatch(divFilter).elementAt(0);topic.setTopicBody(markdown_body.toPlainTextString());//暫時不包含連接和圖片純文本} else if(node.getText().contains("id=\"r_")){//節點是回復Reply reply = new Reply();Node tableNode = node.getChildren().extractAllNodesThatMatch(tagFilter).elementAt(0);Node trNode = tableNode.getChildren().extractAllNodesThatMatch(tagFilter).elementAt(0);//回復的tagNodeListNodeList tagList = trNode.getChildren().extractAllNodesThatMatch(tagFilter);ImageTag reply_img = (ImageTag) tagList.elementAt(0).getChildren().extractAllNodesThatMatch(tagFilter).elementAt(0);reply.setReply_img(reply_img.getImageURL());//nodeList bodyNode = tagList;replylist.add(reply);}}System.out.println("-----------實體----------------");Forum forum = new Forum(topic, replylist);System.out.println(forum.toString());return null;}

好了。解析都做完了，在寫個主方法分析一個帖子試試；

@Testpublic void test() throws Exception {Html2Domain parse = new Html2DomainImpl();parse.parse2Thread("http://www.v2ex.com/t/262409#reply6","UTF-8");}
看看運行結果：

這個內容過長，截圖只能看到帖子名稱，和帖子內容了，有興趣的自己去測試把。請一定要注意地址，貌似這個網站帖子連接會有失效時間，假如測試獲取失敗請換個帖子地址試試。

附上項目代碼：測試使用的是jdk1.6+eclipse kepler

http://pan.baidu.com/s/1mh9OuDi

總結

以上是生活随笔為你收集整理的java解析html之HTMLparser初次尝试的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： cin.tie() 输入加速器
下一篇： 2017年html5行业报告,云适配发布