日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 前端技术 > HTML >内容正文

HTML

java解析html之HTMLparser初次尝试

發布時間:2024/3/12 HTML 28 豆豆
生活随笔 收集整理的這篇文章主要介紹了 java解析html之HTMLparser初次尝试 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

為了爬取一個網頁的數據,嘗試了一下Htmlparser來做小爬蟲。

下面是一個小案例,用來爬取論壇的帖子內容。


1. HtmlParser 簡介

htmlparser是一個純的java寫的html解析的庫,主要用于改造或提取html。用來分析抓取到的網頁信息是個不錯的選擇,遺憾的是參考文檔太少。
項目主頁: http://htmlparser.sourceforge.net/
API文檔: http://htmlparser.sourceforge.net/javadoc/index.html

2. 建立Maven工程

添加相關依賴

pom.xml

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><modelVersion>4.0.0</modelVersion><groupId>com.fancy</groupId><artifactId>htmlParser</artifactId><version>0.0.1-SNAPSHOT</version><dependencies><dependency><groupId>org.htmlparser</groupId><artifactId>htmlparser</artifactId><version>2.1</version></dependency><dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>4.12</version><scope>test</scope></dependency></dependencies> </project>

2.1 創建一個解析器

用parser來抓取并分析一個網頁。

parser并不會處理網頁中的異步請求,在抓取頁面后會把真個頁面解析成DOM樹,并以各種形式的節點/TAG存儲,然后我們就可以用各種過濾器來帥選自己想要的節點。

htmlparser的已包含節點如下


org.htmlparser
Interface Node

All Superinterfaces:
Cloneable
All Known Subinterfaces:
Remark,Tag,Text
All Known Implementing Classes:
AbstractNode,AppletTag,BaseHrefTag, BodyTag, Bullet, BulletList, CompositeTag, DefinitionList, DefinitionListBullet, Div, DoctypeTag, FormTag, FrameSetTag, FrameTag, HeadingTag, HeadTag, Html, ImageTag, InputTag, JspTag, LabelTag, LinkTag, MetaTag, ObjectTag, OptionTag, ParagraphTag, ProcessingInstructionTag, RemarkNode, ScriptTag, SelectTag, Span, StyleTag, TableColumn, TableHeader, TableRow, TableTag, TagNode, TextareaTag, TextNode, TitleTag


網頁被解析后獲得的都是這些節點以及他們之間的父子包含關系。

每一個節點都包含如下方法(很多節點還會自己實現更多的方法,例如linktag有些方法用于獲取link標簽的url,檢查這個url的協議類型...)


Method Summary
?voidaccept(NodeVisitor?visitor)
??????????Apply the visitor to this node.
?Objectclone()
??????????Allow cloning of nodes.
?voidcollectInto(NodeList?list,NodeFilter?filter)
??????????Collect this node and its child nodes into a list, provided the node satisfies the filtering criteria.
?voiddoSemanticAction()
??????????Perform the meaning of this tag.
?NodeListgetChildren()
??????????Get the children of this node.
?intgetEndPosition()
??????????Gets the ending position of the node.
?NodegetFirstChild()
??????????Get the first child of this node.
?NodegetLastChild()
??????????Get the last child of this node.
?NodegetNextSibling()
??????????Get the next sibling to this node.
?PagegetPage()
??????????Get the page this node came from.
?NodegetParent()
??????????Get the parent of this node.
?NodegetPreviousSibling()
??????????Get the previous sibling to this node.
?intgetStartPosition()
??????????Gets the starting position of the node.
?StringgetText()
??????????Returns the text of the node.
?voidsetChildren(NodeList?children)
??????????Set the children of this node.
?voidsetEndPosition(int?position)
??????????Sets the ending position of the node.
?voidsetPage(Page?page)
??????????Set the page this node came from.
?voidsetParent(Node?node)
??????????Sets the parent of this node.
?voidsetStartPosition(int?position)
??????????Sets the starting position of the node.
?voidsetText(String?text)
??????????Sets the string contents of the node.
?StringtoHtml()
??????????Return the HTML for this node.
?StringtoHtml(boolean?verbatim)
??????????Return the HTML for this node.
?StringtoPlainTextString()
??????????A string representation of the node.
?StringtoString()
??????????Return the string representation of the node.
?

節點過濾器,這些過濾器可以按照即誒但類型。節點之間父子關系,也可以自定義過濾器。多個過濾器之間可以組合成符合過濾器用于多條件過濾,

比如AndFilter,NotFilter,OrFilter,XorFilter

Class Summary
AndFilterAccepts nodes matching all of its predicate filters (AND operation).
CssSelectorNodeFilterA NodeFilter that accepts nodes based on whether they match a CSS2 selector.
HasAttributeFilterThis class accepts all tags that have a certain attribute, and optionally, with a certain value.
HasChildFilterThis class accepts all tags that have a child acceptable to the filter.
HasParentFilterThis class accepts all tags that have a parent acceptable to another filter.
HasSiblingFilterThis class accepts all tags that have a sibling acceptable to another filter.
IsEqualFilterThis class accepts only one specific node.
LinkRegexFilterThis class accepts tags of class LinkTag that contain a link matching a given regex pattern.
LinkStringFilterThis class accepts tags of class LinkTag that contain a link matching a given pattern string.
NodeClassFilterThis class accepts all tags of a given class.
NotFilterAccepts all nodes not acceptable to it's predicate filter.
OrFilterAccepts nodes matching any of its predicates filters (OR operation).
RegexFilterThis filter accepts all string nodes matching a regular expression.
StringFilterThis class accepts all string nodes containing the given string.
TagNameFilterThis class accepts all tags matching the tag name.
?

抓取http://www.v2ex.com網站中的一篇帖子

首先要創建獲取網頁內容,分析網頁元素結構制作過濾器;

可以看到回復div的id都是r_加六位數字,推薦使用正則表達式匹配,主題的樣式是corder-bottom:0px(一定要缺人過濾器的結果,免得引入多余節點)。

創建一個方法,獲得主題和回復節點集合

/*** * 獲取html中的主題和所有回復節點* * @param url* @param ENCODE* @return*/protected NodeList getNodelist(String url, String ENCODE) {try {NodeList nodeList = null;Parser parser = new Parser(url);parser.setEncoding(ENCODE);//定義一個Filter,過濾主題divNodeFilter filter = new NodeFilter() {@Overridepublic boolean accept(Node node) {if(node.getText().contains("style=\"border-bottom: 0px;\"")) {return true;} else {return false;}}};//定義一個Filter,過濾所有回復divNodeFilter replyfilter = new NodeFilter() {@Overridepublic boolean accept(Node node) {String containsString = "id=\"r_";if(node.getText().contains(containsString)) {return true;} else {return false;}}};//組合filterOrFilter allFilter = new OrFilter(filter, replyfilter);nodeList = parser.extractAllNodesThatMatch(allFilter);return nodeList;} catch (ParserException e) {e.printStackTrace();return null;}}
好了有了這些節點接下來就是解析了。

這個例子代碼只寫了一部分元素的獲取,剩下的活也是體力活慢慢分析節點關系,用過濾器或者dom樹找目標節點。

下面的代碼是將解析到的節點數據封裝到bean


public Forum parse2Thread(String url,String ENCODE) {List<Reply> replylist = new ArrayList<Reply>(); //回復列表Topic topic = new Topic(); //主題NodeFilter divFilter = new NodeClassFilter(Div.class);//div過濾器NodeFilter headingFilter = new NodeClassFilter(HeadingTag.class);//heading過濾器NodeFilter tagFilter = new NodeClassFilter(TagNode.class);//heading過濾器NodeList nodeList = this.getNodelist(url, ENCODE);//解析node到帖子實體for (int i = 0; i < nodeList.size(); i++) {Node node = nodeList.elementAt(i);if(node.getText().contains("style=\"border-bottom: 0px;\"")) {//如果node是主題NodeList list = node.getChildren();//node的子節點//header divNode headerNode = list.extractAllNodesThatMatch(new NodeClassFilter(Div.class)).elementAt(0);//帖子主題Node h1Node = headerNode.getChildren().extractAllNodesThatMatch(headingFilter).elementAt(0);topic.setTopicName(h1Node.toPlainTextString());//發帖人信息NodeList headerChrildrens = headerNode.getChildren();topic.setAnn_name(headerChrildrens.elementAt(15).toPlainTextString());topic.setTopicDescribe(headerChrildrens.elementAt(16).toPlainTextString());//發帖人頭像鏈接Node frNode = headerChrildrens.extractAllNodesThatMatch(divFilter).elementAt(0);ImageTag imgNode = (ImageTag) frNode.getFirstChild().getFirstChild();topic.setAnn_img(imgNode.getImageURL());//cell divNode cellNode = list.extractAllNodesThatMatch(divFilter).elementAt(1);Node topic_content = cellNode.getChildren().extractAllNodesThatMatch(divFilter).elementAt(0);Node markdown_body = topic_content.getChildren().extractAllNodesThatMatch(divFilter).elementAt(0);topic.setTopicBody(markdown_body.toPlainTextString());//暫時不包含連接和圖片純文本} else if(node.getText().contains("id=\"r_")){//節點是回復Reply reply = new Reply();Node tableNode = node.getChildren().extractAllNodesThatMatch(tagFilter).elementAt(0);Node trNode = tableNode.getChildren().extractAllNodesThatMatch(tagFilter).elementAt(0);//回復的tagNodeListNodeList tagList = trNode.getChildren().extractAllNodesThatMatch(tagFilter);ImageTag reply_img = (ImageTag) tagList.elementAt(0).getChildren().extractAllNodesThatMatch(tagFilter).elementAt(0);reply.setReply_img(reply_img.getImageURL());//nodeList bodyNode = tagList;replylist.add(reply);}}System.out.println("-----------實體----------------");Forum forum = new Forum(topic, replylist);System.out.println(forum.toString());return null;}

好了。解析都做完了,在寫個主方法分析一個帖子試試;

@Testpublic void test() throws Exception {Html2Domain parse = new Html2DomainImpl();parse.parse2Thread("http://www.v2ex.com/t/262409#reply6","UTF-8");}
看看運行結果:

這個內容過長,截圖只能看到帖子名稱,和帖子內容了,有興趣的自己去測試把。請一定要注意地址,貌似這個網站帖子連接會有失效時間,假如測試獲取失敗請換個帖子地址試試。

附上項目代碼:測試使用的是jdk1.6+eclipse kepler

http://pan.baidu.com/s/1mh9OuDi

總結

以上是生活随笔為你收集整理的java解析html之HTMLparser初次尝试的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。