java解析html之HTMLparser初次尝试
為了爬取一個網頁的數據,嘗試了一下Htmlparser來做小爬蟲。
下面是一個小案例,用來爬取論壇的帖子內容。
1. HtmlParser 簡介
htmlparser是一個純的java寫的html解析的庫,主要用于改造或提取html。用來分析抓取到的網頁信息是個不錯的選擇,遺憾的是參考文檔太少。項目主頁: http://htmlparser.sourceforge.net/
API文檔: http://htmlparser.sourceforge.net/javadoc/index.html
2. 建立Maven工程
添加相關依賴
pom.xml
2.1 創建一個解析器
用parser來抓取并分析一個網頁。
parser并不會處理網頁中的異步請求,在抓取頁面后會把真個頁面解析成DOM樹,并以各種形式的節點/TAG存儲,然后我們就可以用各種過濾器來帥選自己想要的節點。
htmlparser的已包含節點如下
org.htmlparser
Interface Node
All Superinterfaces: 每一個節點都包含如下方法(很多節點還會自己實現更多的方法,例如linktag有些方法用于獲取link標簽的url,檢查這個url的協議類型...)
| ?void | accept(NodeVisitor?visitor) ??????????Apply the visitor to this node. |
| ?Object | clone() ??????????Allow cloning of nodes. |
| ?void | collectInto(NodeList?list,NodeFilter?filter) ??????????Collect this node and its child nodes into a list, provided the node satisfies the filtering criteria. |
| ?void | doSemanticAction() ??????????Perform the meaning of this tag. |
| ?NodeList | getChildren() ??????????Get the children of this node. |
| ?int | getEndPosition() ??????????Gets the ending position of the node. |
| ?Node | getFirstChild() ??????????Get the first child of this node. |
| ?Node | getLastChild() ??????????Get the last child of this node. |
| ?Node | getNextSibling() ??????????Get the next sibling to this node. |
| ?Page | getPage() ??????????Get the page this node came from. |
| ?Node | getParent() ??????????Get the parent of this node. |
| ?Node | getPreviousSibling() ??????????Get the previous sibling to this node. |
| ?int | getStartPosition() ??????????Gets the starting position of the node. |
| ?String | getText() ??????????Returns the text of the node. |
| ?void | setChildren(NodeList?children) ??????????Set the children of this node. |
| ?void | setEndPosition(int?position) ??????????Sets the ending position of the node. |
| ?void | setPage(Page?page) ??????????Set the page this node came from. |
| ?void | setParent(Node?node) ??????????Sets the parent of this node. |
| ?void | setStartPosition(int?position) ??????????Sets the starting position of the node. |
| ?void | setText(String?text) ??????????Sets the string contents of the node. |
| ?String | toHtml() ??????????Return the HTML for this node. |
| ?String | toHtml(boolean?verbatim) ??????????Return the HTML for this node. |
| ?String | toPlainTextString() ??????????A string representation of the node. |
| ?String | toString() ??????????Return the string representation of the node. |
節點過濾器,這些過濾器可以按照即誒但類型。節點之間父子關系,也可以自定義過濾器。多個過濾器之間可以組合成符合過濾器用于多條件過濾,
比如AndFilter,NotFilter,OrFilter,XorFilter
| AndFilter | Accepts nodes matching all of its predicate filters (AND operation). |
| CssSelectorNodeFilter | A NodeFilter that accepts nodes based on whether they match a CSS2 selector. |
| HasAttributeFilter | This class accepts all tags that have a certain attribute, and optionally, with a certain value. |
| HasChildFilter | This class accepts all tags that have a child acceptable to the filter. |
| HasParentFilter | This class accepts all tags that have a parent acceptable to another filter. |
| HasSiblingFilter | This class accepts all tags that have a sibling acceptable to another filter. |
| IsEqualFilter | This class accepts only one specific node. |
| LinkRegexFilter | This class accepts tags of class LinkTag that contain a link matching a given regex pattern. |
| LinkStringFilter | This class accepts tags of class LinkTag that contain a link matching a given pattern string. |
| NodeClassFilter | This class accepts all tags of a given class. |
| NotFilter | Accepts all nodes not acceptable to it's predicate filter. |
| OrFilter | Accepts nodes matching any of its predicates filters (OR operation). |
| RegexFilter | This filter accepts all string nodes matching a regular expression. |
| StringFilter | This class accepts all string nodes containing the given string. |
| TagNameFilter | This class accepts all tags matching the tag name. |
抓取http://www.v2ex.com網站中的一篇帖子
首先要創建獲取網頁內容,分析網頁元素結構制作過濾器;
可以看到回復div的id都是r_加六位數字,推薦使用正則表達式匹配,主題的樣式是corder-bottom:0px(一定要缺人過濾器的結果,免得引入多余節點)。
創建一個方法,獲得主題和回復節點集合
/*** * 獲取html中的主題和所有回復節點* * @param url* @param ENCODE* @return*/protected NodeList getNodelist(String url, String ENCODE) {try {NodeList nodeList = null;Parser parser = new Parser(url);parser.setEncoding(ENCODE);//定義一個Filter,過濾主題divNodeFilter filter = new NodeFilter() {@Overridepublic boolean accept(Node node) {if(node.getText().contains("style=\"border-bottom: 0px;\"")) {return true;} else {return false;}}};//定義一個Filter,過濾所有回復divNodeFilter replyfilter = new NodeFilter() {@Overridepublic boolean accept(Node node) {String containsString = "id=\"r_";if(node.getText().contains(containsString)) {return true;} else {return false;}}};//組合filterOrFilter allFilter = new OrFilter(filter, replyfilter);nodeList = parser.extractAllNodesThatMatch(allFilter);return nodeList;} catch (ParserException e) {e.printStackTrace();return null;}}好了有了這些節點接下來就是解析了。
這個例子代碼只寫了一部分元素的獲取,剩下的活也是體力活慢慢分析節點關系,用過濾器或者dom樹找目標節點。
下面的代碼是將解析到的節點數據封裝到bean
好了。解析都做完了,在寫個主方法分析一個帖子試試;
@Testpublic void test() throws Exception {Html2Domain parse = new Html2DomainImpl();parse.parse2Thread("http://www.v2ex.com/t/262409#reply6","UTF-8");}看看運行結果:
這個內容過長,截圖只能看到帖子名稱,和帖子內容了,有興趣的自己去測試把。請一定要注意地址,貌似這個網站帖子連接會有失效時間,假如測試獲取失敗請換個帖子地址試試。
附上項目代碼:測試使用的是jdk1.6+eclipse kepler
http://pan.baidu.com/s/1mh9OuDi
總結
以上是生活随笔為你收集整理的java解析html之HTMLparser初次尝试的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: cin.tie() 输入加速器
- 下一篇: 2017年html5行业报告,云适配发布