當前位置：首頁 >

JAVA分析html算法(JAVA网页蜘蛛算法)

發布時間：2025/3/19 53 豆豆

生活随笔收集整理的這篇文章主要介紹了 JAVA分析html算法(JAVA网页蜘蛛算法) 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

?? 近來有些朋友在做蜘蛛算法，或者在網頁上面做深度的數據挖掘。但是遇到復雜而繁瑣的html頁面大家都望而卻步。因為很難獲取到相應的數據。

?? 最古老的辦法的是嘗試用正則表達式，估計那么繁瑣的東西得不償失，浪費我們寶貴的時間。

?? 第二個辦法用開源組織htmlparser的包，這個是一個比較老的項目，但是效果估計不是很好，好像不可以深入分析html，只能分析5級的結構；

?? 我這里有個htmlparser的源代碼，可以獲取所有的超鏈接的

/** To change this template, choose Tools | Templates* and open the template in the editor.*/ package test;import java.util.HashMap; import java.util.Map;import org.htmlparser.Node; import org.htmlparser.NodeFilter; import org.htmlparser.Parser; import org.htmlparser.tags.LinkTag; import org.htmlparser.util.NodeList;/**** @author Arjick@163.com*/ public class GetLinkTest {public static void main(String[] args) {try {// 通過過濾器過濾出<A>標簽Parser parser = new Parser("http://www.lovezan.com");NodeList nodeList = parser.extractAllNodesThatMatch(new NodeFilter() {// 實現該方法,用以過濾標簽public boolean accept(Node node) {if (node instanceof LinkTag)// 標記 {return true;}return false;}});// 打印for (int i = 0; i < nodeList.size(); i++) {LinkTag n = (LinkTag) nodeList.elementAt(i);//System.out.print(n.getStringText() + " ==>> ");//System.out.println(n.extractLink());try {if (n.extractLink().equals("http://www.zuzwn.com")) {System.out.println(n.extractLink());}} catch (Exception e) {}}} catch (Exception e) {e.printStackTrace();}} }

第三個辦法，也是我現在一直在用的辦法，首先把html清理為xml，然后用java解析xml獲取數據，現在上傳一個java clean html的源代碼：

/** To change this template, choose Tools | Templates* and open the template in the editor.*/ package exec;import java.io.File; import java.io.IOException; import org.htmlcleaner.CleanerProperties; import org.htmlcleaner.HtmlCleaner; import org.htmlcleaner.PrettyXmlSerializer; import org.htmlcleaner.TagNode;/****/ public class HtmlClean {public void cleanHtml(String htmlurl, String xmlurl) {try {long start = System.currentTimeMillis();HtmlCleaner cleaner = new HtmlCleaner();CleanerProperties props = cleaner.getProperties();props.setUseCdataForScriptAndStyle(true);props.setRecognizeUnicodeChars(true);props.setUseEmptyElementTags(true);props.setAdvancedXmlEscape(true);props.setTranslateSpecialEntities(true);props.setBooleanAttributeValues("empty");TagNode node = cleaner.clean(new File(htmlurl));System.out.println("vreme:" + (System.currentTimeMillis() - start));new PrettyXmlSerializer(props).writeXmlToFile(node, xmlurl);System.out.println("vreme:" + (System.currentTimeMillis() - start));} catch (IOException e) {e.printStackTrace();}} }

轉載于:https://www.cnblogs.com/zuzwn/p/3602386.html

總結

以上是生活随笔為你收集整理的JAVA分析html算法(JAVA网页蜘蛛算法)的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：当clear line vty 命令不起
下一篇： web漏洞总结

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

JAVA分析html算法(JAVA网页蜘蛛算法)

總結