當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Jsoup学习

發布時間：2024/1/1 编程问答 34 豆豆

生活随笔收集整理的這篇文章主要介紹了 Jsoup学习小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

1、Jsoup簡介

Jsoup 是一款Java 的HTML解析器，可直接解析某個URL地址、HTML文本內容。它提供了一套非常省力的API，可通過DOM，CSS以及類似于jQuery的操作方法來取出和操作數據。

我們在爬蟲采集網頁領域，主要作用是用HttpClient獲取到網頁后，具體的網頁提取需要的信息的時候，就用到Jsoup，Jsoup可以使用強大的類似Jquery,css選擇器，來獲取需要的數據可以非常輕松的實現。

雖然Jsoup也支持從某個地址直接去爬取網頁源碼，但是只支持HTTP，HTTPS協議，支持不夠豐富。

所以，主要還是用來對HTML進行解析。其中，要被解析的HTML可以是一個HTML的字符串，可以是一個URL，可以是一個文件。

org.jsoup.Jsoup把輸入的HTML轉換成一個org.jsoup.nodes.Document對象，然后從Document對象中取出想要的元素。

org.jsoup.nodes.Document繼承了org.jsoup.nodes.Element，Element又繼承了org.jsoup.nodes.Node類。里面提供了豐富的方法來獲取HTML的元素。

Jsoup官方地址：https://jsoup.org/

Jsoup最新下載：https://jsoup.org/download

Jsoup中文文檔：http://www.open-open.com/jsoup/

2、使用實例

1.添加Maven依賴：

<dependency><groupId>org.apache.httpcomponents</groupId><artifactId>httpclient</artifactId><version>4.5.2</version></dependency><dependency><groupId>org.jsoup</groupId><artifactId>jsoup</artifactId><version>1.10.2</version></dependency>

2.簡單例子：

public class JsoupTest {public static void main(String[] args) throws Exception{// 創建httpClient實例CloseableHttpClient httpClient = HttpClients.createDefault();// 創建httpGet實例HttpGet httpGet = new HttpGet("http://www.cnblogs.com");CloseableHttpResponse response = httpClient.execute(httpGet);String content = null;if(response != null){HttpEntity entity = response.getEntity(); content = EntityUtils.toString(entity, "UTF-8"); // 獲取網頁內容Document document = Jsoup.parse(content); // 解析網頁，得到文檔對象Elements elements = document.getElementsByTag("title"); // 獲取 tag為 title的DOM元素Element element = elements.get(0); // 獲取第一個DOM元素String title = element.text(); // 返回元素的文本System.out.println("博客園的標題：" + title);Element element2 = document.getElementById("site_nav_top");String navTop = element2.text();System.out.println("座右銘：" + navTop);}if(response != null){response.close();}if(httpClient != null){httpClient.close();}} }

3.Jsoup查找DOM元素

getElementById(String id)：通過id來獲取

getElementsByTag(String tagName)：通過標簽名字來獲取

getElementsByClass(String className)：通過類名來獲取

getElementsByAttribute(String key)：通過屬性名字來獲取

getElementsByAttributeValue(String key, String value)：通過指定的屬性名字，屬性值來獲取

getAllElements()：獲取所有元素

簡單的實例：

public class JsoupTest2 {public static void main(String[] args) throws Exception{// 創建httpClient實例CloseableHttpClient httpClient = HttpClients.createDefault();// 創建httpGet實例HttpGet httpGet = new HttpGet("http://www.cnblogs.com");httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0");CloseableHttpResponse response = httpClient.execute(httpGet);String content = null;if(response != null){HttpEntity entity = response.getEntity(); content = EntityUtils.toString(entity, "UTF-8"); // 獲取網頁內容Document document = Jsoup.parse(content); // 解析網頁，得到文檔對象/*** 1.根據tag獲取元素*/Elements elements = document.getElementsByTag("title"); // 獲取 tag為 title的DOM元素Element element = elements.get(0); // 獲取第一個DOM元素String title = element.text(); // 返回元素的文本System.out.println("博客園的標題：" + title);/*** 2.根據 id獲取元素*/Element element2 = document.getElementById("site_nav_top");String navTop = element2.text();System.out.println("座右銘：" + navTop);/*** 3.根據樣式獲取元素*/Elements elements3 = document.getElementsByClass("post_item");System.out.println("============根據樣式獲取元素=============");for(Element e : elements3){System.out.println(e.html());System.out.println("------------------------------");}/*** 4.根據屬性名稱來查詢DOM*/Elements elements4 = document.getElementsByAttribute("width");System.out.println("============根據屬性名稱來查詢DOM=============");for(Element e : elements4){System.out.println(e.toString());System.out.println("------------------------------");}/*** 5.根據屬性名和屬性值來查詢DOM*/Elements elements5 = document.getElementsByAttributeValue("target", "_blank");System.out.println("============ 根據屬性名和屬性值來查詢DOM=============");for(Element e : elements5){System.out.println(e.toString());System.out.println("------------------------------");}}if(response != null){response.close();}if(httpClient != null){httpClient.close();}} }

4.通過類似于css或jQuery的選擇器來查找元素

使用的是Element類的下記方法：public Elements select(String cssQuery)

通過傳入一個類似于CSS或jQuery的選擇器字符串，來查找指定元素。

實例如下：

public class JsoupTest3 {public static void main(String[] args) throws Exception{// 創建httpClient實例CloseableHttpClient httpClient = HttpClients.createDefault();// 創建httpGet實例HttpGet httpGet = new HttpGet("http://www.cnblogs.com");httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0");CloseableHttpResponse response = httpClient.execute(httpGet);String content = null;if(response != null){HttpEntity entity = response.getEntity(); content = EntityUtils.toString(entity, "UTF-8"); // 獲取網頁內容Document document = Jsoup.parse(content); // 解析網頁，得到文檔對象// 1.查找所有帖子DOMElements elements = document.select(".post_item .post_item_body h3 a");for(Element ele : elements){System.out.println("博客標題：" + ele.text());}System.out.println("------------------------分割線------------------------");// 2.查找帶有href屬性的a元素Elements hrefElements = document.select("a[href]");for(Element ele : hrefElements){System.out.println(ele.toString());}System.out.println("------------------------分割線------------------------");// 3.查找擴展名為.png的圖片DOM節點Elements imgElements = document.select("img[src$=.png]");for(Element ele : imgElements){System.out.println(ele.toString());}System.out.println("------------------------分割線------------------------");// 4.獲取tag為title的第一個DOM元素Element titleEle = document.getElementsByTag("title").first();System.out.println("標題為：" + titleEle.text());}if(response != null){response.close();}if(httpClient != null){httpClient.close();}} }

5.Jsoup獲取DOM元素的屬性值

1.獲取博客園的博客標題以及博客地址，獲取友情鏈接

? ?

2.代碼實現：

public class JsoupTest4 {public static void main(String[] args) throws Exception{// 創建httpClient實例CloseableHttpClient httpClient = HttpClients.createDefault();// 創建httpGet實例HttpGet httpGet = new HttpGet("http://www.cnblogs.com");httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0");CloseableHttpResponse response = httpClient.execute(httpGet);String content = null;if(response != null){HttpEntity entity = response.getEntity(); content = EntityUtils.toString(entity, "UTF-8"); // 獲取網頁內容Document document = Jsoup.parse(content); // 解析網頁，得到文檔對象// 1.通過選擇器查找所有博客標題以及鏈接Elements ele = document.select("#post_list .post_item .post_item_body h3 a");for(Element e : ele){System.out.println("博客標題：" + e.text() + "---博客地址：" + e.attr("href"));}// 2.獲取友情鏈接Element linkEle = document.select("#friend_link").first();System.out.println("友情鏈接純文本：" + linkEle.text());System.out.println("友情鏈接HTML：" + linkEle.html());}if(response != null){response.close();}if(httpClient != null){httpClient.close();}} }

總結

以上是生活随笔為你收集整理的Jsoup学习的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

Jsoup

上一篇：我们需要什么样的导航网站？
下一篇： AES加密有什么用，AES加密算法安全性