當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

使用JAVA爬取博客里面的所有文章

發(fā)布時間：2025/3/15 编程问答 42 豆豆

生活随笔收集整理的這篇文章主要介紹了使用JAVA爬取博客里面的所有文章小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

主要思路：

1、找到列表頁。

2、找到文章頁。

3、用一個隊列來保存將要爬取的網(wǎng)頁，爬取隊頭的url，如果隊列非空，則一直爬取。

4、如果是列表頁，則抽取里面所有的文章url進隊；如果是文章頁，則直接爬取至本地。

一個博客是起始頁url是這樣的：

http://www.cnblogs.com/joyeecheung/

第n頁是這樣的：

http://www.cnblogs.com/joyeecheung/default.html?page=n

文章的url是這樣的：

http://www.cnblogs.com/joyeecheung/p/[0-9]+.html

代碼如下：

public class boke {private Queue<String> data = new LinkedList<String>(); //文章頁面String PAGE = "http://www.cnblogs.com/joyeecheung/p/[0-9]+.html";Pattern p = Pattern.compile(PAGE);public void action(String target) throws IOException{Matcher m = p.matcher(target); //如果是文章頁面則讀取if(m.find()){ URL url = new URL(target); HttpURLConnection conn = (HttpURLConnection)url.openConnection();conn.connect();InputStream in = conn.getInputStream();byte[] buf = new byte [1024]; int len = 0;//分割url，把文章的編號作為文件的名字String [] bufen = target.split("/"); String name = bufen[bufen.length-1]; name = name.replaceAll("html", "txt");File file = new File(name);FileOutputStream fp = new FileOutputStream(file);while((len=in.read(buf))!=-1){ fp.write(buf, 0, len); }fp.close();}//如果是列表頁面//抽取里面的文章頁面連接else{ URL url = new URL(target);HttpURLConnection conn = (HttpURLConnection)url.openConnection();conn.connect(); InputStream in = conn.getInputStream(); byte [] buf = new byte[1024];//把列表頁的內(nèi)容放到ByteArrayOutputStream中ByteArrayOutputStream outStream = new ByteArrayOutputStream();int len = 0;while((len=in.read(buf))!=-1){//System.out.println(len);outStream.write(buf,0,len);}in.close();outStream.close();String content = new String(outStream.toByteArray());Matcher page = p.matcher(content);//抽取文章的urlwhile(page.find()){//將抽取的文章url進隊data.add(page.group());} }}public static void main(String args[]) throws IOException{boke test = new boke();//起始頁面String start = "http://www.cnblogs.com/joyeecheung/";test.data.add(start);//列表頁面String page = "http://www.cnblogs.com/joyeecheung/default.html?page=";//總頁數(shù)int total =15;//將15頁列表頁進隊for(int i=2;i<=total;i++)test.data.add(page+i);//隊列非空則一直爬取while(!test.data.isEmpty())test.action(test.data.poll()); }}

總結(jié)

以上是生活随笔為你收集整理的使用JAVA爬取博客里面的所有文章的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Mysql：mysql函数GROUP_C
下一篇：数据结构：二分查找算法

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

编程问答

使用JAVA爬取博客里面的所有文章

主要思路：

總結(jié)