當(dāng)前位置：首頁(yè) > 编程语言 > java >内容正文

java

java 爬虫_Java原生代码实现爬虫（爬取小说）

發(fā)布時(shí)間：2023/12/20 java 28 豆豆

生活随笔收集整理的這篇文章主要介紹了 java 爬虫_Java原生代码实现爬虫（爬取小说）小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

Java也能做爬蟲(chóng)。

現(xiàn)在提到爬蟲(chóng)人第一個(gè)想到的就是python，其實(shí)使用Java編寫(xiě)爬蟲(chóng)也是很好的選擇，Java成熟的爬蟲(chóng)框架很多，下面給大家展示一個(gè)使用Java基礎(chǔ)語(yǔ)言編寫(xiě)的爬取小說(shuō)的案例：

實(shí)現(xiàn)功能：

爬取目標(biāo)網(wǎng)站全本小說(shuō)

代碼編寫(xiě)環(huán)境：

JDK：1.8.0_191

Eclipse：2019-03 (4.11.0)

素材：

網(wǎng)站：http://www.shicimingju.com(如有侵權(quán)，請(qǐng)聯(lián)系我刪除，謝謝)

小說(shuō)：三國(guó)演義

案例實(shí)現(xiàn)用到的技術(shù)：

正則表達(dá)式

Java網(wǎng)絡(luò)通信：URL

IO流

Map—HashMap

字符串操作

異常處理

代碼思路：

根據(jù)小說(shuō)存放位置創(chuàng)建file對(duì)象

根據(jù)網(wǎng)頁(yè)結(jié)構(gòu)編寫(xiě)正則，創(chuàng)建pattern對(duì)象

編寫(xiě)循環(huán)，創(chuàng)建向所有小說(shuō)章節(jié)頁(yè)面發(fā)起網(wǎng)絡(luò)請(qǐng)求的url對(duì)象

網(wǎng)絡(luò)流BufferReader

創(chuàng)建輸入流

循環(huán)讀取請(qǐng)求得到的內(nèi)容，使用正則匹配其中的內(nèi)容

將讀取到的內(nèi)容寫(xiě)入本地文件，知道循環(huán)結(jié)束

注意代碼中的異常處理

案例代碼：

案例代碼：package com.qianfeng.text;import java.io.BufferedReader;import java.io.BufferedWriter;import java.io.File;import java.io.FileOutputStream;import java.io.InputStreamReader;import java.io.OutputStreamWriter;import java.net.URL;import java.util.regex.Matcher;import java.util.regex.Pattern;public class GetText {public static void main(String[] args) {// 1、根據(jù)小說(shuō)存放位置創(chuàng)建file對(duì)象File file = new File("D:Filehree_guo.txt");// 2、根據(jù)網(wǎng)頁(yè)結(jié)構(gòu)編寫(xiě)正則，創(chuàng)建pattern對(duì)象String regex_content = "(.*?)";String regex_title = "(.*?)";Pattern p_content = Pattern.compile(regex_content);Pattern p_title = Pattern.compile(regex_title);Matcher m_content;Matcher m_title;// 3、編寫(xiě)循環(huán)，創(chuàng)建向所有小說(shuō)章節(jié)頁(yè)面發(fā)起網(wǎng)絡(luò)請(qǐng)求的url對(duì)象for (int i = 1; i <= 120; i++) {System.out.println("第" + i + "章開(kāi)始下載。。。");try {// 創(chuàng)建每一個(gè)頁(yè)面的url對(duì)象URL url = new URL("http://www.shicimingju.com/book/sanguoyanyi/" + i + ".html");// 創(chuàng)建網(wǎng)絡(luò)讀取流BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream(),"utf8"));// 4、讀取網(wǎng)絡(luò)內(nèi)容網(wǎng)絡(luò)流BufferReaderString str = null;// 5、創(chuàng)建輸入流BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file,true)));while ((str = reader.readLine()) != null) {m_title = p_title.matcher(str.toString());m_content = p_content.matcher(str.toString());// 獲取小說(shuō)標(biāo)題并寫(xiě)入本地文件Boolean isEx = m_title.find();if (isEx) {String title = m_title.group();// 清洗得到的數(shù)據(jù)title = title.replace("", "").replace("", "");System.out.println(title);writer.write("第" + i + "章：" + title + "");}while (m_content.find()) {String content = m_content.group();// 清洗得到的數(shù)據(jù)content = content.replace("

", "").replace("

", "").replace(" ", "").replace("?", "");// 把小說(shuō)內(nèi)容寫(xiě)入文件writer.write(content + ""); }}System.out.println("第" + i + "章下載完成.........");writer.write("");writer.close();reader.close();} catch (Exception e) {System.out.println("下載失敗");e.printStackTrace();}}}}

運(yùn)行效果：

總結(jié)

以上是生活随笔為你收集整理的java 爬虫_Java原生代码实现爬虫（爬取小说）的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：网易邮箱服务器怎么注册,免费网易域名邮箱
下一篇： java final修饰属性_Java