當前位置：首頁 >

java爬虫抓取起点小说,手把手带你爬虫 | 爬取起点小说网

發布時間：2023/12/14 47 豆豆

生活随笔收集整理的這篇文章主要介紹了 java爬虫抓取起点小说,手把手带你爬虫 | 爬取起点小说网小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

原標題：手把手帶你爬蟲 | 爬取起點小說網

很多同學都喜歡看小說，尤其是程序員群體，對武俠小說，科幻小說都很著迷，最近的修仙的小說也很多，比如凡人修仙傳，武動乾坤，斗破蒼穹等等，今天分享一個用Python來爬取小說的小腳本！

目標

爬取一本仙俠類的小說下載并保存為txt文件到本地。本例為“大周仙吏”。

項目準備

軟件：Pycharm

第三方庫：requests,fake_useragent,lxml

網站地址：https://book.qidian.com

網站分析

打開網址：

判斷是否為靜態加載網頁，Ctrl+U打開源代碼，Ctrl+F打開搜索框，輸入：第一章。

在這里是可以找到的，判定為靜態加載。

反爬分析

同一個ip地址去多次訪問會面臨被封掉的風險，這里采用fake_useragent，產生隨機的User-Agent請求頭進行訪問。

代碼實現1.導入相對應的第三方庫，定義一個class類繼承object，定義init方法繼承self，主函數main繼承self。importrequests

fromfake_useragent importUserAgent

fromlxml importetree

classphoto_spider(object):

def__init__(self):

self.url = 'https://book.qidian.com/info/1020580616#Catalog'

ua = UserAgent(verify_ssl= False)

#隨機產生user-agent

fori inrange( 1, 100):

self.headers = {

'User-Agent': ua.random

}

defmian(self):

pass

if__name__ == '__main__':

spider = qidian

spider.main

2.發送請求,獲取網頁。defget_html(self,url):

response=requests.get(url,headers=self.headers)

html=response.content.decode( 'utf-8')

returnhtml

3.獲取圖片的鏈接地址。importrequests

fromlxml importetree

fromfake_useragent importUserAgent

classqidian(object):

def__init__(self):

self.url = 'https://book.qidian.com/info/1020580616#Catalog'

ua = UserAgent(verify_ssl= False)

fori inrange( 1, 100):

self.headers = {

'User-Agent': ua.random

}

defget_html(self,url):

response=requests.get(url,headers=self.headers)

html=response.content.decode( 'utf-8')

returnhtml

defparse_html(self,html):

target=etree.HTML(html)

links=target.xpath( '//ul[@class="cf"]/li/a/@href') #獲取鏈接

names=target.xpath( '//ul[@class="cf"]/li/a/text') #獲取每一章的名字

forlink,name inzip(links,names):

print(name+ 't'+ 'https:'+link)

defmain(self):

url=self.url

html=self.get_html(url)

self.parse_html(html)

if__name__ == '__main__':

spider=qidian

spider.main

打印結果：

4.解析鏈接，獲取每一章內容。defparse_html(self,html):

target=etree.HTML(html)

links=target.xpath( '//ul[@class="cf"]/li/a/@href')

forlink inlinks:

host= 'https:'+link

#解析鏈接地址

res=requests.get(host,headers=self.headers)

c=res.content.decode( 'utf-8')

target=etree.HTML(c)

names=target.xpath( '//span[@class="content-wrap"]/text')

results=target.xpath( '//div[@class="read-content j_readContent"]/p/text')

forname innames:

print(name)

forresult inresults:

print(result)

打印結果：(下面內容過多，只貼出一部分。)

5.保存為txt文件到本地。withopen( 'F:/pycharm文件/document/'+ name + '.txt', 'a') asf:

forresult inresults:

#print(result)

f.write(result+ 'n')

效果顯示：

打開文件目錄：

完整代碼importrequests

fromlxml importetree

fromfake_useragent importUserAgent

classqidian(object):

def__init__(self):

self.url = 'https://book.qidian.com/info/1020580616#Catalog'

ua = UserAgent(verify_ssl= False)

fori inrange( 1, 100):

self.headers = {

'User-Agent': ua.random

}

defget_html(self,url):

response=requests.get(url,headers=self.headers)

html=response.content.decode( 'utf-8')

returnhtml

defparse_html(self,html):

target=etree.HTML(html)

links=target.xpath( '//ul[@class="cf"]/li/a/@href')

forlink inlinks:

host= 'https:'+link

#解析鏈接地址

res=requests.get(host,headers=self.headers)

c=res.content.decode( 'utf-8')

target=etree.HTML(c)

names=target.xpath( '//span[@class="content-wrap"]/text')

results=target.xpath( '//div[@class="read-content j_readContent"]/p/text')

forname innames:

print(name)

withopen( 'F:/pycharm文件/document/'+ name + '.txt', 'a') asf:

forresult inresults:

#print(result)

f.write(result+ 'n')

defmain(self):

url=self.url

html=self.get_html(url)

self.parse_html(html)

if__name__ == '__main__':

spider=qidian

spider.main 返回搜狐，查看更多

責任編輯：

總結

以上是生活随笔為你收集整理的java爬虫抓取起点小说,手把手带你爬虫 | 爬取起点小说网的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：浅析android手游lua脚本的加密与
下一篇： 2016-2017 7th BSUIR

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

java爬虫抓取起点小说,手把手带你爬虫 | 爬取起点小说网

總結