當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

还不知道要看什么小说嘛？爬取小说网站前10页的小说数据分析一波

發(fā)布時(shí)間：2025/3/19 编程问答 18 豆豆

生活随笔收集整理的這篇文章主要介紹了还不知道要看什么小说嘛？爬取小说网站前10页的小说数据分析一波小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

爬取小說(shuō)數(shù)據(jù)

效果
網(wǎng)頁(yè)分析
- 網(wǎng)頁(yè)網(wǎng)址分析
- 書(shū)內(nèi)容位置分析
- 不同書(shū)內(nèi)容位置分析
將內(nèi)容存到Excel
完整代碼

效果

網(wǎng)頁(yè)分析

網(wǎng)頁(yè)網(wǎng)址分析

對(duì)比我們可以發(fā)現(xiàn)，不同的網(wǎng)頁(yè)只有后邊的數(shù)字不一樣。
得到前10頁(yè)的網(wǎng)址：

urls = ['https://www.qidian.com/all/page{}/'.format(str (i)) for i in range(1,11)]

書(shū)內(nèi)容位置分析

對(duì)比我們可以得到頁(yè)面上不同的小說(shuō)，都是在同一個(gè)<ul>的<li>里邊。
得ul到的XPath后//*[@id="book-img-text"]/ul 在后邊選擇li 即可

#選擇 <ul>節(jié)點(diǎn)中的所有《li>節(jié)點(diǎn)infos = selector.xpath('//*[@id="book-img-text"]/ul/li')

不同書(shū)內(nèi)容位置分析

第一本書(shū)的標(biāo)題的Xpath：//*[@id="book-img-text"]/ul/li[1]/div[2]/h4/a

第二本書(shū)的標(biāo)題的Xpath：//*[@id="book-img-text"]/ul/li[2]/div[2]/h4/a

我們發(fā)現(xiàn)只有 ==li[ ]==中的小標(biāo)不一樣，于是有：

title = info.xpath('//*[@id="book-img-text"]/ul/li['+str(i)+']/div[2]/h4/a/text()')[0]

通過(guò)i的變化來(lái)達(dá)到切換的目的。

將內(nèi)容存到Excel

需要使用第三方庫(kù)：

pip install wlwt

使用步驟：

導(dǎo)入庫(kù)：import xlwt

創(chuàng)建Worbook 對(duì)象，并指定編碼：book = xlwt.Workbook(encoding='utf-8')

添加Sheet ：sheet = book.add_sheet('novels')

向Sheet 的Cell（1,1）位置添加文本：sheet.write(1,1,'世界，你好')

保存文件：book.save('novels.xls')

完整代碼

import requests from lxml import etree import xlwt import timeheaders = { 'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36', 'Host' : 'www.qidian.com', 'Cookie':'_ga_PFYW0QLV3P=GS1.1.1629617197.2.1.1629617333.0' }#//*[@id="book-img-text"]/ul/li[2]/div[2]/h4/a def getOnePage(url):html = requests.get(url, headers=headers, allow_redirects=False)selector = etree.HTML(html.text)#選擇 <ul>節(jié)點(diǎn)中的所有《li>節(jié)點(diǎn)infos = selector.xpath('//*[@id="book-img-text"]/ul/li')print(infos)result = []i = 1pre = '//*[@id="book-img-text"]/ul/li['for info in infos:# 注意的地方一后邊加[0]才能的字符串style_1 = info.xpath('//*[@id="book-img-text"]/ul/li['+str(i)+']/div[2]/p[1]/a[2]/text()')[0]style_2 = info.xpath('//*[@id="book-img-text"]/ul/li['+str(i)+']/div[2]/p[1]/a[3]/text()')[0]# 提取標(biāo)題title = info.xpath('//*[@id="book-img-text"]/ul/li['+str(i)+']/div[2]/h4/a/text()')[0]# 提取作者author = info.xpath('//*[@id="book-img-text"]/ul/li['+str(i)+']/div[2]/p[1]/a[1]/text()')[0]# 風(fēng)格style = style_1 +'.'+style_2# 完成度complete = info.xpath('//*[@id="book-img-text"]/ul/li['+str(i)+']/div[2]/p[1]/span/text()')[0]#簡(jiǎn)介introduce = info.xpath('//*[@id="book-img-text"]/ul/li['+str(i)+']/div[2]/p[2]/text()')[0]# 創(chuàng)建一個(gè)字典對(duì)象存入data = { 'title':title,'author':author,'style':style,'complete':complete,'introduce':introduce}result.append(data)# 換到下一本書(shū)i+=1print(result)return result# header = ['標(biāo)題','作者','類(lèi)型','完成度','介紹']book = xlwt.Workbook(encoding='utf-8')sheet = book.add_sheet('novels')for h in range(len(header)):sheet.write(0,h,header[h])#getOnePage('https://www.qidian.com/all/') # 注意的地方二 /不能少 urls = ['https://www.qidian.com/all/page{}/'.format(str (i)) for i in range(1,11)] i=1 #urls = ['https://www.qidian.com/all/'] for url in urls:novels = getOnePage(url)print(novels)for novel in novels:print(novel)time.sleep(0.1)sheet.write(i,0,novel['title'])sheet.write(i, 1, novel['author'])sheet.write(i, 2, novel['style'])sheet.write(i, 3, novel['complete'])sheet.write(i, 4, novel['introduce'])i+=1 book.save('novels.xls')

與50位技術(shù)專家面對(duì)面20年技術(shù)見(jiàn)證，附贈(zèng)技術(shù)全景圖

總結(jié)

以上是生活随笔為你收集整理的还不知道要看什么小说嘛？爬取小说网站前10页的小说数据分析一波的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： lxml 和 XPah （爬虫）
下一篇：字符串相加/大数相加(代码极短）