當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

爬取起点小说月票榜

發布時間：2023/12/14 编程问答 97 豆豆

生活随笔收集整理的這篇文章主要介紹了爬取起点小说月票榜小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

- 踩點
- 獲取網頁文本
- XPath提取信息
- 破解字體反爬
- 獲取并保存信息
- 獲取所有頁面
- 總代碼(撒花)

踩點

首先進入起點月票榜的頁面進行踩點 https://www.qidian.com/rank/yuepiao，進入后界面如下，首先我們需要知道自己要獲取什么，這里我們提取小說名、作者、小說類型、小說狀態、簡介、最近更新、更新時間、以及月票數。

在知道要獲取什么信息后，右鍵檢查(F12)，進入如下界面：
點擊選擇按鈕，定位一下小說標題位置：
然后我們發現所有信息都在這里面

獲取網頁文本

先調用 requests.get 獲取一下網頁代碼，寫入文件中

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36' } response = requests.get('https://www.qidian.com/rank/yuepiao', headers=headers) f = open("M:/a.txt", 'w') f.write(response.text) f.close()

按ctrl+f 搜索一下文本，發現信息全部都在
將獲取網頁的代碼寫成一個函數

XPath提取信息

尋找各個信息的標簽，我們發現一頁共有20個小說，每個小說的信息都在li 標簽下，進一步分析各個元素位置，與其他小說進行比較，確定有用的信息。
在分析完有用的標簽后，我們就可以用 xpath 提取需要的信息

html = getHtml('https://www.qidian.com/rank/yuepiao') html = etree.HTML(html) html = etree.tostring(html) html = etree.fromstring(html) # 提取書名 name = html.xpath('//li//div[@class="book-mid-info"]//h4//a[@data-eid="qd_C40"]//text()') print(len(name), name) # 提取作者 author = html.xpath('//li//div[@class="book-mid-info"]//p[@class="author"]//a[@data-eid="qd_C41"]//text()') print(len(author), author) # 提取小說類型 types = html.xpath('//li//div[@class="book-mid-info"]//p[@class="author"]//a[@data-eid="qd_C42"]//text()') print(len(types), types) # 提取小說當前狀態 status = html.xpath('//li//div[@class="book-mid-info"]//p[@class="author"]//span/text()') print(len(status), status) # 提取小說簡介 intro = html.xpath('//li//div[@class="book-mid-info"]//p[@class="intro"]//text()') intro = [i.strip() for i in intro] # 刪除文字兩邊的空格 print(len(intro), intro) # 提取當前最新章節 update = html.xpath('//li//div[@class="book-mid-info"]//p[@class="update"]//a[@data-eid="qd_C43"]//text()') update = [i.strip() for i in update] # 刪除文字兩邊的空格 print(len(update), update) # 提取最近更新時間 date = html.xpath('//li//div[@class="book-mid-info"]//p[@class="update"]//span//text()') print(len(date), date)

打印的結果如下，這說明我們已經獲取到了我們想要的數據！

破解字體反爬

然后就是獲取月票，但是！在查看月票數的時候，發現代碼里面是亂碼

網頁上面顯示小框框，我們看不出來到底是什么，我們去剛剛保存的網頁代碼文件里面找找。

我們看到了一些&#....;的東西，一般遇到這種情況，意味著這是字體反爬，小說排行榜還有反爬是我沒想到的，既然遇到了，那就淦了它。
往上翻一翻，嗯？才點一下就找到了。。
我們在這個@font-face 中看到了幾個網址，沒錯，這就是字體，復制網址https://qidian.gtimg.com/qd_anti_spider/jUlcIiMg.woff 在新標簽頁打開，直接下載（也可以復制.ttf結尾的 https://qidian.gtimg.com/qd_anti_spider/jUlcIiMg.ttf,這里我兩個全都下載了）
獲取了字體后，我們先去這個網站 http://fontstore.baidu.com/static/editor/index.html ，把.ttf的文件在網站中打開
我們可以看到這個字體就是0-9，然后使用一個Python的庫fontTools 來處理這個字體文件，使用pip install fontTools即可安裝

from fontTools.ttLib import TTFontfont = TTFont('M:/jUlcIiMg.woff') font.saveXML('M:/font.xml')

利用上面的代碼可以將.woff / .ttf 轉為 .xml 格式的文件，然后我們在瀏覽器中打開xml文件
我們發現這個東西跟剛才字體解析網站解析的一模一樣，那就是它了！我們用 fontTools 的 getBestCmap() 函數，獲取映射。

from fontTools.ttLib import TTFontfont = TTFont('M:/jUlcIiMg.woff') font.saveXML('M:/font.xml') print(font.getBestCmap())

輸出

{100293: ‘eight’, 100295: ‘four’, 100296: ‘three’, 100297: ‘one’, 100298: ‘period’, 100299: ‘two’, 100300: ‘nine’, 100301: ‘five’, 100302: ‘zero’, 100303: ‘six’, 100304: ‘seven’}

在疑惑為什么跟你看到的不一樣？其實剛剛在字體解析網站以及xml中看到的是十六進制的(以0x開頭)，而fontTools輸出的是十進制，不信可以用計算器敲一下。
獲取到映射后，我們再人工進行一下轉換，將英文數字轉為中文，并且剔除掉沒有用的100298: 'period' ，注意到網頁代碼中的字體是以&#××××××;形式的，為了方便替換，我們也將鍵更改為這個形式：

font = TTFont('M:/jUlcIiMg.woff') font.saveXML('M:/font.xml') print(font.getBestCmap()) # 建立英文到數字的字典 camp = {'zero': 0, 'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6, 'seven': 7, 'eight': 8,'nine': 9} cp = {} for k,v in font.getBestCmap().items():try: # 過濾掉非阿拉伯數字的100298: 'periodcp['&#' + str(k) + ';'] = camp[v]except KeyError as e:pass print(cp)

輸出：

{’𘟅’: 8, ‘𘟇’: 4, ‘𘟈’: 3, ‘𘟉’: 1, ‘𘟋’: 2, ‘𘟌’: 9, ‘𘟍’: 5, ‘𘟎’: 0, ‘𘟏’: 6, ‘𘟐’: 7}

至此我們已經將字體映射關系找到，然后就可以直接用正則替換將獲取到的網頁代碼中的這些字體，根據映射關系替換為正常的阿拉伯數字：

def getHtml(url):headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'}response = requests.get(url, headers=headers)if response.status_code == 200:return response.textreturn Nonefont = TTFont('M:/jUlcIiMg.woff') font.saveXML('M:/font.xml') print(font.getBestCmap())camp = {'zero': 0, 'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6, 'seven': 7, 'eight': 8,'nine': 9} cp = {} for k, v in font.getBestCmap().items():try:cp['&#' + str(k) + ';'] = camp[v]except KeyError as e:pass print(cp) # 獲取網址代碼，保存到txt文本中 html = getHtml('https://www.qidian.com/rank/yuepiao') f = open('M:/html.txt', 'w') f.write(html) f.close()# 將網址代碼中的加密字體替換為正常的數字，并保存到文本中 for key in cp.keys():html = re.sub(key, str(cp[key]), html) f = open('M:/html_change.txt', 'w') f.write(html) f.close()

執行完畢后，我們去兩個文本中查看是否替換成功

誒，為什么沒有替換成功，難道是re.sub寫錯了？不對，我們發現這里的字體與剛剛獲取到的映射鍵一個都不一樣
我們向上查看一下@font-face的內容，發現字體變了！我們剛才用的字體是jUlcIiMg.woff,而這里變成了OMkqwDTS.woff，看來每次訪問的字體都不一樣，既然如此，我們就不能直接下載單獨的woff文件。
每次獲取網址代碼時，我們用正則將字體網址取出來，然后下載，再對字體文件進行解析，替換！為此我們將獲取網址的函數改成下面這個樣子，在獲取網址后，直接提取字體網址，然后下載保存為 font.woff

再進行一下測試：

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36' }def getHtml(url):response = requests.get(url, headers=headers)if response.status_code != 200:return Nonewoff = re.search("format\('eot'\); src: url\('(.+?)'\) format\('woff'\)", response.text, re.S)fontfile = requests.get(woff.group(1), headers=headers)if fontfile.status_code != 200:return Nonef = open('M:/font.woff', 'wb')f.write(fontfile.content)f.close()return response.textfont = TTFont('M:/font.woff') print(font.getBestCmap())camp = {'zero': 0, 'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6, 'seven': 7, 'eight': 8,'nine': 9} cp = {} for k, v in font.getBestCmap().items():try:cp['&#' + str(k) + ';'] = camp[v]except KeyError as e:pass print(cp)html = getHtml('https://www.qidian.com/rank/yuepiao') f = open('M:/html.txt', 'w') f.write(html) f.close()for key in cp.keys():html = re.sub(key, str(cp[key]), html) f = open('M:/html_change.txt', 'w') f.write(html) f.close()

字體成功獲取

替換成功！！！

我們將處理字體的代碼寫成一個函數，使其看起來更加美觀。

def fontProc(text):font = TTFont('M:/font.woff')camp = {'zero': 0, 'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6, 'seven': 7, 'eight': 8,'nine': 9}cp = {}for k, v in font.getBestCmap().items():try: # 過濾無用的映射cp['&#' + str(k) + ';'] = camp[str(v)]except KeyError as e:passfor key in cp.keys():text = re.sub(key, str(cp[key]), text)return text

獲取并保存信息

在字體替換成功后，我們就可以用XPath將月票數提取出來，至此，我們的提取信息函數寫成：

def getBook(html):html = etree.HTML(html)html = etree.tostring(html)html = etree.fromstring(html)name = html.xpath('//li//div[@class="book-mid-info"]//h4//a//text()')author = html.xpath('//li//div[@class="book-mid-info"]//p[@class="author"]//a[@class="name"]//text()')types = html.xpath('//li//div[@class="book-mid-info"]//p[@class="author"]//a[@data-eid="qd_C42"]//text()')status = html.xpath('//li//div[@class="book-mid-info"]//p[@class="author"]//span//text()')intro = html.xpath('//li//div[@class="book-mid-info"]//p[@class="intro"]//text()')intro = [i.strip() for i in intro]update = html.xpath('//li//div[@class="book-mid-info"]//p[@class="update"]//a[@data-eid="qd_C43"]//text()')update = [i.strip() for i in update]date = html.xpath('//li//div[@class="book-mid-info"]//p[@class="update"]//span//text()')tickets = html.xpath('//li//div[@class="book-right-info"]//div[@class="total"]//p//span//span//text()')book = zip(name, author, types, status, intro, update, date, tickets)return book

為了方便，我們寫一個保存該信息的函數：

def saveInfo(url):html = getHtml(url)html = fontProc(html)book = getBook(html)for name, author, types, status, intro, update, date, tickets in book:with open('M:/novels.txt', 'a+') as f:f.write('小說名：' + name + '\n')f.write('作者：' + author + ' 小說類型：' + types + ' 當前狀態：' + status + '\n')f.write('小說簡介：' + intro + '\n')f.write(update + ' 更新時間：' + date + '\n')f.write('月票數：' + tickets + '\n')f.write('\n\n')

運行一下試試

saveInfo('https://www.qidian.com/rank/yuepiao')

完美獲取到我們想要的信息

獲取所有頁面

經過上面的分析與操作，我們已經獲取到了所有信息，但是不難發現只獲取到了一頁，我們準備把所有頁面都爬下來。
我們點一下頁碼2，發現網址變成了 https://www.qidian.com/rank/yuepiao?page=2，
再點一下頁碼3，發現網址變成了 https://www.qidian.com/rank/yuepiao?page=3。
已經發現了規律，第幾頁page參數就是幾，因為總共只有五頁，所以寫成：

for page in range(1, 5 + 1):url = 'https://www.qidian.com/rank/yuepiao?page=%d'%pagesaveInfo(url)

運行一下發現出了問題，\xa0 是 latin1 中的擴展字符集字符，代表空白符&nbsp
我們將其替換為空白字符即可

將 getBook() 函數中的：

update = [i.strip() for i in update]

改為：

update = [i.strip().replace('\xa0', ' ') for i in update]

更改完畢后，再次運行，淦，又來：

同理，將getBook函數中的

intro = [i.strip() for i in intro]

改為：

intro = [i.strip().replace('\u2022', ' ') for i in intro]

我們再次運行，#臥21lkad@#!4012

再次替換：

intro = [i.strip().replace('\u2022', ' ').replace('\u2003', ' ') for i in intro]

再次運行，終于成功！我們需要的信息已經獲取成功！

總代碼(撒花)

import requests from lxml import etree from fontTools.ttLib import TTFont import reheaders = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36' } woffDir = './font.woff' novelsDir = './novels.txt'def getHtml(url):response = requests.get(url, headers=headers)if response.status_code != 200:return Nonewoff = re.search("format\('eot'\); src: url\('(.+?)'\) format\('woff'\)", response.text, re.S)fontfile = requests.get(woff.group(1), headers=headers)if fontfile.status_code != 200:return Nonef = open(woffDir, 'wb')f.write(fontfile.content)f.close()response.encoding = response.apparent_encodingreturn response.textdef fontProc(text):font = TTFont(woffDir)camp = {'zero': 0, 'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6, 'seven': 7, 'eight': 8,'nine': 9}cp = {}for k, v in font.getBestCmap().items():try: # 過濾無用的映射cp['&#' + str(k) + ';'] = camp[str(v)]except KeyError as e:passfor key in cp.keys():text = re.sub(key, str(cp[key]), text)return textdef getBook(html):html = etree.HTML(html)html = etree.tostring(html)html = etree.fromstring(html)name = html.xpath('//li//div[@class="book-mid-info"]//h4//a//text()')author = html.xpath('//li//div[@class="book-mid-info"]//p[@class="author"]//a[@class="name"]//text()')types = html.xpath('//li//div[@class="book-mid-info"]//p[@class="author"]//a[@data-eid="qd_C42"]//text()')status = html.xpath('//li//div[@class="book-mid-info"]//p[@class="author"]//span//text()')intro = html.xpath('//li//div[@class="book-mid-info"]//p[@class="intro"]//text()')intro = [i.strip().replace('\u2022', ' ').replace('\u2003', ' ') for i in intro]update = html.xpath('//li//div[@class="book-mid-info"]//p[@class="update"]//a[@data-eid="qd_C43"]//text()')update = [i.strip().replace('\xa0', ' ') for i in update]date = html.xpath('//li//div[@class="book-mid-info"]//p[@class="update"]//span//text()')tickets = html.xpath('//li//div[@class="book-right-info"]//div[@class="total"]//p//span//span//text()')book = zip(name, author, types, status, intro, update, date, tickets)return bookdef saveInfo(url):html = getHtml(url)html = fontProc(html)book = getBook(html)for name, author, types, status, intro, update, date, tickets in book:with open(novelsDir, 'a+') as f:f.write('小說名：' + name + '\n')f.write('作者：' + author + ' 小說類型：' + types + ' 當前狀態：' + status + '\n')f.write('小說簡介：' + intro + '\n')f.write(update + ' 更新時間：' + date + '\n')f.write('月票數：' + tickets + '\n')f.write('\n\n')for page in range(1, 5 + 1):url = 'https://www.qidian.com/rank/yuepiao?page=%d' % pagesaveInfo(url)

總結

以上是生活随笔為你收集整理的爬取起点小说月票榜的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。