當前位置：首頁 >

Python 实现一个自动下载小说的简易爬虫

發布時間：2023/12/16 42 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python 实现一个自动下载小说的简易爬虫小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

最近在學 Python，個人覺得 Python 是一種比較好玩的編程語言。快速看過一遍之后準備自己寫個小說爬蟲來鞏固下 Python 基礎知識。本人編程剛入門，很多東西理解還比較淺，寫下來是為了作為筆記方便以后回來優化改進，如果對本篇文章有好的建議或者有不足的地方，歡迎各位指出。

1. 前期知識準備

Python 基礎語法、正則表達式、BeautifulSoup 庫
傳送門：
廖雪峰老師的Python新手教程
BeautifulSoup教程，轉自:http://www.cnblogs.com/wupeiqi/articles/6283017.html

2. 選擇爬取的目標

我選取的目標是網上隨便找的一個免費小說網：https://www.qu.la（其實是一個盜版小說網站）。選取這個網站的原因是該網站的 html 標簽比較容易提取，適合練手，而且親測沒有任何反爬蟲機制，可以隨便蹂躪（咳咳，大家收斂一點，不要太用力）。壞處么…就是網站服務器不穩定，經常爬著爬著中斷了，報10053錯誤，找不到原因只能重來，據說換成 Python 3不會出現這個問題？博主用的Python 2.7，還沒試過Python 3，有時間的童鞋可以試一下。

3. 實操

下面進入正題，很激動有木有！博主仿佛看見小說網仿佛像一個柔弱無助的小姑娘躲在墻角瑟瑟發抖，嘿嘿嘿…
古人云：“機會總是留給有準備的人的。”所以我們要先理清好我們的思路，再動手，博主的思路如下：

Created with Rapha?l 2.1.2開始下載目標url的html獲取章節標題獲取正文內容保存標題和正文確認是否有下一章？獲取下一章url結束yesno

3.1 下載目標 url 的 html

我選取了《超級神基因》作為要下載的小說進行演示。定義一個download方法，傳入兩個參數。 url 為小說的第一章的網址：https://www.qu.la/book/25877/8923073.html，當網絡出現錯誤時，重新發起請求，num_retries = 5默認同一鏈接請求不超過5次。該方法返回了網站的html代碼。Python 代碼如下：

import urllib2def download(url, num_retries=5):""":param url: the fist chapter's url of novel like 'https://www.qu.la/book/25877/8923073.html':param num_retries: times to retry to reconnect when fail to connect:return:html of url which be inputted"""print 'Start downloading:', urltry:html = urllib2.urlopen(url).read()print 'Download finished:', urlexcept urllib2.URLError as e:print 'Download fail.Download error:', e.reasonhtml = Noneif num_retries > 0:print 'Retrying:', urlhtml = download(url, num_retries - 1)return html

3.2 獲取每一章的標題和正文

這一步我們要查看包含有標題和正文的標簽，將相關的標簽內容篩選出來。注意一定要將html代碼下載下來再查看，不要直接用瀏覽器的開發工具查看源代碼。博主踩了這個坑，發現一直匹配不到相關的標簽，后來有前端老司機告訴我JavaScript代碼可能會自動完成代碼（好像是這么個意思，博主前端 0 基礎），你在瀏覽器看到的代碼很可能改變了。
將下載下來的html代碼保存為一個.html文件，保存html代碼如下：

import re import oshtml = download('https://www.qu.la/book/25877/8923072.html') with open(os.path.join(r'C:\Users\admin\Desktop', '123.html'), 'wb') as f:f.write(html)

查看保存的在桌面的123.html文件，可以發現章節標題的標簽為<h1>

使用正則表達式在html代碼中匹配該標簽的內容，代碼如下：

import redef get_title(html):"""Find Title of each chapter,return the title of chapter"""title_regex = re.compile('<h1>(.*?)</h1>', re.IGNORECASE)title = str(title_regex.findall(html)[0])return title

同理，在html代碼中查找文本的相關標簽為<div id="content">。

然后這里有個坑，用正則表達式居然匹配不到！！！博主一臉懵逼地研究了幾個小時未果，于是用了另一種方法：使用BeautifulSoup庫。話不多說，放代碼：

from bs4 import BeautifulSoupdef get_content(html):"""get content of each chapter from the html"""soup = BeautifulSoup(html, 'html.parser')# fixed_html = soup.prettify()div = soup.find('div', attrs={'id': 'content'})[s.extract() for s in soup('script')]# print divcontent = str(div).replace('<br/>', '\n').replace('<div id="content">', '').replace('</div>', '').strip()return content

3.3 保存標題和正文

接下來就是把上一步得到的標題和正文保存到文檔中去，博主偷懶把地址寫死了，這一步比較簡單，不多做解釋，看代碼：

import redef save(title, content):with open(r'C:\Users\admin\Desktop\DNAofSuperGod\novel.txt', 'a+') as f:f.writelines(title + '\n')f.writelines(content + '\n')

3.4 獲取下一章的 url

在html代碼中找到下一章的鏈接：

老規矩，在html代碼中匹配這個標簽，拿到href的內容，代碼如下：

from bs4 import BeautifulSoupdef get_linkofnextchapter(html):"""This method will get the url of next chapter:return: a relative link of the next chapter"""soup = BeautifulSoup(html, 'html.parser')a = soup.find('a', attrs={'class': 'next', 'id': 'A3', 'target': '_top'})# print a['href']return a['href']

3.5 編寫啟動的方法

調用前面的方法，作為整個程序的入口，代碼如下：

def begin(url):# make sure panth exitedif not os.path.isdir(r'C:\Users\admin\Desktop\DNAofSuperGod'):os.mkdir(r'C:\Users\admin\Desktop\DNAofSuperGod')# remove old file and build a new oneif os.path.isfile(r'C:\Users\admin\Desktop\DNAofSuperGod\novel.txt'):os.remove(r'C:\Users\admin\Desktop\DNAofSuperGod\novel.txt')html = download(url)# if html is None,download fail.if not html == None:title = get_title(html)print titlecontent = get_content(html)save(title, content)print 'Have saved %s for you.' % titlelink = get_linkofnextchapter(html)# judge if has next chapter?if not re.match(r'./', link):nexturl = urlparse.urljoin(url, link)begin(nexturl)else:print 'Save finished!'else:print 'Download fail'

3.6 啟動爬蟲

終于到達最后一步啦，我們要啟動我們的爬蟲程序，調用代碼很簡單：

url = 'https://www.qu.la/book/25877/8923072.html' begin(url)

但是！如果順利的話，在程序下載到900多章的時候，你可以很幸福地看到程序報錯了！
下面這段是我復制的（我才不會傻傻的重新跑一遍程序呢）

RuntimeError: maximum recursion depth exceeded

找了度娘后發現這是python的保護機制，防止無限遞歸導致內存溢出，默認的遞歸深度是 1000，所以我們可以把這個默認值改大一點即可。

import sys# change recursion depth as 10000(defult is 1000) sys.setrecursionlimit(10000) url = 'https://www.qu.la/book/25877/8923072.html' begin(url)

3.7 附上完整代碼

import urllib2 import re import os import urlparseimport sys from bs4 import BeautifulSoup# __author__:chenyuepeng""" This demon is a webspider to get a novel from https://www.qu.la """def download(url, num_retries=5):""":param url: the fist chapter's url of novel like 'https://www.qu.la/book/25877/8923073.html':param num_retries: times to retry to reconnect when fail to connect:return:html of url which be inputted"""print 'Start downloading:', urltry:html = urllib2.urlopen(url).read()print 'Download finished:', urlexcept urllib2.URLError as e:print 'Download fail.Download error:', e.reasonhtml = Noneif num_retries > 0:print 'Retrying:', urlhtml = download(url, num_retries - 1)return htmldef get_title(html):"""Find Title of each chapter,return the title of chapter"""title_regex = re.compile('<h1>(.*?)</h1>', re.IGNORECASE)title = str(title_regex.findall(html)[0])return titledef get_content(html):"""get content of each chapter from the html"""soup = BeautifulSoup(html, 'html.parser')# fixed_html = soup.prettify()div = soup.find('div', attrs={'id': 'content'})[s.extract() for s in soup('script')]# print divcontent = str(div).replace('<br/>', '\n').replace('<div id="content">', '').replace('</div>', '').strip()return contentdef get_linkofnextchapter(html):"""This method will get the url of next chapter:return: a relative link of the next chapter"""soup = BeautifulSoup(html, 'html.parser')a = soup.find('a', attrs={'class': 'next', 'id': 'A3', 'target': '_top'})# print a['href']return a['href']def save(title, content):with open(r'C:\Users\admin\Desktop\DNAofSuperGod\novel.txt', 'a+') as f:f.writelines(title + '\n')f.writelines(content + '\n')def begin(url):# make sure panth exitedif not os.path.isdir(r'C:\Users\admin\Desktop\DNAofSuperGod'):os.mkdir(r'C:\Users\admin\Desktop\DNAofSuperGod')# remove old file and build a new oneif os.path.isfile(r'C:\Users\admin\Desktop\DNAofSuperGod\novel.txt'):os.remove(r'C:\Users\admin\Desktop\DNAofSuperGod\novel.txt')html = download(url)# if html is None,download fail.if not html == None:title = get_title(html)print titlecontent = get_content(html)save(title, content)print 'Have saved %s for you.' % titlelink = get_linkofnextchapter(html)# judge if has next chapter?if not re.match(r'./', link):nexturl = urlparse.urljoin(url, link)begin(nexturl)else:print 'Save finished!'else:print 'Download fail'# change recursion depth as 10000(defult is 900+) sys.setrecursionlimit(10000) url = 'https://www.qu.la/book/25877/8923072.html' begin(url)

4. 優化

這是在初學Python后的練手項目，還有很多優化的空間，暫時先寫下來，留待以后改進。如果有好的建議或者相關的問題，歡迎在評論區留言討論。

將單線程改為多線程或多進程，利用并發加快下載速度
加入緩存機制，將下載過的url緩存到本地，如果程序中斷了不需要從頭開始下載，從緩存中提取相關信息即可

總結

以上是生活随笔為你收集整理的Python 实现一个自动下载小说的简易爬虫的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：【OpenGL ES】着色语言GLSL
下一篇： python爬取网站小说并下载实例

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

生活随笔