當前位置：首頁 > 编程语言 > python >内容正文

python

Python爬虫练习（一）爬取新笔趣阁小说（搜索+爬取）

發布時間：2023/12/29 python 37 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python爬虫练习（一）爬取新笔趣阁小说（搜索+爬取）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

爬取筆趣閣小說（搜索+爬取）

首先看看最終效果（gif）：

實現步驟：
1.探查網站“http://www.xbiquge.la/”，看看網站的實現原理。

2.編寫搜索功能（獲取每本書目錄的URL）。

3.編寫寫入功能（按章節寫入文件）。

4.完善代碼（修修bug，建了文件夾）。

ps:所需模塊：

import requests import bs4 # 爬網站必備兩個模塊不解釋 import os # 用來創建文件夾的 import sys # 沒啥用單純為了好看 import time import random # 使用隨機數設置延時

一、網站搜索原理，并用Python實現。

我本以為這個網站和一般網站一樣，通過修改URL來進行搜索，結果并不然。

可以看出這個網站不會因搜索內容改變而改變URL。
那還有一種可能：通過POST請求，來更新頁面。讓我們打開Network驗證一下。

我的猜想是對的。接下來開始模擬。

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36","Cookie": "_abcde_qweasd=0; Hm_lvt_169609146ffe5972484b0957bd1b46d6=1583122664; bdshare_firstime=1583122664212; Hm_lpvt_169609146ffe5972484b0957bd1b46d6=1583145548","Host": "www.xbiquge.la"} # 設置頭盡量多一點以防萬一 x = str(input("輸入書名或作者名:")) # 通過變量來控制我們要搜索的內容 data = {'searchkey': x} url = 'http://www.xbiquge.la/modules/article/waps.php' r = requests.post(url, data=data, headers=headers) soup = bs4.BeautifulSoup(r.text.encode('utf-8'), "html.parser") # 用BeautifulSoup方法方便我們提取網頁內容網頁

可是如果現在我printf(soup)后發現里面的中文全為亂碼！

這不難看出是編碼格式不對，但我們可以用encoding方法來獲取編碼方式。

改完編碼后就可以正常提取了，并且和瀏覽器顯示的一致，都是我們搜索的內容。

二、接下來我們就來在這一堆代碼里找到我們想要的內容了（書名，作者，目錄URL）

通過元素審查我們很容易就可以定位到它們所在位置。

鏈接和書名在"td class even"< a> 標簽里，作者在"td class=even"里。

什么！標簽重名了！怎么辦！管他三七二十一！先把"td class=even"全打印出來看看。

book_author = soup.find_all("td", class_="even") for each in book_author:print(each)

可以發現每個each分為兩層。
那我們可以奇偶循環來分別處理這兩層。（因為如果不分層處理的話第一層要用的方法（each.a.get(“href”）在第二層會報錯，好像try也可以處理這個錯,沒試）

并且用創建兩個三個列表來儲存三個值。

books = [] #　書名 authors = [] # 作者名 directory = [] # 目錄鏈接 tem = 1 for each in book_author:if tem == 1:books.append(each.text)tem -= 1directory.append(each.a.get("href"))else:authors.append(each.text)tem += 1

成功！三個列表全部一樣對應！
那么要如何實現選擇一個序號，來讓Python獲得一個目錄鏈接呢？
我們可以這樣：

print('搜索結果：') for num,book, author in zip(range(1, len(books)+1),books, authors):print((str(num)+": ").ljust(4)+(book+"\t").ljust(25) + ("\t作者：" + author).ljust(20)) search = dict(zip(books, directory))

是不是很神奇！“search”是我們用書名和目錄URL組成的字典，我們只要
return search[books[i-1]]
就可以讓下一個函數得到這本書的目錄URL了。

三、獲取章節URL，獲取文本內容，寫入文件。

我們得到目錄的URL后就可以用相同的方法獲取每一章節的URL了（不贅述了）。

def get_text_url(titel_url):url = titel_urlglobal headersr = requests.get(url, headers=headers)soup = bs4.BeautifulSoup(r.text.encode('ISO-8859-1'), "html.parser")titles = soup.find_all("dd")texts = []names = []texts_names = []for each in titles:texts.append("http://www.xbiquge.la"+each.a["href"])names.append(each.a.text)texts_names.append(texts)texts_names.append(names)return texts_names # 注意這里的返回值是一個包含兩個列表的列表！！

注意這里的返回值是一個包含兩個列表的列表！！
texts_names[0] 就是每一章節的 URL, texts_names[0] 是章節名
為下一個寫內容的函數方便調用。
接下來接是寫文件了！

search = dict(zip(books, directory)) url = texts_url[0][n] name = texts_url[1][n] req = requests.get(url=url, headers=headers) time.sleep(random.uniform(0, 0.5)) # 即使設置了延遲，他還有會可能503（沒辦法小網站） req.encoding = 'UTF-8' # 這里的編碼是UTF-8，跟目錄不一樣，要注意！ html = req.text soup = bs4.BeautifulSoup(html, features="html.parser") texts = soup.find_all("div", id="content") while (len(texts) == 0): # 他如果503的話，讀取內容就什么都木有，那直接讓他再讀一次，直到讀出來為止。req = requests.get(url=url, headers=headers)time.sleep(random.uniform(0, 0.5))req.encoding = 'UTF-8'html = req.textsoup = bs4.BeautifulSoup(html, features="html.parser")texts = soup.find_all("div", id="content") else:content = texts[0].text.replace('\xa0' * 8, '\n\n')content = content.replace("親,點擊進去,給個好評唄,分數越高更新越快,據說給新筆趣閣打滿分的最后都找到了漂亮的老婆哦!手機站全新改版升級地址：http://m.xbiquge.la，數據和書簽與電腦站同步，無廣告清新閱讀！", "\n")# 使用text屬性，提取文本內容，濾除br標簽，隨后使用replace方法，去掉八個空格符號，并用回車代替再去除每一頁都有得結尾 with open(name + '.txt', "w", encoding='utf-8')as f:f.write(content)sys.stdout.write("\r已下載{}章，還剩下{}章".format(count, max - count)) # sys模塊就在這用了一次，為了不讓他換行。。。count += 1

n就是章節的序列，直接for循環就可以把所有章節寫成文件了
這里處理503的方法雖然很暴力，可是是最有用的！

四、整理代碼，修修bug。

把上面的思路寫成三道四個函數打包一下。
然后測試一下，看看有什么bug，能修就修復，修復不了就直接try掉。（哈哈哈）
想要文件夾的可以研究研究os模塊，很簡單，這里不贅述了。
最后附上完整代碼！

import requests import bs4 # 爬網站必備兩個模塊不解釋 import os # 用來創建文件夾的 import sys # 沒啥用單純為了好看 import time import random # 使用隨機數設置延時 headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36","Cookie": "_abcde_qweasd=0; Hm_lvt_169609146ffe5972484b0957bd1b46d6=1583122664; bdshare_firstime=1583122664212; Hm_lpvt_169609146ffe5972484b0957bd1b46d6=1583145548","Host": "www.xbiquge.la"} # 設置頭盡量多一點以防萬一 b_n = "" def get_title_url():x = str(input("輸入書名或作者名:"))data = {'searchkey': x}url = 'http://www.xbiquge.la/modules/article/waps.php'global headers, b_nr = requests.post(url, data=data, headers=headers)soup = bs4.BeautifulSoup(r.text.encode('ISO-8859-1'), "html.parser")book_author = soup.find_all("td", class_="even")books = [] #　書名authors = [] # 作者名directory = [] # 目錄鏈接tem = 1for each in book_author:if tem == 1:books.append(each.text)tem -= 1directory.append(each.a.get("href"))else:authors.append(each.text)tem += 1print('搜索結果：')for num,book, author in zip(range(1, len(books)+1),books, authors):print((str(num)+": ").ljust(4)+(book+"\t").ljust(25) + ("\t作者：" + author).ljust(20))search = dict(zip(books, directory))if books == []:print("沒有找到任何一本書，請重新輸入!")get_title_url()try:i = int(input("輸入需要下載的序列號(重新搜索輸入'0')"))except:print("輸入錯誤重新輸入:")i = int(input("輸入需要下載的序列號(重新搜索輸入'0')"))if i == 0:books = []authors = []directory = []get_title_url()if i>len(books) or i<0:print("輸入錯誤重新輸入:")i = int(input("輸入需要下載的序列號(重新搜索輸入'0')"))b_n=books[i-1]try:os.mkdir(books[i-1])os.chdir(b_n)except:os.chdir(b_n)b_n = books[i - 1]return search[books[i-1]]def get_text_url(titel_url):url = titel_urlglobal headersr = requests.get(url, headers=headers)soup = bs4.BeautifulSoup(r.text.encode('ISO-8859-1'), "html.parser")titles = soup.find_all("dd")texts = []names = []texts_names = []for each in titles:texts.append("http://www.xbiquge.la"+each.a["href"])names.append(each.a.text)texts_names.append(texts)texts_names.append(names)return texts_names # 注意這里的返回值是一個包含兩個列表的列表！！def readnovel(texts_url):global headers,b_ncount=1max=len(texts_url[1])print("預計耗時{}分鐘".format((max // 60)+1))tishi = input(str(b_n)+"一共{}章，確認下載輸入'y',輸入其他鍵取消".format(max))if tishi == "y"or tishi =="Y":for n in range(max):url = texts_url[0][n]name = texts_url[1][n]req = requests.get(url=url,headers=headers)time.sleep(random.uniform(0, 0.5)) # 即使設置了延遲，他還有會可能503（沒辦法小網站）req.encoding = 'UTF-8' # 這里的編碼是UTF-8，跟目錄不一樣，要注意！html = req.textsoup = bs4.BeautifulSoup(html, features="html.parser")texts = soup.find_all("div", id="content")while (len(texts) == 0): # 他如果503的話，讀取內容就什么都木有，那直接讓他再讀一次，直到讀出來為止。req = requests.get(url=url, headers=headers)time.sleep(random.uniform(0,0.5))req.encoding = 'UTF-8'html = req.textsoup = bs4.BeautifulSoup(html, features="html.parser")texts = soup.find_all("div", id="content")else:content = texts[0].text.replace('\xa0' * 8, '\n\n')content=content.replace("親,點擊進去,給個好評唄,分數越高更新越快,據說給新筆趣閣打滿分的最后都找到了漂亮的老婆哦!手機站全新改版升級地址：http://m.xbiquge.la，數據和書簽與電腦站同步，無廣告清新閱讀！","\n")# 使用text屬性，提取文本內容，濾除br標簽，隨后使用replace方法，去掉八個空格符號，并用回車代替再去除每一頁都有得結尾with open(name+'.txt',"w",encoding='utf-8')as f:f.write(content)sys.stdout.write("\r已下載{}章，還剩下{}章".format(count,max-count)) # sys模塊就在這用了一次，為了不讓他換行。。。count += 1print("\n全部下載完畢")else:print("已取消!")os.chdir('..')os.rmdir(b_n)main()def main():titel_url = get_title_url()texts_url = get_text_url(titel_url)readnovel(texts_url)input("輸入任意鍵退出")if __name__ == '__main__':print("小說資源全部來自于'新筆趣閣'---》http://www.xbiquge.la\n所以搜不到我也沒辦法..........@曉軒\n為了確保下載完整，每章設置了0.5秒到1秒延時！")main()

總結

以上是生活随笔為你收集整理的Python爬虫练习（一）爬取新笔趣阁小说（搜索+爬取）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：二分入门——poj 2456 aggre
下一篇：用户运营体系中，用户精细化运营闭环是怎样

python

Python爬虫练习（一） 爬取新笔趣阁小说（搜索+爬取）

爬取筆趣閣小說（搜索+爬取）

總結

Python爬虫练习（一）爬取新笔趣阁小说（搜索+爬取）