當前位置：首頁 > 编程语言 > python >内容正文

python

python + selenium + chrome 凡人修仙小说爬取

發布時間：2023/12/20 python 46 豆豆

生活随笔收集整理的這篇文章主要介紹了 python + selenium + chrome 凡人修仙小说爬取小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

剛開始學習爬蟲，按照網上的一個項目實戰的視頻學會了爬取一個網站的圖片，然后就想著把自己之前看的《凡人修仙傳》（以前手機的讀書APP是可以免費看的，現在要收費，但是電腦上的網頁是可以看的）爬取一下。

遇到的問題

看完視屏覺得可以直接套用代碼來爬取，結果發現視頻中的代碼只能夠爬取靜態頁面，對于動態加載的頁面沒有辦法爬取。網上找了下方法，對于動態網頁有兩種方式，一種是逆向分析協議包找到腳本請求的網址，然后直接用代碼來請求；另一種是使用selenium模塊來模擬瀏覽器的行為，然后抓取數據。
兩種方法分析：第一種是速度最快的，但是對于我這種小白來說有一定的難度，所以就放棄了；選擇使用第二種方法。

代碼完成的整個步驟

1、安裝ChromeDriver和selenium模塊

百度了一些方法，最后選擇了一個博客的方法，鏈接找不到了，那就簡單描述一下：

查看自己chrome的版本，到chromedriver下載地址上面找到對應的版本下載就可以。這里以我自己的瀏覽器為例，我的chrome是80.0.3987.122版本，但是網站上就只有如下的版本，當時比較困惑不知道要選擇哪一個下載，后面再google的官方網站了解到只要是大版本號對應就可以。
我選擇了上面的一個下載（因為點進去查看notes的時候發現上面的一個的更新日期比較新）。

將chromedriver.exe放到chrome.exe的目錄下，然后環境變量中的系統路徑添加這個路徑；

使用pycharm安裝selenium模塊，from selenium import webdriver driver = webdriver.Chrome()這兩句來進行測試，我的報錯，然后就重啟了一下電腦就OK了；

2、代碼講解

代碼我直接全部寫在了main里面了，
1、獲得所有章節的網頁地址

def getHtml(http_url): header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36"} response = requests.get(http_url,headers = header) return response.textdef main()： #獲取所有章節的地址 novel_url = "https://www.booktxt.net/1_1562"# content = getHtml(novel_url) pattern = re.compile(r'"5(.*?).html"')#正則匹配 novel_num = pattern.findall(content) novel_url_list = [novel_url+"/5"+element for element in novel_num]

2、設置chrome為headless模式（不顯示界面）

#設置無頭瀏覽 chrome_options = Options() chrome_options.add_argument('--headless')#只打開一個瀏覽器（不需要放到循環里面） driver = webdriver.Chrome(chrome_options = chrome_options)

3、遍歷每一章的網址，抓取內容并保存

count = 1 #計數，放到保存的每一章的文件前面，方便排序 for url in novel_url_list:driver.get(url)#打開網頁content_text = driver.find_element_by_id("content").text #使用id屬性抓取本章的內容head = driver.find_element_by_xpath("//div[@class = 'bookname']/h1").text #使用xpath格式抓取章節名稱#保存內容到文件dir_path = "./凡人修仙"if not os.path.exists(dir_path):os.makedirs(dir_path)head = re.sub('[\/:*?"<>|]', '', head)# 去掉章節名中的非法字符，因為要以這個名字作為txt文件的名稱fileName = dir_path+"/"+str(count)+head+".txt"file = open(fileName,"w",encoding="utf-8")file.write(content_text)file.close()print("當前下載章章數為:%d"%count) #輸出信息，方便查看# print("網址為%s"%url)count +=1

4、最后的效果圖

3、完整代碼

import os import re import requests from selenium import webdriver from selenium.webdriver.chrome.options import Optionsdef getHtml(http_url):header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36"}response = requests.get(http_url,headers = header)return response.textdef main():#獲取一章的頁面地址數據novel_url = "https://www.booktxt.net/1_1562"content = getHtml(novel_url)pattern = re.compile(r'"5(.*?).html"')#正則匹配novel_num = pattern.findall(content)novel_url_list = [novel_url+"/5"+element for element in novel_num]# print(novel_url_list)#設置無頭瀏覽chrome_options = Options()chrome_options.add_argument('--headless')#打開一個瀏覽器driver = webdriver.Chrome(chrome_options = chrome_options)count = 1for url in novel_url_list:#獲取一頁的內容driver.get(url)try:content_text = driver.find_element_by_id("content").texthead = driver.find_element_by_xpath("//div[@class = 'bookname']/h1").text#print(head)except Exception:head = "第"+str(count)+"無此章節"content_text = ""pass#保存內容到文件dir_path = "./凡人修仙"if not os.path.exists(dir_path):os.makedirs(dir_path)head = re.sub('[\/:*?"<>|]', '', head)# 去掉非法字符fileName = dir_path+"/"+str(count)+head+".txt"file = open(fileName,"w",encoding="utf-8")file.write(content_text)file.close()print("當前下載章章數為:%d"%count)print("網址為%s"%url)count +=1returnif __name__ == '__main__':main()

總結

以上是生活随笔為你收集整理的python + selenium + chrome 凡人修仙小说爬取的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： 11.29 android入门开发
下一篇： luogu P4299 首都