當前位置：首頁 > 编程语言 > python >内容正文

python

[python爬虫] BeautifulSoup和Selenium对比爬取豆瓣Top250电影信息

發布時間：2024/5/28 python 75 豆豆

生活随笔收集整理的這篇文章主要介紹了 [python爬虫] BeautifulSoup和Selenium对比爬取豆瓣Top250电影信息小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

? ? ? ? 這篇文章主要對比BeautifulSoup和Selenium爬取豆瓣Top250電影信息，兩種方法從本質上都是一樣的，都是通過分析網頁的DOM樹結構進行元素定位，再定向爬取具體的電影信息，通過代碼的對比，你可以進一步加深Python爬蟲的印象。同時，文章給出了我以前關于爬蟲的基礎知識介紹，方便新手進行學習。
? ? ? ? 總之，希望文章對你有所幫助，如果存在不錯或者錯誤的地方，還請海涵~

一. DOM樹結構分析

? ? ? ? 豆瓣Top250電影網址：https://movie.douban.com/top250?format=text
? ? ? ? 通過右鍵Chrome瀏覽器"審查元素"或"檢查"可以定位具體的元素，如下圖所示：

? ? ? ? 圖中由一部部電影構成，在HTML中對應：
? ? ? ? <li><div class="item">......</div></li>
? ? ? ? BeautifulSoup 通過 soup.find_all(attrs={"class":"item"}) 函數可以獲取具體的信息，然后再定位具體內容，如 <span class="title">獲取標題，<div class="star">獲取分數和評價數。

? ? ? ? 下一步需要注意的是，定位爬取網頁元素還需要進行翻頁，通常有兩種方法：
? ? ? ? 1.點擊下一頁分析URL網址進行分析它們之間的規律；
? ? ? ? 2.Selenium可以獲取頁碼按鈕進行自動點擊跳轉。
? ? ? ? 如下圖所示，點擊不同頁碼后分析URL：

? ? ? ? 第2頁URL：https://movie.douban.com/top250?start=25&filter=
? ? ? ? 第3頁URL：https://movie.douban.com/top250?start=50&filter=
? ? ? ? 所以每頁共25部電影，它們是存在規律的，再寫一個循環即可獲取所有電影信息。

二. BeautifulSoup爬取豆瓣信息

? ? ? ? 入門推薦我的前文：[python知識] 爬蟲知識之BeautifulSoup庫安裝及簡單介紹
? ? ? ? 具體代碼如下：
# -*- coding: utf-8 -*- """ Created on 2016-12-29 22:50@author: Easstmount """import urllib2 import re from bs4 import BeautifulSoup import codecs#爬蟲函數 def crawl(url):page = urllib2.urlopen(url) contents = page.read() soup = BeautifulSoup(contents, "html.parser") print u'豆瓣電影250: 序號 \t影片名\t 評分 \t評價人數'infofile.write(u"豆瓣電影250: 序號 \t影片名\t 評分 \t評價人數\r\n")print u'爬取信息如下:\n'for tag in soup.find_all(attrs={"class":"item"}):#print tag#爬取序號num = tag.find('em').get_text()print num #爬取電影名稱name = tag.find(attrs={"class":"hd"}).a.get_text()name = name.replace('\n',' ')print nameinfofile.write(num+" "+name+"\r\n")#電影名稱title = tag.find_all(attrs={"class":"title"})i = 0for n in title:text = n.get_text()text = text.replace('/','')text = text.lstrip()if i==0:print u'[中文標題]', textinfofile.write(u"[中文標題]" + text + "\r\n")elif i==1:print u'[英文標題]', textinfofile.write(u"[英文標題]" + text + "\r\n")i = i + 1#爬取評分和評論數info = tag.find(attrs={"class":"star"}).get_text()info = info.replace('\n',' ')info = info.lstrip()print infomode = re.compile(r'\d+\.?\d*')print mode.findall(info)i = 0for n in mode.findall(info):if i==0:print u'[分數]', ninfofile.write(u"[分數]" + n + "\r\n")elif i==1:print u'[評論]', ninfofile.write(u"[評論]" + n + "\r\n")i = i + 1#獲取評語info = tag.find(attrs={"class":"inq"})if(info): # 132部電影 [消失的愛人] 沒有影評content = info.get_text()print u'[影評]', contentinfofile.write(u"[影評]" + content + "\r\n")print ''#主函數 if __name__ == '__main__':infofile = codecs.open("Result_Douban.txt", 'a', 'utf-8') url = 'http://movie.douban.com/top250?format=text'i = 0while i<10:print u'頁碼', (i+1)num = i*25 #每次顯示25部 URL序號按25增加url = 'https://movie.douban.com/top250?start=' + str(num) + '&filter='crawl(url)infofile.write("\r\n\r\n\r\n")i = i + 1infofile.close() ? ? ? ? 輸出結果如下所示：
豆瓣電影250: 序號影片名評分評價人數 1 肖申克的救贖 ?/?The Shawshank Redemption ?/?月黑高飛(港) / 刺激1995(臺) [中文標題]肖申克的救贖 [英文標題]The Shawshank Redemption [分數]9.6 [評論]761249 [影評]希望讓人自由。 2 這個殺手不太冷 ?/?Léon ?/?殺手萊昂 / 終極追殺令(臺) [中文標題]這個殺手不太冷 [英文標題]Léon [分數]9.4 [評論]731250 [影評]怪蜀黍和小蘿莉不得不說的故事。 3 霸王別姬 ?/?再見，我的妾 / Farewell My Concubine [中文標題]霸王別姬 [分數]9.5 [評論]535808 [影評]風華絕代。 4 阿甘正傳 ?/?Forrest Gump ?/?福雷斯特·岡普 [中文標題]阿甘正傳 [英文標題]Forrest Gump [分數]9.4 [評論]633434 [影評]一部美國近現代史。 5 美麗人生 ?/?La vita è bella ?/?一個快樂的傳說(港) / Life Is Beautiful [中文標題]美麗人生 [英文標題]La vita è bella [分數]9.5 [評論]364132 [影評]最美的謊言。 6 千與千尋 ?/?千と千尋の神隠し ?/?神隱少女(臺) / Spirited Away [中文標題]千與千尋 [英文標題]千と千尋の神隠し [分數]9.2 [評論]584559 [影評]最好的宮崎駿，最好的久石讓。 ? ? ? ? 同時輸出文件Reseult_Douban.txt，如下圖所示：

三. Selenium爬取豆瓣信息及Chrome爬蟲介紹

? ? ? ? 入門推薦我的前文：[Python爬蟲] Selenium自動登錄和Locating Elements介紹
? ? ? ? 代碼如下所示：

# -*- coding: utf-8 -*- """ Created on 2016-12-29 22:50@author: Easstmount """import time import re import sys import codecs import urllib from selenium import webdriver from selenium.webdriver.common.keys import Keys #爬蟲函數 def crawl(url):driver.get(url)print u'豆瓣電影250: 序號 \t影片名\t 評分 \t評價人數'infofile.write(u"豆瓣電影250: 序號 \t影片名\t 評分 \t評價人數\r\n")print u'爬取信息如下:\n'content = driver.find_elements_by_xpath("//div[@class='item']") for tag in content:print tag.textinfofile.write(tag.text+"\r\n")print ''#主函數 if __name__ == '__main__':driver = webdriver.Firefox()infofile = codecs.open("Result_Douban.txt", 'a', 'utf-8') url = 'http://movie.douban.com/top250?format=text'i = 0while i<10:print u'頁碼', (i+1)num = i*25 #每次顯示25部 URL序號按25增加url = 'https://movie.douban.com/top250?start=' + str(num) + '&filter='crawl(url)infofile.write("\r\n\r\n\r\n")i = i + 1infofile.close() ? ? ? ? 該部分代碼會自動調用Firefox瀏覽器，然后爬取內容。調用如下所示：

? ? ? ? 同時，也可以爬取文件如下圖所示，也可以再定向分析具體的節點，其方式方法也是類似的。

? ? ? ? 調用Chrome瀏覽器需要在：
? ? ? ? C:\Program Files (x86)\Google\Chrome\Application
? ? ? ? 路徑下放置一個?chromedriver.exe 驅動文件，再進行調用。核心代碼：

chromedriver = "C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe"os.environ["webdriver.chrome.driver"] = chromedriverdriver = webdriver.Chrome(chromedriver) ? ? ? ? 但是可能會報錯如下所示，需要保持版本一致。

? ? ? ? 總結下兩個代碼的優缺點：BeautifulSoup比較快速，結構更加完善，但爬取如CSDN等博客會報錯Forbidden；而Selenium可以調用瀏覽器進行爬取，自動化操作及動態操作，點擊鼠標鍵盤等按鈕比較方便，但其速度比較慢，尤其是重復的調用瀏覽器。

? ? ? ? 最近年尾學院事情太多了，所以很少有定量的時間進行寫博客，這其實挺悲傷的，但幸運的是遇見了她，讓我在百忙之中還是體會到了一些甜蜜，陪著我工作。
? ? ? ? 無需多言，彼此的心意一言一行里都能感受到愛意和溫暖，follow you~
? ? ? ? (By:Eastmount 2016-12-30 晚上12點半??http://blog.csdn.net/eastmount/?)

總結

以上是生活随笔為你收集整理的[python爬虫] BeautifulSoup和Selenium对比爬取豆瓣Top250电影信息的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：【Python数据挖掘课程】七.PCA降
下一篇：【Python数据挖掘课程】九.回归模型