生活随笔
收集整理的這篇文章主要介紹了
python爬虫拉取豆瓣Top250数据
小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
python爬蟲拉取豆瓣Top250數(shù)據(jù)
利用request和正則表達(dá)式抓取豆瓣電影Top250的相關(guān)內(nèi)容,提取出電影的名稱、時(shí)間、評分和圖片等信息,提取的站點(diǎn)url為https://movie.douban.com/top250,提取的結(jié)果以數(shù)據(jù)庫和文本文件保存。
導(dǎo)入包
import json
import requests
from requests
.exceptions
import RequestException
import re
import time
import pymysql
相關(guān)配置,傳入url參數(shù),抓取頁面結(jié)果返回
def get_one_page(url
):try:headers
= {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'}response
= requests
.get
(url
, headers
=headers
)if response
.status_code
== 200:return response
.text
return Noneexcept RequestException
:return None
獲取頁面并解析相關(guān)數(shù)據(jù),用正則表達(dá)式匹配電影的信息,如電影名、評分、電影海報(bào)、導(dǎo)演、主角和上映年份等,并賦值為字典,形成結(jié)構(gòu)化數(shù)據(jù)。
def parse_one_page(html
):pattern
= re
.compile('<li>''.*?<em class="">(.*?)</em>''.*?src="(.*?)" ''.*?title">(.*?)</span>''.*?<br>.*?(\d+)''.*?inq">(.*?)</span>''.*?</li>', re
.S
)items
= re
.findall
(pattern
, html
)for item
in items
:yield {'id': item
[0],'image': item
[1],'name': item
[2],'year': item
[3],'inq': item
[4]}
將數(shù)據(jù)寫入文本文件,通過JSON庫的dumps()方法實(shí)現(xiàn)字典的序列化
def write_to_file(content
):with open('D:/result.txt', 'a', encoding
='utf-8') as f
:f
.write
(json
.dumps
(content
, ensure_ascii
=False) + '\n')
將數(shù)據(jù)寫入數(shù)據(jù)庫
def write_to_mysql(item
):db
= pymysql
.connect
(host
='localhost', user
='root', passwd
='xjz01405', db
='test', port
=3306)cursor
= db
.cursor
()cursor
.execute
('SELECT VERSION()')data
= cursor
.fetchone
()print('Database version:', data
)table
= 'movies'keys
= ', '.join
(item
.keys
())values
= ', '.join
(['%s'] * len(item
))sql
= 'INSERT INTO {table}({keys}) VALUES ({values})'.format(table
=table
, keys
=keys
, values
=values
)try:if cursor
.execute
(sql
, tuple(item
.values
())):print('sucessful')db
.commit
()except:print('failed')db
.rollback
()
主函數(shù),調(diào)用之前實(shí)現(xiàn)的方法,將電影數(shù)據(jù)寫入數(shù)據(jù)庫和文件中
def main(offset
):url
= 'https://movie.douban.com/top250?start=' + str(start
) + '&filter='html
= get_one_page
(url
)for item
in parse_one_page
(html
):print(item
)write_to_mysql
(item
)update_to_mysql
(item
)write_to_file
(item
)
頁面跳轉(zhuǎn),給主函數(shù)內(nèi)的url傳入start參數(shù),實(shí)現(xiàn)全部250條數(shù)據(jù)的爬取
if __name__
== '__main__':for i
in range(2, 3):main
(start
=i
* 25)time
.sleep
(2)
結(jié)果:
更新數(shù)據(jù),如果有新的數(shù)據(jù)就插入數(shù)據(jù),如果數(shù)據(jù)已經(jīng)存在數(shù)據(jù)庫中,就更新數(shù)據(jù),通過比較主鍵判斷是否存在,實(shí)現(xiàn)主鍵不存在就插入數(shù)據(jù),若存在就更新數(shù)據(jù)。
def update_to_mysql(item
):db
= pymysql
.connect
(host
='localhost', user
='root', passwd
='xjz01405', db
='test', port
=3306)cursor
= db
.cursor
()table
= 'movies'keys
= ', '.join
(item
.keys
())values
= ', '.join
(['%s'] * len(item
))sql
= 'INSERT INTO {table}({keys}) VALUES ({values}) ON DUPLICATE KEY UPDATE'.format(table
=table
, keys
=keys
, values
=values
)update
= ','.join
([" {key} = %s".format(key
=key
) for key
in item
])sql
+= update
try:if cursor
.execute
(sql
, tuple(item
.values
())*2):print('sucessful')db
.commit
()except:print('failed')db
.rollback
()
總結(jié)
以上是生活随笔為你收集整理的python爬虫拉取豆瓣Top250数据的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。