當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

python爬虫拉取豆瓣Top250数据

發(fā)布時(shí)間：2023/12/15 python 26 豆豆

生活随笔收集整理的這篇文章主要介紹了 python爬虫拉取豆瓣Top250数据小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

python爬蟲拉取豆瓣Top250數(shù)據(jù)

利用request和正則表達(dá)式抓取豆瓣電影Top250的相關(guān)內(nèi)容，提取出電影的名稱、時(shí)間、評分和圖片等信息，提取的站點(diǎn)url為https://movie.douban.com/top250，提取的結(jié)果以數(shù)據(jù)庫和文本文件保存。

導(dǎo)入包

import json import requests from requests.exceptions import RequestException import re import time import pymysql

相關(guān)配置，傳入url參數(shù)，抓取頁面結(jié)果返回

def get_one_page(url):try:headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'}response = requests.get(url, headers=headers)if response.status_code == 200:return response.textreturn Noneexcept RequestException:return None

獲取頁面并解析相關(guān)數(shù)據(jù)，用正則表達(dá)式匹配電影的信息，如電影名、評分、電影海報(bào)、導(dǎo)演、主角和上映年份等，并賦值為字典，形成結(jié)構(gòu)化數(shù)據(jù)。

def parse_one_page(html):pattern = re.compile('<li>''.*?(.*?)''.*?src="(.*?)" ''.*?title">(.*?)'# '.*?average">(.*?)'# '.*?導(dǎo)演: (.*?)&nbsp'# '.*?主演: (.*?)... '# ' \n(\d+)''.*? .*?(\d+)'# '.*?/ (.*?) /'# '.*?\d+ / .*? / (.*?)\n''.*?inq">(.*?)''.*?</li>', re.S)items = re.findall(pattern, html)for item in items:yield {'id': item[0],'image': item[1],'name': item[2],# 'score': item[3],# 'actor': item[4],# 'star': item[3],'year': item[3],# 'country': item[7],# 'kind': item[8],'inq': item[4]}# time.sleep(1)

將數(shù)據(jù)寫入文本文件，通過JSON庫的dumps()方法實(shí)現(xiàn)字典的序列化

def write_to_file(content):with open('D:/result.txt', 'a', encoding='utf-8') as f:f.write(json.dumps(content, ensure_ascii=False) + '\n')

將數(shù)據(jù)寫入數(shù)據(jù)庫

def write_to_mysql(item):db = pymysql.connect(host='localhost', user='root', passwd='xjz01405', db='test', port=3306)cursor = db.cursor()cursor.execute('SELECT VERSION()')data = cursor.fetchone()print('Database version:', data)table = 'movies'keys = ', '.join(item.keys())values = ', '.join(['%s'] * len(item))sql = 'INSERT INTO {table}({keys}) VALUES ({values})'.format(table=table, keys=keys, values=values)try:if cursor.execute(sql, tuple(item.values())):print('sucessful')db.commit()except:print('failed')db.rollback()# db.close()

主函數(shù),調(diào)用之前實(shí)現(xiàn)的方法，將電影數(shù)據(jù)寫入數(shù)據(jù)庫和文件中

def main(offset):url = 'https://movie.douban.com/top250?start=' + str(start) + '&filter='html = get_one_page(url)# print(html)for item in parse_one_page(html):print(item)write_to_mysql(item)update_to_mysql(item)write_to_file(item)

頁面跳轉(zhuǎn)，給主函數(shù)內(nèi)的url傳入start參數(shù)，實(shí)現(xiàn)全部250條數(shù)據(jù)的爬取

if __name__ == '__main__':for i in range(2, 3):main(start=i * 25)time.sleep(2)

結(jié)果：

文本文檔

數(shù)據(jù)庫

更新數(shù)據(jù)，如果有新的數(shù)據(jù)就插入數(shù)據(jù)，如果數(shù)據(jù)已經(jīng)存在數(shù)據(jù)庫中，就更新數(shù)據(jù)，通過比較主鍵判斷是否存在，實(shí)現(xiàn)主鍵不存在就插入數(shù)據(jù)，若存在就更新數(shù)據(jù)。

def update_to_mysql(item):db = pymysql.connect(host='localhost', user='root', passwd='xjz01405', db='test', port=3306)cursor = db.cursor()table = 'movies'keys = ', '.join(item.keys())values = ', '.join(['%s'] * len(item))sql = 'INSERT INTO {table}({keys}) VALUES ({values}) ON DUPLICATE KEY UPDATE'.format(table=table, keys=keys, values=values)# sql = 'UPDATE {table} 'update = ','.join([" {key} = %s".format(key=key) for key in item])sql += updatetry:if cursor.execute(sql, tuple(item.values())*2):print('sucessful')db.commit()except:print('failed')db.rollback()# db.close()

總結(jié)

以上是生活随笔為你收集整理的python爬虫拉取豆瓣Top250数据的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： word中分栏后文字均匀的分布在了左右两
下一篇： python脚本语言采用声音作为手段_p