當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

python爬取豆瓣电影并分析_爬取豆瓣电影top250提取电影分类进行数据分析

發(fā)布時間：2025/4/5 python 34 豆豆

生活随笔收集整理的這篇文章主要介紹了 python爬取豆瓣电影并分析_爬取豆瓣电影top250提取电影分类进行数据分析小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

標(biāo)簽（空格分隔）：python爬蟲

一、爬取網(wǎng)頁，獲取需要內(nèi)容

我們今天要爬取的是豆瓣電影top250

頁面如下所示：

我們需要的是里面的電影分類，通過查看源代碼觀察可以分析出我們需要的東西。直接進(jìn)入主題吧！

知道我們需要的內(nèi)容在哪里了，接下來就使用我們python強大的request庫先獲取網(wǎng)頁內(nèi)容下來吧！獲取內(nèi)容后，再使用一個好用的lxml庫來分析網(wǎng)頁內(nèi)容，然后獲取我們的內(nèi)容就可以做下一步操作了。

先貼出使用request庫和lxml分析的代碼

def get_page(i):

url = 'https://movie.douban.com/top250?start={}&filter='.format(i)

html = requests.get(url).content.decode('utf-8') # 使用request庫獲取網(wǎng)頁內(nèi)容

selector = etree.HTML(html) # 使用lxml庫提取內(nèi)容

'''

通過觀察頁面就能發(fā)現(xiàn)內(nèi)容在

下的一部分

'''

content = selector.xpath('//div[@class="info"]/div[@class="bd"]/p/text()')

print(content)

for i in content[1::2]:

print(str(i).strip().replace('\n\r', ''))

# print(str(i).split('/'))

i = str(i).split('/')

i = i[len(i) - 1]

key = i.strip().replace('\n', '').split(' ') # 這里的strip和replace的使用目的是去除空格和空行之類

print(key)

通過獲取下來的內(nèi)容我們發(fā)現(xiàn)一部電影的各項內(nèi)容都是用'/'分隔著，我們只需要提取電影分類中的東西，所以我們需要使用

i = str(i).split('/')

來把內(nèi)容分隔成幾項內(nèi)容，因為電影分類排在最后，所以我們通過

i = i[len(i) - 1]

來獲取分隔后的最后一項也就是我們需要的電影分類，還有最后一步我們需要完成的，因為一部電影里面一般都有多個電影分類的標(biāo)簽，所以我們還要繼續(xù)分隔獲取到的電影分類，并且觀察可以知道電影分類之間只是用一個空格隔開，所以我們使用下面一行代碼就可以分離出各個分類：

key = i.strip().replace('\n', '').split(' ')

二、接下來就是保存到mysql數(shù)據(jù)庫

把電影分類保存在mysql數(shù)據(jù)庫以便下面進(jìn)行數(shù)據(jù)分析，這里我們使用到pymysql來連接mysql數(shù)據(jù)庫,首先我們需要在mysql數(shù)據(jù)庫建好表：

然后我們通過pymysql把數(shù)據(jù)保存到數(shù)據(jù)庫中，代碼如下：

首先要連接數(shù)據(jù)庫：

# 連接mysql數(shù)據(jù)庫

conn = pymysql.connect(host = 'localhost', user = 'root', passwd = '2014081029', db = 'mysql', charset = 'utf8')? # user為數(shù)據(jù)庫的名字，passwd為數(shù)據(jù)庫的密碼，一般把要把字符集定義為utf8，不然存入數(shù)據(jù)庫容易遇到編碼問題

cur = conn.cursor()? # 獲取操作游標(biāo)

cur.execute('use douban')? # 使用douban這個數(shù)據(jù)庫

在保存到數(shù)據(jù)庫之前，我們還有一個需要做得，那就是把250部電影的分類匯總數(shù)量，所以我們定義了一個字典來統(tǒng)計電影分類的個數(shù)，這里的代碼是get_page函數(shù)的一部分,代碼如下：

for i in content[1::2]:

print(str(i).strip().replace('\n\r', ''))

# print(str(i).split('/'))

i = str(i).split('/')

i = i[len(i) - 1]

key = i.strip().replace('\n', '').split(' ')

print(key)

for i in key:

if i not in douban.keys():

douban[i] = 1

else:

douban[i] += 1

然后定義一個保存函數(shù)，執(zhí)行插入操作，如果出現(xiàn)插入失敗，就執(zhí)行回滾操作，還有記得在操作完成之后，使用conn.close()和cur.close()來關(guān)閉數(shù)據(jù)庫連接,代碼如下：

def save_mysql(douban):

print(douban) # douban在主函數(shù)中定義的字典

for key in douban:

print(key)

print(douban[key])

if key != '':

try:

sql = 'insert douban(類別, 數(shù)量) value(' + "\'" + key + "\'," + "\'" + str(douban[key]) + "\'" + ');'

cur.execute(sql)

conn.commit()

except:

print('插入失敗')

conn.rollback()

三、使用matplotlib進(jìn)行數(shù)據(jù)可視化操作

首先，從數(shù)據(jù)庫中把電影分類和每個分類的數(shù)量分別存入一個列表中，然后使用matplotlib進(jìn)行可視化操作，具體如下：

def pylot_show():

sql = 'select * from douban;'

cur.execute(sql)

rows = cur.fetchall() # 把表中所有字段讀取出來

count = [] # 每個分類的數(shù)量

category = [] # 分類

for row in rows:

count.append(int(row[2]))

category.append(row[1])

y_pos = np.arange(len(category)) # 定義y軸坐標(biāo)數(shù)

plt.barh(y_pos, count, align='center', alpha=0.4) # alpha圖表的填充不透明度(0~1)之間

plt.yticks(y_pos, category) # 在y軸上做分類名的標(biāo)記

for count, y_pos in zip(count, y_pos):

# 分類個數(shù)在圖中顯示的位置，就是那些數(shù)字在柱狀圖尾部顯示的數(shù)字

plt.text(count, y_pos, count, horizontalalignment='center', verticalalignment='center', weight='bold')

plt.ylim(+28.0, -1.0) # 可視化范圍，相當(dāng)于規(guī)定y軸范圍

plt.title(u'豆瓣電影250') # 圖表的標(biāo)題

plt.ylabel(u'電影分類') # 圖表y軸的標(biāo)記

plt.subplots_adjust(bottom = 0.15)

plt.xlabel(u'分類出現(xiàn)次數(shù)') # 圖表x軸的標(biāo)記

plt.savefig('douban.png') # 保存圖片

下面說明一下matplotlib的一些簡單使用，首先我們要導(dǎo)入matplotlib和numpy的包

import numpy as np

import matplotlib.pyplot as plt

這次可視化是柱狀圖，這里給出brah()函數(shù)的定義：

barh()

主要功能：做一個橫向條形圖，橫向條的矩形大小為: left, left + width, bottom, bottom + height

參數(shù)：barh ( bottom , width , height =0.8, left =0, **kwargs )

返回類型：一個 class 類別，matplotlib.patches.Rectangle**實例

參數(shù)說明：

bottom: Bars 的垂直位置的底部邊緣

width: Bars 的長度

可選參數(shù)：

height: bars 的高度

left: bars 左邊緣 x 軸坐標(biāo)值

color: bars 顏色

edgecolor: bars 邊緣顏色

linewidth: bar 邊緣寬度;None 表示默認(rèn)寬度;0 表示不 i 繪制邊緣

xerr: 若不為 None,將在 bar 圖上生成 errobars

yerr: 若不為 None,將在 bar 圖上生成 errobars

ecolor: 指定 errorbar 顏色

capsize: 指定 errorbar 的頂部(cap)長度

align: ‘edge’ (默認(rèn)) | ‘center’:‘edge’以底部為準(zhǔn)對齊;‘center’以 y 軸作為中心

log: [False|True] False (默認(rèn)),若為 True,使用 log 坐標(biāo)

然后就可以顯示出圖片來了

源碼在這里：

# -*- coding: utf-8 -*-

# !/usr/bin/env python

from lxml import etree

import requests

import pymysql

import matplotlib.pyplot as plt

from pylab import *

import numpy as np

# 連接mysql數(shù)據(jù)庫

conn = pymysql.connect(host = 'localhost', user = 'root', passwd = '2014081029', db = 'mysql', charset = 'utf8')

cur = conn.cursor()

cur.execute('use douban')

def get_page(i):

url = 'https://movie.douban.com/top250?start={}&filter='.format(i)

html = requests.get(url).content.decode('utf-8')

selector = etree.HTML(html)

content = selector.xpath('//div[@class="info"]/div[@class="bd"]/p/text()')

print(content)

for i in content[1::2]:

print(str(i).strip().replace('\n\r', ''))

# print(str(i).split('/'))

i = str(i).split('/')

i = i[len(i) - 1]

# print('zhe' +ｉ)

# print(i.strip())

# print(i.strip().split(' '))

key = i.strip().replace('\n', '').split(' ')

print(key)

for i in key:

if i not in douban.keys():

douban[i] = 1

else:

douban[i] += 1

def save_mysql():

print(douban)

for key in douban:

print(key)

print(douban[key])

if key != '':

try:

sql = 'insert douban(類別, 數(shù)量) value(' + "\'" + key + "\'," + "\'" + str(douban[key]) + "\'" + ');'

cur.execute(sql)

conn.commit()

except:

print('插入失敗')

conn.rollback()

def pylot_show():

sql = 'select * from douban;'

cur.execute(sql)

rows = cur.fetchall()

count = []

category = []

for row in rows:

count.append(int(row[2]))

category.append(row[1])

print(count)

y_pos = np.arange(len(category))

print(y_pos)

print(category)

colors = np.random.rand(len(count))

# plt.barh()

plt.barh(y_pos, count, align='center', alpha=0.4)

plt.yticks(y_pos, category)

for count, y_pos in zip(count, y_pos):

plt.text(count, y_pos, count, horizontalalignment='center', verticalalignment='center', weight='bold')

plt.ylim(+28.0, -1.0)

plt.title(u'豆瓣電影250')

plt.ylabel(u'電影分類')

plt.subplots_adjust(bottom = 0.15)

plt.xlabel(u'分類出現(xiàn)次數(shù)')

plt.savefig('douban.png')

if __name__ == '__main__':

douban = {}

for i in range(0, 250, 25):

get_page(i)

# save_mysql()

pylot_show()

cur.close()

conn.close()

總結(jié)

以上是生活随笔為你收集整理的python爬取豆瓣电影并分析_爬取豆瓣电影top250提取电影分类进行数据分析的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： 3281025461为什么汽油能够让车子
下一篇： delphi 调用php接口_新浪图床