當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

猫眼电影爬虫和数据分析

發(fā)布時(shí)間：2023/12/14 编程问答 38 豆豆

生活随笔收集整理的這篇文章主要介紹了猫眼电影爬虫和数据分析小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

由于疫情關(guān)系，宅在家里。記錄一下作業(yè)，貓眼電影爬蟲及分析，爬取貓眼電影數(shù)據(jù)，并對(duì)爬取的數(shù)據(jù)進(jìn)行分析和展示。

貓眼電影爬蟲

基于requests庫(kù)和lxml庫(kù)進(jìn)去貓眼電影TOP100榜電影爬取，爬取地址為：https://maoyan.com/board/4

爬取的信息有：電影名字，主演名字，上映時(shí)間以及地點(diǎn)，貓眼評(píng)分得分，電影類型，電影時(shí)長(zhǎng)。

電影數(shù)據(jù)保存為.csv格式。表頭：電影名字(title)，主演名字(author)，上映時(shí)間以及地點(diǎn)(pub_time)，貓眼評(píng)分得分(star)，電影類型(style)，電影時(shí)長(zhǎng)(long_time)。

import requests from lxml import etree import csvheaders = { # 設(shè)置header'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36' }def get_url(url): # top100電影獲取res = requests.get(url, headers=headers) # 請(qǐng)求# print(res.text)html = etree.HTML(res.text) # 獲取網(wǎng)頁(yè)源碼infos = html.xpath('//dl[@class="board-wrapper"]/dd') # 獲取頁(yè)面的10部電影，xpathfor info in infos:title = info.xpath('div/div/div[1]/p[1]/a/text()')[0] # 電影名稱author = info.xpath('div/div/div[1]/p[2]/text()')[0].strip().strip('主演：') # 電影主演，strip()去掉空格pub_time = info.xpath('div/div/div[1]/p[3]/text()')[0].strip('上映時(shí)間：') # 上映時(shí)間star_1 = info.xpath('div/div/div[2]/p/i[1]/text()')[0] # 得分1(整數(shù)部分)star_2 = info.xpath('div/div/div[2]/p/i[2]/text()')[0] # 得分2（小數(shù)部分）star = star_1 + star_2 # 電影得分movie_url = 'https://maoyan.com' + info.xpath('div/div/div[1]/p[1]/a/@href')[0] # 電影的詳細(xì)頁(yè)# print(title,author,pub_time,star,movie_url)get_info(movie_url, title, author, pub_time, star) # 進(jìn)入電影的詳細(xì)頁(yè)爬取print(‘保存完畢！’)def get_info(url, title, author, pub_time, star): # 電影詳細(xì)獲取res = requests.get(url, headers=headers)html = etree.HTML(res.text)style = html.xpath('/html/body/div[3]/div/div[2]/div[1]/ul/li[1]/text()')[0] # 電影類型long_time = html.xpath('/html/body/div[3]/div/div[2]/div[1]/ul/li[2]/text()')[0].split('/')[1].strip().strip('分鐘') # 電影時(shí)長(zhǎng)print(title, author, pub_time, star, style, long_time)writer.writerow([title, author, pub_time, star, style, long_time]) # 寫入數(shù)據(jù)if __name__ == '__main__':fp = open('F://maoyan.csv', 'w', newline='', encoding='utf-8') # 存儲(chǔ)文件writer = csv.writer(fp)writer.writerow(['title', 'author', 'pub_time', 'star', 'style', 'long_time']) # 寫入表頭urls = ['https://maoyan.com/board/4?offset={}'.format(str(i)) for i in range(0, 100, 10)] # url構(gòu)造for url in urls:get_url(url)

數(shù)據(jù)分析

數(shù)據(jù)分析和展示時(shí)基于pandas和matplotlib。

分析了100部電影基本分析、演員出演這100部電影次數(shù)、電影的年份分布情況、電影的月份分布情況、電影的國(guó)家分布情況、前20部電影評(píng)分得分情況、電影的類型分布情況、電影的時(shí)長(zhǎng)分布情況等。

演員出演這100部電影次數(shù)
（1）讀取電影數(shù)據(jù)。
（2）取出主演(author)一列數(shù)據(jù)，循環(huán)對(duì)其進(jìn)行字符串拼接。
（3）以“,”為分隔符進(jìn)行切割，統(tǒng)計(jì)演員的個(gè)數(shù)以及名字、演員名字出現(xiàn)的次數(shù)（實(shí)現(xiàn)方法很多，這里使用的是Counter模塊）。
（4）選取出演次數(shù)最多的六位演員。構(gòu)造水平軸數(shù)據(jù)author，垂直軸數(shù)據(jù)count。
（5）使用matplotlib的pyplot繪制條形圖，設(shè)置title，xlabel，ylabel。

import pandas as pd from matplotlib import pyplot as plt from collections import Counterplt.rcParams['font.sans-serif'] = ['SimHei']datas = pd.read_csv('maoyan.csv', encoding='utf-8')s = '' for i in range(99):s += datas.iloc[i, 1]+',' s += datas.iloc[99, 1] # 防止最后的空格 # print(s) authors = s.split(',') # print(authors) c = Counter(authors) # print(c) items = c.most_common(6) print(items)author = [] count = [] for item in items:author.append(item[0])count.append(item[1])# print(author) # print(count) plt.bar(author, count, color='orange') plt.title('出演次數(shù)最多的六位演員情況') plt.xlabel('演員') plt.ylabel('出演次數(shù)') plt.show()

運(yùn)行結(jié)果：

電影的年份分布情況
（1）讀取電影數(shù)據(jù)。
（2）取出上映時(shí)間及地點(diǎn)(pub_time)一列數(shù)據(jù)。
（3）以“-”為分隔符進(jìn)行切割，取出第一個(gè)元素，第二個(gè)元素是月份。存儲(chǔ)為新的一列year。
（4）使用groupby對(duì)year這一列按照年份進(jìn)行分組，并統(tǒng)計(jì)次數(shù)。
（5）數(shù)據(jù)的index轉(zhuǎn)化為列表作為水平軸數(shù)據(jù)，數(shù)據(jù)轉(zhuǎn)化為list作為垂直軸數(shù)據(jù)。
（6）使用matplotlib的pyplot繪制折線圖，設(shè)置title，xlabel，ylabel。

import pandas as pd from matplotlib import pyplot as pltplt.rcParams['font.sans-serif'] = ['SimHei']datas = pd.read_csv('maoyan.csv', encoding='utf-8')datas['year'] = datas['pub_time'].str.split('-').str[0] datas['month'] = datas['pub_time'].str.split('-').str[1]year = datas.groupby('year')['year'].count() month = datas.groupby('month')['month'].count() # print(list(year.index)) # print(list(year)) # print(month)plt.figure(figsize=(20, 8), dpi=80)plt.plot(list(year.index), list(year)) plt.title('電影年份的分布情況') plt.xlabel('年份') plt.ylabel('電影數(shù)量') plt.grid(alpha=0.4) plt.show()

運(yùn)行結(jié)果：

電影的國(guó)家分布情況
（1）讀取電影數(shù)據(jù)。
（2）定義一個(gè)方法get_country()，對(duì)字符串進(jìn)行切割，取出國(guó)家部分，中國(guó)香港返回中國(guó)，法國(guó)戛納返回法國(guó)。
（3）取出上映時(shí)間及地點(diǎn)(pub_time)一列數(shù)據(jù)。
（4）分別進(jìn)行g(shù)et_country操作。存儲(chǔ)為新的一列country。
（5）使用groupby對(duì)country這一列按照國(guó)家進(jìn)行分組，并統(tǒng)計(jì)次數(shù)。
（6）數(shù)據(jù)的index轉(zhuǎn)化為列表作為水平軸數(shù)據(jù)，數(shù)據(jù)轉(zhuǎn)化為list作為垂直軸數(shù)據(jù)。
（7）使用matplotlib的pyplot繪制餅圖，設(shè)置title，xlabel，ylabel。

import pandas as pd from matplotlib import pyplot as pltplt.rcParams['font.sans-serif'] = ['SimHei']datas = pd.read_csv('maoyan.csv', encoding='utf-8')def get_country(s):country = s.split('(')if len(country) == 1:return '中國(guó)'else:temp = country[1].strip(')')if temp == '中國(guó)香港':return '中國(guó)'elif temp == '法國(guó)戛納':return '法國(guó)'else:return tempdatas['country'] = datas['pub_time'].map(get_country) # print(datas['country'])country = datas.groupby('country')['country'].count() # print(country) # print(list(country)) # print(list(country.index)) explods = [0, 0.2, 0, 0, 0, 0, 0, 0, 0]plt.pie(list(country), labels=list(country.index), autopct='%1.1f%%', explode=explods) plt.title('電影的國(guó)家分布情況') plt.show()

運(yùn)行結(jié)果：

電影的時(shí)長(zhǎng)分布情況
（1）讀取電影數(shù)據(jù)。
（2）取出電影時(shí)長(zhǎng)(long_time)一列數(shù)據(jù)。
（3）按照10分鐘為一個(gè)長(zhǎng)度歸類
（4）使用matplotlib的pyplot繪制柱狀圖，設(shè)置title，xlabel，ylabel。

import pandas as pd from matplotlib import pyplot as pltplt.rcParams['font.sans-serif'] = ['SimHei']datas = pd.read_csv('maoyan.csv', encoding='utf-8')long_time = list(datas['long_time']) # print(long_time)d = 10 num_bins = int((max(long_time) - min(long_time)) / d)plt.hist(long_time, range(min(long_time), max(long_time) + d, d), density=True)plt.xticks(range(min(long_time), max(long_time) + d, d)) plt.grid(alpha=0.4)plt.title('電影的時(shí)長(zhǎng)分布情況') plt.xlabel('電影時(shí)長(zhǎng)') plt.ylabel('比例')plt.show()

運(yùn)行結(jié)果：

【更多分析】
https://github.com/Tcrushes/crawler-analysis
【參考文獻(xiàn)】
[1] matplotlib用戶指南2020.04.08
[2] panluoluo. 貓眼電影爬蟲及分析GitHub 2019.03.04

總結(jié)

以上是生活随笔為你收集整理的猫眼电影爬虫和数据分析的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： 4级网络工程师第5套知识点
下一篇： chromium 安装flash pla