生活随笔
收集整理的這篇文章主要介紹了
爬取豆瓣电影Top 250封面
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
大家好,作為一名互聯網行業的小白,寫博客只是為了鞏固自己學習的知識,但由于水平有限,博客中難免會有一些錯誤出現,有不妥之處懇請各位大佬指點一二!
博客主頁:鏈接: https://blog.csdn.net/weixin_52720197?spm=1018.2118.3001.5343
import requests
from lxml
import etree
import pandas
as pd
import osMOVIES
= []
IMGURLS
= []
def get_html(url
):headers
= {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}try:html
= requests
.get
(url
,headers
= headers
)html
.encoding
= html
.apparent_encoding
if html
.status_code
== 200:print('成功獲取源代碼')except Exception
as e
:print('獲取源代碼失敗:%s' % e
)return html
.text
def parse_html(html
):movies
= [] imgurls
= [] html
= etree
.HTML
(html
)lis
= html
.xpath
("//ol[@class = 'grid_view']/li")for li
in lis
:name
= li
.xpath
(".//a/span[@class='title'][1]/text()")[0]director_actor
= "".join
(li
.xpath
(".//div[@class='bd']/p/text()[1]")[0].replace
(' ','').replace
('\n','').replace
('/','').split
())info
= "".join
(li
.xpath
(".//div[@class='bd']/p/text()[2]")[0].replace
(' ','').replace
('\n','').split
())rating_score
= li
.xpath
(".//span[@class='rating_num']/text()")[0]rating_num
= li
.xpath
(".//div[@class='star']/span[4]/text()")[0]introduce
= li
.xpath
(".//p[@class='quote']/span/text()")if introduce
:movie
= {'name': name
, 'director_actor': director_actor
, 'info': info
, 'rating_score': rating_score
,'rating_num': rating_num
, 'introduce': introduce
[0]}else:movie
= {'name': name
, 'director_actor': director_actor
, 'info': info
, 'rating_score': rating_score
,'rating_num': rating_num
, 'introduce': None}imgurl
= li
.xpath
(".//img/@src")[0]movies
.append
(movie
)imgurls
.append
(imgurl
)return movies
,imgurls
def download_img(url
,movie
):if 'movieposter' in os
.listdir
(r
'D:\Python\spader\爬蟲\豆瓣電影封面'): pass else: os
.mkdir
('movieposter') os
.chdir
(r
'D:\Python\spader\爬蟲\豆瓣電影封面\movieposter')img
= requests
.get
(url
).content
with open(movie
['name'] + '.jpg','wb') as f
: print('正在下載 : %s' % url
) f
.write
(img
)if __name__
== '__main__':for i
in range(10):url
= 'https://movie.douban.com/top250?start=' + str(i
* 25) + '&filter='html
= get_html
(url
)movies
= parse_html
(html
)[0]imgurls
= parse_html
(html
)[1]MOVIES
.extend
(movies
)IMGURLS
.extend
(imgurls
)for i
in range(250):download_img
(IMGURLS
[i
],MOVIES
[i
])os
.chdir
(r
'D:\Python\spader\爬蟲\豆瓣電影封面')moviedata
= pd
.DataFrame
(MOVIES
)moviedata
.to_csv
('movie.csv') print('電影信息成功保存到本地')
結果
解決文字亂碼
用記事本打開CSV文件----->文件------>另存為------>點編碼------>選擇ANSI----->保存,然后用excel打開就不會是亂碼了。
總結
以上是生活随笔為你收集整理的爬取豆瓣电影Top 250封面的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。