當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

Python爬虫豆瓣电影top250

發(fā)布時(shí)間：2023/12/8 python 46 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python爬虫豆瓣电影top250 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

??我的另一篇博客，Python爬蟲豆瓣讀書評(píng)分9分以上榜單

??有了上次的基礎(chǔ)，這次簡(jiǎn)單爬了下豆瓣上電影TOP250，鏈接豆瓣電影TOP250。

??打開鏈接，查看網(wǎng)頁源代碼，查找我們需要的信息的字段標(biāo)簽，本次以標(biāo)題、概要、評(píng)分、圖片為目標(biāo)，分別進(jìn)行處理、獲取并保存。（當(dāng)然最根本的前提依然是通過url獲取到網(wǎng)頁的源代碼）

??本實(shí)例完整代碼請(qǐng)移步github：

??https://github.com/selfcon/douban_movie_scraper_python

??推薦正則表達(dá)式在線檢測(cè)工具：

??http://tool.oschina.net/regex/

1、源代碼html

# 獲取網(wǎng)頁源代碼 def getHtml(url):page = urllib.request.urlopen(url);html = page.read();return html;

2、標(biāo)題title

??從源代碼中可以發(fā)現(xiàn)，標(biāo)題有多個(gè)，而我們需要的是首標(biāo)題。因此需要對(duì)通過正則表達(dá)式獲得的結(jié)果進(jìn)行相應(yīng)的處理。

# 通過正則表達(dá)式獲取該網(wǎng)頁下的每部電影的title def getName(html):nameList = re.findall(r'<span.*?class="title">(.*?)</span>', html, re.S);global topnumnewNameList = [];for index,item in enumerate(nameList):if item.find("&nbsp") == -1:#通過檢測(cè)&gt或者&nbsp這種HTML轉(zhuǎn)義符，只保留第一個(gè)標(biāo)題newNameList.append("Top " + str(topnum) + " " + item);topnum += 1;return newNameList;

3、概要introduction

??通過源代碼可以找到相應(yīng)的標(biāo)簽，編寫正則表達(dá)式（ps：由于有的電影沒有概要介紹，所以在最后的數(shù)據(jù)存儲(chǔ)中沒存儲(chǔ)該屬性）

# 通過正則表達(dá)式獲取該網(wǎng)頁下的每部電影的introduction def getInfo(html):infoList = re.findall(r'<span.*?class="inq">(.*?)</span>', html, re.S);return infoList;

4、評(píng)分rating

# 通過正則表達(dá)式獲取該網(wǎng)頁下的每部電影的rating_num def getScore(html):scoreList = re.findall(r'<span.*?class="rating_num".*?property="v:average">(.*?)</span>', html, re.S);return scoreList;

5、圖片img

# 通過正則表達(dá)式獲取該網(wǎng)頁下的每部電影的img def getImg(html):imgList = re.findall(r'<img.*?alt=.*?src="(https.*?)".*?class.*?>', html, re.S);return imgList;

6、翻頁page

??我們發(fā)現(xiàn)一共250條記錄，每頁10條，共25頁

# 實(shí)現(xiàn)翻頁,每頁25個(gè) for page in range(0,250,25):url = "https://movie.douban.com/top250?start={}".format(page)html = getHtml(url).decode("UTF-8");namesUrl.extend(getName(html));scoresUrl.extend(getScore(html));infosUrl.extend(getInfo(html));imgsUrl.extend(getImg(html));

7、打印print

# 將獲得的信息進(jìn)行打印，并存給列表allinfo，方便存儲(chǔ) allInfo = []; if len(namesUrl) == len(scoresUrl) == len(imgsUrl):length = len(namesUrl);for i in range(0,length):print(namesUrl[i]+" , score = "+scoresUrl[i]+" ,\n imgUrl="+imgsUrl[i]);tmp = [];tmp.append(namesUrl[i]);tmp.append(scoresUrl[i]);tmp.append(imgsUrl[i]);allInfo.append(tmp);

8、存儲(chǔ)store

# 將獲得的數(shù)據(jù)進(jìn)行存儲(chǔ) def save_to_csv(list_tmp):with open('D:/movie.csv','w+',newline='') as fp:a = csv.writer(fp,delimiter=',');a.writerow(['name','score','imgurl']);a.writerows(list_tmp);

9、結(jié)果result

------至所有正在努力奮斗的程序猿們！加油！！
有碼走遍天下無碼寸步難行
1024 - 夢(mèng)想，永不止步!
愛編程不愛Bug
愛加班不愛黑眼圈
固執(zhí) 但不偏執(zhí)
瘋狂但不瘋癲
生活里的菜鳥
工作中的大神
身懷寶藏，一心憧憬星辰大海
追求極致，目標(biāo)始于高山之巔
一群懷揣好奇，夢(mèng)想改變世界的孩子
一群追日逐浪，正在改變世界的極客
你們用最美的語言，詮釋著科技的力量
你們用極速的創(chuàng)新，引領(lǐng)著時(shí)代的變遷

——樂于分享，共同進(jìn)步，歡迎補(bǔ)充
——Treat Warnings As Errors
——Any comments greatly appreciated
——Talking is cheap, show me the code
——誠心歡迎各位交流討論！QQ:1138517609
——CSDN：https://blog.csdn.net/u011489043
——簡(jiǎn)書：https://www.jianshu.com/u/4968682d58d1
——GitHub：https://github.com/selfconzrr

總結(jié)

以上是生活随笔為你收集整理的Python爬虫豆瓣电影top250的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：饱和度调整
下一篇：读书笔记:《流畅的Python》第21章

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片