當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

爬取豆瓣王祖贤电影海报代码

發(fā)布時間：2024/1/1 编程问答 64 豆豆

生活随笔收集整理的這篇文章主要介紹了爬取豆瓣王祖贤电影海报代码小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

爬蟲實際上是用瀏覽器訪問的方式模擬了訪問網(wǎng)站的過程，整個過程包括三個階段：打開網(wǎng)頁、提取數(shù)據(jù)和保存數(shù)據(jù)。
在 Python 中，這三個階段都有對應(yīng)的工具可以使用。在“打開網(wǎng)頁”這一步驟中，可以使用 Requests 訪問頁面，得到服務(wù)器返回給我們的數(shù)據(jù)，這里包括 HTML 頁面以及 JSON 數(shù)據(jù)。在“提取數(shù)據(jù)”這一步驟中，主要用到了兩個工具。針對 HTML 頁面，可以使用 XPath 進行元素定位，提取數(shù)據(jù)；針對 JSON 數(shù)據(jù)，可以使用 JSON 進行解析。在最后一步“保存數(shù)據(jù)”中，我們可以使用 Pandas 保存數(shù)據(jù)，最后導(dǎo)出 CSV 文件。

Xpath定位

import os import requests from lxml import etree from selenium import webdriver search_text = "王祖賢" start = 0 limit = 15 total = 15 def download(img, title): dir = "D:\\數(shù)據(jù)分析\\python test\\query\\" + search_text + "\\" id = title.replace(u'\u200e', u'').replace(u'?', u'') .replace(u'/', u'or') if not os.path.exists(dir): os.makedirs(dir) try: pic = requests.get(img, timeout=10) img_path = dir + str(id) + '.jpg' fp = open(img_path, 'wb') fp.write(pic.content) fp.close() except requests.exceptions.ConnectionError: print('圖片無法下載') def crawler_xpath(): src_img = "//div[@class='item-root']/a[@class='cover-link']/img[@class='cover']/@src" src_title = "//div[@class='item-root']/div[@class='detail']/div[@class='title']/a[@class='title-text']" for i in range(start,total,limit): request_url = "https://search.douban.com/movie/subject_search?search_text="+search_text+"&cat=1002&start="+str(i)driver = webdriver.Chrome() driver.get(request_url) html = etree.HTML(driver.page_source) imgs = html.xpath(src_img) titles = html.xpath(src_title) print(imgs,titles) for img, title in zip(imgs, titles): download(img, title.text) if __name__ == '__main__': crawler_xpath()

JSON解析

import requests import json query = '王祖賢' ''' 下載圖片 '''def download(src, id):dir = './' + str(id) + '.jpg'try:pic = requests.get(src, timeout=10)fp = open(dir, 'wb')fp.write(pic.content)fp.close()except requests.exceptions.ConnectionError:print('圖片無法下載')''' for 循環(huán) 請求全部的 url ''' for i in range(0, 22471, 20):url = 'https://www.douban.com/j/search_photo?q=' + query + '&limit=20&start=' + str(i)print(url)headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'}req = requests.g小菜雞的技術(shù)之路et(url=url,headers=headers)html = req.text # 得到返回結(jié)果print(html)response = json.loads(html, encoding='utf-8') # 將 JSON 格式轉(zhuǎn)換成 Python 對象for image in response['images']:print(image['src']) # 查看當(dāng)前下載的圖片網(wǎng)址download(image['src'], image['id']) # 下載一張圖片

歡迎關(guān)注個人微信公眾號：小菜雞的技術(shù)之路。

總結(jié)

以上是生活随笔為你收集整理的爬取豆瓣王祖贤电影海报代码的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。