當前位置：首頁 > 编程语言 > python >内容正文

python

python 爬虫案例：爬取百度贴吧图片

發布時間：2024/1/18 python 34 豆豆

生活随笔收集整理的這篇文章主要介紹了 python 爬虫案例：爬取百度贴吧图片小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章更新于：2020-04-24
注1：打包后的程序（無需python環境）下載參見：https://ww.lanzous.com/ibvwref
注2：更多爬蟲案例參見：https://github.com/amnotgcs/SpiderCase

一、分析

1.1、程序流程分析

1、從用戶接收一個字符串
2、判斷是否存在該貼吧
3、如果存在解析總頁數，并接收兩個數字作為提取圖片的頁數
4、循環保存圖片

1.2、技術分析

1、貼吧超鏈接為 https://tieba.baidu.com/f?ie=utf-8&kw=關鍵字&pn=頁數x50
2、所以我們從用戶接收關鍵字、頁數即可。
3、進入貼吧后，我們可以查看源碼發現每頁有 50 個帖子，超連接為 https://tieba.baidu.com/p/頁面ID
4、所以我們從當前頁面檢索出所有的頁面 ID ，然后自己構造 URL 進行訪問即可。
5、進入帖子頁面后，我們可以發現圖片的超連接都在 BDE_Image 類的 a 標簽里面

6、所以我們直接提取這個 a 標簽的 href 屬性使用 urllib.request.urlretrieve 進行保存即可。
7、其他細節根據需要進行完善。

二、源代碼

import requests import urllib.request import urllib.parse from bs4 import BeautifulSoup from os import mkdirdef setKeyword():# 構造帶參 URLurl_prefix = "https://tieba.baidu.com/f?"# 確定是否存在該吧print("\n\n\t\t歡迎使用百度貼吧圖片檢索程序 v1.0")print("\n\n\t\t程序更新于：2020-04-24 by amnotgcs")keyword = input("\n\n\t\t請輸入你要檢索的貼吧名：")url = "%sid=utf-8&kw=%s"%(url_prefix, urllib.parse.quote(keyword))response = requests.get(url)soup = BeautifulSoup(response.text, 'html.parser')confirmName = soup.find('a', class_ = 'card_title_fname')if confirmName:# 輸出貼吧名print("\t\t已匹配：",confirmName.string.strip())else:print("\t\t該吧不存在")return NonemaxPage = soup.find('a', class_ = "last pagination-item")['href'].split("=")[-1]maxPage = int(maxPage) // 50 +1print("\n\t\t總共檢索到 %d 頁"%maxPage)pageNum = int(input("\t\t請輸入你想要獲取的開始頁數："))pageNumEnd = int(input("\t\t請輸入你想要獲取的結束頁數："))pageList = []if pageNumEnd - pageNum >= 0:while pageNumEnd >= pageNum:params = {'ie':'utf-8','kw': keyword,'pn': (pageNum-1)*50}url = url_prefix + urllib.parse.urlencode(params)pageList.append(url)pageNum += 1return pageListelse:return Nonedef get_html(url = ""):if not url:return None# 獲取網頁源碼response = requests.get(url)html_doc = response.text# 調用解析函數logString = "\n" + "-"*30 + "\n下面是：%s\n"%url + "-"*30 + "\n" To_log(logString)analyseHtml(html_doc)def analyseHtml(html_doc = ""):if not html_doc:return None# 進行解析soup = BeautifulSoup(html_doc, 'html.parser')entries = soup.find_all('a')entries = soup.find_all('a', class_ = 'j_th_tit')print("\n\n帖子位置：\t\t 主題：")for item in entries:logString = "\n%s\t%s"%(item['href'],item.string)To_log(logString)entryUrl = "https://tieba.baidu.com" + item['href']# 定義圖片名使用的前綴，防止覆蓋pageID = item['href'].split("/")[-1]getImage(entryUrl, pageID)def getImage(url, pageID):if not url:return Nonetry:response = requests.get(url)soup = BeautifulSoup(response.text, 'html.parser')except:print("\n", url, "好像出錯了哦~")return None# 定位圖片tags = soup.find_all('img', class_ = 'BDE_Image')if len(tags):logString = "\t\t||||||||||發現目標： %d 個圖片"%len(tags)To_log(logString)else:return Noneglobal imgCountfor item in tags:try:urllib.request.urlretrieve(item['src'], r"./tiebaImg/%s_%d.jpg"%(pageID, imgCount))except:print("\n此圖片保存失敗", end = "")imgCount += 1def To_log(data):with open('tiebaImg/result.txt', 'a', encoding = 'utf-8')as file:file.write(data)print(data, end = "")def main():global imgCountimgCount = 0try:with open("tiebaImg/result.txt", 'w', encoding = 'utf-8')as file:file.write("程序正常開始")except:mkdir("tiebaImg")print("已經創建 tiebaImg 文件夾")pageList = setKeyword()if pageList:for page in pageList:get_html(page)print("\n", "="*60, "\n\t\t共獲取 %d 圖片"%imgCount)print("\t\t圖片保存在程序所在目錄 tiebaImg 文件夾內！")else:print("\t\t好像發現了什么奇怪了東西~")end = input("\n\t\t按回車鍵結束關閉窗口~")if __name__ == '__main__':main()

三、運行截圖

3.1、運行截圖：

3.2、運行結果：

四、Enjoy！

總結

以上是生活随笔為你收集整理的python 爬虫案例：爬取百度贴吧图片的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：找寻“失落”的系统功能
下一篇： python爬虫爬取百度贴吧图片，req

python

python 爬虫案例：爬取百度贴吧图片

一、分析

1.1、程序流程分析

1.2、技術分析

二、源代碼