當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

qt爬取网页信息_豆瓣TOP250数据爬取

發布時間：2024/1/23 编程问答 29 豆豆

生活随笔收集整理的這篇文章主要介紹了 qt爬取网页信息_豆瓣TOP250数据爬取小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

一、問題描述

用python爬取網頁數據是現在流行的一種快速獲取數據的方法，簡單快捷。最近小編通過教程學習完成了豆瓣TOP250數據的爬取。下面就簡單介紹一下如何用python程序實現豆瓣網頁信息的爬取。

二、問題分析

首先需要引入一些python用來爬取數據的庫，其次指定url獲取網頁數據，再用正則表達式進行文字匹配表示規則，然后開始爬取數據，解析數據，最后保存數據到excel文件，爬取完成。

三、解決方案

1.導入相關庫

from bs4 import BeautifulSoup【#網頁解析獲取數據】

import re【#正則表達式，進行文字匹配】

import urllib.request,urllib.error【#指定url，獲取網頁數據】

import xlwt【#進行excel操作】

表3.1.1

2.得到指定一個url的網頁內容

需要模擬一個瀏覽器的頭部向豆瓣服務器發送訪問請求，需要在豆瓣的網頁中按F12進行網頁檢查界面，刷新使網頁運行再停止點擊TOP250找到User-Agent中的信息，將它復制到程序的作為頭部(要注意用引號引起來，刪掉空格)。

圖3.2.1

def askURL(url):
? ?head = {【#模擬瀏覽器頭部信息，向豆瓣服務器發送信息】
? ? ? ?"User-Agent":" Mozilla / 5.0(Windows NT 10.0;WOW64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 76.0.3809.87Safari / 537.36SLBrowser / 6.0.1.8131"
? ?}【#用戶代理，表示告訴豆瓣服務器，我們是什么類型的機器瀏覽器(本質上是告訴瀏覽器，我們可以接收什么水平的文件內容)】
? ?request = urllib.request.Request(url,headers = head)
? ?html = ""
? ?try:【#得到網頁數據，添加異常處理】
? ? ? ?response = urllib.request.urlopen(request)
? ? ? ?html = response.read().decode("utf-8")
? ? ? ?# print(html)
? ?except urllib.error.URLError as e:
? ? ? ?if hasattr(e,"code"):
? ? ? ? ? ?print(e.code)
? ? ? ?if hasattr(e,"reason"):
? ? ? ? ? ?print(e.reason)
? ?return html

表3.1.2

3.創建正則表達式對象

先將網頁的代碼獲取出來，再根據頁面標簽用正則表達式創建對象，獲取主要信息。

圖3.3.1

(1)#影片詳情鏈接的規則
findLink = re.compile(r'') ??

【 #創建正則表達式對象，表示規則(字符串的模式)】
(2)#影片圖片
findImgSrc = re.compile(r'

【#re.s讓換行符包含在字符中】
(3)#影片片名
findTitle = re.compile(r'(.*)')
(4)#影片的評分
findRating = re.compile(r'(.*)')
(5)#找到評價人數
findJudge = re.compile(r'(\d*)人評價')
(6)#找到概況
findIng = re.compile(r'(.*)')
(7)#找到影片的相關內容
findBd = re.compile(r'

(.*?)

',re.S)

表3.3.1

4.爬取網頁解析數據

爬取數據需要用for循環遍歷網頁，保存獲取網頁源碼。解析數據主要是將爬取到的數據進行整理，處理和修改，保存到datalist數據中。

def getdata(baseurl):
? ?datalist = []
? ?for i in range(0,10):【#調用獲取頁面信息的函數10次】
? ? ? ?url = baseurl + str(i*25)
? ? ? ?html = askURL(url)【#保存獲取到的網頁源碼】
? ? ? ?# 2.逐個解析數據
? ? ? ?soup = BeautifulSoup(html,"html.parser")
? ? ? ?for item in soup.find_all('div',class_ = "item"):
? ? ? ? ? ?#print(item)【#測試查看電影item全部信息】
? ? ? ? ? ?data = []【#保存一部電影的所有信息】
? ? ? ? ? ?item= str(item)
? ? ? ? ? ?#影片詳情的鏈接
? ? ? ? ? ?link = re.findall(findLink,item)[0]【#re庫用來通過正則表達式查找指定的字符串】

? ? ? ? ? ?data.append(link)【#添加鏈接】
? ? ? ? ? ?imgSrc = re.findall(findImgSrc,item)[0]
? ? ? ? ? ?data.append(imgSrc)【#添加圖片】
? ? ? ? ? titles = re.findall(findTitle,item)【#片名可能只有一個中文名，沒有外國名】

? ? ? ? ? ?if(len(titles) == 2):
? ? ? ? ? ? ? ?ctitle = titles[0]
? ? ? ? ? ? ? ?data.append(ctitle)
? ? ? ? ? ? ? ?otitle = titles[1].replace("/","")【#去掉無關的符合】
? ? ? ? ? ? ? ?data.append(otitle)【 #添加外國名】
? ? ? ? ? ?else:
? ? ? ? ? ? ? ?data.append(titles[0])
? ? ? ? ? ? ? ?data.append(' ')【?#外國名字留空】
? ? ? ? ? ?rating = re.findall(findRating,item)[0]【#添加評分】
? ? ? ? ? ?data.append(rating)
? ? ? ? ? ?judgeNum = re.findall(findJudge,item)[0]【#添加評價人數】
? ? ? ? ? ?data.append(judgeNum)
? ? ? ? ? ?inq = re.findall(findIng,item)
? ? ? ? ? ?if len(inq) != 0:
? ? ? ? ? ? ? ?inq = inq[0].replace("。"," ")【#去掉句號】
? ? ? ? ? ? ? ?data.append(inq)【#添加概述】
? ? ? ? ? ?else:
? ? ? ? ? ? ? ?data.append(' ')【#留空】
? ? ? ? ? ?bd = re.findall(findBd,item)[0]
? ? ? ? ? ?bd = re.sub('
(\s+)?'," ",bd)【?#去掉
】
? ? ? ? ? ?bd = re.sub('/'," ",bd)【#替換斜杠】
? ? ? ? ? ?data.append(bd.strip())【 #去掉前后的空格】
? ? ? ? ? ?datalist.append(data)? ? ? ? ??

【 #把處理好的一部電影信息放入datalist】
? ?return datalist

表3.4.1

5.保存數據到excel

創建豆瓣電影TOP250工作表，再定義表頭，最后用for循環將數據填入excel表格中。

def saveData(datalist,savepath):

print()

book=xlwt.Workbook(encoding="utf-8",style_compression=0)? ? ? ?【#創建workbook對象】

sheet = book.add_sheet('豆瓣電影top250',cell_overwrite_ok=True)? ?【#創建工作表】

col = ("電影詳情鏈接","圖片鏈接","影片中文名","影片外國名","評分","評價數","概況","相關信息")

for i in range(0,8):

sheet.write(0,i,col[i])

for i in range(0,250):

print("第%d條" %(i+1))

data = datalist[i]

for j in range(0,8):

sheet.write(i+1,j,data[j])【#數據】

print("爬取成功！")

book.save(savepath)【?#保存數據表】

表3.5.1

6.用main調用實現函數

def main():

baseurl = "https://movie.douban.com/top250?start="

# 1.爬取網頁

datalist = getdata(baseurl)

# 2.文件名稱

savepath = "豆瓣電影Top250.xls"

# 3.保存數據

saveData(datalist,savepath)

getdata("https://movie.douban.com/top250?start=")

……

if __name__ == '__main__':

main()

表3.6.1

圖3.6.1

四、總結

網頁數據的爬取會用到很多庫的知識，需要熟練掌握它們，理解它們的原理。可以多找一些反爬機制比較少的網頁來練習，提高爬蟲技能。

本期編輯：董澤梅

總結

以上是生活随笔為你收集整理的qt爬取网页信息_豆瓣TOP250数据爬取的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： python怎么变成exe文件_如何在p
下一篇：协议处理程序初始化失败_TCP协议、算法