當前位置：首頁 > 编程语言 > python >内容正文

python

python爬取“微博”移动端评论数据

發布時間：2024/3/12 python 63 豆豆

生活随笔收集整理的這篇文章主要介紹了 python爬取“微博”移动端评论数据小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

目的

爬取微博移動端的評論數據(如下圖)，然后將數據保存到.txt文件和.xlsl文件中。

實現過程

實現的方法很簡單，就是模擬瀏覽器發送ajax請求，然后獲取后端傳過來的json數據。

一、找到獲取評論數據的ajax請求

按下F12，打開控制臺，找到以下請求

以 https://m.weibo.cn/detail/4467454577673256 為例，得到的ajax請求是這樣的：

https://m.weibo.cn/comments/hotflow?id=4467454577673256&mid=4467454577673256&max_id_type=0

然后，我們往下滾動屏幕，再觀察幾個獲取評論數據的ajax請求(篇幅有限，只看兩個)，看有什么規律。

https://m.weibo.cn/comments/hotflow?id=4467454577673256&mid=4467454577673256&max_id=142552137895298&max_id_type=0

https://m.weibo.cn/comments/hotflow?id=4467454577673256&mid=4467454577673256&max_id=139116183376416&max_id_type=0

可以看到這幾個ajax都有幾個共同的部分：

https://m.weibo.cn/comments/hotflow?
id=4467454577673256
mid=4467454577673256
max_id_type=0

其中，id和mid的數值為微博鏈接的最后一串數字，max_id_type都為0(但事實上，不總為0，后面會講)。到這一步，我們可以模擬獲取評論數據的第一個ajax請求了，因為它就是由上面這些組成的。而后面的那些ajax請求，則是多了一個max_id參數。經過觀察，該參數是從前一個ajax請求返回的數據中得到的。以前面第二個ajax請求為例：

該請求中的max_id參數就是由從第一個ajax請求的max_id字段中得到。所以，后續的操作就是：

根據baseurl，id，mid，獲取第一個ajax請求中的數據，然后將有用的評論信息數據保存到數組中，并返回一個max_id字段。

模擬瀏覽器，根據前面獲得的max_id字段，以及baseurl，id，mid發送下一個ajax請求，將評論數據結果append到前一次的結果中。

循環第2步max次，其中max為發送ajax的次數(好像是錯的，循環次數應該是≤max，自行打印試試)，該字段由上面截圖的max字段可得，初始值為1。

最終，將評論數據結果保存到.txt和.xlsx文件中。

二、獲取ajax請求的數據

獲取數據的方式有很多，我選擇了request.get()來獲取。

web_data = requests.get(url, headers=headers,timeout=5) js_con = web_data.json()

使用以上代碼，我們可以很輕松的獲取第一個請求的數據。但是到獲取第二第三個請求的數據時，就會報錯。原因是沒有加入Cookies。于是乎，我就在Request Headers中找含有Coookies的數據信息，結果全部文件找了一遍都沒有。因此，我決定把微博站點的所有Cookies字段值都敲進去。如下。

因此，解決了cookie問題之后，我們測試一下看是否能獲取后面所有ajax請求的數據。答案是：no.

經無數次測試，在爬取第17次請求的數據時，總是拋出異常。而原因則是我們前面所提及的，max_id_type不總是等于0，當發送第17次請求時max_id_type等于1。

于是到這一步，我們基本就可以爬取所有的數據了。但還是有一個小問題。就是有些微博的評論開啟了精選模式，只顯示回復數最多的幾個評論。比如這個示例：

雖然上面寫著有5757條評論，但是后端只返回17條數據到前端。因此，如果按照前面的思維，循環max次發送ajax請求的話是會報錯的，因此解決該問題的方法就是判斷該ajax請求返回的數據中是否包含{ok: 0}(其他有效的返回結果都為ok:1)。如果有，則說明這個請求及之后的請求是無效的，不用一直循環下去，直接將結果保存到.txt和.xlsx文件即可。

三、將數據保存到txt和excel中

將數據導出到這兩種文件的代碼，網上應有盡有，這里不加贅述。但是，這里要說明的是：因為要存儲多個微博的評論數據，所以excel中要分不同sheet來存儲；而txt中的數據僅僅是為了隨便看看，所以微博的評論數據都追加到這里面。

四、源代碼

comment.py(主文件)

import requests import pandas as pd import time import openpyxl #導出excel需要用到 from config import headers,url,Cookie,base_url,weiboComment,excel_name,txt_name#將中國標準時間(Sat Mar 16 12:12:03 +0800 2019)轉換成年月日 def formatTime(time_string, from_format, to_format='%Y.%m.%d %H:%M:%S'):time_struct = time.strptime(time_string,from_format)times = time.strftime(to_format, time_struct)return times# 爬取第一頁的微博評論 def first_page_comment(weibo_id, url, headers):try:url = url + str(weibo_id) + '&mid=' + str(weibo_id) + '&max_id_type=0'web_data = requests.get(url, headers=headers,cookies = Cookie,timeout=20)js_con = web_data.json()# 獲取連接下一頁評論的max_idmax_id = js_con['data']['max_id']max = js_con['data']['max']comments_list = js_con['data']['data']for commment_item in comments_list:Obj = {'commentor_id':commment_item['user']['id'],'commentor_name':commment_item['user']['screen_name'],'commentor_blog_url':commment_item['user']['profile_url'],'comment_id':commment_item['id'],'comment_text':commment_item['text'],'create_time':formatTime(commment_item['created_at'],'%a %b %d %H:%M:%S +0800 %Y','%Y-%m-%d %H:%M:%S'),'like_count':commment_item['like_count'],'reply_number':commment_item['total_number'],'full_path':base_url+str(weibo_id),'max_id': max_id,'max':max}commentLists.append(Obj)print("已獲取第1頁的評論")return commentListsexcept Exception as e:print("遇到異常")return []#運用遞歸思想，爬取剩余頁面的評論。因為后面每一頁的url都有一個max_id，這只有從前一個頁面返回的數據中獲取。 def orther_page_comments(count,weibo_id, url, headers,max,max_id):if count<=max:try:if count<15:urlNew = url + str(weibo_id) + '&mid='+ str(weibo_id) + '&max_id=' + str(max_id) + '&max_id_type=0'else:urlNew = url + str(weibo_id) + '&mid=' + str(weibo_id) + '&max_id=' + str(max_id) + '&max_id_type=1'web_data = requests.get(url=urlNew, headers=headers,cookies = Cookie,timeout=10)#成功獲取數據了，才執行下一步操作if web_data.status_code == 200:js_con = web_data.json()# print('js_con：', js_con)#評論開啟了精選模式，返回的數據為空if js_con['ok']!=0:# 獲取連接下一頁評論的max_idmax_id = js_con['data']['max_id']max = js_con['data']['max']comments_list = js_con['data']['data']# print('comments_list:',comments_list)for commment_item in comments_list:Obj = {'commentor_id':commment_item['user']['id'],'commentor_name':commment_item['user']['screen_name'],'commentor_blog_url':commment_item['user']['profile_url'],'comment_id':commment_item['id'],'comment_text':commment_item['text'],'create_time':formatTime(commment_item['created_at'],'%a %b %d %H:%M:%S +0800 %Y','%Y-%m-%d %H:%M:%S'),'like_count':commment_item['like_count'],'reply_number':commment_item['total_number'],'full_path':base_url+str(weibo_id),'max_id': max_id,'max':max}commentLists.append(Obj)count += 1print("已獲取第" + str(count+1) + "頁的評論。")orther_page_comments(count,weibo_id,url,headers,max,max_id)#遞歸return commentListselse:return []except Exception as e:if count==1:print("遇到異常,爬蟲失敗") #假設連第一條數據都沒有爬到，我就認為是爬蟲失敗else:return#將數據保存到excel中的不同sheet中 def export_excel(exportArr,id,sheetName):#創建sheet# wb = openpyxl.load_workbook(excel_name)# wb.create_sheet(title=sheetName, index=0)# wb.save(excel_name)#將數據保存到sheet中pf = pd.DataFrame(exportArr) #將字典列表轉換為DataFrameorder = ['comment_id','commentor_name','commentor_id','commentor_blog_url','comment_text','create_time','like_count','reply_number','full_path'] #指定字段順序pf = pf[order]#將列名替換為中文columns_map = {'comment_id':'comment_id','commentor_name':'評論者名字','commentor_id':'評論者id','commentor_blog_url':'評論者的微博主頁','comment_text':'評論內容','create_time':'發布時間','like_count':'點贊數','reply_number':'回復數','full_path':'微博url',}pf.rename(columns=columns_map, inplace=True)pf.fillna(' ',inplace = True) # 替換空單元格pf.to_excel(file_path,encoding = 'utf-8',index = False,sheet_name=sheetName) #輸出print('----------第',id,'篇微博的評論已經保存了---------------')return 'true'#將數據保存到txt文件中 def export_txt(list,txtId):arr = [str(txtId),' ',list['full_path'],' ',list['commentor_name']]commentorNameMaxLen = 20 #假設最大的長度為20，不足20的以空格代替，確保長度一致，避免參差不齊lenGap = commentorNameMaxLen - len(list['commentor_name'])for i in range(lenGap):arr.append('-')arr.append(list['comment_text'])arr.append('\n') #每一行結束要換行file_handle.writelines(arr)if __name__ == "__main__":output = []commentLists = [] # 初始化存儲一個微博評論數組weibo_comment = weiboCommentfile_path = pd.ExcelWriter(excel_name) # 指定生成的Excel表格名稱txt_id = 1 # 用于記錄txt數據的idfile_handle = open(txt_name, mode='w',encoding='utf-8') # 打開txt文件file_handle.writelines(['id ','微博鏈接 ','評論者',' ','評論內容\n']) #寫入頭部的字段名字#存儲每一篇微博的評論數據for ind,item in enumerate(weibo_comment):output = first_page_comment(item['weibo_id'], url, headers)if len(output)>0:maxPage = output[-1]['max']maxId =output[-1]['max_id']#如果結果不只一頁，就繼續爬if(maxPage!=1):ans = orther_page_comments(0,item['weibo_id'], url, headers,maxPage,maxId)# 如果評論開啟了精選模式，最后一頁返回的數據是為空的if ans!=[]:bool = export_excel(ans,item['id'],item['sheet_name'])else:bool = export_excel(commentLists,item['id'],item['sheet_name'])if bool=='true':commentLists = [] #將存儲的數據置0for list in ans:txt_id = txt_id + 1 # 用于記錄txt數據的idexport_txt(list, txt_id)else:print('----------------該微博的評論只有1頁-----------------')file_path.save() #保存到表格file_handle.close() #保存到txt

config.py

base_url = 'https://m.weibo.cn/detail/' url = 'https://m.weibo.cn/comments/hotflow?id='excel_name = r'weibo_comments.xlsx' txt_name = 'weibo_comments.txt'# 參考代碼：https://www.cnblogs.com/pythonfm/p/9056461.html ALF = 1583630252 MLOGIN = 1 M_WEIBOCN_PARAMS = 'oid%3D4469046194244186%26luicode%3D10000011%26lfid%3D102803%26uicode%3D10000011%26fid%3D102803' SCF = 'AjheAPuZRqxmyLT-kTVnBXGduebXE6nZGT5fS8_VPbfADyWHQ_WyoRzZqAJNujugOFYP1tUivrlzK2TGTx83_Qo.' SSOLoginState = 1581038313 SUB = '_2A25zOMq5DeRhGeNM6FUX8S_EzDqIHXVQwtbxrDV6PUJbktAKLVPhkW1NTjKs6wgXZoFv2vqllQWpcwE-e9-8LlMs' SUBP = '0033WrSXqPxfM725Ws9jqgMF55529P9D9W58TWlXMj17lMMvjhSsjQ1p5JpX5K-hUgL.Fo-Ee0MceK2RS0q2dJLoIEXLxKqLBozL1h.LxKML1-BLBK2LxKML1-2L1hBLxK-LBKqL12BLxK-LBKqL12Bt' SUHB = '0BLYTPzIKSGsDo' WEIBOCN_FROM = 1110006030 XSRF_TOKEN = '5dcf70' _T_WM = 64204543757 Cookie = {'Cookie': 'ALF={:d};MLOGIN={:d};M_WEIBOCN_PARAMS={};SCF={};SSOLoginState={:d};SUB={};SUBP={};SUHB={};WEIBOCN_FROM={:d};XSRF-TOKEN={};_T_WM={:d};'.format(ALF,MLOGIN,M_WEIBOCN_PARAMS,SCF,SSOLoginState,SUB,SUBP,SUHB,WEIBOCN_FROM,XSRF_TOKEN,_T_WM) }headers = {'Sec-Fetch-Mode': 'cors','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36','X-Requested-With': 'XMLHttpRequest', # 通過ajax請求形式獲取數據'X-XSRF-TOKEN': 'aa8bed','Accept': 'application/json, text/plain, */*' }# 數據id號，要爬取的微博的id號，以及導出到excel對應的sheet名 weiboComment = [{'id':1,'weibo_id': 4349331148136901,'sheet_name': 'file_tab1', },{'id':2,'weibo_id': 4349336798569857,'sheet_name': 'file_tab2', },{'id':3,'weibo_id': 4349342632452485,'sheet_name': 'file_tab3', },{'id':4,'weibo_id': 4349359489249263,'sheet_name': 'file_tab4', },{'id':5,'weibo_id': 4349367202366649,'sheet_name': 'file_tab5', },{'id':6,'weibo_id': 4349409263609558,'sheet_name': 'file_tab6', },{'id':7,'weibo_id': 4349473562085041,'sheet_name': 'file_tab7', },{'id':8,'weibo_id': 4349476527153453,'sheet_name': 'file_tab8', },{'id':9,'weibo_id': 4349484396400084,'sheet_name': 'file_tab9', },{'id':10,'weibo_id': 4349520848132903,'sheet_name': 'file_tab10', },{'id':11,'weibo_id': 4349719763185960,'sheet_name': 'file_tab11', },{'id':12,'weibo_id': 4349801526543328,'sheet_name': 'file_tab12', },{'id':13,'weibo_id': 4350037775161542,'sheet_name': 'file_tab13', },{'id':14,'weibo_id': 4350053403309300,'sheet_name': 'file_tab14', },{'id':15,'weibo_id': 4350126740919864,'sheet_name': 'file_tab15', },{'id':16,'weibo_id': 4350129907409012,'sheet_name': 'file_tab16', },{'id':17,'weibo_id': 4350130469806786,'sheet_name': 'file_tab17', },{'id':18,'weibo_id': 4350133967955764,'sheet_name': 'file_tab18', },{'id':19,'weibo_id': 4350135909606542,'sheet_name': 'file_tab19', },{'id':20,'weibo_id': 4350218999265612,'sheet_name': 'file_tab20', },{'id':21,'weibo_id': 4350440310723864,'sheet_name': 'file_tab21', },{'id':22,'weibo_id': 4350520937742523,'sheet_name': 'file_tab22', },{'id':23,'weibo_id': 4350785468613341,'sheet_name': 'file_tab23', },{'id':24,'weibo_id': 4350785615363253,'sheet_name': 'file_tab24', },{'id':25,'weibo_id': 4350789927730012,'sheet_name': 'file_tab25', },{'id':26,'weibo_id': 4350789751053448,'sheet_name': 'file_tab26', },{'id':27,'weibo_id': 4350780188153079,'sheet_name': 'file_tab27', },{'id':28,'weibo_id': 4350791797481716,'sheet_name': 'file_tab28', },{'id':29,'weibo_id': 4350797737493161,'sheet_name': 'file_tab29', },{'id':30,'weibo_id': 4350798441501055,'sheet_name': 'file_tab30', },{'id':31,'weibo_id': 4350800991931397,'sheet_name': 'file_tab31', },{'id':32,'weibo_id': 4350974611001741,'sheet_name': 'file_tab32', },{'id':33,'weibo_id': 4351283193709752,'sheet_name': 'file_tab33', }]

github項目鏈接

總結

以上是生活随笔為你收集整理的python爬取“微博”移动端评论数据的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：怎么查看计算机的dns地址,如何查看wi
下一篇：如何在华为云软件开发云上运行Python