當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

Python 微博奔驰事件爬虫

發(fā)布時間：2024/1/18 python 27 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python 微博奔驰事件爬虫小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

Python微博奔馳事件爬蟲

工具：Pycharm，Win10，Python3.6.4

最近奔馳漏油事件成為大家討論的熱點，也頻上熱搜。我就來做了一下微博評論爬蟲，看看大家對這件事情的看法。

微博移動端相對好爬一些，且評論數(shù)據(jù)差不多，所以我們主要是針對微博移動端進行信息采集。

我們打開網(wǎng)址看到如下信息

評論量還是挺多的，所以我們要先獲取該微博的鏈接，然后通過該微博去獲取評論頁。可以看到微博是動態(tài)加載的，你不斷往下翻就會有新的微博內(nèi)容出現(xiàn)，此時我們應(yīng)該考慮是動態(tài)加載

可以看到每次會產(chǎn)生10條微博數(shù)據(jù)，我們要獲取這10條數(shù)據(jù)中的idstr去構(gòu)造每條微博的詳情鏈接。

找到了我們要的idstr之后我們觀察微博詳情鏈接的網(wǎng)頁規(guī)則發(fā)現(xiàn)https://m.weibo.cn/detail/+idstr，詳情頁有著這樣的規(guī)則，那么對我們來說很容易構(gòu)造。有了詳情頁鏈接我們就要獲取詳情頁中的微博內(nèi)容和評論數(shù)據(jù)。通過分析發(fā)現(xiàn)微博內(nèi)容實在詳情頁的源碼中，這個很好獲取。但是微博內(nèi)容常常帶著超鏈接，這影響數(shù)據(jù)的實用性，我們需要用一個正則去匹配中文內(nèi)容。評論內(nèi)容同樣是動態(tài)加載，注意，我這邊只考慮一級評論。

在獲取評論的時候有個注意點，評論不像我們之前我們所接觸的翻頁，直接page加一就能獲取下一頁的評論內(nèi)容，他是在上一頁的評論中有一個id，然后通過這個id去構(gòu)造下一頁的評論頁，如此往復(fù)。

下面貼出代碼

import requests import json import re import csv import timeheaders = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36','cookie':'換上自己的cookie', }def get_html(url):response = requests.get(url, headers=headers)response.encoding = response.apparent_encodinghtml = response.textreturn htmldef parse_index_html(html):html = json.loads(html)idstr = []for i in html['data']['cards'][0]['card_group']:idstr.append(i['mblog']['idstr'])return idstrdef parse_detail_html(html):text_pattern = re.compile('"text":(.*?)"textLength"', re.S)text = re.findall(text_pattern, html)text_process_pattern = re.compile('[\u4e00-\u9fa5|，、“”‘’：！~@#￥【】*（）——+。；？]+', re.S)text_process = re.findall(text_process_pattern, text[0])text = ''.join(text_process)return str('內(nèi)容') + textdef parse_comment_html(html):html = json.loads(html)comments = []max_id = html['data']['max_id']# print(max_id)for i in html['data']['data']:text_process_pattern = re.compile('[\u4e00-\u9fa5|，、“”‘’：！~@#￥【】*（）——+。；？]+', re.S)text_process = re.findall(text_process_pattern, i['text'])text = ''.join(text_process)write2csv(text)# comments.append(text)return max_iddef write2csv(content):with open('info1.csv','a',encoding='utf-8-sig',newline='') as f:writer = csv.writer(f)writer.writerow([content]) if __name__ == '__main__':max_id = '0'for page in range(2,10):print('第幾頁 '+str(page))url = 'https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D60%26q%3D%23%E5%A5%94%E9%A9%B0%E5%A5%B3%E8%BD%A6%E4%B8%BB%E5%93%AD%E8%AF%89%E7%BB%B4%E6%9D%83%23%26t%3D0&page_type=searchall&page='+str(page)index_html = get_html(url)idstr = parse_index_html(index_html)for id in idstr:print('內(nèi)容ID '+str(id))detail_text_url = 'https://m.weibo.cn/detail/' + str(id)detail_html = get_html(detail_text_url)text = parse_detail_html(detail_html)write2csv(text)for i in range(5):try:time.sleep(3)print('評論頁碼 '+str(i))# 'https://m.weibo.cn/comments/hotflow?id=4362541104634930&mid=4362541104634930&max_id_type=0'comment_url = 'https://m.weibo.cn/comments/hotflow?id=' + str(id) + '&mid=' + str(id) + '&max_id=' + str(max_id) + '&max_id_type=0'print(comment_url)comment_html = get_html(comment_url)# print(comment_html)max_id = parse_comment_html(comment_html)print('max_id '+str(max_id))if str(max_id) == '0':breakexcept:continue

總結(jié)

以上是生活随笔為你收集整理的Python 微博奔驰事件爬虫的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： 28岁转行之路：破局35岁魔咒，转行Py
下一篇： BI+AI 有没有前途？