日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 > 编程语言 > python >内容正文

python

python 爬取今日头条热点新闻

發(fā)布時(shí)間:2024/4/11 python 48 豆豆
生活随笔 收集整理的這篇文章主要介紹了 python 爬取今日头条热点新闻 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

嗯,今天就讓我們來一起爬爬今日頭條的熱點(diǎn)新聞吧!

今日頭條地址:https://www.toutiao.com/ch/news_hot/

在瀏覽器中打開今日頭條的鏈接,選中左側(cè)的熱點(diǎn),在瀏覽器開發(fā)者模式,network下很快能找到一個(gè)‘?category=new_hot…’字樣的文件,點(diǎn)擊進(jìn)去就能看到請(qǐng)求地址了。如下圖:


該請(qǐng)求地址的數(shù)據(jù)全部存放在data 字段中,并且數(shù)據(jù)類型為json。如下圖

請(qǐng)求的鏈接地址為:

https://www.toutiao.com/api/pc/feed/?category=news_hot&utm_source=toutiao&widen=1&max_behot_time=1577347347&max_behot_time_tmp=1577347347&tadrequire=true&as=A1450EF0A468003&cp=5E04F850E003EE1&_signature=VYMs9gAgEBe5v1fEUcnQ31WDLeAAAuI

其中有9個(gè)參數(shù),對(duì)比如下表:

其中max_behot_time在獲取的json數(shù)據(jù)中獲得,具體數(shù)據(jù)見如下截圖:

請(qǐng)求地址中,有兩個(gè)參數(shù)as和cp,都是經(jīng)過js加密處理過。不過也有相對(duì)應(yīng)的加密算法:

加密算法:

var e = {};e.getHoney = function() {var t = Math.floor((new Date).getTime() / 1e3),e = t.toString(16).toUpperCase(),n = md5(t).toString().toUpperCase();if (8 != e.length) return {as: "479BB4B7254C150",cp: "7E0AC8874BB0985"};for (var o = n.slice(0, 5), i = n.slice(-5), a = "", r = 0; 5 > r; r++) a += o[r] + e[r];for (var l = "", s = 0; 5 > s; s++) l += e[s + 3] + i[s];return {as: "A1" + a + e.slice(-3),cp: e.slice(0, 3) + l + "E1"}}, t.ascp = e }(window, document), function() {var t = ascp.getHoney(),e = {path: "/",domain: "i.snssdk.com"};$.cookie("cp", t.cp, e), $.cookie("as", t.as, e), window._honey = t }(), Flow.prototype = {init: function() {var t = this;this.url && (t.showState(t.auto_load ? NETWORKTIPS.LOADING : NETWORKTIPS.HASMORE), this.container.on("scrollBottom", function() {t.auto_load && (t.lock || t.has_more && t.loadmore())}), this.list_bottom.on("click", "a", function() {return t.lock = !1, t.loadmore(), !1}))},loadmore: function(t) {this.getData(this.url, this.type, this.param, t)},

python獲取as和cp值的代碼如下:

參考博客:https://www.cnblogs.com/xuchunlin/p/7097391.html

import time import hashlib def get_as_cp_args():zz ={}now = round(time.time())print (now) # 獲取計(jì)算機(jī)時(shí)間e = hex(int(now)).upper()[2:] # hex()轉(zhuǎn)換一個(gè)整數(shù)對(duì)象為十六進(jìn)制的字符串表示print (e)i = hashlib.md5(str(int(now)).encode("utf8")).hexdigest().upper() # hashlib.md5().hexdigest()創(chuàng)建hash對(duì)象并返回16進(jìn)制結(jié)果if len(e)!=8:zz = {'as': "479BB4B7254C150",'cp': "7E0AC8874BB0985"}return zzn=i[:5]a=i[-5:]r = ""s = ""for i in range(5):s = s+n[i]+e[i]for j in range(5):r = r+e[j+3]+a[j]zz = {'as': "A1" + s + e[-3:],'cp': e[0:3] + r + "E1"}print (zz)return zz

這樣完整的鏈接就構(gòu)成了,另外提一點(diǎn)就是:_signature參數(shù)去掉也是可以獲取到j(luò)son數(shù)據(jù)的,因此這樣請(qǐng)求的鏈接就完成了。

全部代碼如下:

import requests import json import time import hashlib import xlwt# 獲取as和cp參數(shù)的函數(shù) def get_as_cp_args():zz ={}now = round(time.time())print (now) # 獲取計(jì)算機(jī)時(shí)間e = hex(int(now)).upper()[2:] # hex()轉(zhuǎn)換一個(gè)整數(shù)對(duì)象為十六進(jìn)制的字符串表示print (e)i = hashlib.md5(str(int(now)).encode("utf8")).hexdigest().upper() # hashlib.md5().hexdigest()創(chuàng)建hash對(duì)象并返回16進(jìn)制結(jié)果if len(e)!=8:zz = {'as': "479BB4B7254C150",'cp': "7E0AC8874BB0985"}return zzn=i[:5]a=i[-5:]r = ""s = ""for i in range(5):s = s+n[i]+e[i]for j in range(5):r = r+e[j+3]+a[j]zz = {'as': "A1" + s + e[-3:],'cp': e[0:3] + r + "E1"}print (zz)return zz#獲取解析json后的數(shù)據(jù) def get_html_data(target_url):# 這里你換成你自己的請(qǐng)求頭。直接復(fù)制代碼,會(huì)報(bào)錯(cuò)!!!headers = {"referer": "https://www.toutiao.com/","accept": "text/javascript, text/html, application/xml, text/xml, */*","content-type": "application/x-www-form-urlencoded","cookie": "tt_webid=6774555886024279565; s_v_web_id=76cec5f9a5c4ee50215b678a6f53dea5; WEATHER24279565; csrftoken=bb8c835711d848db5dc5445604d0a9e9; __tasessionId=gphokc0el1577327623076","user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"}response = requests.get(target_url, headers=headers)res_data = json.loads(response.text)return res_data# 解析數(shù)據(jù),提取相關(guān)的字段 def get_parse_data(max_behot_time, base_url, start_url,):# 存放所有的今日頭條新聞數(shù)據(jù)excel_data = []# 循環(huán)次數(shù),相當(dāng)于于刷新新聞的次數(shù),正常情況下刷新一次會(huì)出現(xiàn)10條新聞,但也存在少于10條的情況;所以最后的結(jié)果并不一定是10的倍數(shù)for i in range(3):# 獲取as和cp參數(shù)的函數(shù)as_cp_args = get_as_cp_args() # 拼接請(qǐng)求路徑地址targetUrl = start_url + max_behot_time + '&max_behot_time_tmp=' + max_behot_time + '&tadrequire=true&as=' + as_cp_args['as'] + '&cp=' + as_cp_args['cp']res_data = get_html_data(targetUrl)time.sleep(1)toutiao_data = res_data['data']for i in range(len(toutiao_data)):toutiao = []toutiao_title = toutiao_data[i]['title'] # 頭條新聞標(biāo)題toutiao_source_url = toutiao_data[i]['source_url'] # 頭條新聞鏈接if "https" not in toutiao_source_url:toutiao_source_url = base_url + toutiao_source_urltoutiao_source = toutiao_data[i]['source'] # 頭條發(fā)布新聞的來源toutiao_media_url = base_url + toutiao_data[i]['media_url'] # 頭條發(fā)布新聞鏈接toutiao.append(toutiao_title)toutiao.append(toutiao_source_url)toutiao.append(toutiao_source)toutiao.append(toutiao_media_url)excel_data.append(toutiao)print(toutiao)# 獲取下一個(gè)鏈接的max_behot_time參數(shù)的值max_behot_time = str(res_data['next']['max_behot_time'])return excel_data# 數(shù)據(jù)保存到Excel 表格中中 def save_data(excel_data):header = ["新聞標(biāo)題", "新聞鏈接", "頭條號(hào)", "頭條號(hào)鏈接"]excel_data.insert(0, header)workbook = xlwt.Workbook(encoding="utf-8", style_compression=0)worksheet = workbook.add_sheet("sheet1", cell_overwrite_ok=True)for i in range(len(excel_data)):for j in range(len(excel_data[i])):worksheet.write(i, j, excel_data[i][j])workbook.save(r"今日頭條熱點(diǎn)新聞.xls")print("今日頭條新聞保存完畢!!")if __name__ == '__main__':# 鏈接參數(shù)max_behot_time = '0'# 基礎(chǔ)地址base_url = 'https://www.toutiao.com'# 請(qǐng)求的前半部分地址start_url = 'https://www.toutiao.com/api/pc/feed/?category=news_hot&utm_source=toutiao&widen=1&max_behot_time='toutiao_data = get_parse_data(max_behot_time, base_url, start_url)save_data(toutiao_data)

程序運(yùn)行結(jié)束后Excel表格截圖:

總結(jié)

以上是生活随笔為你收集整理的python 爬取今日头条热点新闻的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。