日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 编程语言 > python >内容正文

python

爬虫cookie过期_python instagram 爬虫

發(fā)布時(shí)間:2023/12/20 python 31 豆豆
生活随笔 收集整理的這篇文章主要介紹了 爬虫cookie过期_python instagram 爬虫 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
葉湘?zhèn)?#xff1a;【文字篇】如何系統(tǒng)地自學(xué) Python??zhuanlan.zhihu.com

直接介紹一下具體的步驟以及注意點(diǎn):

instagram 爬蟲注意點(diǎn)

  • instagram 的首頁(yè)數(shù)據(jù)是 服務(wù)端渲染的,所以首頁(yè)出現(xiàn)的 11 或 12 條數(shù)據(jù)是以 html 中的一個(gè) json 結(jié)構(gòu)存在的(additionalData), 之后的帖子加載才是走 ajax 請(qǐng)求的
  • 在 2019/06 之前,ins 是有反爬機(jī)制的,請(qǐng)求時(shí)需要在請(qǐng)求頭加了 'X-Instagram-GIS' 字段。其算法是:
    1、將 rhx_gis 和 queryVariables 進(jìn)行組合
rhx_gis 可以在首頁(yè)處的 sharedData 這個(gè) json 結(jié)構(gòu)中獲得

2、然后進(jìn)行 md5 哈希

e.g.

queryVariables = '{"id":"' + user_id + '","first":12,"after":"' +cursor+ '"}'print(queryVariables)headers['X-Instagram-GIS'] = hashStr(GIS_rhx_gis + ":" + queryVariables)
  • 但是在在 2019/06 之后, instagram 已經(jīng)取消了 X-Instagram-GIS 的校驗(yàn),所以無需再生成 X-Instagram-GIS,上一點(diǎn)內(nèi)容可以當(dāng)做歷史來了解了
  • 初始訪問 ins 首頁(yè)的時(shí)候會(huì)設(shè)置一些 cookie,設(shè)置的內(nèi)容 (response header) 如下:
set-cookie: rur=PRN; Domain=.instagram.com; HttpOnly; Path=/; Secureset-cookie: ds_user_id=11859524403; Domain=.instagram.com; expires=Mon, 15-Jul-2019 09:22:48 GMT; Max-Age=7776000; Path=/; Secureset-cookie: urlgen="{"45.63.123.251": 20473}:1hGKIi:7bh3mEau4gMVhrzWRTvtjs9hJ2Q"; Domain=.instagram.com; HttpOnly; Path=/; Secureset-cookie: csrftoken=Or4nQ1T3xidf6CYyTE7vueF46B73JmAd; Domain=.instagram.com; expires=Tue, 14-Apr-2020 09:22:48 GMT; Max-Age=31449600; Path=/; Secure
  • 關(guān)于 query_hash,一般這個(gè)哈希值不用怎么管,可以直接寫死
  • 特別注意:在每次請(qǐng)求時(shí)務(wù)必帶上自定義的 header,且 header 里面要有 user-agent,這樣子才能使用 rhx_gis 來進(jìn)行簽名訪問并且獲取到數(shù)據(jù)。切記!是每次訪問!例如:
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36' }

大部分 api 的訪問需要在請(qǐng)求頭的 cookie 中攜帶 session-id 才能得到數(shù)據(jù),一個(gè)正常的請(qǐng)求頭 (request header) 如下:

:authority: www.instagram.com:method: GET:path: /graphql/query/?query_hash=ae21d996d1918b725a934c0ed7f59a74&variables=%7B%22fetch_media_count%22%3A0%2C%22fetch_suggested_count%22%3A30%2C%22ignore_cache%22%3Atrue%2C%22filter_followed_friends%22%3Atrue%2C%22seen_ids%22%3A%5B%5D%2C%22include_reel%22%3Atrue%7D:scheme: httpsaccept: */*accept-encoding: gzip, deflate, braccept-language: zh-CN,zh;q=0.9,en;q=0.8,la;q=0.7cache-control: no-cachecookie: mid=XI-joQAEAAHpP4H2WkiI0kcY3sxg; csrftoken=Or4nQ1T3xidf6CYyTE7vueF46B73JmAd; ds_user_id=11859524403; sessionid=11859524403%3Al965tcIRCjXmVp%3A25; rur=PRN; urlgen="{"45.63.123.251": 20473}:1hGKIj:JvyKtYz_nHgBsLZnKrbSq0FEfeg"pragma: no-cachereferer: https://www.instagram.com/user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36x-ig-app-id: 936619743392459x-instagram-gis: 8f382d24b07524ad90b4f5ed5d6fccdbx-requested-with: XMLHttpRequest
  • 注意 user-agent、x-ig-app-id (html 中的 sharedData 中獲取)、x-instagram-gis,以及 cookie 中的 session-id 配置
api 的分頁(yè) (請(qǐng)求下一頁(yè)數(shù)據(jù)),如用戶帖子列表 ins 中一個(gè)帶分頁(yè)的 ajax 請(qǐng)求,一般請(qǐng)求參數(shù)會(huì)類似下面:# 網(wǎng)頁(yè)頁(yè)面信息page_info = js_data["entry_data"]["ProfilePage"][0]["graphql"]["user"]["edge_owner_to_timeline_media"]['page_info']# 下一頁(yè)的索引值A(chǔ)QCSnXw1JsoV6LPOD2Of6qQUY7HWyXRc_CBSMWB6WvKlseC-7ibKho3Em0PEG7_EP8vwoXw5zwzsAv_mNMR8yX2uGFZ5j6YXdyoFfdbHc6942wcursor = page_info['end_cursor']# 是否有下一頁(yè)flag = page_info['has_next_page']
  • end_cursor 即為 after 的值,has_next_page 檢測(cè)是否有下一頁(yè)
    如果是有下一頁(yè),可進(jìn)行第一次分頁(yè)數(shù)據(jù)請(qǐng)求,第一次分頁(yè)請(qǐng)求的響應(yīng)數(shù)據(jù)回來之后,id,first 的值不用變,after 的值變?yōu)轫憫?yīng)數(shù)據(jù)中 page_info 中 end_cursor 的值,再構(gòu)造 variables,連同 query_hash 發(fā)起再下一頁(yè)的請(qǐng)求
    再判斷響應(yīng)數(shù)據(jù)中的 page_info 中 has_next_page 的值,循環(huán)下去,可拿完全部數(shù)據(jù)。若不想拿完,可利用響應(yīng)數(shù)據(jù)中的 edge_owner_to_timeline_media 中的 count 值來做判斷,該值表示用戶總共有多少媒體
  • 視頻帖子和圖片帖子數(shù)據(jù)結(jié)構(gòu)不一樣,注意判斷響應(yīng)數(shù)據(jù)中的 is_video 字段
  • 如果是用一個(gè) ins 賬號(hào)去采集的話,只要請(qǐng)求頭的 cookie 中帶上合法且未過期的 session_id,可直接訪問接口,無需計(jì)算簽名。
    最直接的做法是:打開瀏覽器,登錄 instagram 后,F12 查看 xhr 請(qǐng)求,將 request header 中的 cookie 復(fù)制過來使用即可,向下面:
# -*- coding:utf-8 -*- import requests import re import json import urllib.parse import hashlib import sysUSER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'BASE_URL = 'https://www.instagram.com' ACCOUNT_MEDIAS = "http://www.instagram.com/graphql/query/?query_hash=42323d64886122307be10013ad2dcc44&variables=%s" ACCOUNT_PAGE = 'https://www.instagram.com/%s'proxies = {'http': 'http://127.0.0.1:1087','https': 'http://127.0.0.1:1087', }# 一次設(shè)置proxy的辦法,將它設(shè)置在一次session會(huì)話中,這樣就不用每次都在調(diào)用requests的時(shí)候指定proxies參數(shù)了 # s = requests.session() # s.proxies = {'http': '121.193.143.249:80'}def get_shared_data(html=''):"""get window._sharedData from page,return the dict loaded by window._sharedData str"""if html:target_text = htmlelse:header = generate_header()response = requests.get(BASE_URL, proxies=proxies, headers=header)target_text = response.textregx = r"s*.*s*<script.*?>.*_sharedDatas*=s*(.*?);</script>"match_result = re.match(regx, target_text, re.S)data = json.loads(match_result.group(1))return data# def get_rhx_gis(): # """get the rhx_gis value from sharedData # """ # share_data = get_shared_data() # return share_data['rhx_gis']def get_account(user_name):"""get the account info by username:param user_name::return:"""url = get_account_link(user_name)header = generate_header()response = requests.get(url, headers=header, proxies=proxies)data = get_shared_data(response.text)account = resolve_account_data(data)return accountdef get_media_by_user_id(user_id, count=50, max_id=''):"""get media info by user id:param id::param count::param max_id::return:"""index = 0medias = []has_next_page = Truewhile index <= count and has_next_page:varibles = json.dumps({'id': str(user_id),'first': count,'after': str(max_id)}, separators=(',', ':')) # 不指定separators的話key:value的:后會(huì)默認(rèn)有空格,因?yàn)槠淠J(rèn)separators為(', ', ': ')url = get_account_media_link(varibles)header = generate_header()response = requests.get(url, headers=header, proxies=proxies)media_json_data = json.loads(response.text)media_raw_data = media_json_data['data']['user']['edge_owner_to_timeline_media']['edges']if not media_raw_data:return mediasfor item in media_raw_data:if index == count:return mediasindex += 1medias.append(general_resolve_media(item['node']))max_id = media_json_data['data']['user']['edge_owner_to_timeline_media']['page_info']['end_cursor']has_next_page = media_json_data['data']['user']['edge_owner_to_timeline_media']['page_info']['has_next_page']return mediasdef get_media_by_url(media_url):response = requests.get(get_media_url(media_url), proxies=proxies, headers=generate_header())media_json = json.loads(response.text)return general_resolve_media(media_json['graphql']['shortcode_media'])def get_account_media_link(varibles):return ACCOUNT_MEDIAS % urllib.parse.quote(varibles)def get_account_link(user_name):return ACCOUNT_PAGE % user_namedef get_media_url(media_url):return media_url.rstrip('/') + '/?__a=1'# def generate_instagram_gis(varibles): # rhx_gis = get_rhx_gis() # gis_token = rhx_gis + ':' + varibles # x_instagram_token = hashlib.md5(gis_token.encode('utf-8')).hexdigest() # return x_instagram_tokendef generate_header(gis_token=''):# todo: if have session, add the session key:value to headerheader = {'user-agent': USER_AGENT,}if gis_token:header['x-instagram-gis'] = gis_tokenreturn headerdef general_resolve_media(media):res = {'id': media['id'],'type': media['__typename'][5:].lower(),'content': media['edge_media_to_caption']['edges'][0]['node']['text'],'title': 'title' in media and media['title'] or '','shortcode': media['shortcode'],'preview_url': BASE_URL + '/p/' + media['shortcode'],'comments_count': media['edge_media_to_comment']['count'],'likes_count': media['edge_media_preview_like']['count'],'dimensions': 'dimensions' in media and media['dimensions'] or {},'display_url': media['display_url'],'owner_id': media['owner']['id'],'thumbnail_src': 'thumbnail_src' in media and media['thumbnail_src'] or '','is_video': media['is_video'],'video_url': 'video_url' in media and media['video_url'] or ''}return resdef resolve_account_data(account_data):account = {'country': account_data['country_code'],'language': account_data['language_code'],'biography': account_data['entry_data']['ProfilePage'][0]['graphql']['user']['biography'],'followers_count': account_data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_followed_by']['count'],'follow_count': account_data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_follow']['count'],'full_name': account_data['entry_data']['ProfilePage'][0]['graphql']['user']['full_name'],'id': account_data['entry_data']['ProfilePage'][0]['graphql']['user']['id'],'is_private': account_data['entry_data']['ProfilePage'][0]['graphql']['user']['is_private'],'is_verified': account_data['entry_data']['ProfilePage'][0]['graphql']['user']['is_verified'],'profile_pic_url': account_data['entry_data']['ProfilePage'][0]['graphql']['user']['profile_pic_url_hd'],'username': account_data['entry_data']['ProfilePage'][0]['graphql']['user']['username'],}return accountaccount = get_account('shaq')result = get_media_by_user_id(account['id'], 56)media = get_media_by_url('https://www.instagram.com/p/Bw3-Q2XhDMf/')print(len(result)) print(result)

封裝成庫(kù)了!

如果還有問題未能得到解決,搜索887934385交流群,進(jìn)入后下載資料工具安裝包等。最后,感謝觀看!

總結(jié)

以上是生活随笔為你收集整理的爬虫cookie过期_python instagram 爬虫的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。