日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

爬虫:Instagram信息爬取

發布時間:2023/12/20 编程问答 22 豆豆
生活随笔 收集整理的這篇文章主要介紹了 爬虫:Instagram信息爬取 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

這是一個關于Instagram爬蟲的介紹。

GitHub源碼參考(代碼和爬取數據):https://github.com/hilqiqi0/crawler/tree/master/simple/instagram

?

爬取的每個數據保存格式:{ 圖片的訪問路徑,評論數,點贊數,帖子的內容 }
eg:{
??????? "img_url": "https://scontent-sin6-2.cdninstagram.com/vp/0e345bfd870f2fb489f091ed5507397f/5C1A8CB6/t51.2885-15/e35/40949123_1104283529724860_6046749716819964824_n.jpg",
??????? "comment_count": 12932,
??????? "like_count": 1321753,
??????? "text": "Featured photo by @maomay__\\nWeekend Hashtag Project: #WHPperspective\\nThis weekend, the goal is to take photos and videos from a different point of view, as in this featured photo by Mao May (@maomay__). Here are some tips to get you started:\\nCapture a familiar subject or scene from an unexpected angle. Get up close and let a face cover the entire frame, or make a puppy look large by shooting from ground-level as she stares down. Find a high vantage point to show the wider context of a festival scene or bustling market.\\nUse geometry to your advantage. Look for graphic lines — in bridges or telephone wires — that converge to a vanishing point in your composition. Find a new way to capture patterns in everyday places, like the wheels of bicycles lined up in a rack, or symmetrical bricks in an unruly garden.\\nPlay an eye trick. Defy gravity with simple editing, like rotating the frame. Recruit a friend to make a well-timed leap, that, when rotated, looks like they’re flying through air. Or turn a dandelion into a human-size parasol by playing with scale and distance.\\n\\nPROJECT RULES: Please add the #WHPperspective hashtag only to photos and videos shared over this weekend and only submit your own visuals to the project. If you include music in your video submissions, please only use music to which you own the rights. Any tagged photo or video shared over the weekend is eligible to be featured next week."
??? }

技術難點總結:1、需要翻墻;2、Instagram在8、9月份之前是沒有反扒,之后ajax請求加了反扒。

反扒算法:(請求頭加了'X-Instagram-GIS'字段)
?? ??? ?1、將rhx_gis和queryVariables進行組合
?? ??? ?2、然后進行md5哈希

?

代碼說明和修改:0、默認下載120個,若想下載更多可以刪除數量判斷或者修改閾值
?? ??? 1、該代碼使用的是藍燈,代理端口為52212;若是其他的翻墻工具,請修改代理端口號
?? ??? 2、該代碼爬取的是https://www.instagram.com網站中instagram博主的信息;若想爬取其他博主的信息,需要修改博主名
?? ??? 3、該代碼僅是測試,尚未進行代碼模塊化、封裝等

?

關于流程和分析:1、參見文章最后參考;2、直接分析代碼

import re import json import time import random import requests from pyquery import PyQuery as pq import hashliburl_base = 'https://www.instagram.com/instagram/' uri = 'https://www.instagram.com/graphql/query/?query_hash=a5164aed103f24b03e7b7747a2d94e3c&variables=%7B%22id%22%3A%22{user_id}%22%2C%22first%22%3A12%2C%22after%22%3A%22{cursor}%22%7D'headers = { 'Connection':'keep-alive', 'Host':'www.instagram.com', 'Referer':'https://www.instagram.com/instagram/', 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36', 'X-Requested-With':'XMLHttpRequest' }proxy = {'http': 'http://127.0.0.1:52212','https': 'http://127.0.0.1:52212' }def hashStr(strInfo):h = hashlib.md5()h.update(strInfo.encode("utf-8"))return h.hexdigest()def get_html(url):try:response = requests.get(url, headers=headers, proxies=proxy)if response.status_code == 200:return response.textelse:print('請求網頁源代碼錯誤, 錯誤狀態碼:', response.status_code)except Exception as e:print(e)return Nonedef get_json(headers,url):try:response = requests.get(url, headers=headers,proxies=proxy, timeout=10)if response.status_code == 200:return response.json()else:print('請求網頁json錯誤, 錯誤狀態碼:', response.status_code)except Exception as e:print(e)time.sleep(60 + float(random.randint(1, 4000))/100)return get_json(headers,url)def get_samples(html):samples = []user_id = re.findall('"profilePage_([0-9]+)"', html, re.S)[0]GIS_rhx_gis = re.findall('"rhx_gis":"([0-9a-z]+)"', html, re.S)[0]print('user_id:' + user_id)print(GIS_rhx_gis)doc = pq(html)items = doc('script[type="text/javascript"]').items()for item in items:if item.text().strip().startswith('window._sharedData'):# window._sharedData 的內容轉換為字典js_data = json.loads(item.text()[21:-1], encoding='utf-8')# 12 張初始頁面圖片信息edges = js_data["entry_data"]["ProfilePage"][0]["graphql"]["user"]["edge_owner_to_timeline_media"]["edges"]# 網頁頁面信息page_info = js_data["entry_data"]["ProfilePage"][0]["graphql"]["user"]["edge_owner_to_timeline_media"]['page_info']# 下一頁的索引值AQCSnXw1JsoV6LPOD2Of6qQUY7HWyXRc_CBSMWB6WvKlseC-7ibKho3Em0PEG7_EP8vwoXw5zwzsAv_mNMR8yX2uGFZ5j6YXdyoFfdbHc6942wcursor = page_info['end_cursor']# 是否有下一頁flag = page_info['has_next_page']# 節點信息篩選for edge in edges: # 如果是視頻直接跳過if edge['node']['is_video'] == "true":continuetime.sleep(1)# 圖片信息篩選sample = {}if edge['node']['display_url']:display_url = edge['node']['display_url'] # print(display_url)sample["img_url"] = display_urlsample["comment_count"] = edge['node']['edge_media_to_comment']["count"]sample["like_count"] = edge['node']['edge_liked_by']["count"] print(sample["img_url"])print(sample["comment_count"])print(sample["like_count"])if edge['node']['shortcode']:shortcode = edge['node']['shortcode']# https://www.instagram.com/p/{shortcode}/?__a=1textUrl = 'https://www.instagram.com/p/' + shortcode + '/?__a=1'textRespose = get_json(headers,textUrl) # print(textRespose) # print(type(textRespose)) textDict = textRespose['graphql']['shortcode_media']['edge_media_to_caption']['edges'][0]['node']sample["text"] = str(textDict)[10:-2]print(sample["text"])samples.append(sample)print(cursor, flag)# AJAX 請求更多信息 while flag:url = uri.format(user_id=user_id, cursor=cursor)print(url)queryVariables = '{"id":"' + user_id + '","first":12,"after":"' +cursor+ '"}'print(queryVariables)headers['X-Instagram-GIS'] = hashStr(GIS_rhx_gis + ":" + queryVariables)print(headers)js_data = get_json(headers,url) # print(js_data)infos = js_data['data']['user']['edge_owner_to_timeline_media']['edges']cursor = js_data['data']['user']['edge_owner_to_timeline_media']['page_info']['end_cursor']flag = js_data['data']['user']['edge_owner_to_timeline_media']['page_info']['has_next_page']# print(infos)for info in infos:if info['node']['is_video']:continueelse:sample = {}display_url = info['node']['display_url'] # print(display_url)sample["img_url"] = display_urlsample["comment_count"] = info['node']['edge_media_to_comment']["count"]sample["like_count"] = info['node']['edge_media_preview_like']["count"] if info['node']['shortcode']:time.sleep(1)shortcode = info['node']['shortcode']# https://www.instagram.com/p/{shortcode}/?__a=1textUrl = 'https://www.instagram.com/p/' + shortcode + '/?__a=1'textRespose = get_json(headers,textUrl) # print(textRespose) # print(type(textRespose)) textDict = textRespose['graphql']['shortcode_media']['edge_media_to_caption']['edges'][0]['node']sample["text"] = str(textDict)[10:-2]print(sample["img_url"])print(sample["comment_count"])print(sample["like_count"]) print(sample["text"])samples.append(sample)print(cursor, flag)# 下載120個 返回if len(samples) > 120:return samplesreturn samplesdef main():url = url_basehtml = get_html(url)samples = get_samples(html) # print(samples)with open("./samples.txt","a",encoding='utf-8') as f:f.write(str(samples))if __name__ == '__main__':start = time.time()main()

?

參考1:https://www.jianshu.com/p/985c2b4e8f6c

參考2:https://blog.csdn.net/geng333abc/article/details/79403395

?

總結

以上是生活随笔為你收集整理的爬虫:Instagram信息爬取的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。

主站蜘蛛池模板: 特黄视频在线观看 | www久久久久 | 久久人妻少妇嫩草av无码专区 | 久久久久性色av无码一区二区 | 精品女同一区二区 | 久久久香蕉| 草女人视频 | 最新天堂在线视频 | 久草免费福利视频 | 日本久久高清 | 成人小视频免费在线观看 | 自拍偷拍视频在线 | 国产成人观看 | 韩国美女av | 老司机一区二区三区 | 免费精品在线观看 | 国产网站在线看 | 天天天天天天操 | 2018国产精品 | 黄色一级国产 | 99久久久久久久久久 | 色综合91 | 性――交――性――乱睡觉 | 久久久九九九热 | 国产精品999.| 午夜操操 | 已满十八岁免费观看全集动漫 | 国产精品揄拍一区二区 | 鲁在线视频| 国产一级aa大片毛片 | 免费视频91蜜桃 | 欧美乱码精品一区二区三区 | 国产亚洲欧美日韩精品一区二区三区 | 黑丝扣逼 | 97精品人妻一区二区三区蜜桃 | 亚洲成av| 亚洲欧美激情精品一区二区 | 国产精品美女久久久久久 | 国产在线观看成人 | 日韩欧美aⅴ综合网站发布 国产成人一区二区三区小说 | 亚洲欧美日韩色 | 26uuu成人网 国产精品久久久久久久久久直播 | 国产精品久久久久久久久久久久午夜片 | 成人黄色性视频 | 成人性生活免费视频 | 激情五月婷婷小说 | 久久精品一区二区三区黑人印度 | 五月深爱婷婷 | 亚洲成人二区 | 人人草在线视频 | 视频一区国产精品 | 免看一级a毛片一片成人不卡 | 日韩羞羞 | 少妇激情一区二区三区 | 欧美黑人性xxx猛交 少妇无套内谢久久久久 | 免看黄大片aa | 午夜影院性 | av日韩不卡| 日韩二三区 | 俄罗斯乱妇 | 性一区| 北条麻妃久久 | 国产精品不卡在线观看 | 免费在线日本 | 女厕厕露p撒尿八个少妇 | 欧美日韩国产黄色 | 高清国产一区二区三区四区五区 | 台湾佬美性中文网 | 国产sm调教视频 | 久久夜色精品国产噜噜亚洲av | 黄色网战大全 | 国产一区精品在线观看 | 福利视频不卡 | 在线一级片 | 少妇伦子伦精品无吗 | 国产精品中文在线 | 奇米影视四色7777 | 日本乱子伦xxxx | 色就是色网站 | 深夜久久久 | 久久精品国产亚洲av嫖农村妇女 | 国产精品入口麻豆 | 少妇熟女一区二区三区 | 看a网站| 久久99热这里只频精品6学生 | 九色视频在线观看 | 久草国产精品视频 | 亚洲欧美乱综合图片区小说区 | 国产女教师一区二区三区 | 91麻豆精品一二三区在线 | 日韩欧美高清片 | 好吊色在线视频 | 精品三级电影 | 色播视频在线观看 | 看一级黄色 | 亚洲欧美日韩中文字幕在线观看 | 91免费版在线看 | 欧美熟妇精品一区二区蜜桃视频 | 精品久久久久久久久久久久 |