微指数爬虫
做微指數(shù)的爬蟲時(shí),遇到點(diǎn)難題,去網(wǎng)上查些資料,發(fā)現(xiàn)有點(diǎn)錯(cuò)誤;
特此經(jīng)過(guò)研究,有了正確的采集方法。
1、采集界面,搜索關(guān)鍵字后頁(yè)面跳轉(zhuǎn)到指數(shù)界面,顯示搜索的指數(shù)。
2、fiddler抓包定位采集數(shù)據(jù)源,加載方式為json
3、獲取json數(shù)據(jù);
分析數(shù)據(jù)的加載方式,發(fā)現(xiàn)數(shù)據(jù)是在定位請(qǐng)求頭中的referer地址進(jìn)行返回?cái)?shù)據(jù)的,
將重定向的地址解析后知道wid和wname是我們需要首先獲得的,其實(shí)就是我們請(qǐng)求的地址。
4、代碼驗(yàn)證,這些步驟也是網(wǎng)上可以搜索到的步驟
a、獲取wid和對(duì)應(yīng)的搜索關(guān)鍵字
b、拼接json數(shù)據(jù)的url
c、請(qǐng)求數(shù)據(jù)
5、用此方法驗(yàn)證的代碼,結(jié)果無(wú)論搜索什么只顯示歡樂(lè)頌的指數(shù),說(shuō)明該方法是不能實(shí)現(xiàn)的
#coding=utf-8
import sys
reload(sys)
sys.setdefaultencoding( "utf-8" )
import requests
import urllib
class xl():
def pc(self,name):
url_name=urllib.quote(name)
headers={
'Host': 'data.weibo.com',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:46.0) Gecko/20100101 Firefox/46.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
'Accept-Encoding': 'gzip, deflate',
'Content-Type': 'application/x-www-form-urlencoded',
'X-Requested-With': 'XMLHttpRequest',
'Referer': 'http://data.weibo.com/index/hotword?wname='+url_name,
'Cookie': 'UOR=www.baidu.com,data.weibo.com,www.baidu.com; SINAGLOBAL=1213237876483.9214.1464074185942; ULV=1464183246396:2:2:2:3463179069239.6826.1464183246393:1464074185944; DATA=usrmdinst_12; _s_tentry=www.baidu.com; Apache=3463179069239.6826.1464183246393; WBStore=8ca40a3ef06ad7b2|undefined; PHPSESSID=3mn5oie7g3cm954prqan14hbg5',
'Connection': 'keep-alive'
}
r=requests.get("http://data.weibo.com/index/ajax/getchartdata?month=default&__rnd=1464188164238",headers=headers)
return r.text
x=xl()
print x.pc("歡樂(lè)頌")
6、糾正方法,重新進(jìn)入搜索頁(yè);在紅方框中重新搜索關(guān)鍵字進(jìn)行刷新
7、實(shí)際的數(shù)據(jù)加載過(guò)程為下面的四個(gè)步驟,“歡樂(lè)頌”的實(shí)際的請(qǐng)求url為
http://data.weibo.com/index/ajax/getchartdata?wid=1091324230349&sdate=2017-05-23&edate=2017-06-22&__rnd=1498202175662
8、獲取url中的日期和wid即可以拿到數(shù)據(jù),不需要cookie
#coding=utf-8
import requests
import urllib
import cProfile
def search_name(name):
url_format = "http://data.weibo.com/index/ajax/hotword?word={}&flag=nolike&_t=0"
cookie_header = {
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36",
"Referer":"http://data.weibo.com/index?sudaref=www.google.com"
}
urlname = urllib.quote(name)
first_requests = url_format.format(urlname)
codes = requests.get(first_requests,headers=cookie_header).json()
ids = codes["data"]["id"]
header = {
"Connection":"keep-alive",
"Accept-Encoding": "gzip, deflate, sdch",
"Accept": "*/*",
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36",
"Accept-Language": "zh-CN,zh;q=0.8",
"Referer": "http://data.weibo.com/index/hotword?wid={}&wname={}".format(ids,urlname),
"Content-Type": "application/x-www-form-urlencoded",
"Host":"data.weibo.com"
}
#獲取日期
date_url = "http://data.weibo.com/index/ajax/getdate?month=1&__rnd=1498190033389"
dc = requests.get(date_url,headers=header).json()
edate,sdate = dc["edate"],dc["sdate"]
#數(shù)據(jù)返回
codes = requests.get("http://data.weibo.com/index/ajax/getchartdata?wid={}&sdate={}&edate={}"
.format(ids,sdate,edate),headers=header).json()
return codes
if __name__ == "__main__":
cProfile.run('search_name("天津")')
#print search_name("天涯")
9、如有錯(cuò)誤還請(qǐng)指正
總結(jié)
- 上一篇: sourceTree 文件的丢弃和删除
- 下一篇: 人物志---刁光斗(《大宋提刑官》中的反