生活随笔
收集整理的這篇文章主要介紹了
python爬虫模拟登录人人网
小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.
模擬登錄:爬取基于某些用戶的用戶信息。
需求1:對人人網(wǎng)進(jìn)行模擬登錄。
點(diǎn)擊登錄按鈕之后會發(fā)起一個post請求 post請求中會攜帶登錄之前錄入的相關(guān)的登錄信息(用戶名,密碼,驗(yàn)證碼…) 驗(yàn)證碼:每次請求都會變化
需求2:爬取當(dāng)前用戶的相關(guān)的用戶信息(個人主頁中顯示的用戶信息)
http/https協(xié)議特性:無狀態(tài)。
沒有請求到對應(yīng)頁面數(shù)據(jù)的原因:
發(fā)起的第二次基于個人主頁頁面請求的時候,服務(wù)器端并不知道該此請求是基于登錄狀態(tài)下的請求。
cookie:用來讓服務(wù)器端記錄客戶端的相關(guān)狀態(tài)。
手動處理:通過抓包工具獲取cookie值,將該值封裝到headers中。(不建議) 自動處理: - cookie值的來源是哪里? - 模擬登錄post請求后,由服務(wù)器端創(chuàng)建。
session會話對象: 作用:
可以進(jìn)行請求的發(fā)送。 如果請求過程中產(chǎn)生了cookie,則該cookie會被自動存儲/攜帶在該session對象中。 - 創(chuàng)建一個session對象:session = requests.Session() - 使用session對象進(jìn)行模擬登錄post請求的發(fā)送(cookie就會被存儲在session中) - session對象對個人主頁對應(yīng)的get請求進(jìn)行發(fā)送(攜帶了cookie)
1. 對http://www.renren.com/發(fā)送請求,拿到下面這個頁面的源碼
2. 對頁面中的驗(yàn)證碼圖片進(jìn)行定位,獲取到img標(biāo)簽中的src屬性的值,再對src中的網(wǎng)址發(fā)送get請求,將驗(yàn)證碼圖片保存到本地,后面會使用超級鷹打碼平臺將保存到本地的驗(yàn)證碼圖片進(jìn)行識別
3. 點(diǎn)擊登錄按鈕通過瀏覽器抓包,發(fā)現(xiàn)瀏覽器向服務(wù)器發(fā)送了一個post請求,請求的url為http://www.renren.com/ajaxLogin/login?1=1&uniqueTimestamp=202112910495,抓取該次請求的數(shù)據(jù)包,查看響應(yīng)頭信息中是否存在set-cookie,如果有,則證實(shí)該次請求時,服務(wù)器端給客戶端創(chuàng)建了會話對象,且創(chuàng)建了cookie返回給了客戶端進(jìn)行存儲。
果然存在set-cookie,因此,我們在使用requests模塊進(jìn)行模擬登陸時,發(fā)起的請求也是需要攜帶cookie的 。那么cookie如何被攜帶到requests的請求中呢?
將cookie手動從抓包工具中獲取,然后封裝到requests請求的headers中,將headers作用到請求方法中。(不建議)
headers
= { 'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36' , 'Cookie' : 'xxxxxxxxx'
}
創(chuàng)建會話對象,使用會話對象進(jìn)行請求發(fā)送。因?yàn)闀捴袝詣訑y帶且處理cookie。 (推薦)
session
= requests
. Session
( )
page_text
= session
. get
( url
= url
, headers
= headers
) . text
. . . . . .
4. 通過對網(wǎng)站登錄的抓包,發(fā)現(xiàn)了請求的url為:http://www.renren.com/974713149,響應(yīng)回來的就是我們所需要的登錄成功之后的首頁。所以對這個url發(fā)送請求,并注意模擬請求頭User-Agent、Referer、Cookie
5. 對http://www.renren.com/974713149/profile發(fā)送get請求拿到下面?zhèn)€人主頁的源碼:
代碼演示:
將cookie手動從抓包工具中獲取,然后封裝到requests請求的headers中,將headers作用到請求方法中。(不建議)
import requests
from lxml
import etree
from hashlib
import md5
def getCodeText ( userName
, password
, appId
, imgUrl
) : class Chaojiying_Client ( object ) : def __init__ ( self
, username
, password
, soft_id
) : self
. username
= usernamepassword
= password
. encode
( 'utf8' ) self
. password
= md5
( password
) . hexdigest
( ) self
. soft_id
= soft_idself
. base_params
= { 'user' : self
. username
, 'pass2' : self
. password
, 'softid' : self
. soft_id
, } self
. headers
= { 'Connection' : 'Keep-Alive' , 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)' , } def PostPic ( self
, im
, codetype
) : """im: 圖片字節(jié)codetype: 題目類型 參考 http://www.chaojiying.com/price.html""" params
= { 'codetype' : codetype
, } params
. update
( self
. base_params
) files
= { 'userfile' : ( 'ccc.jpg' , im
) } r
= requests
. post
( 'http://upload.chaojiying.net/Upload/Processing.php' , data
= params
, files
= files
, headers
= self
. headers
) return r
. json
( ) def ReportError ( self
, im_id
) : """im_id:報錯題目的圖片ID""" params
= { 'id' : im_id
, } params
. update
( self
. base_params
) r
= requests
. post
( 'http://upload.chaojiying.net/Upload/ReportError.php' , data
= params
, headers
= self
. headers
) return r
. json
( ) if __name__
== '__main__' : chaojiying
= Chaojiying_Client
( userName
, password
, appId
) im
= open ( imgUrl
, 'rb' ) . read
( ) return chaojiying
. PostPic
( im
, 1902 )
headers
= { 'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36' , 'Referer' : 'http://www.renren.com/SysHome.do' , 'Cookie' : 'anonymid=klgdsqz5n7c6dn; depovince=ZGQT; _r01_=1; JSESSIONID=abcqWHDNhNOVf95ntfjFx; taihe_bi_sdk_uid=926da97ed7bdff5fc3ece47fdd554b0b; taihe_bi_sdk_session=ffa92a5a812142ba8dac302676d881cd; ick_login=426dff64-6952-4319-8c8f-96ea6f498550; first_login_flag=1; ln_uact=910456393@qq.com; ln_hurl=http://hdn.xnimg.cn/photos/hdn421/205/2035/h_main_9aN0_0c1b00037b06195a.jpg; wp_fold=0; jebecookies=c2363801-e587-4f54-8566-24b86aa22659|||||; _de=B3D043F455F38852340E4CEC836F3769696BF75400CE19CC; p=2e69883207d99e253471f621d896037d9; t=1f917c44eaa1178b8bd357e96d7346fc9; societyguester=1f917c44eaa1178b8bd357e96d7346fc9; id=974713149; xnsid=364172ac; loginfrom=syshome'
}
url
= 'http://www.renren.com/'
page_text
= requests
. get
( url
= url
, headers
= headers
) . text
tree
= etree
. HTML
( page_text
)
img_url
= tree
. xpath
( '//*[@id="verifyPic_login"]/@src' ) [ 0 ]
print ( img_url
)
img_data
= requests
. get
( img_url
, headers
= headers
) . content
print ( img_data
)
with open ( './code.jpg' , 'wb' ) as fp
: fp
. write
( img_data
)
result
= getCodeText
( '用戶名' , '密碼' , 'appid' , '驗(yàn)證碼本地存儲的路徑' )
print ( result
[ 'pic_str' ] ) login_url
= 'http://www.renren.com/9747139'
login_page_text
= requests
. get
( url
= login_url
, headers
= headers
) . text
with open ( 'renren.html' , 'w' , encoding
= 'utf-8' ) as fp
: fp
. write
( login_page_text
)
detail_url
= 'http://www.renren.com/974713149/profile'
detail_page_text
= requests
. get
( url
= detail_url
, headers
= headers
) . text
with open ( 'zep.html' , 'w' , encoding
= 'utf-8' ) as fp
: fp
. write
( detail_page_text
)
保存到本地的renren.html: 保存到本地的zep.html: 2. 創(chuàng)建會話對象,使用會話對象進(jìn)行請求發(fā)送。因?yàn)闀捴袝詣訑y帶且處理cookie。 (推薦)
import requests
from lxml
import etree
from hashlib
import md5
def getCodeText ( userName
, password
, appId
, imgUrl
) : class Chaojiying_Client ( object ) : def __init__ ( self
, username
, password
, soft_id
) : self
. username
= usernamepassword
= password
. encode
( 'utf8' ) self
. password
= md5
( password
) . hexdigest
( ) self
. soft_id
= soft_idself
. base_params
= { 'user' : self
. username
, 'pass2' : self
. password
, 'softid' : self
. soft_id
, } self
. headers
= { 'Connection' : 'Keep-Alive' , 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)' , } def PostPic ( self
, im
, codetype
) : """im: 圖片字節(jié)codetype: 題目類型 參考 http://www.chaojiying.com/price.html""" params
= { 'codetype' : codetype
, } params
. update
( self
. base_params
) files
= { 'userfile' : ( 'ccc.jpg' , im
) } r
= requests
. post
( 'http://upload.chaojiying.net/Upload/Processing.php' , data
= params
, files
= files
, headers
= self
. headers
) return r
. json
( ) def ReportError ( self
, im_id
) : """im_id:報錯題目的圖片ID""" params
= { 'id' : im_id
, } params
. update
( self
. base_params
) r
= requests
. post
( 'http://upload.chaojiying.net/Upload/ReportError.php' , data
= params
, headers
= self
. headers
) return r
. json
( ) if __name__
== '__main__' : chaojiying
= Chaojiying_Client
( userName
, password
, appId
) im
= open ( imgUrl
, 'rb' ) . read
( ) return chaojiying
. PostPic
( im
, 1902 )
session
= requests
. Session
( )
headers
= { 'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36' , 'Referer' : 'http://www.renren.com/SysHome.do' ,
}
url
= 'http://www.renren.com/'
page_text
= session
. get
( url
= url
, headers
= headers
) . text
tree
= etree
. HTML
( page_text
)
img_url
= tree
. xpath
( '//*[@id="verifyPic_login"]/@src' ) [ 0 ]
print ( img_url
)
img_data
= session
. get
( img_url
, headers
= headers
) . content
print ( img_data
)
with open ( './code.jpg' , 'wb' ) as fp
: fp
. write
( img_data
)
result
= getCodeText
( '用戶名' , '密碼' , 'appid' , '驗(yàn)證碼圖片的路徑' )
print ( result
[ 'pic_str' ] ) login_post_url
= 'http://www.renren.com/ajaxLogin/login?1=1&uniqueTimestamp=202112910495'
data
= { 'email' : '910451393@qq.com' , 'icode' : result
[ 'pic_str' ] , 'origURL' : 'http://www.renren.com/home' , 'domain' : 'renren.com' , 'key_id' : '1' , 'captcha_type' : 'web_login' , 'password' : '346d050fe82d3cfe090210864d73b65b5608bf90173371b3c10e7df6e533' , 'rkey' : '3a7cdde0b042c1ba11169c3378fd5b' , 'f' : 'http%3A%2F%2Fwww.renren.com%2F974713149%2Fnewsfeed%2Fphoto'
}
response
= session
. post
( url
= login_post_url
, headers
= headers
, data
= data
)
print ( response
. text
) login_url
= 'http://www.renren.com/974713149'
login_page_text
= session
. get
( url
= login_url
, headers
= headers
) . text
with open ( 'renren.html' , 'w' , encoding
= 'utf-8' ) as fp
: fp
. write
( login_page_text
)
detail_url
= 'http://www.renren.com/974713149/profile'
detail_page_text
= session
. get
( url
= detail_url
, headers
= headers
) . text
with open ( 'zep.html' , 'w' , encoding
= 'utf-8' ) as fp
: fp
. write
( detail_page_text
)
zep.html:
總結(jié)
以上是生活随笔 為你收集整理的python爬虫模拟登录人人网 的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔 網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔 推薦給好友。