生活随笔
收集整理的這篇文章主要介紹了
三、scrapy爬虫框架——scrapy模拟登陆
小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
scrapy模擬登陸
學(xué)習(xí)目標(biāo):
應(yīng)用 請(qǐng)求對(duì)象cookies參數(shù)的使用 了解 start_requests函數(shù)的作用 應(yīng)用 構(gòu)造并發(fā)送post請(qǐng)求
1. 回顧之前的模擬登陸的方法
1.1 requests模塊是如何實(shí)現(xiàn)模擬登陸的?
直接攜帶cookies請(qǐng)求頁面 找url地址,發(fā)送post請(qǐng)求存儲(chǔ)cookie
1.2 selenium是如何模擬登陸的?
找到對(duì)應(yīng)的input標(biāo)簽,輸入文本點(diǎn)擊登陸
1.3 scrapy的模擬登陸
直接攜帶cookies 找url地址,發(fā)送post請(qǐng)求存儲(chǔ)cookie
2. scrapy攜帶cookies直接獲取需要登陸后的頁面
應(yīng)用場(chǎng)景
cookie過期時(shí)間很長(zhǎng),常見于一些不規(guī)范的網(wǎng)站 能在cookie過期之前把所有的數(shù)據(jù)拿到 配合其他程序使用,比如其使用selenium把登陸之后的cookie獲取到保存到本地,scrapy發(fā)送請(qǐng)求之前先讀取本地cookie
2.1 實(shí)現(xiàn):重構(gòu)scrapy的starte_rquests方法
scrapy中start_url是通過start_requests來進(jìn)行處理的,其實(shí)現(xiàn)代碼如下
def start_requests ( self
) : cls
= self
. __class__
if method_is_overridden
( cls
, Spider
, 'make_requests_from_url' ) : warnings
. warn
( "Spider.make_requests_from_url method is deprecated; it " "won't be called in future Scrapy releases. Please " "override Spider.start_requests method instead (see %s.%s)." % ( cls
. __module__
, cls
. __name__
) , ) for url
in self
. start_urls
: yield self
. make_requests_from_url
( url
) else : for url
in self
. start_urls
: yield Request
( url
, dont_filter
= True )
所以對(duì)應(yīng)的,如果start_url地址中的url是需要登錄后才能訪問的url地址,則需要重寫start_request方法并在其中手動(dòng)添加上cookie
2.2 攜帶cookies登陸github
測(cè)試賬號(hào) noobpythoner zhoudawei123
import scrapy
class Git1Spider ( scrapy
. Spider
) : name
= 'git1' allowed_domains
= [ 'github.com' ] start_urls
= [ 'https://github.com/zep03' ] def start_request ( self
) : url
= self
. start_urls
[ 0 ] temp
= '_octo=GH1.1.838083519.1594559947; _ga=GA1.2.1339438892.1594559990; _gat=1; tz=Asia%2FShanghai; _device_id=4d76e456d7a0c1e69849de2655198d40; has_recent_activity=1; user_session=e6aK8ODfFzCDBmDG72FxcGE17CQ3FiL23o; __Host-user_session_same_site=e6aK8ODfFzCDBmDTZMReW2g3PhRJEG72FxcGE17CQ3FiL23o; logged_in=yes; dotc' cookies
= { data
. split
( '=' ) [ 0 ] : data
. split
( '=' ) [ - 1 ] for data
in temp
. split
( ';' ) } print ( cookies
) yield scrapy
. Request
( url
= url
, callback
= self
. parse
, cookies
= cookies
) def parse ( self
, response
) : print ( response
. xpath
( '/html/head/title/text()' ) . extract_first
( ) )
import scrapy
import re
class Login1Spider ( scrapy
. Spider
) : name
= 'login1' allowed_domains
= [ 'github.com' ] start_urls
= [ 'https://github.com/NoobPythoner' ] def start_requests ( self
) : cookies_str
= '...' cookies_dict
= { i
. split
( '=' ) [ 0 ] : i
. split
( '=' ) [ 1 ] for i
in cookies_str
. split
( '; ' ) } yield scrapy
. Request
( self
. start_urls
[ 0 ] , callback
= self
. parse
, cookies
= cookies_dict
) def parse ( self
, response
) : result_list
= re
. findall
( r
'noobpythoner|NoobPythoner' , response
. body
. decode
( ) ) print ( result_list
) pass
注意:
scrapy中cookie不能夠放在headers中,在構(gòu)造請(qǐng)求的時(shí)候有專門的cookies參數(shù),能夠接受字典形式的coookie 在setting中設(shè)置ROBOTS協(xié)議、USER_AGENT
3. scrapy.Request發(fā)送post請(qǐng)求
我們知道可以通過scrapy.Request()指定method、body參數(shù)來發(fā)送post請(qǐng)求;但是通常使用scrapy.FormRequest()來發(fā)送post請(qǐng)求
3.1 發(fā)送post請(qǐng)求
注意:scrapy.FormRequest()能夠發(fā)送表單和ajax請(qǐng)求,參考閱讀 https://www.jb51.net/article/146769.htm
3.1.1 思路分析
找到post的url地址:點(diǎn)擊登錄按鈕進(jìn)行抓包,然后定位url地址為https://github.com/session
找到請(qǐng)求體的規(guī)律:分析post請(qǐng)求的請(qǐng)求體,其中包含的參數(shù)均在前一次的響應(yīng)中
否登錄成功:通過請(qǐng)求個(gè)人主頁,觀察是否包含用戶名
3.1.2 代碼實(shí)現(xiàn)如下:
import scrapy
class Git2Spider ( scrapy
. Spider
) : name
= 'git2' allowed_domains
= [ 'github.com' ] start_urls
= [ 'http://github.com/login' ] def parse ( self
, response
) : token
= response
. xpath
( '//input[@name="authenticity_token"]/@value' ) . extract_first
( ) timestamp_secret
= response
. xpath
( '//input[@name="timestamp_secret"]/@value' ) . extract_first
( ) timestamp
= response
. xpath
( '//input[@name="timestamp"]/@value' ) . extract_first
( ) required_field_name
= response
. xpath
( '//*[@id="login"]/form/div[4]/input[6]/@name' ) . extract_first
( ) post_data
= { "commit" : "Sign in" , "authenticity_token" : token
, "ga_id" : "1029919665.1594130837" , "login" : "賬號(hào)" , "password" : "密碼" , "webauthn-support" : "supported" , "webauthn-iuvpaa-support" : "unsupported" , "return_to" : "" , required_field_name
: "" , "timestamp" : timestamp
, "timestamp_secret" : timestamp_secret
} print ( post_data
) yield scrapy
. FormRequest
( url
= 'https://github.com/session' , callback
= self
. after_login
, formdata
= post_data
) def after_login ( self
, response
) : yield scrapy
. Request
( 'https://github.com/zep03' , callback
= self
. check_login
) def check_login ( self
, response
) : print ( response
. xpath
( '/html/head/title/text()' ) . extract_first
( ) )
import scrapy
import re
class Login2Spider ( scrapy
. Spider
) : name
= 'login2' allowed_domains
= [ 'github.com' ] start_urls
= [ 'https://github.com/login' ] def parse ( self
, response
) : authenticity_token
= response
. xpath
( "//input[@name='authenticity_token']/@value" ) . extract_first
( ) utf8
= response
. xpath
( "//input[@name='utf8']/@value" ) . extract_first
( ) commit
= response
. xpath
( "//input[@name='commit']/@value" ) . extract_first
( ) yield scrapy
. FormRequest
( "https://github.com/session" , formdata
= { "authenticity_token" : authenticity_token
, "utf8" : utf8
, "commit" : commit
, "login" : "noobpythoner" , "password" : "***" } , callback
= self
. parse_login
) def parse_login ( self
, response
) : ret
= re
. findall
( r
"noobpythoner|NoobPythoner" , response
. text
) print ( ret
)
小技巧
在settings.py中通過設(shè)置COOKIES_DEBUG=TRUE 能夠在終端看到cookie的傳遞傳遞過程
小結(jié)
start_urls中的url地址是交給start_request處理的,如有必要,可以重寫start_request函數(shù) 直接攜帶cookie登陸:cookie只能傳遞給cookies參數(shù)接收 scrapy.Request()發(fā)送post請(qǐng)求
總結(jié)
以上是生活随笔 為你收集整理的三、scrapy爬虫框架——scrapy模拟登陆 的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔 網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔 推薦給好友。