當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

爬虫总结(五)-- 其他技巧

發(fā)布時間：2025/3/15 编程问答 34 豆豆

生活随笔收集整理的這篇文章主要介紹了爬虫总结(五)-- 其他技巧小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

補充前面沒有提到的一些技巧。

模擬登錄

研究源碼

以 github 登錄（https://github.com/login）?為例，查看html源碼會發(fā)現(xiàn)表單里面有個隱藏的authenticity_token值，這個是需要先獲取然后跟用戶名和密碼一起提交的。

1234567891011121314151617181920212223242526

<div class="auth-form p-3" id="login"><form accept-charset="UTF-8" action="/session" data-form-nonce="b2e0b5f779ddbb5dbf93b903a82e5fc5204da96b" method="post"><div style="margin:0;padding:0;display:inline"><input name="utf8" type="hidden" value="✓" /><input name="authenticity_token" type="hidden" value="MDOLdxNeNMPn2sjrj51G+v/yMYpikLru8QWiLI170WRME4UBfvGItiAhzZWFujZVUSoT7SFygFcjE8pMfRcMHQ==" /></div> <div class="auth-form-header"><h1>Sign in to GitHub</h1></div><div id="js-flash-container"></div><div class="auth-form-body mt-4"><label for="login_field">Username or email address</label><input autocapitalize="off" autocorrect="off" autofocus="autofocus" class="form-control input-block" id="login_field" name="login" tabindex="1" type="text" /><label for="password">Password <a href="/password_reset" class="label-link">Forgot password?</a></label><input class="form-control form-control input-block" id="password" name="password" tabindex="2" type="password" /><input class="btn btn-primary btn-block" data-disable-with="Signing in…" name="commit" tabindex="3" type="submit" value="Sign in" /></div></form>

重寫start_requests方法

首先確保 cookie 打開

COOKIES_ENABLES = True

重寫start_requests方法

# 重寫了爬蟲類的方法, 實現(xiàn)了自定義請求, 運行成功后會調(diào)用callback回調(diào)函數(shù)def start_requests(self): return [Request("https://github.com/login", meta={'cookiejar': 1}, callback=self.post_login)]# FormRequesetdef post_login(self, response): # 先去拿隱藏的表單參數(shù)authenticity_token authenticity_token = response.xpath( '//input[@name="authenticity_token"]/@value').extract_first() logging.info('authenticity_token=' + authenticity_token) pass

start_requests方法指定了回調(diào)函數(shù)，用來獲取隱藏表單值authenticity_token，同時我們還給Request指定了cookiejar的元數(shù)據(jù)，用來往回調(diào)函數(shù)傳遞cookie標(biāo)識。

使用FormRequest

Scrapy為我們準(zhǔn)備了FormRequest類專門用來進(jìn)行Form表單提交的。

# FormRequesetdef post_login(self, response): # 先去拿隱藏的表單參數(shù)authenticity_token authenticity_token = response.xpath( '//input[@name="authenticity_token"]/@value').extract_first() logging.info('authenticity_token=' + authenticity_token) # FormRequeset.from_response是Scrapy提供的一個函數(shù), 用于post表單 # 登陸成功后, 會調(diào)用after_login回調(diào)函數(shù)，如果url跟Request頁面的一樣就省略掉 return [FormRequest.from_response(response, url='https://github.com/session', meta={'cookiejar': response.meta['cookiejar']}, #headers=self.post_headers, formdata={ 'login': 'shuang0420', 'password': 'XXXXXXXXXXXXXXXXX', 'authenticity_token': authenticity_token }, callback=self.after_login, dont_filter=True )]

FormRequest.from_response()方法讓你指定提交的url，請求頭還有form表單值，注意我們還通過meta傳遞了cookie標(biāo)識。它同樣有個回調(diào)函數(shù)，登錄成功后調(diào)用。下面我們來實現(xiàn)它。注意這里我繼續(xù)傳遞cookiejar，訪問初始頁面時帶上cookie信息。

def after_login(self, response): # 登錄之后，開始進(jìn)入我要爬取的私信頁面 for url in self.start_urls: logging.info('letter url=' + url) yield Request(url, meta={'cookiejar': response.meta['cookiejar']},callback=self.parse_page)

頁面處理

這個例子的主要任務(wù)是模擬登錄，在登錄 github 后爬取主頁的 comments 內(nèi)容。

代碼

123456789101112

def parse_page(self, response):"""comments 內(nèi)容"""logging.info(u'--------------消息分割線-----------------')logging.info(response.url)replaceTags = re.compile('<.*?>')replaceLine = re.compile('\r|\n|\t')message = response.xpath('//div[@class="details"]/div[@class="message markdown-body"]|div[@class="message markdown-body"]/blockquote').extract()for m in message:m = replaceTags.sub("", m)m = replaceLine.sub("", m)print m

爬取結(jié)果

I like topn (or perhaps top_n) a little better, because it's not dependent on what the features represent (words, phrases, entities, characters...). … Note: as of now, the classes and methods are not well arranged, and there are a few mock classes (which will be removed) to help me with testing. O… Hello @gojomo thank you for replying fast.I have used save() to save the model and load_word2vec_format() to load the model. Thats where the probl… The unicode_errors='ignore' option should make it impossible for the exact same error to occur; perhaps you're getting some other very-similar error? (Nevermind, #758 added annoy.) It looks like the tests don't run on Travis, since Annoy is not installed there. Not sure how to fix the test failure in Python 2.6 either. Hello,Sorry for posting after even you have created the FAQ.I trained a model with tweets which had some undecodable unicode characters. When i t… dtto Misleading comment: there is no "training", the model is transferred from Mallet. These parameters only affect inference, model is unchanged. PEP8: Hanging indent of 4 spaces. @piskvorky I've addressed the comments. Could you please check? Thanks, that was quick :) @piskvorky , @tmylk , could you review? Added comment, made change in changelog. No, this was after that in 0.13.2. I noticed it because when I was testing the #768 solution, print_topics was failing. @tmylk how do you review these PRs before merging? There are too many errors, we cannot merge code so carelessly. Looks good to me... except still needs a comment explaining why the alias is there. And maybe a mention in the changelog, so we can deprecate the o… Yes, assign self.wordtopics = self.word_topics, with a big fat comment explaining why this alias is there. I don't understand how this version with storing unicode to binary files even worked. It means our unit tests must be faulty / incomplete.

代碼

識別驗證碼

驗證碼是一種非常有效的反爬蟲機(jī)制，它能阻止大部分的暴力抓取，在電商類、投票類以及社交類等網(wǎng)站上應(yīng)用廣泛。如果破解驗證碼，成為了數(shù)據(jù)抓取工作者必須要面對的問題。下面介紹3種常用的方法。

更換ip地址

在訪問某些網(wǎng)站時，我們最初只是需要提供用戶名密碼就可以登陸的，比如說豆瓣網(wǎng)，如果我們要是頻繁登陸訪問，可能這時網(wǎng)站就會出現(xiàn)一個驗證碼圖片，要求我們輸入驗證碼才能登陸，這樣在保證用戶方便訪問的同時，又防止了機(jī)器的惡意頻繁訪問。對于這種情況，我們可以使用代理服務(wù)器訪問，只需要換個ip地址再次訪問，驗證碼就不會出現(xiàn)了，當(dāng)然，當(dāng)驗證碼再次出現(xiàn)的時候，我們只能再更換ip地址。

使用cookie登陸

如果采用cookie登陸，可以這樣實現(xiàn)：首先需要手動登陸網(wǎng)站一次，獲取服務(wù)器返回的cookie，這里就帶有了用戶的登陸信息，當(dāng)然也可以采用獲取的cookie登陸該網(wǎng)站的其他頁面，而不用再次登陸。具體代碼已經(jīng)實現(xiàn)，詳見ZhihuSpider。我們只需要在配置文件中提供用戶名密碼，及相應(yīng)的cookie即可。對于不出現(xiàn)驗證碼的情況，爬蟲會提交用戶名密碼實現(xiàn)post請求登陸，如果失敗，才會使用事先提供的cookie信息。

需要說明的是，判斷爬蟲登陸與否，我們只需要看一下爬取的信息里面是否帶有用戶信息即可。在使用cookie登陸的時候，還需要不定期更新cookie，以保證爬取順利進(jìn)行。

驗證碼識別手段

使用cookie登陸比較簡單，但是有時效性問題。驗證碼識別是個很好的思路，然而識別的精度又限制了抓取的效率。

爬取js交互式表格數(shù)據(jù)

這里，若使用Google Chrome分析”請求“對應(yīng)的鏈接(方法：右鍵→審查元素→Network→清空，點擊”加載更多“，出現(xiàn)對應(yīng)的GET鏈接尋找Type為text/html的，點擊，查看get參數(shù)或者復(fù)制Request URL)，循環(huán)過程。

啟動 splash 容器

$ boot2docker start$ boot2docker ssh$ docker run -p 8050:8050 scrapinghub/splash

配置 scrapy-splash

在你的 scrapy 工程的配置文件settings.py中添加

SPLASH_URL = 'http://192.168.59.103:8050'# 添加Splash中間件，還是在settings.py中通過DOWNLOADER_MIDDLEWARES指定，并且修改HttpCompressionMiddleware的優(yōu)先級DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,}# 默認(rèn)情況下，HttpProxyMiddleware的優(yōu)先級是750，要把它放在Splash中間件后面# 設(shè)置Splash自己的去重過濾器DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'# 如果你使用Splash的Http緩存，那么還要指定一個自定義的緩存后臺存儲介質(zhì)，scrapy-splash提供了一個scrapy.contrib.httpcache.FilesystemCacheStorage的子類HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'# 如果你要使用其他的緩存存儲，那么需要繼承這個類并且將所有的scrapy.util.request.request_fingerprint調(diào)用替換成scrapy_splash.splash_request_fingerprint

使用 scrapy-splash

SplashRequest

最簡單的渲染請求的方式是使用scrapy_splash.SplashRequest，通常你應(yīng)該選擇使用這個

12345678910111213

yield SplashRequest(url, self.parse_result,args={# optional; parameters passed to Splash HTTP API'wait': 0.5,# 'url' is prefilled from request url# 'http_method' is set to 'POST' for POST requests# 'body' is set to request body for POST requests},endpoint='render.json', # optional; default is render.htmlsplash_url='<url>', # optional; overrides SPLASH_URLslot_policy=scrapy_splash.SlotPolicy.PER_DOMAIN, # optional)

另外，你還可以在普通的scrapy請求中傳遞splash請求meta關(guān)鍵字達(dá)到同樣的效果

12345678910111213141516171819202122

yield scrapy.Request(url, self.parse_result, meta={'splash': {'args': {# set rendering arguments here'html': 1,'png': 1,# 'url' is prefilled from request url# 'http_method' is set to 'POST' for POST requests# 'body' is set to request body for POST requests},# optional parameters'endpoint': 'render.json', # optional; default is render.json'splash_url': '<url>', # optional; overrides SPLASH_URL'slot_policy': scrapy_splash.SlotPolicy.PER_DOMAIN,'splash_headers': {}, # optional; a dict with headers sent to Splash'dont_process_response': True, # optional, default is False'dont_send_headers': True, # optional, default is False'magic_response': False, # optional, default is True}})

Splash API說明，使用SplashRequest是一個非常便利的工具來填充request.meta[‘splash’]里的數(shù)據(jù)

meta[‘splash’][‘a(chǎn)rgs’] 包含了發(fā)往Splash的參數(shù)。
meta[‘splash’][‘endpoint’] 指定了Splash所使用的endpoint，默認(rèn)是render.html
meta[‘splash’][‘splash_url’] 覆蓋了settings.py文件中配置的Splash URL
meta[‘splash’][‘splash_headers’] 運行你增加或修改發(fā)往Splash服務(wù)器的HTTP頭部信息，注意這個不是修改發(fā)往遠(yuǎn)程web站點的HTTP頭部
meta[‘splash’][‘dont_send_headers’] 如果你不想傳遞headers給Splash，將它設(shè)置成True
meta[‘splash’][‘slot_policy’] 讓你自定義Splash請求的同步設(shè)置
meta[‘splash’][‘dont_process_response’] 當(dāng)你設(shè)置成True后，SplashMiddleware不會修改默認(rèn)的scrapy.Response請求。默認(rèn)是會返回SplashResponse子類響應(yīng)比如SplashTextResponse
meta[‘splash’][‘magic_response’] 默認(rèn)為True，Splash會自動設(shè)置Response的一些屬性，比如response.headers,response.body等
如果你想通過Splash來提交Form請求，可以使用scrapy_splash.SplashFormRequest，它跟SplashRequest使用是一樣的。

Responses

對于不同的Splash請求，scrapy-splash返回不同的Response子類

SplashResponse 二進(jìn)制響應(yīng)，比如對/render.png的響應(yīng)
SplashTextResponse 文本響應(yīng)，比如對/render.html的響應(yīng)
SplashJsonResponse JSON響應(yīng)，比如對/render.json或使用Lua腳本的/execute的響應(yīng)

如果你只想使用標(biāo)準(zhǔn)的Response對象，就設(shè)置meta[‘splash’][‘dont_process_response’]=True

所有這些Response會把response.url設(shè)置成原始請求URL(也就是你要渲染的頁面URL)，而不是Splash endpoint的URL地址。實際地址通過response.real_url得到

實例

爬取華為應(yīng)用市場(?http://appstore.huawei.com/more/all?)的“下一頁” url 鏈接。

查看網(wǎng)頁源代碼

123456789101112131415161718

查看渲染后的代碼

啟動 splash 容器，在瀏覽器打開?http://192.168.59.103:8050/?，輸入網(wǎng)址進(jìn)行 render，查看渲染后的代碼。

spider 部分代碼

def parse(self, response): page = Selector(response) hrefs = page.xpath('//h4[@class="title"]/a/@href') if not hrefs: return for href in hrefs: url = href.extract() yield scrapy.Request(url, callback=self.parse_item)# find next page nextpage = page.xpath('//div[@class="page-ctrl ctrl-app"]/a/em[@class="arrow-grey-rt"]/../@href').extract_first() print nextpage yield scrapy.Request(nextpage,callback=self.parse,meta={ 'splash': { 'endpoint': 'render.html', 'args': {'wait': 0.5} } })

完整代碼

分析不規(guī)則的 html

之前的幾個部分解決的都是下載 Web 頁面的問題，這里補充下獲取網(wǎng)頁后分析過程的一些技巧。
以蘇寧易購 help 頁面為例。start_url 是?http://help.suning.com/faq/list.htm?，爬取的是左邊側(cè)欄每個大類的每個小類下右邊的問題頁面，如“權(quán)益介紹”、“等級權(quán)益介紹”這些 FAQ 頁面，如何到達(dá)這些頁面就不再多說，關(guān)鍵是到達(dá)這些頁面后怎么獲得信息。

看一部分的網(wǎng)頁源代碼

1234567891011121314151617181920212223242526272829

<div id="contentShow"><p class="MsoNormal" style="background:white;text-align:left;" align="left"><span style="font-size:9pt;font-family:宋體;color:black;"></span></p><p class="MsoNormal" style="background:white;" align="left"><br /></p><p class="MsoNormal" style="background:white;" align="left"><b>一、權(quán)益類型</b><b></b></p><p class="MsoNormal" style="background:white;" align="left">本次改版將上線<span>7</span>個會員權(quán)益，涵蓋價格優(yōu)惠、資格搶先、服務(wù)優(yōu)先等多個方面，會員等級越高，可享受到的會員權(quán)益越多。<span></span></p><p class="MsoNormal" style="background:white;" align="left"><b>二、具體詳情：</b><b></b></p><p class="MsoNormal" style="background:white;" align="left"><b>1</b><b>、生日紅包</b><br />特權(quán)內(nèi)容：<span><br /></span>已驗證手機(jī)號的<span>V2</span>及以上等級的會員，在實名認(rèn)證或完善生日資料后，可在生日周期間獲得生日紅包。<span><br /></span><span>V2</span>等級生日紅包為<span>6</span>元云券，<span>V3</span>等級生日紅包為<span>8</span>元云券。（<span>2016</span>年6月12日開始實施）<span><br /></span>注意事項：<span><br />1</span>）生日紅包券為限品類云券，在生日周時自動發(fā)到會員賬戶，會員在成功收到生日紅包券后會有短信提醒，并可登錄“我的易購<span>-</span>我的優(yōu)惠券”【<span><a href="http://member.suning.com/emall/MyGiftTicket" target="_blank"><span>點擊查看</span></a></span>】，每個會員同一自然年內(nèi)僅可獲得一張生日紅包券；<span><br />2</span>）券使用規(guī)則：且限一次性使用、不找零、不兌現(xiàn)，不可以和云券疊加，可以和無敵券、易券疊加使用，不可使用自提；<span><br />3</span>）券有效期：自券到賬之日起<span>8</span>日內(nèi)有效；<span><br />4</span>）券適用商品范圍：僅限購買自營商品使用，也可以用于大聚惠、搶團(tuán)購、手機(jī)專享價、名品特賣商品，但閃拍、秒殺、預(yù)售、海外購、虛擬商品、特殊類商品（一段奶粉等）及平臺商戶商品不可使用；<span><br />5</span>）使用生日紅包券的訂單若發(fā)生退貨，在有效期內(nèi)券將返回至顧客賬戶，可再次使用；如用券訂單退貨時已超過券有效期，券將自動失效，不做延期；<span></span></p>......

不難發(fā)現(xiàn)，有些文字分布在

div[@id=”contentShow”]/p
div[@id=”contentShow”]/p/span
div[@id=”contentShow”]/p/b

觀察其他頁面會發(fā)現(xiàn)還有些分布在 div[@id=”contentShow”]/h4 下或者 h3 下，有的甚至直接就在 div[@id=”contentShow”] 下。。
怎么辦？
當(dāng)然可以窮盡各種規(guī)則，也可以先把不需要的標(biāo)簽給去掉再 extract，這些我開始都傻傻的嘗試過，結(jié)果總會忽略一些文字，后來在沮喪的看著 output 文件時福至心靈，直接取了 div[@id=”contentShow”] 再把所有的標(biāo)簽去掉不就行了？！

上代碼

12345

page = html.xpath('//div[@id="contentShow"]').extract_first()replaceTags = re.compile('<.*?>')replaceLine = re.compile('\r|\n|\t')page = replaceTags.sub("", page)page = re.sub(replaceLine, "", page)

最后的結(jié)果非常干凈

{"url": "http://help.suning.com/page/id-26.htm", "text": " 一、賬號注冊目前注冊個人用戶僅支持：手機(jī)號方式進(jìn)行注冊。1、打開蘇寧易購網(wǎng)站，點擊頁頭“注冊”，進(jìn)入注冊頁面?2、進(jìn)入注冊頁面，如果您是個人用戶，可以用手機(jī)號進(jìn)行注冊；如果您是企業(yè)用戶，可以點擊“企業(yè)用戶注冊”，用單位名稱進(jìn)行注冊，如果您有易購賬號，可以點擊“馬上登錄”3、填寫注冊信息，按照網(wǎng)頁提示，填寫手機(jī)號、驗證碼和密碼?4、恭喜您，注冊成功?", "question": "賬戶注冊", "title": "易購注冊登錄"}

掌握這個技巧，處理類似問題就很簡單啦，如再爬京東的 help 網(wǎng)頁，稍微改下代碼5分鐘就能搞定。

代碼

其他

回頭談點背景知識,scrapy使用了twisted.一個異步網(wǎng)絡(luò)框架.因此要留意潛在的阻塞情況.但注意到settings中有個參數(shù)是設(shè)置ItemPipeline的并行度.由此推測pipeline不會阻塞,pipeline可能是在線程池中執(zhí)行的(未驗證).Pipeline一般用于將抓取到的信息保存(寫數(shù)據(jù)庫,寫文件),因此這里你就不用擔(dān)心耗時操作會阻塞整個框架了,也就不用在Pipeline中將這個寫操作實現(xiàn)為異步.
除此之外框架的其他部分.都是異步的,簡單說來就是,爬蟲生成的請求交由調(diào)度器去下載,然后爬蟲繼續(xù)執(zhí)行.調(diào)度器完成下載后會將響應(yīng)交由爬蟲解析.
網(wǎng)上找到的參考例子,部分將js支持寫到了DownloaderMiddleware中,scrapy官網(wǎng)的code snippet也是這樣 .若這樣實現(xiàn),就阻塞了整個框架,爬蟲的工作模式變成了,下載-解析-下載-解析,而不在是并行的下載.在對效率要求不高的小規(guī)模爬取中問題不大.
更好的做法是將js支持寫到scrapy的downloader里.網(wǎng)上有一個這樣的實現(xiàn)(使用selenium+phantomjs).不過僅支持get請求.
在適配一個webkit給scrapy的downloader時,有各種細(xì)節(jié)需要處理.

參考鏈接

scrapy定制爬蟲-爬取javascript內(nèi)容

Scrapy筆記（11）- 模擬登錄

網(wǎng)絡(luò)爬蟲-驗證碼登陸

Scrapy筆記（12）- 抓取動態(tài)網(wǎng)站

原文地址： http://www.shuang0420.com/2016/06/20/%E7%88%AC%E8%99%AB%E6%80%BB%E7%BB%93-%E4%BA%94-%E5%85%B6%E4%BB%96%E6%8A%80%E5%B7%A7/

總結(jié)

以上是生活随笔為你收集整理的爬虫总结(五)-- 其他技巧的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：爬虫总结(四)-- 分布式爬虫
下一篇： TensorFlow 实战 MINST