當前位置：首頁 > 编程语言 > python >内容正文

python

python爬虫----（4. scrapy框架，官方文档以及例子）

發(fā)布時間：2023/12/20 python 31 豆豆

生活随笔收集整理的這篇文章主要介紹了 python爬虫----（4. scrapy框架，官方文档以及例子）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

為什么80%的碼農都做不了架構師？>>> ??

????官方文檔： http://doc.scrapy.org/en/latest/

??? github例子： https://github.com/search?utf8=%E2%9C%93&q=scrapy

????剩下的待會再整理...... 買飯去......?????? --2014年08月20日19:29:20

????の...剛搜狗輸入法出問題，直接注銷重新登陸，結果剛才的那些內容全部沒了。看來草稿箱也不是太靠譜呀！！！

????再重新整理下吧

????????????????????????????????????????????????????????????? -- 2014年08月21日04:02:37

（一）基本的 -- scrapy.spider.Spider

????（1）使用交互shell

dizzy@dizzy-pc:~$?scrapy?shell?"http://www.baidu.com/" 2014-08-21?04:09:11+0800?[scrapy]?INFO:?Scrapy?0.24.4?started?(bot:?scrapybot) 2014-08-21?04:09:11+0800?[scrapy]?INFO:?Optional?features?available:?ssl,?http11,?django 2014-08-21?04:09:11+0800?[scrapy]?INFO:?Overridden?settings:?{'LOGSTATS_INTERVAL':?0} 2014-08-21?04:09:11+0800?[scrapy]?INFO:?Enabled?extensions:?TelnetConsole,?CloseSpider,?WebService,?CoreStats,?SpiderState 2014-08-21?04:09:11+0800?[scrapy]?INFO:?Enabled?downloader?middlewares:?HttpAuthMiddleware,?DownloadTimeoutMiddleware,?UserAgentMiddleware,?RetryMiddleware,?DefaultHeadersMiddleware,?MetaRefreshMiddleware,?HttpCompressionMiddleware,?RedirectMiddleware,?CookiesMiddleware,?ChunkedTransferMiddleware,?DownloaderStats 2014-08-21?04:09:11+0800?[scrapy]?INFO:?Enabled?spider?middlewares:?HttpErrorMiddleware,?OffsiteMiddleware,?RefererMiddleware,?UrlLengthMiddleware,?DepthMiddleware 2014-08-21?04:09:11+0800?[scrapy]?INFO:?Enabled?item?pipelines:? 2014-08-21?04:09:11+0800?[scrapy]?DEBUG:?Telnet?console?listening?on?127.0.0.1:6024 2014-08-21?04:09:11+0800?[scrapy]?DEBUG:?Web?service?listening?on?127.0.0.1:6081 2014-08-21?04:09:11+0800?[default]?INFO:?Spider?opened 2014-08-21?04:09:12+0800?[default]?DEBUG:?Crawled?(200)?<GET?http://www.baidu.com/>?(referer:?None) [s]?Available?Scrapy?objects: [s]???crawler????<scrapy.crawler.Crawler?object?at?0xa483cec> [s]???item???????{} [s]???request????<GET?http://www.baidu.com/> [s]???response???<200?http://www.baidu.com/> [s]???settings???<scrapy.settings.Settings?object?at?0xa0de78c> [s]???spider?????<Spider?'default'?at?0xa78086c> [s]?Useful?shortcuts: [s]???shelp()???????????Shell?help?(print?this?help) [s]???fetch(req_or_url)?Fetch?request?(or?URL)?and?update?local?objects [s]???view(response)????View?response?in?a?browser>>>?#?response.body?返回的所有內容#?response.xpath('//ul/li')?可以測試所有的xpath內容

??? ????More important, if you type response.selector you will access a selector object you can use to
query the response, and convenient shortcuts like response.xpath() and response.css() mapping to
response.selector.xpath() and response.selector.css()

????????也就是可以很方便的，以交互的形式來查看xpath選擇是否正確。之前是用FireFox的F12來選擇的，但是并不能保證每次都能正確的選擇出內容。

????????也可使用：

scrapy?shell?’http://scrapy.org’?--nolog #?參數?--nolog?沒有日志

??? （2）示例

from?scrapy?import?Spider from?scrapy_test.items?import?DmozItemclass?DmozSpider(Spider):name?=?'dmoz'allowed_domains?=?['dmoz.org']start_urls?=?['http://www.dmoz.org/Computers/Programming/Languages/Python/Books/','http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/,''']def?parse(self,?response):for?sel?in?response.xpath('//ul/li'):item?=?DmozItem()item['title']?=?sel.xpath('a/text()').extract()item['link']?=?sel.xpath('a/@href').extract()item['desc']?=?sel.xpath('text()').extract()yield?item

????（3）保存文件

????????可以使用，保存文件。格式可以 json，xml，csv

scrapy?crawl?-o?'a.json'?-t?'json'

????（4）使用模板創(chuàng)建spider

scrapy?genspider?baidu?baidu.com#?-*-?coding:?utf-8?-*- import?scrapyclass?BaiduSpider(scrapy.Spider):name?=?"baidu"allowed_domains?=?["baidu.com"]start_urls?=?('http://www.baidu.com/',)def?parse(self,?response):pass

????這段先這樣吧，記得之前5個的，現在只能想起4個來了. :-(

????千萬記得隨手點下保存按鈕。否則很是影響心情的(⊙o⊙)！

（二）高級 -- scrapy.contrib.spiders.CrawlSpider

????（1）CrawlSpider

class?scrapy.contrib.spiders.CrawlSpiderThis?is?the?most?commonly?used?spider?for?crawling?regular?websites,?as?it?provides?a?convenient?mechanism?forfollowing?links?by?defining?a?set?of?rules.?It?may?not?be?the?best?suited?for?your?particular?web?sites?or?project,but?it’s?generic?enough?for?several?cases,?so?you?can?start?from?it?and?override?it?as?needed?for?more?customfunctionality,?or?just?implement?your?own?spider.Apart?from?the?attributes?inherited?from?Spider?(that?you?must?specify),?this?class?supports?a?new?attribute: rulesWhich?is?a?list?of?one?(or?more)?Rule?objects.?Each?Rule?defines?a?certain?behaviour?for?crawling?thesite.?Rules?objects?are?described?below.?If?multiple?rules?match?the?same?link,?the?first?one?will?be?used,according?to?the?order?they’re?defined?in?this?attribute.This?spider?also?exposes?an?overrideable?method: parse_start_url(response)This?method?is?called?for?the?start_urls?responses.?It?allows?to?parse?the?initial?responses?and?must?returneither?a?Item?object,?a?Request?object,?or?an?iterable?containing?any?of?them.

????（2）例子

#coding=utf-8 from?scrapy.contrib.spiders?import?CrawlSpider,?Rule from?scrapy.contrib.linkextractors?import?LinkExtractor import?scrapyclass?TestSpider(CrawlSpider):name?=?'test'allowed_domains?=?['example.com']start_urls?=?['http://www.example.com/']rules?=?(#?元組Rule(LinkExtractor(allow=('category\.php',?),?deny=('subsection\.php',?))),Rule(LinkExtractor(allow=('item\.php',?)),?callback='pars_item'),)def?parse_item(self,?response):self.log('item?page?:?%s'?%?response.url)item?=?scrapy.Item()item['id']?=?response.xpath('//td[@id="item_id"]/text()').re('ID：(\d+)')item['name']?=?response.xpath('//td[@id="item_name"]/text()').extract()item['description']?=?response.xpath('//td[@id="item_description"]/text()').extract()return?item

????（3）其他的。

????????其他的還有 XMLFeedSpider，這個有空再研究吧。

class?scrapy.contrib.spiders.XMLFeedSpiderclass?scrapy.contrib.spiders.CSVFeedSpiderclass?scrapy.contrib.spiders.SitemapSpider

（三）選擇器

????>>> from scrapy.selector import Selector
????>>> from scrapy.http import HtmlResponse
????可以靈活的使用 .css() 和 .xpath() 來快速的選取目標數據

??? ！！！關于選擇器，需要好好研究一下。xpath() 和 css() ，還要繼續(xù)熟悉正則.

????當通過class來進行選擇的時候，盡量使用 css() 來選擇，然后再用 xpath() 來選擇元素的熟悉

（四）Item Pipeline

????After an item has been scraped by a spider, it is sent to the Item Pipeline which process it through several components that are executed sequentially.

Typical?use?for?item?pipelines?are:??cleansing?HTML?data?#?清除HTML數據??validating?scraped?data?(checking?that?the?items?contain?certain?fields)?#?驗證數據??checking?for?duplicates?(and?dropping?them)?#?檢查重復??storing?the?scraped?item?in?a?database?#?存入數據庫

????（1）驗證數據

from?scrapy.exceptions?import?DropItemclass?PricePipeline(object):vat_factor?=?1.5def?process_item(self,?item,?spider):if?item['price']:if?item['price_excludes_vat']:item['price']?*=?self.vat_factorelse:raise?DropItem('Missing?price?in?%s'?%?item)

????（2）寫Json文件

import?jsonclass?JsonWriterPipeline(object):def?__init__(self):self.file?=?open('json.jl',?'wb')def?process_item(self,?item,?spider):line?=?json.dumps(dict(item))?+?'\n'self.file.write(line)return??item

????（3）檢查重復

from?scrapy.exceptions?import?DropItemclass?Duplicates(object):def?__init__(self):self.ids_seen?=?set()def?process_item(self,?item,?spider):if?item['id']?in?self.ids_seen:raise?DropItem('Duplicate?item?found?:?%s'?%?item)else:self.ids_seen.add(item['id'])return?item

????至于將數據寫入數據庫，應該也很簡單。在 process_item 函數中，將 item 存入進去即可了。

????看了一晚上，看到85頁。算是把基本的看的差不多了。

????????????????????????????????????????????????????????????????????????????-- 2014年08月21日06:39:41

（五）

轉載于:https://my.oschina.net/lpe234/blog/304880

總結

以上是生活随笔為你收集整理的python爬虫----（4. scrapy框架，官方文档以及例子）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Cocos2d-x编程中CCRemove
下一篇： Python学习预备

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

python

python爬虫----（4. scrapy框架，官方文档以及例子）

總結