python爬虫----(4. scrapy框架,官方文档以及例子)
為什么80%的碼農都做不了架構師?>>> ??
????官方文檔: http://doc.scrapy.org/en/latest/
??? github例子: https://github.com/search?utf8=%E2%9C%93&q=scrapy
????剩下的待會再整理...... 買飯去......?????? --2014年08月20日19:29:20
????の...剛搜狗輸入法出問題,直接注銷重新登陸,結果剛才的那些內容全部沒了。看來草稿箱也不是太靠譜呀!!!
????再重新整理下吧
????????????????????????????????????????????????????????????? -- 2014年08月21日04:02:37
(一)基本的 -- scrapy.spider.Spider
????(1)使用交互shell
dizzy@dizzy-pc:~$?scrapy?shell?"http://www.baidu.com/" 2014-08-21?04:09:11+0800?[scrapy]?INFO:?Scrapy?0.24.4?started?(bot:?scrapybot) 2014-08-21?04:09:11+0800?[scrapy]?INFO:?Optional?features?available:?ssl,?http11,?django 2014-08-21?04:09:11+0800?[scrapy]?INFO:?Overridden?settings:?{'LOGSTATS_INTERVAL':?0} 2014-08-21?04:09:11+0800?[scrapy]?INFO:?Enabled?extensions:?TelnetConsole,?CloseSpider,?WebService,?CoreStats,?SpiderState 2014-08-21?04:09:11+0800?[scrapy]?INFO:?Enabled?downloader?middlewares:?HttpAuthMiddleware,?DownloadTimeoutMiddleware,?UserAgentMiddleware,?RetryMiddleware,?DefaultHeadersMiddleware,?MetaRefreshMiddleware,?HttpCompressionMiddleware,?RedirectMiddleware,?CookiesMiddleware,?ChunkedTransferMiddleware,?DownloaderStats 2014-08-21?04:09:11+0800?[scrapy]?INFO:?Enabled?spider?middlewares:?HttpErrorMiddleware,?OffsiteMiddleware,?RefererMiddleware,?UrlLengthMiddleware,?DepthMiddleware 2014-08-21?04:09:11+0800?[scrapy]?INFO:?Enabled?item?pipelines:? 2014-08-21?04:09:11+0800?[scrapy]?DEBUG:?Telnet?console?listening?on?127.0.0.1:6024 2014-08-21?04:09:11+0800?[scrapy]?DEBUG:?Web?service?listening?on?127.0.0.1:6081 2014-08-21?04:09:11+0800?[default]?INFO:?Spider?opened 2014-08-21?04:09:12+0800?[default]?DEBUG:?Crawled?(200)?<GET?http://www.baidu.com/>?(referer:?None) [s]?Available?Scrapy?objects: [s]???crawler????<scrapy.crawler.Crawler?object?at?0xa483cec> [s]???item???????{} [s]???request????<GET?http://www.baidu.com/> [s]???response???<200?http://www.baidu.com/> [s]???settings???<scrapy.settings.Settings?object?at?0xa0de78c> [s]???spider?????<Spider?'default'?at?0xa78086c> [s]?Useful?shortcuts: [s]???shelp()???????????Shell?help?(print?this?help) [s]???fetch(req_or_url)?Fetch?request?(or?URL)?and?update?local?objects [s]???view(response)????View?response?in?a?browser>>>?#?response.body?返回的所有內容#?response.xpath('//ul/li')?可以測試所有的xpath內容??? ????More important, if you type response.selector you will access a selector object you can use to
query the response, and convenient shortcuts like response.xpath() and response.css() mapping to
response.selector.xpath() and response.selector.css()
????????也就是可以很方便的,以交互的形式來查看xpath選擇是否正確。之前是用FireFox的F12來選擇的,但是并不能保證每次都能正確的選擇出內容。
????????也可使用:
scrapy?shell?’http://scrapy.org’?--nolog #?參數?--nolog?沒有日志??? (2)示例
from?scrapy?import?Spider from?scrapy_test.items?import?DmozItemclass?DmozSpider(Spider):name?=?'dmoz'allowed_domains?=?['dmoz.org']start_urls?=?['http://www.dmoz.org/Computers/Programming/Languages/Python/Books/','http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/,''']def?parse(self,?response):for?sel?in?response.xpath('//ul/li'):item?=?DmozItem()item['title']?=?sel.xpath('a/text()').extract()item['link']?=?sel.xpath('a/@href').extract()item['desc']?=?sel.xpath('text()').extract()yield?item????(3)保存文件
????????可以使用,保存文件。格式可以 json,xml,csv
????(4)使用模板創(chuàng)建spider
scrapy?genspider?baidu?baidu.com#?-*-?coding:?utf-8?-*- import?scrapyclass?BaiduSpider(scrapy.Spider):name?=?"baidu"allowed_domains?=?["baidu.com"]start_urls?=?('http://www.baidu.com/',)def?parse(self,?response):pass????這段先這樣吧,記得之前5個的,現在只能想起4個來了. :-(
????千萬記得隨手點下保存按鈕。否則很是影響心情的(⊙o⊙)!
(二)高級 -- scrapy.contrib.spiders.CrawlSpider
????(1)CrawlSpider
class?scrapy.contrib.spiders.CrawlSpiderThis?is?the?most?commonly?used?spider?for?crawling?regular?websites,?as?it?provides?a?convenient?mechanism?forfollowing?links?by?defining?a?set?of?rules.?It?may?not?be?the?best?suited?for?your?particular?web?sites?or?project,but?it’s?generic?enough?for?several?cases,?so?you?can?start?from?it?and?override?it?as?needed?for?more?customfunctionality,?or?just?implement?your?own?spider.Apart?from?the?attributes?inherited?from?Spider?(that?you?must?specify),?this?class?supports?a?new?attribute: rulesWhich?is?a?list?of?one?(or?more)?Rule?objects.?Each?Rule?defines?a?certain?behaviour?for?crawling?thesite.?Rules?objects?are?described?below.?If?multiple?rules?match?the?same?link,?the?first?one?will?be?used,according?to?the?order?they’re?defined?in?this?attribute.This?spider?also?exposes?an?overrideable?method: parse_start_url(response)This?method?is?called?for?the?start_urls?responses.?It?allows?to?parse?the?initial?responses?and?must?returneither?a?Item?object,?a?Request?object,?or?an?iterable?containing?any?of?them.????(2)例子
#coding=utf-8 from?scrapy.contrib.spiders?import?CrawlSpider,?Rule from?scrapy.contrib.linkextractors?import?LinkExtractor import?scrapyclass?TestSpider(CrawlSpider):name?=?'test'allowed_domains?=?['example.com']start_urls?=?['http://www.example.com/']rules?=?(#?元組Rule(LinkExtractor(allow=('category\.php',?),?deny=('subsection\.php',?))),Rule(LinkExtractor(allow=('item\.php',?)),?callback='pars_item'),)def?parse_item(self,?response):self.log('item?page?:?%s'?%?response.url)item?=?scrapy.Item()item['id']?=?response.xpath('//td[@id="item_id"]/text()').re('ID:(\d+)')item['name']?=?response.xpath('//td[@id="item_name"]/text()').extract()item['description']?=?response.xpath('//td[@id="item_description"]/text()').extract()return?item????(3)其他的。
????????其他的還有 XMLFeedSpider,這個有空再研究吧。
class?scrapy.contrib.spiders.XMLFeedSpiderclass?scrapy.contrib.spiders.CSVFeedSpiderclass?scrapy.contrib.spiders.SitemapSpider(三)選擇器
????>>> from scrapy.selector import Selector
????>>> from scrapy.http import HtmlResponse
????可以靈活的使用 .css() 和 .xpath() 來快速的選取目標數據
??? !!!關于選擇器,需要好好研究一下。xpath() 和 css() ,還要繼續(xù)熟悉 正則.
????當通過class來進行選擇的時候,盡量使用 css() 來選擇,然后再用 xpath() 來選擇元素的熟悉
(四)Item Pipeline
????After an item has been scraped by a spider, it is sent to the Item Pipeline which process it through several components that are executed sequentially.
Typical?use?for?item?pipelines?are:??cleansing?HTML?data?#?清除HTML數據??validating?scraped?data?(checking?that?the?items?contain?certain?fields)?#?驗證數據??checking?for?duplicates?(and?dropping?them)?#?檢查重復??storing?the?scraped?item?in?a?database?#?存入數據庫????(1)驗證數據
from?scrapy.exceptions?import?DropItemclass?PricePipeline(object):vat_factor?=?1.5def?process_item(self,?item,?spider):if?item['price']:if?item['price_excludes_vat']:item['price']?*=?self.vat_factorelse:raise?DropItem('Missing?price?in?%s'?%?item)????(2)寫Json文件
import?jsonclass?JsonWriterPipeline(object):def?__init__(self):self.file?=?open('json.jl',?'wb')def?process_item(self,?item,?spider):line?=?json.dumps(dict(item))?+?'\n'self.file.write(line)return??item????(3)檢查重復
from?scrapy.exceptions?import?DropItemclass?Duplicates(object):def?__init__(self):self.ids_seen?=?set()def?process_item(self,?item,?spider):if?item['id']?in?self.ids_seen:raise?DropItem('Duplicate?item?found?:?%s'?%?item)else:self.ids_seen.add(item['id'])return?item????至于將數據寫入數據庫,應該也很簡單。在 process_item 函數中,將 item 存入進去即可了。
????看了一晚上,看到85頁。 算是把基本的看的差不多了。
????????????????????????????????????????????????????????????????????????????-- 2014年08月21日06:39:41
(五)
轉載于:https://my.oschina.net/lpe234/blog/304880
總結
以上是生活随笔為你收集整理的python爬虫----(4. scrapy框架,官方文档以及例子)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Cocos2d-x编程中CCRemove
- 下一篇: Python学习预备