當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

爬虫学习笔记（八）—— Scrapy框架（三）：CrawSpider模板

發布時間：2025/3/21 编程问答 22 豆豆

生活随笔收集整理的這篇文章主要介紹了爬虫学习笔记（八）—— Scrapy框架（三）：CrawSpider模板小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

CrawlSpider

創建CrawlSpider 的爬蟲文件
命令：

scrapy genspider -t crawl 爬蟲文件名域名

Rule
功能：Rule用來定義CrawlSpider的爬取規則
參數：
link_extractor： Link Extractor對象，它定義如何從每個已爬網頁面中提取鏈接。
callback ：回調函數處理link_extractor形成的response
cb_kwargs ： cb:callback 回調函數的參數，是一個包含要傳遞給回調函數的關鍵字參數的dict
follow ：它指定是否應該從使用此規則提取的每個響應中跟蹤鏈接。
?????兩個值：True /False；follow=True link_extractor形成的response 會交給rule;False 則不會;
process_links ：用于過濾鏈接的回調函數，處理link_extractor提取到的鏈接
process_request ：用于過濾請求的回調函數
errback:處理異常的函數

LinkExractor

LinkExractor也是scrapy框架定義的一個類，它唯一的目的是從web頁面中提取最終將被跟蹤的額連接。

我們也可定義我們自己的鏈接提取器，只需要提供一個名為extract_links的方法，它接收Response對象并返回scrapy.link.Link對象列表。

參數：
allow(允許)：正則表達式或其列表匹配url，若為空則匹配所有url
deny(不允許)：正則表達式或其列表排除url，若為空則不排除url
allow_domains(允許的域名)：str或其列表
deny_domains(不允許的域名)：str或其列表
restrict_xpaths(通過xpath 限制匹配區域)：xpath表達式或列表
restrict_css(通過css 限制匹配區域)：css表達式
restrict_text(通過text 限制匹配區域)：正則表達式
tags=(‘a’, ‘area’)：允許的標簽
attrs=(‘href’,)：允許的屬性
canonicalize：規范化每個提取的url
unique(唯一)：將匹配到的重復鏈接過濾
process_value：接收從標簽提取的每個值函數
deny_extensions：不允許拓展，提取鏈接的時候,忽略一些擴展名.jpg .xxx

案例：爬取縱橫小說

需求分析

需爬取內容：小說書名、作者、是否完結、字數、簡介、各個章節及其內容

頁面結構

一級頁面：各個小說的url

二級頁面：小說書名、作者、是否完結、字數、簡介、章節目錄url

三級頁面：各個章節名稱

四級頁面：小說各個章節內容

需求字段

小說信息

章節信息

代碼實現

spider文件

根據目標數據——要存儲的數據，在rules中定義Rule規則，按需配置callback函數，解析response獲得想要的數據。

parse_book函數獲取小說信息
parse_catalog函數獲取章節信息
parse_chapter函數獲取章節內容

#(爬蟲文件名：zh） import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from ..items import BookItem,ChapterItem,CatalogItemclass ZhSpider(CrawlSpider):name = 'zh'allowed_domains = ['book.zongheng.com']start_urls = ['http://book.zongheng.com/store/c0/c0/b0/u0/p1/v9/s1/t1/u0/i1/ALL.html'] #起始的url 一級界面的url#定義爬取規則 1.提取url(LinkExtractor對象) 2.形成請求 3.響應的處理規則rules = (Rule(LinkExtractor(allow=r'http://book.zongheng.com/book/\d+.html',restrict_xpaths='//div[@class="bookname"]'), callback='parse_book', follow=True,process_links='get_booklink'), #這里加restrict_xpaths限制匹配區域Rule(LinkExtractor(allow=r'http://book.zongheng.com/showchapter/\d+.html'), callback='parse_catalog', follow=True),Rule(LinkExtractor(allow=r'http://book.zongheng.com/chapter/\d+/\d+.html',restrict_xpaths='//ul[@class="chapter-list clearfix"]'), callback='parse_chapter',follow=False,process_links='get_chapterlink'), #這里加restrict_xpaths限制匹配區域)def get_booklink(self,links) :#處理 LinkExtractor 提取到的url 每本書的urlfor index,link in enumerate(links):if index==0:yield linkelse:returndef get_chapterlink(self,links): #處理 LinkExtractor 提取到的url 章節的urlfor index,link in enumerate(links):if index<=20:yield linkelse:returndef parse_book(self, response):#類別category = response.xpath('//div[@class="book-info"]/div[@class="book-label"]/a[2]/text()').extract()[0].strip()#書名book_name = response.xpath('//div[@class="book-info"]/div[@class="book-name"]/text()').extract()[0].strip()#作者author = response.xpath('//div[@class="au-name"]/a/text()').extract()[0].strip()#狀態status = response.xpath('//div[@class="book-info"]/div[@class="book-label"]/a[1]/text()').extract()[0].strip()#字數book_nums = response.xpath('//div[@class="book-info"]/div[@class="nums"]/span/i/text()').extract()[0].strip()#描述description = ' '.join(response.xpath('//div[@class="book-info"]/div[@class="book-dec Jbook-dec hide"]/p/text()').extract())#書的urlbook_url = response.url#目錄的URLcatalog_url =response.xpath('//div[@class="book-info"]//div[@class="fr link-group"]/a/@href').extract()[0].strip()item = BookItem()item['category']=categoryitem['book_name']=book_nameitem['author']=authoritem['status']=statusitem['book_nums']=book_numsitem['description']=descriptionitem['book_url']=book_urlitem['catalog_url']=catalog_urlyield itemdef parse_catalog(self,response):a_text = response.xpath('//ul[@class="chapter-list clearfix"]/li/a')chapter_list = []catalog_url = response.urlfor a in a_text:title = a.xpath('./text()').extract()[0]chapter_url = a.xpath('./@href').extract()[0]chapter_list.append((title,chapter_url,catalog_url)) #章節名章節url 目錄urlitem = CatalogItem()item['chapter_list']=chapter_listyield itemdef parse_chapter(self,response):content = ''.join(response.xpath('//div[@class="content"]/p/text()').extract())chapter_url=response.urlitem = ChapterItem()item['content']=content #小說章節的內容item['chapter_url']=chapter_url #章節的urlyield item

items文件

items文件里的字段，是根據目標數據的需求確定的

import scrapy class BookItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()category = scrapy.Field()book_name = scrapy.Field()author = scrapy.Field()status =scrapy.Field()book_nums =scrapy.Field()description =scrapy.Field()book_url =scrapy.Field()catalog_url = scrapy.Field()class CatalogItem(scrapy.Item):chapter_list = scrapy.Field()class ChapterItem(scrapy.Item):content =scrapy.Field()chapter_url = scrapy.Field()

pipelines文件

此處是將數據寫入數據庫

先寫入小說信息

寫入章節信息，除了章節內容之外的部分

根據章節url，寫入章節內容

注意：記得提前創建好數據庫和各個表，同時寫代碼時表的字段千萬不要寫錯！！！

import pymysql # zhnovel是我的項目文件夾名 from zhnovel.items import BookItem,ChapterItem,CatalogItem from scrapy.exceptions import DropItemclass ZhnovelPipeline:#打開數據庫def open_spider(self,spider):data_config = spider.settings['DATABASE_CONFIG']#建立連接self.conn = pymysql.connect(**data_config)#定義游標self.cur = self.conn.cursor()spider.conn = self.connspider.cur = self.cur#數據存儲def process_item(self, item, spider):if isinstance(item,BookItem):sql = "select id from novel where book_name=%s and author=%s"self.cur.execute(sql,(item['book_name'],item['author']))if not self.cur.fetchone():sql = "insert into novel(category,book_name,author,status,book_nums,description,book_url,catalog_url) values(%s,%s,%s,%s,%s,%s,%s,%s)"self.cur.execute(sql,(item['category'],item['book_name'],item['author'],item['status'],item['book_nums'], item['description'],item['book_url'],item['catalog_url']))self.conn.commit()return itemelif isinstance(item,CatalogItem):sql = 'delete from chapter'self.cur.execute(sql)sql = 'insert into chapter(title,ord_num,chapter_url,catalog_url) values(%s,%s,%s,%s)'data_list = []for index, chapter in enumerate(item['chapter_list']):ord_num = index+1title, chapter_url, catalog_url = chapterdata_list.append((title, ord_num, chapter_url, catalog_url))self.cur.executemany(sql,data_list)self.conn.commit()return itemelif isinstance(item,ChapterItem):sql = "update chapter set content=%s where chapter_url=%s"self.cur.execute(sql,(item['content'],item['chapter_url']))self.conn.commit()return itemelse:return DropItem#關閉數據庫def close_spider(self,spider):self.cur.close()self.conn.close()

settings文件

設置robots協議，添加全局請求頭，開啟管道
開啟下載延遲(可以不開，最好開啟)
配置數據庫

ROBOTSTXT_OBEY = False #robots協議DOWNLOAD_DELAY = 1 #下載延遲1sDEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36 Edg/91.0.864.54' }DATABASE_CONFIG={ #配置數據庫'host':'localhost', #ip為127.0.0.1 或寫 localhost'port':3306, #端口3306'user':'root', #這里是你登錄mysql的用戶名'password':'123456', #這是登錄mysql的密碼'db':'zhnovel', #你的數據庫'charset':'utf8', #編碼utf8 }

結果：

總結

以上是生活随笔為你收集整理的爬虫学习笔记（八）—— Scrapy框架（三）：CrawSpider模板的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：爬虫学习笔记（七）——Scrapy框架（
下一篇：爬虫学习笔记（九）—— Scrapy框架