當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【爬虫】Scrapy爬取腾讯社招信息

發布時間：2025/3/19 编程问答 14 豆豆

生活随笔收集整理的這篇文章主要介紹了【爬虫】Scrapy爬取腾讯社招信息小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

目標任務：爬取騰訊社招信息，需要爬取的內容為：職位名稱，職位的詳情鏈接，職位類別，招聘人數，工作地點，發布時間。

一、預備基礎

1、Scrapy簡介

Scrapy是用純Python實現一個為了爬取網站數據、提取結構性數據而編寫的應用框架，用途非常廣泛，可用于數據挖掘、監測和自動化測試。

Scrapy 使用了 Twisted 異步網絡庫來處理網絡通訊，可以加快我們的下載速度，不用自己去實現異步框架，并且包含了各種中間件接口，可以靈活的完成各種需求。

網站：

官網
中文維護站點

2、Scrapy架構

Scrapy主要包括了以下組件：

引擎(Scrapy): 用來處理整個系統的數據流處理, 觸發事務(框架核心);
調度器(Scheduler): 用來接受引擎發過來的請求, 壓入隊列中, 并在引擎再次請求的時候返回. 可以想像成一個URL（抓取網頁的網址或者說是鏈接）的優先隊列, 由它來決定下一個要抓取的網址是什么, 同時去除重復的網址；
下載器(Downloader): 用于下載網頁內容, 并將網頁內容返回給蜘蛛(Scrapy下載器是建立在twisted這個高效的異步模型上的)；
爬蟲(Spiders): 爬蟲是主要干活的, 用于從特定的網頁中提取自己需要的信息, 即所謂的實體(Item)。用戶也可以從中提取出鏈接,讓Scrapy繼續抓取下一個頁面；
項目管道(Pipeline): 負責處理爬蟲從網頁中抽取的實體，主要的功能是持久化實體、驗證實體的有效性、清除不需要的信息。當頁面被爬蟲解析后，將被發送到項目管道，并經過幾個特定的次序處理數據。
下載器中間件(Downloader Middlewares): 位于Scrapy引擎和下載器之間的框架，主要是處理Scrapy引擎與下載器之間的請求及響應。
爬蟲中間件(Spider Middlewares): 介于Scrapy引擎和爬蟲之間的框架，主要工作是處理蜘蛛的響應輸入和請求輸出。
調度中間件(Scheduler Middewares): 介于Scrapy引擎和調度之間的中間件，從Scrapy引擎發送到調度的請求和響應。

3、運行流程

首先，引擎從調度器中取出一個鏈接(URL)用于接下來的抓取；
引擎把URL封裝成一個請求(Request)傳給下載器，下載器把資源下載下來，并封裝成應答包(Response)；
然后，爬蟲解析Response
若是解析出實體（Item）,則交給實體管道進行進一步的處理。
若是解析出的是鏈接（URL）,則把URL交給Scheduler等待抓取

4、安裝

pip install Scrapy

5、Scrapy項目目錄介紹

新建一個Scrapy項目，名稱為 Test：

scrapy startproject Test

則項目目錄結構如下：

Test ├── scrapy.cfg └── Test├── __init__.py├── items.py├── pipelines.py├── settings.py└── spiders└── __init__.py

scrapy.cfg ：項目的配置文件
Test/ ：項目的Python模塊，將會從這里引用代碼
Test/items.py ：項目的目標文件
Test/pipelines.py ：項目的管道文件
Test/settings.py ：項目的設置文件
Test/spiders/ ：存儲爬蟲代碼目錄

二、Scrapy爬取騰訊社招信息

一般的爬蟲步驟：

新建項目 (scrapy startproject xxx)：新建一個新的爬蟲項目
明確目標（編寫 items.py）：定義提取的結構化數據
制作爬蟲（spiders/xxspider.py）：制作爬蟲開始爬取網頁，提取出結構化數據
存儲內容（pipelines.py）：設計管道存儲爬取內容

1、創建Scrapy項目

scrapy startproject Tencent cd Tencent

2、編寫item.py文件

根據需要爬取的內容定義爬取字段，因為需要爬取的內容為：職位名稱，職位的詳情鏈接，職位類別，招聘人數，工作地點，發布時間。

# -*- coding: utf-8 -*- import scrapy# Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.htmlclass TencentItem(scrapy.Item):# define the fields for your item here like:# 職位名positionname = scrapy.Field()# 詳情連接positionlink = scrapy.Field()# 職位類別positionType = scrapy.Field()# 招聘人數peopleNum = scrapy.Field()# 工作地點workLocation = scrapy.Field()# 發布時間publishTime = scrapy.Field()

3、編寫spider文件

使用命令創建一個基礎爬蟲類：

scrapy genspider tencentPostion "tencent.com"

其中，tencentPostion為爬蟲名，tencent.com為爬蟲作用范圍。

執行命令后會在 Tencent\spiders 文件夾中創建一個tencentPostion.py的文件，現在開始對其編寫：

# -*- coding: utf-8 -*- import scrapy from Tencent.items import TencentItemclass TencentpostionSpider(scrapy.Spider):"""功能：爬取騰訊社招信息"""# 爬蟲名name = 'tencentPostion'# 爬蟲作用范圍allowed_domains = ['tencent.com']url = 'https://hr.tencent.com/position.php?&start='offset = 0# 起始urlstart_urls = [url + str(offset)]def parse(self, response):for each in response.xpath("//tr[@class='even'] | //tr[@class='odd']"):# 初始化模型對象item = TencentItem()item['positionname'] = each.xpath("./td[1]/a/text()").extract()[0]item['positionlink'] = each.xpath("./td[1]/a/@href").extract()[0]if len(each.xpath("./td[2]/text()").extract()) > 0:item['positionType'] = each.xpath('./td[2]/text()').extract()[0]else:item['positionType'] = "None"item['peopleNum'] = each.xpath("./td[3]/text()").extract()[0]item['workLocation'] = each.xpath("./td[4]/text()").extract()[0]item['publishTime'] = each.xpath("./td[5]/text()").extract()[0]yield itemif self.offset < 2000:self.offset += 10# 每次處理完一頁的數據之后，重新發送下一頁頁面請求# self.offset自增10，同時拼接為新的url，并調用回調函數self.parse處理Responseyield scrapy.Request(self.url + str(self.offset), callback=self.parse, dont_filter=True)

遇到的問題：
1、[scrapy] DEBUG:Filtered duplicate request:<GET:xxxx>-no more duplicates will be shown——不會顯示更多重復項（[參考](https://blog.csdn.net/sinat_41701878/article/details/80302357)）

其實這個的問題是,CrawlSpider結合LinkExtractor\Rule,在提取鏈接與發鏈接的時候,出現了重復的連接,重復的請求,出現這個DEBUG
或者是yield scrapy.Request(xxxurl,callback=self.xxxx)中有重復的請求

其實scrapy自身是默認有過濾重復請求的，讓這個DEBUG不出現,可以有 dont_filter=True,在Request中添加可以解決：

yield scrapy.Request(xxxurl,callback=self.xxxx,dont_filter=True)

4、編寫pipelines文件

# -*- coding: utf-8 -*-# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html import jsonclass TencentPipeline(object):""" 功能：保存item數據 """def __init__(self):self.filename = open("tencent.json", "wb+")def process_item(self, item, spider):text = json.dumps(dict(item), ensure_ascii=False) + ",\n"self.filename.write(text.encode("utf-8"))return itemdef close_spider(self, spider):self.filename.close()

Q、TypeError: write() argument must be str, not bytes
情況：使用open打開文件的時候出現了下面的錯誤。
因為存儲方式默認是二進制方式，所以使用二進制方式打開文件。

self.filename = open("tencent.json", "wb+")

5、setting文件設置

# 設置請求頭部，添加url DEFAULT_REQUEST_HEADERS = {"User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;",'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' }# 設置item——pipelines ITEM_PIPELINES = {'tencent.pipelines.TencentPipeline': 300, }

6、執行程序

scrapy crawl tencentPosition

其中，tencentPosition為爬蟲名

運行結果如下：

三、使用CrawlSpider類

# 創建項目 scrapy startproject TencentSpider cd TencentSpider# 進入項目目錄下，創建爬蟲文件 scrapy genspider -t crawl tencent tencent.com

item.py等文件不變，主要是爬蟲文件（TencentSpider\spider\tencent.py）的編寫

# -*- coding: utf-8 -*- import scrapy # 導入鏈接規則匹配類，用來提取符合規則的連接 from scrapy.linkextractors import LinkExtractor # 導入CrawlSpider類和Rule from scrapy.spiders import CrawlSpider, Rule from TencentSpider.items import TencentItemclass TencentSpider(CrawlSpider):name = 'tencent'allowed_domains = ['hr.tencent.com']start_urls = ['https://hr.tencent.com/position.php?&start=0#a']# Response里鏈接的提取規則，返回的符合匹配規則的鏈接匹配對象的列表pagelink = LinkExtractor(allow=("start=\d+"))rules = (# 獲取這個列表里的鏈接，依次發送請求，并且繼續跟進，調用指定回調函數處理Rule(pagelink, callback='parseTencent', follow=True),)# 指定的回調函數def parseTencent(self, response):for each in response.xpath("//tr[@class='even'] | //tr[@class='odd']"):item = TencentItem()item['positionname'] = each.xpath("./td[1]/a/text()").extract()[0]item['positionlink'] = each.xpath("./td[1]/a/@href").extract()[0]if len(each.xpath("./td[2]/text()").extract()) > 0:item['positionType'] = each.xpath('./td[2]/text()').extract()[0]else:item['positionType'] = "None"item['peopleNum'] = each.xpath("./td[3]/text()").extract()[0]item['workLocation'] = each.xpath("./td[4]/text()").extract()[0]item['publishTime'] = each.xpath("./td[5]/text()").extract()[0]yield item

總結

以上是生活随笔為你收集整理的【爬虫】Scrapy爬取腾讯社招信息的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：集成学习(ensemble learni
下一篇：集成学习(ensemble learni