當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

大数据时代下的Scrapy爬虫框架

發(fā)布時(shí)間：2023/12/10 编程问答 30 豆豆

生活随笔收集整理的這篇文章主要介紹了大数据时代下的Scrapy爬虫框架小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

文章目錄

前言
一、Scrapy是什么？
二、使用步驟
- 1.安裝Scrapy
- 2.創(chuàng)建Scrapy項(xiàng)目
- 3.Scrapy架構(gòu)圖
三.實(shí)戰(zhàn)項(xiàng)目:爬取豆瓣電影TOP250電影信息
- 1.items.py
- 2.pipelines.py
- 3.douban_spider.py
- 4.運(yùn)行結(jié)果

前言

隨著大數(shù)據(jù)時(shí)代的來臨,數(shù)據(jù)對一個(gè)企業(yè)越來越重要,沒有數(shù)據(jù)的支撐,那么這個(gè)企業(yè)必然會(huì)落后于其它企業(yè),那么怎么樣獲取數(shù)據(jù)呢？本篇文章將告訴你如何從互聯(lián)網(wǎng)上抓取有用的數(shù)據(jù)并持久化存儲(chǔ)

一、Scrapy是什么？

Scrapy 是一套基于基于Twisted的異步處理框架，純python實(shí)現(xiàn)的爬蟲框架，用戶只需要定制開發(fā)幾個(gè)模塊就可以輕松的實(shí)現(xiàn)一個(gè)爬蟲，用來抓取網(wǎng)頁內(nèi)容以及各種圖片，非常之方便～

二、使用步驟

1.安裝Scrapy

pip install scrapy

2.創(chuàng)建Scrapy項(xiàng)目

scrapy startproject 項(xiàng)目名

3.Scrapy架構(gòu)圖

Item Pipeline:可以簡稱為數(shù)據(jù)結(jié)構(gòu),即要存儲(chǔ)的數(shù)據(jù)的結(jié)構(gòu),可以理解為面向?qū)ο笾械念?這個(gè)模塊在Spiders模塊解析后,會(huì)進(jìn)行回調(diào)。
Spiders:數(shù)據(jù)解析模塊,即在此模塊中,只是做對數(shù)據(jù)的解析，并提取鏈接信息發(fā)送給Scheduler模塊進(jìn)行排隊(duì)。
Downloader:下載模塊,只做數(shù)據(jù)請求,并將返回的數(shù)據(jù)放入Spiders中解析。
Scheduler:隊(duì)列模塊,只負(fù)責(zé)對請求的鏈接進(jìn)行排序并發(fā)送給Downloader.

三.實(shí)戰(zhàn)項(xiàng)目:爬取豆瓣電影TOP250電影信息

1.items.py

該模塊對應(yīng)items模塊

# Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapy class DoubanItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()##電影序號(hào)movie_number=scrapy.Field()##電影名字movie_name=scrapy.Field()##電影信息movie_tostar=scrapy.Field()##星級(jí)movie_star=scrapy.Field()##評論人數(shù)movie_evaluate=scrapy.Field()##電影介紹movie_introduction=scrapy.Field()

2.pipelines.py

# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface from itemadapter import ItemAdapter import pymongoclass DoubanPipeline:def __init__(self) -> None:host="mongodb://localhost"prot="27017"dbname="movie"client=pymongo.MongoClient("mongodb://localhost:27017")db=client[dbname]self.movie_table=db['movie_table']def process_item(self, item, spider):print(item)data=dict(item)self.movie_table.insert_one(data)return item

3.douban_spider.py

import scrapy from douban.items import DoubanItemclass DoubanSpiderSpider(scrapy.Spider):##爬蟲名字name = 'douban_spider'##允許的域名allowed_domains = ['movie.douban.com']##入口urlstart_urls = ['https://movie.douban.com/top250']def parse(self, response):next=response.xpath('//*[@id="content"]/div/div[1]/div[2]/span[3]/a/@href').extract()my_list=response.xpath('//*[@id="content"]/div/div[1]/ol/li')for item in my_list:my_item=DoubanItem()my_item['movie_number']=item.xpath('./div[@class="item"]//em/text()').extract_first()my_item['movie_name']=item.xpath('./div[@class="item"]/div[@class="info"]//a/span[1]/text()').extract_first()my_item['movie_tostar']=item.xpath('./div[@class="item"]/div[@class="info"]//div[@class="bd"]/p/text()').extract_first()my_item['movie_star']=item.xpath('./div[@class="item"]/div[@class="info"]//div[@class="star"]/span[2]/text()').extract_first()my_item['movie_evaluate']=item.xpath('./div[@class="item"]/div[@class="info"]//div[@class="star"]/span[4]/text()').extract_first()my_item['movie_introduction']=item.xpath('./div[@class="item"]/div[@class="info"]//p[@class="quote"]/span[1]/text()').extract_first()yield my_itemif next:yield scrapy.Request("https://movie.douban.com/top250"+next[0],self.parse)

4.運(yùn)行結(jié)果

附帶源碼:下載源碼

總結(jié)

以上是生活随笔為你收集整理的大数据时代下的Scrapy爬虫框架的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： ORA-27101 shared mem
下一篇： node 遍历读取制定后缀文件名