【爬虫笔记】Scrapy爬虫技术文章网站
文章目錄
- 一、Xpath
- 1、xpath簡介
- 2、xpath語法
- 二、CSS選擇器
- 三、爬取伯樂在線——初級(jí)
- 1、創(chuàng)建Scrapy項(xiàng)目
- 2、編寫item.py文件
- 3、編寫spider文件
- 4、編寫pipelines文件——保存在json文件
- 5、setting文件設(shè)置
- 6、執(zhí)行程序
- 四、爬取伯樂在線——進(jìn)階
- 1、item loader機(jī)制
- (1)思路
- (2)spider.py
- (3)item.py文件
- 2、pipelines文件
- (1)相關(guān)環(huán)境安裝(MySQL、Navicat)
- (2)保存到MySQL(同步機(jī)制)
- (3)保存到MySQL(異步機(jī)制)
Scrapy相關(guān)基本介紹參考這里
一、Xpath
1、xpath簡介
- xpath使用路表達(dá)式在xml和html中進(jìn)行導(dǎo)航
- xpath包含標(biāo)準(zhǔn)函數(shù)庫
- xpath是一個(gè)W3C的標(biāo)準(zhǔn)
xpath節(jié)點(diǎn)關(guān)系
父節(jié)點(diǎn)、子節(jié)點(diǎn)、同胞節(jié)點(diǎn)、先輩節(jié)點(diǎn)、后代節(jié)點(diǎn)。
2、xpath語法
| article | 選取所有article元素的所有子節(jié)點(diǎn) |
| /article | 選取根元素article |
| article/a | 選取所有屬于article的子元素的a元素 |
| //div | 選取所有div子元素(無論出現(xiàn)文檔任何地方) |
| article//div | 選取所有屬于article元素的后代的div元素,不管它出現(xiàn)在article之下的任何位置 |
| //@class | 選取所有名為class的屬性 |
| /article/div[1] | 選取屬于article子元素的第一個(gè)div元素 |
| /article/div[last()] | 選取屬于article子元素的最后一個(gè)div元素 |
| /article/div[last()-1] | 選取屬于article子元素的倒數(shù)第二個(gè)div元素 |
| //div[@lang] | 選取所有擁有l(wèi)ang屬性的div元素 |
| //div[@lang=‘eng’] | 選取所有l(wèi)ang屬性為eng的div元素 |
| /div/* | 選取屬于div元素的所有子節(jié)點(diǎn) |
| //* | 選取所有元素 |
| //div[@*] | 選取所有帶屬性的div元素 |
| /div/a | //div/p | 選取所有div元素的a和p元素 |
| //span | //ul | 選取文檔中的span元素和ul元素 |
| article/div/p | //span | 選取所有屬于article元素的div元素的p元素 以及 所有的span元素 |
二、CSS選擇器
| * | 選擇所有節(jié)點(diǎn) |
| #container | 選擇id為container的節(jié)點(diǎn) |
| .container | 選擇所有class包含container的節(jié)點(diǎn) |
| li a | 選擇所有l(wèi)i下的所有a節(jié)點(diǎn) |
| ul + p | 選擇ul后面(兄弟節(jié)點(diǎn))的第一個(gè)p元素 |
| div#container > ul | 選擇id為container的div的第一個(gè)ul子節(jié)點(diǎn) |
| ul ~ p | 選擇與ul相鄰的所有p元素 |
| a[title] | 選擇所有有title屬性的a元素 |
| a[href=“http://jobbole.com”] | 選擇所有href屬性為jobbole.com值的a元素 |
| a[href*=“jobole”] | 選擇所有href屬性包含jobbole的a元素 |
| a[href^=“http”] | 選擇所有href屬性值以http開頭的a元素 |
| a[href$=".jpg"] | 選擇所有href屬性值以.jpg結(jié)尾的a元素 |
| input[type=radio]:checked | 選擇選中的radio的元素 |
| div:not(#container) | 選取所有id非container的div屬性 |
| li:nth-child(3) | 選取第三個(gè)li元素 |
| tr:nth-child(2n) | 第偶數(shù)個(gè)tr |
三、爬取伯樂在線——初級(jí)
一般的爬蟲步驟:
- 新建項(xiàng)目 (scrapy startproject xxx):新建一個(gè)新的爬蟲項(xiàng)目
- 明確目標(biāo)(編寫 items.py):定義提取的結(jié)構(gòu)化數(shù)據(jù)
- 制作爬蟲(spiders/xxspider.py):制作爬蟲開始爬取網(wǎng)頁,提取出結(jié)構(gòu)化數(shù)據(jù)
- 存儲(chǔ)內(nèi)容(pipelines.py):設(shè)計(jì)管道存儲(chǔ)爬取內(nèi)容
目標(biāo)任務(wù):爬取伯樂在線所有技術(shù)文檔,需要爬取的內(nèi)容為:標(biāo)題、創(chuàng)建時(shí)間、網(wǎng)站、網(wǎng)站id、文章封面圖url、文章封面圖路徑、收藏?cái)?shù)、點(diǎn)贊數(shù)、評論數(shù)、全文、標(biāo)簽
1、創(chuàng)建Scrapy項(xiàng)目
scrapy startproject Article cd Article2、編寫item.py文件
根據(jù)需要爬取的內(nèi)容定義爬取字段,因?yàn)樾枰廊〉膬?nèi)容為:標(biāo)題、創(chuàng)建時(shí)間、網(wǎng)站、網(wǎng)站id、文章封面圖url、文章封面圖路徑、收藏?cái)?shù)、點(diǎn)贊數(shù)、評論數(shù)、全文、標(biāo)簽。
# -*- coding: utf-8 -*-# Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass TestarticleItem(scrapy.Item):title = scrapy.Field() # 標(biāo)題time = scrapy.Field() # 創(chuàng)建時(shí)間url = scrapy.Field() # 網(wǎng)址url_object_id = scrapy.Field() # 網(wǎng)址id(使用MD5方法)front_image_url = scrapy.Field() # 文章封面圖urlfront_image_path = scrapy.Field() # 文章封面圖路徑coll_nums = scrapy.Field() # 收藏?cái)?shù)comment_nums = scrapy.Field() # 評論數(shù)fav_nums = scrapy.Field() # 點(diǎn)贊數(shù)content = scrapy.Field() # 全文tags = scrapy.Field() # 標(biāo)簽3、編寫spider文件
使用命令創(chuàng)建一個(gè)基礎(chǔ)爬蟲類:
scrapy genspider jobbole "blog.jobbole.com"其中,jobbole為爬蟲名,blog.jobbole.com為爬蟲作用范圍。
執(zhí)行命令后會(huì)在 Article/spiders 文件夾中創(chuàng)建一個(gè)jobbole.py的文件,現(xiàn)在開始對其編寫,該部分分別用xpath方法和css方法進(jìn)行編寫。
# -*- coding: utf-8 -*- import re import scrapy import datetime from scrapy.http import Request from urllib import parsefrom ArticleSpider.items import ArticleItem from ArticleSpider.utils.common import get_md5class JobboleSpider(scrapy.Spider):name = "jobbole"allowed_domains = ["python.jobbole.com"]start_urls = ['http://python.jobbole.com/all-posts/']def parse(self, response):"""1. 獲取文章列表頁中的文章url并交給scrapy下載后并進(jìn)行解析2. 獲取下一頁的url并交給scrapy進(jìn)行下載, 下載完成后交給parse"""# 解析列表頁中的所有文章url并交給scrapy下載后并進(jìn)行解析post_nodes = response.css("#archive .floated-thumb .post-thumb a")for post_node in post_nodes:image_url = post_node.css("img::attr(src)").extract_first("")post_url = post_node.css("::attr(href)").extract_first("")yield Request(url=parse.urljoin(response.url, post_url), meta={"front_image_url":image_url}, callback=self.parse_detail_xpath)# 提取下一頁并交給scrapy進(jìn)行下載next_url = response.css(".next.page-numbers::attr(href)").extract_first("")if next_url:yield Request(url=parse.urljoin(response.url, next_url), callback=self.parse)def parse_detail_xpath(self, response):article_item = TestarticleItem()# 提取文章具體字段front_image_url = response.meta.get("front_image_url","")title = response.xpath('//div[@class="entry-header"]/h1/text()').extract()[0]time = response.xpath('//div[@class="entry-meta"]/p/text()').extract()[0].strip().replace("·","").strip()fav_nums = response.xpath('//div[@class="post-adds"]/span[1]/h10/text()').extract()[0]coll_nums = response.xpath('//div[@class="post-adds"]/span[2]/text()').extract()[0]match_re = re.match(".*(\d+).*", coll_nums)if match_re:coll_nums = match_re.group(1)else:coll_nums = 0comment_nums = response.xpath('//div[@class="post-adds"]/a[@href="#article-comment"]/span/text()').extract()[0]match_re = re.match(".*(\d+).*", comment_nums)if match_re:comment_nums = match_re.group(1)else:comment_nums = 0content = response.xpath('//div[@class="entry"]').extract()[0]tag_list = response.xpath('//p[@class="entry-meta-hide-on-mobile"]/a/text()').extract()tag_list = [element for element in tag_list if not element.strip().endswith("評論")]tags = ",".join(tag_list)article_item['title'] = titlearticle_item['url'] = response.urlarticle_item['url_object_id'] = get_md5(response.url)try:time = datetime.datetime.strptime(time,'%Y%m%d').date()except Exception as e:time = datetime.datetime.now().date()article_item['time'] = timearticle_item['front_image_url'] = [front_image_url]article_item['fav_nums'] = fav_numsarticle_item['coll_nums'] = coll_numsarticle_item['comment_nums'] = comment_numsarticle_item['tags'] = tagsarticle_item['content'] = contentyield article_itemdef parse_detail_css(self, response):article_item = TestarticleItem()# 通過css選擇器提取字段front_image_url = response.meta.get("front_image_url", "") # 文章封面圖title = response.css(".entry-header h1::text").extract()[0]time = response.css("p.entry-meta-hide-on-mobile::text").extract()[0].strip().replace("·","").strip()coll_nums = response.css(".vote-post-up h10::text").extract()[0]fav_nums = response.css(".bookmark-btn::text").extract()[0]match_re = re.match(".*?(\d+).*", fav_nums)if match_re:fav_nums = int(match_re.group(1))else:fav_nums = 0comment_nums = response.css("a[href='#article-comment'] span::text").extract()[0]match_re = re.match(".*?(\d+).*", comment_nums)if match_re:comment_nums = int(match_re.group(1))else:comment_nums = 0content = response.css("div.entry").extract()[0]tag_list = response.css("p.entry-meta-hide-on-mobile a::text").extract()tag_list = [element for element in tag_list if not element.strip().endswith("評論")]tags = ",".join(tag_list)article_item["url_object_id"] = get_md5(response.url)article_item["title"] = titlearticle_item["url"] = response.urltry:time = datetime.datetime.strptime(time, "%Y/%m/%d").date()except Exception as e:time = datetime.datetime.now().date()article_item["time"] = timearticle_item["front_image_url"] = [front_image_url]article_item["coll_nums"] = coll_numsarticle_item["comment_nums"] = comment_numsarticle_item["fav_nums"] = fav_numsarticle_item["tags"] = tagsarticle_item["content"] = contentyield article_item在 Aticle 目錄下創(chuàng)建 utils/common.py 用于定義一些共有的函數(shù)。
# -*- coding: utf-8 -*- import hashlibdef get_md5(url):if isinstance(url, str):url = url.encode('utf-8')m = hashlib.md5()m.update(url)return m.hexdigest()4、編寫pipelines文件——保存在json文件
保存為json文件
- 利用 json 方式
- 利用 scrapy 中的 JsonItemExporter 方式
5、setting文件設(shè)置
ITEM_PIPELINES 設(shè)置pipelines文件中類的優(yōu)先級(jí),數(shù)字越小優(yōu)先級(jí)越高,分別注釋'Article.pipelines.JsonWithEncodingPipeline'和'Article.pipelines.JsonExporterPipeline'使用不同的json保存方法
# 設(shè)置請求頭部,添加url DEFAULT_REQUEST_HEADERS = {"User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;",'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' }# 設(shè)置item——pipelines ITEM_PIPELINES = { # 'Article.pipelines.ArticlePipeline': 300,'Article.pipelines.JsonWithEncodingPipeline': 2, # 'Article.pipelines.JsonExporterPipeline': 2, }6、執(zhí)行程序
scrapy crawl jobbole報(bào)錯(cuò):TypeError: Object of type 'date' is not JSON serializable
解決方法:item[“item”]的類型是date,需要轉(zhuǎn)化為str,使用如下:item["time"] = str(item["time"])
四、爬取伯樂在線——進(jìn)階
1、item loader機(jī)制
在上一節(jié)中,在spider文件中定義爬取并解析item.py中定義的字段,但是可移植性不強(qiáng),item loader機(jī)制提供了一種便捷的方式填充抓取到的 :Item。 雖然Items可以使用自帶的類字典形式API填充,但是Items Loaders提供了更便捷的API, 可以分析原始數(shù)據(jù)并對Item進(jìn)行賦值。
(1)思路
參考文章:爬蟲 Scrapy 學(xué)習(xí)系列之七:Item Loaders
- 通過item loader加載Item(spider文件中)
- item loader三個(gè)主要的方法分別是:add_css(), add_xpath(), add_value()
- 通過items.py處理數(shù)據(jù)
- 引入from scrapy.loader.processors import MapCompose,TakeFirst, Join 等
在scrapy.Field中可以加入處理函數(shù),同時(shí)可自定義處理函數(shù)
- 引入from scrapy.loader.processors import MapCompose,TakeFirst, Join 等
(2)spider.py
spider.py文件中部分代碼
from scrapy.loader import ItemLoaderclass JobboleSpider(scrapy.Spider):"""添加部分,未變化的部分已省略"""def parse_detail(self, response):article_item = ArticleItem()front_image_url = response.meta.get("front_image_url", "") # 文章封面圖item_loader = ArticleItemLoader(item=ArticleItem(), response=response)item_loader.add_css("title", ".entry-header h1::text")item_loader.add_value("url", response.url)item_loader.add_value("url_object_id", get_md5(response.url))item_loader.add_css("time", "p.entry-meta-hide-on-mobile::text")item_loader.add_value("front_image_url", [front_image_url])item_loader.add_css("coll_nums", ".vote-post-up h10::text")item_loader.add_css("comment_nums", "a[href='#article-comment'] span::text")item_loader.add_css("fav_nums", ".bookmark-btn::text")item_loader.add_css("tags", "p.entry-meta-hide-on-mobile a::text")item_loader.add_css("content", "div.entry")article_item = item_loader.load_item()yield article_item(3)item.py文件
定義相關(guān)處理函數(shù),并利用input_processor或output_processor參數(shù)在輸入前、輸出后對字段元數(shù)據(jù)進(jìn)行處理。
# -*- coding: utf-8 -*-# Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapy from scrapy.loader import ItemLoader from scrapy.loader.processors import MapCompose, TakeFirst, Joinimport datetime import redef date_convert(value):try:time = datetime.datetime.strptime(value, "%Y/%m/%d").date()except Exception as e:time = datetime.datetime.now().date()return timedef get_nums(value):match_re = re.match(".*?(\d+).*", value)if match_re:nums = int(match_re.group(1))else:nums = 0return numsdef return_value(value):return valuedef remove_comment_tags(value):# 去掉tag中提取的評論if "評論" in value:return ""else:return valueclass ArticleItemLoader(ItemLoader):# 自定義itemloaderdefault_output_processor = TakeFirst()class ArticleItem(scrapy.Item):# define the fields for your item here like:title = scrapy.Field()time = scrapy.Field(input_processor=MapCompose(date_convert))url = scrapy.Field()url_object_id = scrapy.Field() ## md5front_image_url = scrapy.Field(output_processor=MapCompose(return_value))front_image_path = scrapy.Field()fav_nums = scrapy.Field(input_processor=MapCompose(get_nums))coll_nums = scrapy.Field(input_processor=MapCompose(get_nums))comment_nums = scrapy.Field(input_processor=MapCompose(get_nums))content = scrapy.Field()tags = scrapy.Field(input_processor=MapCompose(remove_comment_tags),output_processor=Join(","))def get_insert_sql(self):sql1 = "alter table article convert to character set utf8mb4;"insert_sql = """insert into article(title, time, url, url_object_id, front_image_url, front_image_path, coll_nums,comment_nums,fav_nums,content,tags) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)ON DUPLICATE KEY UPDATE content=VALUES(fav_nums)"""front_image_url = ""if self["front_image_url"]:front_image_url = self["front_image_url"][0]params = (self["title"], self["time"], self["url"],self["url_object_id"],self["front_image_url"],self["front_image_path"],self["coll_nums"],self["comment_nums"],self["fav_nums"],self["content"],self["tags"])return insert_sql, params2、pipelines文件
(1)相關(guān)環(huán)境安裝(MySQL、Navicat)
安裝相關(guān)環(huán)境:Ubuntu18.04 安裝MySQL、Navicat
## mysqlclient是mysql的一個(gè)驅(qū)動(dòng) pip install mysqlclient表定義如下圖所示:
(2)保存到MySQL(同步機(jī)制)
import pymysql import pymysql.cursors class MysqlPipeline(object):## 采用同步的機(jī)制寫入mysql"""docstring for MysqlPipeline"""def __init__(self):# self.conn = pymysql.connect('host', 'user', 'password', 'dbname', charset='utf8', use_unicode=True)self.conn = pymysql.connect(host='localhost', user='root', password='asdfjkl;', db='atricle', charset="utf8mb4", use_unicode=True)self.cursor = self.conn.cursor()def process_item(self, item, spider):sql1 = "alter table article convert to character set utf8mb4;"insert_sql = """insert into article(title, url, url_object_id, time, coll_nums,comment_nums,fav_nums,content) VALUES (%s, %s, %s, %s, %s, %s, %s, %s) """self.cursor.execute(sql1)self.cursor.execute(insert_sql, (pymysql.escape_string(item["title"]), item["url"],item["url_object_id"],item["time"],item["coll_nums"],item["comment_nums"],item["fav_nums"],pymysql.escape_string(item["content"]),# item["url"], item["time"], item["coll_nums"]))self.conn.commit()(3)保存到MySQL(異步機(jī)制)
當(dāng)采集量大時(shí),爬取的速度要高于讀寫的速度,所以對于大型的一般采用異步機(jī)制存儲(chǔ)數(shù)據(jù)。
from twisted.enterprise import adbapiclass MysqlTwistedPipeline(object):"""docstring for MysqlTwistedPipeline"""def __init__(self, dbpool):self.dbpool = dbpool@classmethoddef from_settings(cls, settings):'''傳入settings的參數(shù)'''dbparams = dict(host = settings['MYSQL_HOST'],db = settings['MYSQL_DB'],user = settings['MYSQL_USER'],password = settings['MYSQL_PASSWORD'],charset = "utf8mb4",cursorclass = pymysql.cursors.DictCursor,use_unicode = True,)dbpool = adbapi.ConnectionPool("pymysql", **dbparams)return cls(dbpool)def process_item(self, item, spider):# 使用twisted將mysql插入變成異步執(zhí)行query = self.dbpool.runInteraction(self.do_insert, item)query.addErrback(self.handle_error, item, spider) #處理異常def handle_error(self, failure, item, spider):# 處理異步插入的異常print (failure)def do_insert(self, cursor, item):# 執(zhí)行具體的插入# 根據(jù)不同的item 構(gòu)建不同的sql語句并插入到mysql中sql1 = "alter table article convert to character set utf8mb4;"insert_sql = """insert into article(title, url, url_object_id, time, coll_nums,comment_nums,fav_nums,content) VALUES (%s, %s, %s, %s, %s, %s, %s, %s) """cursor.execute(sql1)cursor.execute(insert_sql, (pymysql.escape_string(item["title"]), item["url"],item["url_object_id"],item["time"],item["coll_nums"],item["comment_nums"],item["fav_nums"],pymysql.escape_string(item["content"]),# item["url"], item["time"], item["coll_nums"]))保存結(jié)果如下圖所示:
總結(jié)
以上是生活随笔為你收集整理的【爬虫笔记】Scrapy爬虫技术文章网站的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Ubuntu18.04 安装MySQL、
- 下一篇: 知识图谱(二)——知识表示