當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

页面元素解析原

發(fā)布時間：2024/4/17 编程问答 38 豆豆

生活随笔收集整理的這篇文章主要介紹了页面元素解析原小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

1.解析字段信息

我們知道蜘蛛運(yùn)行時會下載要爬取的頁面，然后傳給給start_urls，頁面的返回對象response響應(yīng)體就會封裝到parse方法response對象里面，然后通過response對象css選擇器定位元素，返回一個selector對象，通過extract()方法來提取selector對象中標(biāo)簽的信息。
那現(xiàn)在我們使用dribbble網(wǎng)站來試著解析字段信息，創(chuàng)建一個dribbble蜘蛛，就和之前創(chuàng)建csdn一樣，然后將測試頁面中的execute()方法中的參數(shù)改為需要測試的蜘蛛頁面中的name屬性值。

import scrapy from urllib import parse from scrapy.http import Request class DribbbleSpider(scrapy.Spider): name = 'dribbble' allowed_domains = ['dribbble.com'] start_urls = ['https://dribbble.com/stories'] def parse(self, response): # 獲取a標(biāo)簽的url值 # urls = response.css('h2 a::attr(href)').extract() a_nodes = response.css('header div.teaser a') for a_node in a_nodes: # print(a_node) a_url = a_node.css('::attr(href)').extract()[0] a_image_url = a_node.css('img::attr(src)').extract()[0] yield Request(url=parse.urljoin(response.url, a_url), callback=self.parse_analyse, meta={'a_image_url': a_image_url}) def parse_analyse(self, response): a_image_url = response.meta.get('a_image_url') title = response.css('.post header h1::text').extract()[0] date = response.css('span.date::text').extract_first() print('圖片的url是：{}'.format(a_image_url)) print('標(biāo)題是: {}'.format(title)) print('時間是：{}'.format(date.strip()))

2.構(gòu)建數(shù)據(jù)模型

我們在創(chuàng)建模板時會自動生成一些文件，items.py文件就是其中一個，我們構(gòu)建數(shù)據(jù)模型就需要用到這個文件，這個文件會自動生成一個modle，這個modle會繼承scrapy.Item，然后我們可以根據(jù)我們的需求在自動生成的這個modle中隨意創(chuàng)建字段；

import scrapy class XkdDribbbleSpiderItem(scrapy.Item): title = scrapy.Field() a_image_url = scrapy.Field() date = scrapy.Field()

創(chuàng)建好字段之后，需要在spider中添加構(gòu)建模型，最后讓構(gòu)建模型中的字段和之前modle中的字段名一致，防止賦值出錯；在spider中添加構(gòu)建模型首先需要實例化items.py文件中的modle，然后通過實例化對象添加字段到modle中，最后將數(shù)據(jù)模型進(jìn)行落地，讓數(shù)據(jù)持久化。把實例化對象返回到pipelines.py中；

import scrapy from urllib import parse from scrapy.http import Request from ..items import XkdDribbbleSpiderItem from datetime import datetime class DribbbleSpider(scrapy.Spider): name = 'dribbble' allowed_domains = ['dribbble.com'] start_urls = ['https://dribbble.com/stories'www.bdqxylgw.com] def parse(self, response): # 獲取a標(biāo)簽的url值 # urls = response.css('h2 a::attr(href)'www.zhongxinyuLegw3.com).extract() a_nodes = response.css('header div.teaser www.seoxinyang.cn a www.yaoshiyulegw.com') for a_node in a_nodes: # print(a_node) a_url = a_node.css('::attr(href)').extract()[0] a_image_url = a_node.css('img::attr(src)').extract()[0] yield Request(url=parse.urljoin(response.url, a_url), callback=self.parse_analyse, meta={'a_image_url': a_image_url}) def parse_analyse(self, response): a_image_url = response.meta.get('a_image_url') title = response.css(www.oushenggw.com'.post header h1::text').extract()[0] date = response.css(www.honghaiyLpt.com'span.date::text').extract_first() date = date.strip(www.yifayule2d.com) date = datetime.strptime(date, '%b %d, %Y').date() # 構(gòu)建模型 dri_item = XkdDribbbleSpiderItem() dri_item['a_image_url'] = a_image_url dri_item['title'] = title dri_item['date'] = date yield dri_item

轉(zhuǎn)載于:https://www.cnblogs.com/qwangxiao/p/11088239.html

總結(jié)

以上是生活随笔為你收集整理的页面元素解析原的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。