當前位置：首頁 > 编程语言 > python >内容正文

python

Python—Scrapy爬取京东商城

發布時間：2023/12/16 python 36 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python—Scrapy爬取京东商城小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Python—Scrapy爬取京東商城

1.創建項目
scrapy startproject jd

效果：

2.生成一個爬蟲
scrapy genspider jd_category jd.com

效果：

3.在items.py文件中定義要提取的字段

import scrapyclass JdItem(scrapy.Item):"""商品信息"""title = scrapy.Field() # 標題price = scrapy.Field() # 價格sku_id = scrapy.Field() # 商品idurl = scrapy.Field() # 商品鏈接info = scrapy.Field() # 評論class CommentItem(scrapy.Item):"""評論"""# '留言時間', '評分', '回復數', '點贊數', '圖片數', '評論內容'content = scrapy.Field()comment_time = scrapy.Field()reply_count = scrapy.Field()score = scrapy.Field()vote_count = scrapy.Field()image_count = scrapy.Field()

4.jd_category.py中的內容，對列表頁進行了爬取

對評論信息進行了爬取

import html import json import reimport scrapy from ..items import JdItem, CommentItemclass JdSpider(scrapy.Spider):name = 'jd_goods'allowed_domains = ['jd.com'] # 有的時候寫個www.jd.com會導致search.jd.com無法爬取# https: // list.jd.com / list.html?cat = 9987, 653, 655page = 1s = 1url = 'https://list.jd.com/list.html?cat=9987%2C653%2C655&page=1&s=1&click=0'next_url = 'https://list.jd.com/list.html?cat=9987%2C653%2C655&page={}&s={}&click=0'def start_requests(self):yield scrapy.Request(self.url)def parse(self, response):"""爬取每頁的前三十個商品，數據直接展示在原網頁中:param response::return:"""for li in response.xpath('//*[@id="J_goodsList"]/ul/li'):item = JdItem()title = li.xpath('div/div/a/em/text()').extract_first("") # 標題price = li.xpath('div/div/strong/i/text()').extract_first("") # 價格sku_id = li.xpath('./@data-sku').extract_first("") # id# 詳細內容的urlurl = li.xpath('./div/div[@class="p-img"]/a/@href').extract_first("") # 需要跟進的鏈接item['title'] = titleitem['price'] = priceitem['url'] = urlitem['sku_id'] = sku_idif not item['url'].startswith("https:"):item['info'] = Noneitem['url'] = "https:" + item['url']# yield item# 詳細頁面yield scrapy.Request(item['url'], callback=self.info_parse, meta={"item": item})if self.page <=10:self.page +=2self.s +=60# print(self.next_url.format(self.page, self.s))yield scrapy.Request(url=self.next_url.format(self.page, self.s),callback=self.parse)def info_parse(self, response):"""詳細頁面:param response::return:"""item = response.meta['item']# 評論頁面的url# page是評論的頁面，如果爬取多頁，可以更改pagecomment_url = 'https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId={}' \'&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1'# print(comment_url.format(item.get('sku_id')))# 評論頁面yield scrapy.Request(comment_url.format(item.get('sku_id')), callback=self.comment_parse, meta={"item": item})def comment_parse(self, response):"""爬取評論:param response::return:"""text= response.textcomment_list = re.findall(r'guid":".*?"content":"(.*?)".*?"creationTime":"(.*?)",".*?"replyCount":(\d+),"score":(\d+).*?usefulVoteCount":(\d+).*?imageCount":(\d+).*?images":',text)info = []for result in comment_list:# 根據正則表達式結果匹配數據# '留言時間', '評分', '回復數', '點贊數', '圖片數', '評論內容'comment_item = CommentItem()comment_item['content'] = result[0]comment_item['comment_time'] = result[1]comment_item['reply_count'] = result[2]comment_item['score'] = result[3]comment_item['vote_count'] = result[4]comment_item['image_count'] = result[5]info.append(comment_item)item = response.meta['item']item['info'] = infoyield item

5.只在pipelines.py中進行了簡單的打印

6.執行： python -m scrapy crawl jd_category
效果：

總結

以上是生活随笔為你收集整理的Python—Scrapy爬取京东商城的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

python

Python—Scrapy爬取京东商城

Python—Scrapy爬取京東商城

總結