生活随笔
收集整理的這篇文章主要介紹了
Python—Scrapy爬取京东商城
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
Python—Scrapy爬取京東商城
1.創建項目
scrapy startproject jd
效果:
2.生成一個爬蟲
scrapy genspider jd_category jd.com
效果:
3.在items.py文件中定義要提取的字段
import scrapy
class JdItem(scrapy
.Item
):"""商品信息"""title
= scrapy
.Field
() price
= scrapy
.Field
() sku_id
= scrapy
.Field
() url
= scrapy
.Field
() info
= scrapy
.Field
() class CommentItem(scrapy
.Item
):"""評論"""content
= scrapy
.Field
()comment_time
= scrapy
.Field
()reply_count
= scrapy
.Field
()score
= scrapy
.Field
()vote_count
= scrapy
.Field
()image_count
= scrapy
.Field
()
4.jd_category.py中的內容,對列表頁進行了爬取
對評論信息進行了爬取
import html
import json
import re
import scrapy
from ..items
import JdItem
, CommentItem
class JdSpider(scrapy
.Spider
):name
= 'jd_goods'allowed_domains
= ['jd.com'] page
= 1s
= 1url
= 'https://list.jd.com/list.html?cat=9987%2C653%2C655&page=1&s=1&click=0'next_url
= 'https://list.jd.com/list.html?cat=9987%2C653%2C655&page={}&s={}&click=0'def start_requests(self
):yield scrapy
.Request
(self
.url
)def parse(self
, response
):"""爬取每頁的前三十個商品,數據直接展示在原網頁中:param response::return:"""for li
in response
.xpath
('//*[@id="J_goodsList"]/ul/li'):item
= JdItem
()title
= li
.xpath
('div/div/a/em/text()').extract_first
("") price
= li
.xpath
('div/div/strong/i/text()').extract_first
("") sku_id
= li
.xpath
('./@data-sku').extract_first
("") url
= li
.xpath
('./div/div[@class="p-img"]/a/@href').extract_first
("") item
['title'] = titleitem
['price'] = priceitem
['url'] = urlitem
['sku_id'] = sku_id
if not item
['url'].startswith
("https:"):item
['info'] = Noneitem
['url'] = "https:" + item
['url']yield scrapy
.Request
(item
['url'], callback
=self
.info_parse
, meta
={"item": item
})if self
.page
<=10:self
.page
+=2self
.s
+=60yield scrapy
.Request
(url
=self
.next_url
.format(self
.page
, self
.s
),callback
=self
.parse
)def info_parse(self
, response
):"""詳細頁面:param response::return:"""item
= response
.meta
['item']comment_url
= 'https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId={}' \
'&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1'yield scrapy
.Request
(comment_url
.format(item
.get
('sku_id')), callback
=self
.comment_parse
, meta
={"item": item
})def comment_parse(self
, response
):"""爬取評論:param response::return:"""text
= response
.textcomment_list
= re
.findall
(r
'guid":".*?"content":"(.*?)".*?"creationTime":"(.*?)",".*?"replyCount":(\d+),"score":(\d+).*?usefulVoteCount":(\d+).*?imageCount":(\d+).*?images":',text
)info
= []for result
in comment_list
:comment_item
= CommentItem
()comment_item
['content'] = result
[0]comment_item
['comment_time'] = result
[1]comment_item
['reply_count'] = result
[2]comment_item
['score'] = result
[3]comment_item
['vote_count'] = result
[4]comment_item
['image_count'] = result
[5]info
.append
(comment_item
)item
= response
.meta
['item']item
['info'] = info
yield item
5.只在pipelines.py中進行了簡單的打印
6.執行: python -m scrapy crawl jd_category
效果:
總結
以上是生活随笔為你收集整理的Python—Scrapy爬取京东商城的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。