當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

python爬取千图网高清图

發(fā)布時間：2023/12/31 python 36 豆豆

生活随笔收集整理的這篇文章主要介紹了 python爬取千图网高清图小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

###一、scrapy圖片爬蟲構(gòu)建思路
1.分析網(wǎng)站
2.選擇爬取方式與策略
3.創(chuàng)建爬蟲項(xiàng)目 → 定義items.py
4.編寫爬蟲文件
5.編寫pipelines與setting
6.調(diào)試

二、千圖網(wǎng)難點(diǎn)（http://www.58pic.com/）

1.要爬取全站的圖片
2.要爬取高清的圖片------找出高清地址即可
3.要有相應(yīng)的反爬蟲機(jī)制------如模擬瀏覽器，不記錄cookie等，只要相應(yīng)注釋去掉即可COOKIES_ENABLED = False

三、散點(diǎn)知識

1.from scrapy.http import Request 是回調(diào)函數(shù)用Request(url=…,callback=…)
2.xpath的//表示提取所有符合的節(jié)點(diǎn)

代碼：

items.py

import scrapy class QiantuwangItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()url = scrapy.Field()title = scrapy.Field()

middlewares.py

from scrapy import signalsclass QiantuwangSpiderMiddleware(object):# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the spider middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_spider_input(response, spider):# Called for each response that goes through the spider# middleware and into the spider.# Should return None or raise an exception.return Nonedef process_spider_output(response, result, spider):# Called with the results returned from the Spider, after# it has processed the response.# Must return an iterable of Request, dict or Item objects.for i in result:yield idef process_spider_exception(response, exception, spider):# Called when a spider or process_spider_input() method# (from other spider middleware) raises an exception.# Should return either None or an iterable of Response, dict# or Item objects.passdef process_start_requests(start_requests, spider):# Called with the start requests of the spider, and works# similarly to the process_spider_output() method, except# that it doesn’t have a response associated.# Must return only requests (not items).for r in start_requests:yield rdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name) import urllib import random class QiantuwangPipeline(object):def process_item(self, item, spider):try:title = item['title'][0].encode('gbk')file = "E:/tupian/" + str(title) + str(int(random.random() * 10000)) + ".jpg"urllib.urlretrieve(item['url'][0], filename=file)except Exception, e:print epassreturn item

總結(jié)

以上是生活随笔為你收集整理的python爬取千图网高清图的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： linux禁止访问国外ip,Shell脚
下一篇： rdkitpython | 多个化合物中

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

python

python爬取千图网高清图

二、千圖網(wǎng)難點(diǎn)（http://www.58pic.com/）

三、散點(diǎn)知識

代碼：

總結(jié)

二、千圖網(wǎng)難點(diǎn)（http://www.58pic.com/）

三、散點(diǎn)知識