日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

scrapy 项目实战(一)----爬取雅昌艺术网数据

發布時間:2024/3/7 编程问答 29 豆豆
生活随笔 收集整理的這篇文章主要介紹了 scrapy 项目实战(一)----爬取雅昌艺术网数据 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

第一步:創建scrapy項目:

  scrapy startproject Demo

第二步:創建一個爬蟲

  

scrapy genspider demo http://auction.artron.net/result/pmh-0-0-2-0-1/

第三步:項目結構:

  

第四部:依次粘貼處各個文件的代碼:

  1. demo.py 文件驗證碼

      

# -*- coding: utf-8 -*- import scrapy from scrapy import Request from Demo.items import * from bs4 import BeautifulSoup import time # import sys # reload(sys) # sys.setdefaultencoding('utf-8') import re import hashlib
# 加密去重
def md5(str):m = hashlib.md5()m.update(str)return m.hexdigest() #過濾注釋信息,去掉換行 def replace(newline):newline = str(newline)newline = newline.replace('\r','').replace('\n','').replace('\t','').replace(' ','').replace('amp;','')re_comment = re.compile('<!--[^>]*-->')newlines = re_comment.sub('', newline)newlines = newlines.replace('<!--','').replace('-->','')return newlinesclass DemoSpider(scrapy.Spider):name = 'demo'allowed_domains = ['http://auction.artron.net/result/']start_urls = ['http://auction.artron.net/result/pmh-0-0-2-0-1/','http://auction.artron.net/result/pmh-0-0-2-0-2/','http://auction.artron.net/result/pmh-0-0-2-0-4/','http://auction.artron.net/result/pmh-0-0-2-0-5/','http://auction.artron.net/result/pmh-0-0-2-0-6/','http://auction.artron.net/result/pmh-0-0-2-0-7/','http://auction.artron.net/result/pmh-0-0-2-0-8/','http://auction.artron.net/result/pmh-0-0-2-0-9/','http://auction.artron.net/result/pmh-0-0-2-0-10/','http://auction.artron.net/result/pmh-0-0-2-0-3/']def parse(self, response):html = response.textsoup = BeautifulSoup(html,'html.parser')result_lists = soup.find_all('ul',attrs={"class":"dataList"})[0]result_lists_replace = replace(result_lists)result_lists_replace = result_lists_replace.decode('utf-8')result_list = re.findall('<ul><li class="name">(.*?)</span></li></ul></li>',result_lists_replace)for ii in result_list:item = DemoItem()auction_name_url = re.findall('<a alt="(.*?)" href="(.*?)" target="_blank" title',ii)[0]auction_name = auction_name_url[0]auction_url = auction_name_url[1]auction_url = "http://auction.artron.net" + auction_urlaucr_name_spider = re.findall('<li class="company"><a href=".*?" target="_blank">(.*?)</a>',ii)[0]session_address_time = re.findall('<li class="city">(.*?)</li><li class="time">(.*?)</li></ul>',ii)[0]session_address = session_address_time[0]item_auct_time = session_address_time[1]hashcode = md5(str(auction_url))create_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(time.time()))item['auction_name'] = auction_nameitem['auction_url'] = auction_urlitem['aucr_name_spider'] = aucr_name_spideritem['session_address'] = session_addressitem['item_auct_time'] = item_auct_timeitem['hashcode'] = hashcodeitem['create_time'] = create_timeprint itemyield item

?

2.? ?items.py? 文件

   

# -*- coding: utf-8 -*-import scrapyclass DemoItem(scrapy.Item):auction_name = scrapy.Field()auction_url = scrapy.Field()aucr_name_spider = scrapy.Field()session_address = scrapy.Field()item_auct_time = scrapy.Field()hashcode = scrapy.Field()create_time = scrapy.Field()

?

3.? ? ?pipelines.py?

    

# -*- coding: utf-8 -*-# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html import json import MySQLdbdef insert_data(dbName,data_dict):try:data_values = "(" + "%s," * (len(data_dict)) + ")"data_values = data_values.replace(',)', ')')dbField = data_dict.keys()dataTuple = tuple(data_dict.values())dbField = str(tuple(dbField)).replace("'",'')conn = MySQLdb.connect(host="10.10.10.77", user="xuchunlin", passwd="ed35sdef456", db="epai_spider_2018", charset="utf8")cursor = conn.cursor()sql = """ insert into %s %s values %s """ % (dbName,dbField,data_values)params = dataTuplecursor.execute(sql, params)conn.commit()cursor.close()conn.close()print "===== 插入成功 ====="return 1except Exception as e:print "******** 插入失敗 ********"print ereturn 0class DemoPipeline(object):def process_item(self, item, spider):dbName = "yachang_auction"data_dict= iteminsert_data(dbName, data_dict)

?

4. setting.py

?

  

# -*- coding: utf-8 -*-# Scrapy settings for Demo project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # http://doc.scrapy.org/en/latest/topics/settings.html # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html BOT_NAME = 'Demo'SPIDER_MODULES = ['Demo.spiders'] NEWSPIDER_MODULE = 'Demo.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'Demo (+http://www.yourdomain.com)'# Obey robots.txt rules ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default) #COOKIES_ENABLED = False# Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False# Override the default request headers: DEFAULT_REQUEST_HEADERS = {"Host":"auction.artron.net",# "Connection":"keep-alive",# "Upgrade-Insecure-Requests":"1","User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.91 Safari/537.36","Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8","Referer":"http://auction.artron.net/result/pmh-0-0-2-0-2/","Accept-Encoding":"gzip, deflate","Accept-Language":"zh-CN,zh;q=0.8","Cookie":"td_cookie=2322469817; gr_user_id=84f865e6-466f-4386-acfb-e524e8452c87;
gr_session_id_276fdc71b3c353173f111df9361be1bb=ee1eb94e-b7a9-4521-8409-439ec1958b6c; gr_session_id_276fdc71b3c353173f111df9361be1bb_ee1eb94e-b7a9-4521-8409-
439ec1958b6c=true; _at_pt_0_=2351147; _at_pt_1_=A%E8%AE%B8%E6%98%A5%E6%9E%97; _at_pt_2_=e642b85a3cf8319a81f48ef8cc403d3b;
Hm_lvt_851619594aa1d1fb8c108cde832cc127=1533086287,1533100514,1533280555,1534225608; Hm_lpvt_851619594aa1d1fb8c108cde832cc127=1534298942
",}# Enable or disable spider middlewares # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'Demo.middlewares.DemoSpiderMiddleware': 543, #}# Enable or disable downloader middlewares # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'Demo.middlewares.MyCustomDownloaderMiddleware': 543, #}# Enable or disable extensions # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #}# Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = {'Demo.pipelines.DemoPipeline': 300, }# Enable and configure the AutoThrottle extension (disabled by default) # See http://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default) # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

?

5. 爬蟲數據庫表格:

  

CREATE TABLE `yachang_auction` (`key_id` int(255) NOT NULL AUTO_INCREMENT,`auction_name` varchar(255) DEFAULT NULL,`auction_url` varchar(255) DEFAULT NULL,`aucr_name_spider` varchar(255) DEFAULT NULL,`session_address` varchar(255) DEFAULT NULL,`item_auct_time` varchar(255) DEFAULT NULL,`hashcode` varchar(255) DEFAULT NULL,`create_time` varchar(255) DEFAULT NULL,PRIMARY KEY (`key_id`),UNIQUE KEY `hashcode` (`hashcode`) USING BTREE ) ENGINE=InnoDB AUTO_INCREMENT=230 DEFAULT CHARSET=utf8;

?

6.數據展示

  

?

轉載于:https://www.cnblogs.com/xuchunlin/p/7253951.html

總結

以上是生活随笔為你收集整理的scrapy 项目实战(一)----爬取雅昌艺术网数据的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。

主站蜘蛛池模板: 久操伊人网 | 亚洲a视频在线观看 | 国产良妇出轨视频在线观看 | 国产美女久久久久 | 91女人18毛片水多国产 | 日本精品网| 丝袜视频在线观看 | 成人爽a毛片一区二区免费 日本高清免费看 | 中国字幕av | 中文字幕麻豆 | 男人与雌性宠物交啪啪 | 亚洲精品一区二区三区精华液 | 国产成人在线免费视频 | 久久精品高清 | 国产人妖在线 | 禁断介护老人中文字幕 | 日日操日日干 | 久热精品视频在线播放 | 九色91av| 日本爱爱免费视频 | 亚洲精品鲁一鲁一区二区三区 | 五月天综合在线 | 91视频高清 | 国产成人影视 | 日韩一区二区三区在线免费观看 | 日韩色小说 | av三级在线观看 | 黑人超碰| 中文天堂网 | 麻豆av免费在线观看 | 成年人观看视频 | 欧美亚洲伦理 | 欧美视频色 | 欧美视频黄色 | 制服丝袜av电影 | 日日夜夜艹 | 亚洲在线资源 | 91丝袜| 亚洲毛茸茸 | 日韩欧美a级片 | av免费观看网 | 国产精品69久久久久孕妇欧美 | 精品国产免费人成在线观看 | 91丨porny丨九色| 在线色av| 国产成人在线免费观看视频 | 国产主播第一页 | 欧美99热 | 久久精品韩国 | 亚洲成a| www日韩欧美| 在线视频在线观看 | 中文字幕日韩三级 | 久久国产视频一区 | 亚洲免费在线视频 | 亚洲国产精品第一页 | 亚洲成年人 | 色插综合 | 精品无码久久久久国产 | 一区二区高清在线 | 黄色av网站免费看 | 麻豆综合 | 国产精品第1页 | 久操中文 | 久久国产精彩视频 | 三级a做爰全过程 | 五月婷婷综合在线观看 | 国产无遮挡免费 | a视频在线 | 国产成人无码精品久久 | 综合视频一区 | 亚洲三级国产 | 丝袜福利视频 | 国产精品福利影院 | 亚洲精选一区二区三区 | 精品久久香蕉国产线看观看亚洲 | 免费看裸体视频网站 | 精品久久久久久国产 | 久久久久久欧美精品se一二三四 | 影音先锋丝袜制服 | 2018自拍偷拍 | 国产精品无码久久久久 | 亚洲午夜视频在线 | 日本午夜电影网站 | 欧美日韩精品一二三区 | 国内91视频 | 69久久精品无码一区二区 | 欧洲成人午夜精品无码区久久 | 亚洲永久无码精品 | 三级av在线播放 | 中国黄色1级片 | 91精品国产aⅴ一区二区 | 五月婷婷综合久久 | 国产成人精品一区二区三区网站观看 | 午夜不卡福利视频 | 亚洲自拍中文 | 一区二区三区四区免费观看 | 国产视频二区三区 | 欧美片在线观看 |