當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

爬虫下载页面

發布時間：2025/3/11 编程问答 36 豆豆

生活随笔收集整理的這篇文章主要介紹了爬虫下载页面小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

簡介

爬蟲下載頁面

代碼

簡易下載

#!/usr/bin/env python #coding=utf-8 import urllib2def download(url):print('Download:',url)try:html = urllib2.urlopen(url).read()except urllib2.URLError as e:print('Download error:', e.reason)html = Nonereturn htmlif __name__ == '__main__':download('http://www.baidu.com')

似乎并沒有把百度的html 下載下來

多次嘗試下載 5XX服務器錯誤并設置代理

很多網站都不喜歡被爬蟲程序訪問，但又沒有辦法完全禁止，于是就設置了一些反爬策略。比如User Agent，中文名為用戶代理，簡稱UA。User Agent存放于Headers中，服務器就是通過查看Headers中的User Agent來判斷是誰在訪問。通過不同的瀏覽器訪問，會有不同的User Agent，如果爬蟲不設置的話，很容易被識別出來，就會被限制訪問。一般的做法是收集很多不同的User Agent，然后隨機使用。 def download(url, user_agent='wswp', num_retries=2):print 'Downloading:', urlheaders = {'User-agent':user_agent}request = urllib2.Request(url, headers=headers)try:html = urllib2.urlopen(request).read()except urllib2.URLError as e:print('Download error:', e.reason)html = Noneif num_retries > 0:if hasattr(e, 'code') and 500 <= e.code < 600:#retyr 5XX HTTP errorsreturn download(url, user_agent, num_retries-1)return html

使用網站地圖下載相關的頁面

def crawl_sitemap(url):# download the sitemap filesitemap = download(url)# extract the sitemap linkslinks = re.findall('<loc>(.*?)</loc>', sitemap)# download each linkfor link in links:html = download(link)print link

網站可能會把前面的字符串忽略然后可以只用后面的數字

def crawl_string():for page in itertools.count(1):url = 'http://example.webscraping.com/view/-%d' % pagehtml = download(url)if ( html is None):breakelse:pass

網站通過一個頁面的鏈接下載

def get_links(html):webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']',re.IGNORECASE)return webpage_regex.findall(html)def link_crawler(seed_url, link_regex):crawl_queue = [seed_url]# keep track which URL's have seen beforeseen = set(crawl_queue)while crawl_queue:url = crawl_queue.pop()html = download(url)print "getlinks", get_links(html)for link in get_links(html):if re.match(link_regex, link):link = urlparse.urljoin(seed_url, link)if link not in seen:seen.add(link)crawl_queue.append(link)if __name__ == '__main__':link_crawler('http://example.webscraping.com', '/places/default/(index|view)')

支持對 robots.txt 的解析

def link_crawler(seed_url, link_regex):rp = robotparser.RobotFileParser()rp.set_url(seed_url+'/robots.txt')rp.read()crawl_queue = [seed_url]# keep track which URL's have seen beforeseen = set(crawl_queue)while crawl_queue:url = crawl_queue.pop()user_agent = 'wswp'if rp.can_fetch(user_agent, url):html = download(url)print "getlinks", get_links(html)for link in get_links(html):if re.match(link_regex, link):link = urlparse.urljoin(seed_url, link)if link not in seen:seen.add(link)crawl_queue.append(link)else:print 'Blocked by robots.txt:', url

代理

def link_crawler(seed_url, link_regex, proxy=False):if proxy: # 暫時無法代理proxy_info={'host':'106.12.38.133','port':22}# We create a handler for the proxyproxy_support = urllib2.ProxyHandler({"http" : "http://%(host)s:%(port)d" % proxy_info})# We create an opener which uses this handler:opener = urllib2.build_opener(proxy_support)# Then we install this opener as the default opener for urllib2:urllib2.install_opener(opener)#如果代理需要驗證proxy_info = { 'host' : '106.12.38.133','port' : 20,'user' : 'root','pass' : 'Woaini7758258!'}proxy_support = urllib2.ProxyHandler({"http" : "http://%(user)s:%(pass)s@%(host)s:%(port)d" % proxy_info})opener = urllib2.build_opener(proxy_support)urllib2.install_opener(opener)#htmlpage = urllib2.urlopen("http://sebsauvage.net/").read(200000)rp = robotparser.RobotFileParser()rp.set_url(seed_url+'/robots.txt')rp.read()crawl_queue = [seed_url]# keep track which URL's have seen beforeseen = set(crawl_queue)while crawl_queue:url = crawl_queue.pop()user_agent = 'wswp'if rp.can_fetch(user_agent, url):html = download(url)print "getlinks", get_links(html)for link in get_links(html):if re.match(link_regex, link):link = urlparse.urljoin(seed_url, link)if link not in seen:seen.add(link)crawl_queue.append(link)else:print 'Blocked by robots.txt:', url

下載限速

class Throttle:"""下載延遲下載之前調用"""def __init__(self, delay):self.delay = delayself.domains()def wait(self, url):domain = urlparse.urlparse(url).netloclast_accessed = self.domains.get(domain)if self.delay > 0 and last_accessed is not None:sleep_secs = self.delay - (datetime.datetime.now() - last_accessed).secondsif sleep_secs > 0:time.sleep(sleep_secs)self.domains[domain] = datetime.datetime.now()

參考鏈接

https://tieba.baidu.com/p/5832236970

轉載于:https://www.cnblogs.com/eat-too-much/p/11560067.html

創作挑戰賽新人創作獎勵來咯，堅持創作打卡瓜分現金大獎

總結

以上是生活随笔為你收集整理的爬虫下载页面的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： OCP-052考试题库汇总（60）-CU
下一篇： JS判断数字类型