當(dāng)前位置：首頁(yè) > 编程语言 > python >内容正文

python

python 爬取西刺ip代理池

發(fā)布時(shí)間：2024/3/24 python 41 豆豆

生活随笔收集整理的這篇文章主要介紹了 python 爬取西刺ip代理池小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

1. 如何在requests中設(shè)置ip代理

最直接的用法，在get中添加proxies設(shè)置ip代理

proxies = {'https': 'http://183.148.153.147:9999/''https': 'http://183.148.153.147:9999/'}) requests.get(url=url, headers=headers, proxies=proxies)

當(dāng)ip被網(wǎng)站ban掉時(shí)，我們就需要使用大量的ip來(lái)進(jìn)行替換，由此引出了下面的內(nèi)容，爬取西刺提供的免費(fèi)ip

2. 爬取西刺免費(fèi)ip代理,并存入mysql

。。具體的字段分析，先待定吧，我ip被西刺ban了。。。233
直接上代碼了

import time from random import randomimport requests from scrapy.selector import Selector import pymysqlheaders = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER" }# 創(chuàng)建數(shù)據(jù)庫(kù)時(shí)，使用的字段是ip(varchar)(主鍵) port(varchar) proxy_type(varchar) speed(float) ，數(shù)據(jù)庫(kù)名字(ips)，表名(ip_pond),user,passwd，這些都填自己的 conn = pymysql.connect(host='127.0.0.1', user='root', passwd='123456', db='ips', charset='utf8') cursor = conn.cursor()# 定義一個(gè)隨機(jī)延時(shí)，因?yàn)槲鞔踢@網(wǎng)站太容易封ip了，， def rand_sleep_time():sleep_time = random() * 100return time.sleep(sleep_time)# 用于更新ip池 def update_ip_pond():# 這個(gè)網(wǎng)站目前一共有3637頁(yè)，這里獲取前面的10個(gè)頁(yè)面for i in range(1, 11):resp = requests.get('https://www.xicidaili.com/nn/%s' % i, headers=headers)if resp.status_code != 200:print('第%s頁(yè)獲取失敗' % i)else:print('已獲取第%s頁(yè)內(nèi)容' % i)selector = Selector(text=resp.text)# 使用xpath找到ip_list這個(gè)idall_items = selector.xpath('//*[@id="ip_list"]//tr')ip_list = []#第一行不是我們需要的，過(guò)濾掉，從第1列開始for item in all_items[1:]:# 這里使用xpath從網(wǎng)頁(yè)提取speed_str = item.xpath('td[7]/div/@title').get()if speed_str:speed = float(speed_str.split('秒')[0])ip = item.xpath('td[2]/text()').get()port = item.xpath('td[3]/text()').get()proxy_type = item.xpath('td[6]/text()').get().lower()ip_list.append((ip, port, proxy_type, speed))for ip_info in ip_list:# sql的作用為：插入并更新相應(yīng)的字段cursor.execute("insert ip_pond(ip,port,proxy_type,speed) values ('{0}','{1}','{2}','{3}') ON DUPLICATE KEY UPDATE ip=VALUES(ip),port=VALUES(port),proxy_type=VALUES(proxy_type),speed=VALUES(speed)".format(ip_info[0], ip_info[1], ip_info[2], ip_info[3]))#延時(shí)，防拉黑rand_sleep_time()conn.commit()

3. 定義GetIp類，用于從mysql中取出ip

class GetIp(object):# 刪除不可用的Ipdef delete_ip(self, ip):delete_sql = """DELETE FROM ip_pond WHERE ip='{0}'""".format(ip)cursor.execute(delete_sql)conn.commit()return True# 驗(yàn)證ip是否可用def judge_ip(self, ip, port, proxy_type):#通過(guò)百度來(lái)驗(yàn)證吧http_url = 'https://www.baidu.com'proxy_url = '{0}://{1}:{2}'.format(proxy_type, ip, port)try:#對(duì)http.https進(jìn)行區(qū)分if proxy_type == 'http':proxy_dict = {'http': proxy_url,}response = requests.get(http_url, proxies=proxy_dict)else:proxy_dict = {'https': proxy_url,}response = requests.get(http_url, proxies=proxy_dict, verify=False)except Exception as e:print('invalid ip and port')self.delete_ip(ip)return Falseelse:code = response.status_codeif code >= 200 and code < 300:print('effective ip')return Trueelse:print('invalid ip and port')self.delete_ip(ip)return False# 從數(shù)據(jù)庫(kù)中隨機(jī)選擇def get_random_ip(self):random_sql = """SELECT ip,port,proxy_type,speed FROM ip_pond ORDER BY RAND() LIMIT 1"""cursor.execute(random_sql)for ip_info in cursor.fetchall():ip = ip_info[0]port = ip_info[1]proxy_type = ip_info[2]judge_re = self.judge_ip(ip, port, proxy_type)if judge_re:return '{0}://{1}:{2}'.format(proxy_type, ip, port)else:return self.get_random_ip()# 從數(shù)據(jù)庫(kù)中選速度最快的 (大部分和上面的一樣，只是sql語(yǔ)句不一樣)def get_optimum_ip(self):optimum_sql = """SELECT ip,port,proxy_type,speed FROM ip_pond ORDER BY speed LIMIT 1"""cursor.execute(optimum_sql)for ip_info in cursor.fetchall():ip = ip_info[0]port = ip_info[1]proxy_type = ip_info[2]judge_re = self.judge_ip(ip, port, proxy_type)if judge_re:return '{0}://{1}:{2}'.format(proxy_type, ip, port)else:return self.get_optimum_ip()#對(duì)獲取的ip簡(jiǎn)單封裝了下，方便使用def get_proxies(self):getip = GetIp()ip = getip.get_random_ip()print(ip)proxy_type = ip.split(':')[0]proxies = {proxy_type: ip}return proxies

4. 正確的使用方式

if __name__ == '__main__':# 當(dāng)取到的ip是https的時(shí)候，會(huì)有點(diǎn)慢# 先確認(rèn)是否存在ip_pond這表sql = """SELECT * FROM ip_pond"""check_table = cursor.execute(sql)if check_table:#測(cè)試用的url，這個(gè)隨便寫url = 'https://www.baidu.com'headers = {"User-Agent": "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10"}#上一步的簡(jiǎn)單封裝，直接獲取到proxiesproxies = GetIp().get_proxies()res = requests.get(url=url, headers=headers, proxies=proxies)else:update_ip_pond()

源碼請(qǐng)點(diǎn)擊這里

總結(jié)

以上是生活随笔為你收集整理的python 爬取西刺ip代理池的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： ADA集成开发环境GNAT-GPS的版本
下一篇： python切片操作当所有数据都省略时