简单爬虫-爬取免费代理ip
生活随笔
收集整理的這篇文章主要介紹了
简单爬虫-爬取免费代理ip
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
環境:python3.6
主要用到模塊:requests,PyQuery
代碼比較簡單,不做過多解釋了
#!usr/bin/python # -*- coding: utf-8 -*- import requests from pyquery import PyQuery as pqclass GetProxy(object):def __init__(self):# 代理ip網站self.url = 'http://www.xicidaili.com/nn/'self.header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'}self.file = r'F:\python\code2\get_proxy\proxies.txt'# 用于檢查代理ip是否可用self.check_url = 'https://www.python.org/'self.title = 'Welcome to Python.org'def get_page(self):response = requests.get(self.url, headers=self.header)# print(response.status_code)return response.textdef page_parse(self, response):stores = []result = pq(response)('#ip_list')for p in result('tr').items():if p('tr > td').attr('class') == 'country':ip = p('td:eq(1)').text()port = p('td:eq(2)').text()protocol = p('td:eq(5)').text().lower()# if protocol == 'socks4/5':# protocol = 'socks5'proxy = '{}://{}:{}'.format(protocol, ip, port)stores.append(proxy)return storesdef start(self):response = self.get_page()proxies = self.page_parse(response)print(len(proxies))file = open(self.file, 'w')i = 0for proxy in proxies:try:check = requests.get(self.check_url, headers=self.header, proxies={'http': proxy}, timeout=5)check_char = pq(check.text)('head > title').text()if check_char == self.title:print('%s is useful'%proxy)file.write(proxy + '\n')i += 1except Exception as e:continuefile.close()print('Get %s proxies'%i)if __name__ == '__main__':get = GetProxy()get.start()?
轉載于:https://www.cnblogs.com/thunderLL/p/6569067.html
總結
以上是生活随笔為你收集整理的简单爬虫-爬取免费代理ip的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: vue.js初识(一)
- 下一篇: CCF计算机职业资格认证2016-12-