當(dāng)前位置：首頁(yè) > 编程语言 > python >内容正文

python

python伪造请求头x-forwarded-for的作用_Pyspider中给爬虫伪造随机请求头的实例

發(fā)布時(shí)間：2024/9/27 python 34 豆豆

生活随笔收集整理的這篇文章主要介紹了 python伪造请求头x-forwarded-for的作用_Pyspider中给爬虫伪造随机请求头的实例小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

Pyspider 中采用了 tornado 庫(kù)來(lái)做 http 請(qǐng)求，在請(qǐng)求過(guò)程中可以添加各種參數(shù)，例如請(qǐng)求鏈接超時(shí)時(shí)間，請(qǐng)求傳輸數(shù)據(jù)超時(shí)時(shí)間，請(qǐng)求頭等等，但是根據(jù)pyspider的原始框架，給爬蟲(chóng)添加參數(shù)只能通過(guò) crawl_config這個(gè)Python字典來(lái)完成(如下所示)，框架代碼將這個(gè)字典中的參數(shù)轉(zhuǎn)換成 task 數(shù)據(jù)，進(jìn)行http請(qǐng)求。這個(gè)參數(shù)的缺點(diǎn)是不方便給每一次請(qǐng)求做隨機(jī)請(qǐng)求頭。

crawl_config = {

"user_agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36",

"timeout": 120,

"connect_timeout": 60,

"retries": 5,

"fetch_type": 'js',

"auto_recrawl": True,

}

這里寫(xiě)出給爬蟲(chóng)添加隨機(jī)請(qǐng)求頭的方法：

1、編寫(xiě)腳本，將腳本放置在 pyspider 的 libs 文件夾下，命名為 header_switch.py

#!/usr/bin/env python

# -*- coding:utf-8 -*-

# Created on 2017-10-18 11:52:26

import random

import time

class HeadersSelector(object):

"""

Header 中缺少幾個(gè)字段 Host 和 Cookie

"""

headers_1 = {

"Proxy-Connection": "keep-alive",

"Pragma": "no-cache",

"Cache-Control": "no-cache",

"User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36",

"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",

"DNT": "1",

"Accept-Encoding": "gzip, deflate, sdch",

"Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.6,en;q=0.4",

"Referer": "https://www.baidu.com/s?wd=%BC%96%E7%A0%81&rsv_spt=1&rsv_iqid=0x9fcbc99a0000b5d7&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-8&rqlang=cn&tn=baiduhome_pg&rsv_enter=0&oq=If-None-Match&inputT=7282&rsv_t",

"Accept-Charset": "gb2312,gbk;q=0.7,utf-8;q=0.7,*;q=0.7",

} # 網(wǎng)上找的瀏覽器

headers_2 = {

"Proxy-Connection": "keep-alive",

"Pragma": "no-cache",

"Cache-Control": "no-cache",

"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0",

"Accept": "image/gif,image/x-xbitmap,image/jpeg,application/x-shockwave-flash,application/vnd.ms-excel,application/vnd.ms-powerpoint,application/msword,*/*",

"DNT": "1",

"Referer": "https://www.baidu.com/link?url=c-FMHf06-ZPhoRM4tWduhraKXhnSm_RzjXZ-ZTFnPAvZN",

"Accept-Encoding": "gzip, deflate, sdch",

"Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.6,en;q=0.4",

} # window 7 系統(tǒng)瀏覽器

headers_3 = {

"Proxy-Connection": "keep-alive",

"Pragma": "no-cache",

"Cache-Control": "no-cache",

"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0",

"Accept": "image/x-xbitmap,image/jpeg,application/x-shockwave-flash,application/vnd.ms-excel,application/vnd.ms-powerpoint,application/msword,*/*",

"DNT": "1",

"Referer": "https://www.baidu.com/s?wd=http%B4%20Pragma&rsf=1&rsp=4&f=1&oq=Pragma&tn=baiduhome_pg&ie=utf-8&usm=3&rsv_idx=2&rsv_pq=e9bd5e5000010",

"Accept-Encoding": "gzip, deflate, sdch",

"Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.7,en;q=0.6",

} # Linux 系統(tǒng) firefox 瀏覽器

headers_4 = {

"Proxy-Connection": "keep-alive",

"Pragma": "no-cache",

"Cache-Control": "no-cache",

"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:55.0) Gecko/20100101 Firefox/55.0",

"Accept": "*/*",

"DNT": "1",

"Referer": "https://www.baidu.com/link?url=c-FMHf06-ZPhoRM4tWduhraKXhnSm_RzjXZ-ZTFnP",

"Accept-Encoding": "gzip, deflate, sdch",

"Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.7,en;q=0.6",

} # Win10 系統(tǒng) firefox 瀏覽器

headers_5 = {

"Connection": "keep-alive",

"Pragma": "no-cache",

"Cache-Control": "no-cache",

"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64;) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36 Edge/15.15063",

"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",

"Referer": "https://www.baidu.com/link?url=c-FMHf06-ZPhoRM4tWduhraKXhnSm_RzjXZ-",

"Accept-Encoding": "gzip, deflate, sdch",

"Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.7,en;q=0.6",

"Accept-Charset": "gb2312,gbk;q=0.7,utf-8;q=0.7,*;q=0.7",

} # Win10 系統(tǒng) Chrome 瀏覽器

headers_6 = {

"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",

"Accept-Encoding": "gzip, deflate, sdch",

"Accept-Language": "zh-CN,zh;q=0.8",

"Pragma": "no-cache",

"Cache-Control": "no-cache",

"Connection": "keep-alive",

"DNT": "1",

"Referer": "https://www.baidu.com/s?wd=If-None-Match&rsv_spt=1&rsv_iqid=0x9fcbc99a0000b5d7&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-8&rq",

"Accept-Charset": "gb2312,gbk;q=0.7,utf-8;q=0.7,*;q=0.7",

"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0",

} # win10 系統(tǒng)瀏覽器

def __init__(self):

pass

def select_header(self):

n = random.randint(1, 6)

switch={

1: self.headers_1

2: self.headers_2

3: self.headers_3

4: self.headers_4

5: self.headers_5

6: self.headers_6

}

headers = switch[n]

return headers

其中，我只寫(xiě)了6個(gè)請(qǐng)求頭，如果爬蟲(chóng)的量非常大，完全可以寫(xiě)更多的請(qǐng)求頭，甚至上百個(gè)，然后將 random的隨機(jī)范圍擴(kuò)大，進(jìn)行選擇。

2、在pyspider 腳本中編寫(xiě)如下代碼：

#!/usr/bin/env python

# -*- encoding: utf-8 -*-

# Created on 2017-08-18 11:52:26

from pyspider.libs.base_handler import *

from pyspider.addings.headers_switch import HeadersSelector

import sys

defaultencoding = 'utf-8'

if sys.getdefaultencoding() != defaultencoding:

reload(sys)

sys.setdefaultencoding(defaultencoding)

class Handler(BaseHandler):

crawl_config = {

"user_agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36",

"timeout": 120,

"connect_timeout": 60,

"retries": 5,

"fetch_type": 'js',

"auto_recrawl": True,

}

@every(minutes=24 * 60)

def on_start(self):

header_slt = HeadersSelector()

header = header_slt.select_header() # 獲取一個(gè)新的 header

# header["X-Requested-With"] = "XMLHttpRequest"

orig_href = 'http://sww.bjxch.gov.cn/gggs.html'

self.crawl(orig_href,

callback=self.index_page,

headers=header) # 請(qǐng)求頭必須寫(xiě)在 crawl 里，cookies 從 response.cookies 中找

@config(age=24 * 60 * 60)

def index_page(self, response):

header_slt = HeadersSelector()

header = header_slt.select_header() # 獲取一個(gè)新的 header

# header["X-Requested-With"] = "XMLHttpRequest"

if response.cookies:

header["Cookies"] = response.cookies

其中最重要的就是在每個(gè)回調(diào)函數(shù) on_start，index_page 等等當(dāng)中，每次調(diào)用時(shí)，都會(huì)實(shí)例化一個(gè) header 選擇器，給每一次請(qǐng)求添加不一樣的 header。要注意添加的如下代碼：

header_slt = HeadersSelector()

header = header_slt.select_header() # 獲取一個(gè)新的 header

# header["X-Requested-With"] = "XMLHttpRequest"

header["Host"] = "www.baidu.com"

if response.cookies:

header["Cookies"] = response.cookies

當(dāng)使用 XHR 發(fā)送 AJAX 請(qǐng)求時(shí)會(huì)帶上 Header，常被用來(lái)判斷是不是 Ajax 請(qǐng)求， headers 要添加 {‘X-Requested-With': ‘XMLHttpRequest'} 才能抓取到內(nèi)容。

確定了 url 也就確定了請(qǐng)求頭中的 Host，需要按需添加，urlparse包里給出了根據(jù) url解析出 host的方法函數(shù)，直接調(diào)用netloc即可。

如果響應(yīng)中有 cookie，就需要將 cookie 添加到請(qǐng)求頭中。

如果還有別的偽裝需求，自行添加。

如此即可實(shí)現(xiàn)隨機(jī)請(qǐng)求頭，完。

以上這篇Pyspider中給爬蟲(chóng)偽造隨機(jī)請(qǐng)求頭的實(shí)例就是小編分享給大家的全部?jī)?nèi)容了，希望能給大家一個(gè)參考，也希望大家多多支持我們。

本文標(biāo)題: Pyspider中給爬蟲(chóng)偽造隨機(jī)請(qǐng)求頭的實(shí)例

本文地址: http://www.cppcns.com/jiaoben/python/227296.html

總結(jié)

以上是生活随笔為你收集整理的python伪造请求头x-forwarded-for的作用_Pyspider中给爬虫伪造随机请求头的实例的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： sql截去最后一位_SqlServer从
下一篇： python多进程传递参数_急急急， P