當(dāng)前位置：首頁 > 运维知识 > Nginx >内容正文

Nginx

IIS6/IIS7以上、Nginx、Apache拦截屏蔽垃圾蜘蛛UA爬行降低负载方法IIS7.5如何限制某UserAgent 禁止访问

發(fā)布時(shí)間：2023/12/18 Nginx 33 豆豆

生活随笔收集整理的這篇文章主要介紹了 IIS6/IIS7以上、Nginx、Apache拦截屏蔽垃圾蜘蛛UA爬行降低负载方法IIS7.5如何限制某UserAgent 禁止访问小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

最近網(wǎng)站訪問非常慢，cpu占用非常高，服務(wù)器負(fù)載整體也非常高，打開日志發(fā)現(xiàn)有很多不知名的蜘蛛一直在爬行我的站點(diǎn)，根據(jù)經(jīng)驗(yàn)肯定是這里的問題，于是根據(jù)我的情況寫了規(guī)則做了屏蔽，屏蔽后負(fù)載降下來了，下面整理下iis及nginx及apache環(huán)境下如何屏蔽不知名的蜘蛛ua。海寧育嬰師

注意（請根據(jù)自己的情況調(diào)整刪除或增加ua信息，我提供的規(guī)則中包含了不常用的蜘蛛ua，幾乎用不著，若您的網(wǎng)站比較特殊，需要不同的蜘蛛爬取，建議仔細(xì)分析規(guī)則，將指定ua刪除即可）

IIS7.5測試ok

指定特征禁止UA訪問，返回代碼403

例如只禁止空UA

例如禁止其他UA+空UA

禁止特定蜘蛛

禁止瀏覽某文件

1、nginx禁止垃圾蜘蛛訪問，把下列代碼放到你的nginx配置文件里面。
#禁止Scrapy等工具的抓取

if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) { return 403; } #禁止指定UA及UA為空的訪問 if ($http_user_agent ~ "opensiteexplorer|MauiBot|FeedDemon|SemrushBot|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|semrushbot|alphaseobot|semrush|Feedly|UniversalFeedParser|webmeup-crawler|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|^$" ) { return 403; } #禁止非GET|HEAD|POST方式的抓取 if ($request_method !~ ^(GET|HEAD|POST)$) { return 403; }

2、IIS7/IIS8/IIS10及以上web服務(wù)請?jiān)诰W(wǎng)站根目錄下創(chuàng)建web.config文件,并寫入如下代碼即可;

<?xml version="1.0" encoding="UTF-8"?> <configuration> <system.webServer> <rewrite> <rules> <rule name="Block spider"> <match url="(^robots.txt$)" ignoreCase="false" negate="true" /> <conditions> <add input="{HTTP_USER_AGENT}" pattern="MegaIndex|MegaIndex.ru|BLEXBot|Qwantify|qwantify|semrush|Semrush|serpstatbot|hubspot|python|Bytespider|Go-http-client|Java|PhantomJS|SemrushBot|Scrapy|Webdup|AcoonBot|AhrefsBot|Ezooms|EdisterBot|EC2LinkFinder|jikespider|Purebot|MJ12bot|WangIDSpider|WBSearchBot|Wotbox|xbfMozilla|Yottaa|YandexBot|Jorgee|SWEBot|spbot|TurnitinBot-Agent|mail.RU|perl|Python|Wget|Xenu|ZmEu|^$" ignoreCase="true" /> </conditions> <action type="AbortRequest" /> </rule> </rules> </rewrite> </system.webServer> </configuration>

3、apache請?jiān)?htaccess文件中添加如下規(guī)則即可：

<IfModule mod_rewrite.c> RewriteEngine On #Block spider RewriteCond %{HTTP_USER_AGENT} "MegaIndex|MegaIndex.ru|BLEXBot|Qwantify|qwantify|semrush|Semrush|serpstatbot|hubspot|python|Bytespider|Go-http-client|Java|PhantomJS|SemrushBot|Scrapy|Webdup|AcoonBot|AhrefsBot|Ezooms|EdisterBot|EC2LinkFinder|jikespider|Purebot|MJ12bot|WangIDSpider|WBSearchBot|Wotbox|xbfMozilla|Yottaa|YandexBot|Jorgee|SWEBot|spbot|TurnitinBot-Agent|mail.RU|perl|Python|Wget|Xenu|ZmEu|^$" [NC] RewriteRule !(^robots\.txt$) - [F] </IfModule>

注：規(guī)則中默認(rèn)屏蔽部分不明蜘蛛，要屏蔽其他蜘蛛按規(guī)則添加即可

附各大蜘蛛名字：

google蜘蛛：googlebot

百度蜘蛛：baiduspider

百度手機(jī)蜘蛛：baiduboxapp

yahoo蜘蛛：slurp

alexa蜘蛛：ia_archiver

msn蜘蛛：msnbot

bing蜘蛛：bingbot

altavista蜘蛛：scooter

lycos蜘蛛：lycos_spider_(t-rex)

alltheweb蜘蛛：fast-webcrawler

inktomi蜘蛛：slurp

有道蜘蛛：YodaoBot和OutfoxBot

熱土蜘蛛：Adminrtspider

搜狗蜘蛛：sogou spider

SOSO蜘蛛：sosospider

360搜蜘蛛：360spider

網(wǎng)絡(luò)上常見的垃圾UA列表
內(nèi)容采集

FeedDemon
Java 內(nèi)容采集
Jullo 內(nèi)容采集
Feedly 內(nèi)容采集
UniversalFeedParser 內(nèi)容采集
SQL注入

BOT/0.1 (BOT for JCE)
CrawlDaddy
無用爬蟲

EasouSpider
Swiftbot
YandexBot
AhrefsBot
jikeSpider
MJ12bot
YYSpider
oBot
CC攻擊器

ApacheBench
WinHttp
TCP攻擊

HttpClient
掃描

Microsoft URL Control
ZmEu phpmyadmin
jaunty

總結(jié)

以上是生活随笔為你收集整理的IIS6/IIS7以上、Nginx、Apache拦截屏蔽垃圾蜘蛛UA爬行降低负载方法IIS7.5如何限制某UserAgent 禁止访问的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：洛谷P3356 火星探险问题(费用流)
下一篇： Nginx —— 检查配置文件ngi