Jsoup爬虫以及防反爬
生活随笔
收集整理的這篇文章主要介紹了
Jsoup爬虫以及防反爬
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
1 java也可以爬取第三方網站的數據;
注: 1 ip限制【防爬】
? ? ? ? ?2 header參數referer
? ? ? ? ?3 偽裝hearder ua
就源引 一個第三方代理網站試試
{Random r = new Random();String[] ua = {"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.87 Safari/537.36 OPR/37.0.2178.32","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2486.0 Safari/537.36 Edge/13.10586","Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko","Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)","Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)","Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0)","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 BIDUBrowser/8.3 Safari/537.36","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36 Core/1.47.277.400 QQBrowser/9.4.7658.400","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 UBrowser/5.6.12150.8 Safari/537.36","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36 TheWorld 7","Mozilla/5.0 (Windows NT 6.1; W…) Gecko/20100101 Firefox/60.0"};int i = r.nextInt(14);logger.info("檢測中------ {}:{}",ip,port );Map<String,String> map = new HashMap<String,String>();map.put("waybillNo","DD1838768852");try {total ++ ;long a = System.currentTimeMillis();//爬取的目標網站,url記得換下。。。!!! 代理ip網站Document doc = Jsoup.connect("http://xxxx.com/dayProxy/ip/314639.html").timeout(5000)//.proxy(ip, port).data(map).ignoreContentType(true).userAgent(ua[i]).header("referer","http://xxxx.com/dayProxy.html")//這個來源記得換...post();System.out.println(ip+":"+port+"訪問時間:"+(System.currentTimeMillis() -a) + " 訪問結果: "+doc.text());suc ++ ;} catch (IOException e) {e.printStackTrace();fail ++ ;}finally {if (total == count ) {System.out.println("總次數:"+total);System.out.println("成功次數:"+suc);System.out.println("失敗次數:"+fail);}}}這樣通過org.jsoup.nodes.Document解析返回的數據, 解析出ip 和端口,
然后 上面的同樣代碼只要
.proxy(ip, port)放開這句 填入對應的ip port即可開啟代理訪問模式 ,
可以過濾90%的反防;
?
?
?
?
?
?
?
?
?
?
?
?
?
總結
以上是生活随笔為你收集整理的Jsoup爬虫以及防反爬的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: JDBC(九)DatabaseMetaD
- 下一篇: java收发邮寄_JavaMail收发邮