日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

.net core之ACG小站爬虫(二)

發(fā)布時(shí)間:2023/12/4 编程问答 51 豆豆
生活随笔 收集整理的這篇文章主要介紹了 .net core之ACG小站爬虫(二) 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

緊跟著上一節(jié)說的文章,雖然已經(jīng)放出了所寫的全代碼,但還是再解釋一下另外一個(gè)頁面的請(qǐng)求和分析過程吧。

PS:又可以愉快的水一章了,咕嘿嘿。


頁面分析

上回說到下載按鈕的href屬性是javascript:;偽協(xié)議,導(dǎo)致了新打開的頁面鏈接攜帶#符號(hào),但是我們通過了phantomjs已經(jīng)解決了第一次跳轉(zhuǎn)的問題。


下載頁面

事實(shí)證明,這里更加狠,連個(gè)偽協(xié)議都沒有。不過沒關(guān)系,我們還是沿用上回的那個(gè)方法,使用phantomjs來渲染頁面并且將跳轉(zhuǎn)的頁面鏈接以響應(yīng)返回給我們的客戶端請(qǐng)求。

實(shí)現(xiàn)

采用上一節(jié)所說的讓Phantomjs作為服務(wù)端,然后去請(qǐng)求它,讓它把要爬取的結(jié)果反饋給.net。注意,這里的返回給客戶端的響應(yīng)結(jié)果可以是網(wǎng)頁頁面,也可以是Phantomjs進(jìn)行HTML解析完的真實(shí)數(shù)據(jù)。

.Net Core代碼

public async Task<string> GetDownloadPageAsync(string url){string result = string.Empty;//請(qǐng)求phantomjs 獲取下載頁面string dom = "Tappable-inactive animated fadeIn";KeyValuePair<string, string> url2dom = new KeyValuePair<string, string>(url, dom);var postData = JsonConvert.SerializeObject(url2dom);CookieContainer cc = new CookieContainer(); ?HttpHelpers helper = new HttpHelpers(); ?HttpItems items = new HttpItems();HttpResults hr = new HttpResults();items.Url = this.PostUrl1;items.Method = "POST";items.Container = cc;items.Postdata = postData;items.Timeout = 100000;hr = await helper.GetHtmlAsync(items);var downloadPageUrl = hr.Html;Console.WriteLine($"first => { downloadPageUrl }");if(downloadPageUrl.Contains("http")){//獲取百度云下載地址和分享密碼//string code1 = "1";dom = "Tappable-inactive btn btn-success btn-block"; // 下載鏈接url2dom = new KeyValuePair<string, string>(downloadPageUrl, dom);postData = JsonConvert.SerializeObject(url2dom);items = new HttpItems{Url = this.PostUrl2};items.Method = "POST";items.Container = cc;items.Postdata = postData;items.Timeout = 1000000;hr = await helper.GetHtmlAsync(items);result = hr.Html; //返回json數(shù)據(jù)Console.WriteLine($"second => { result }");}else{result = downloadPageUrl; //輸出錯(cuò)誤信息}return result;}

這里包含了第一次在詳情頁獲取下載頁的那個(gè)請(qǐng)求,以及下載頁面獲取百度云鏈接和分享密碼的請(qǐng)求。

JavaScript代碼

"use strict"; var port = 8089; var server = require('webserver').create();server.listen(8089, function (request, response) {//傳入的參數(shù)有待更改,目前為//{"Key":"https://acg12.com/download/#60e21d8417ab60fbfJfcqnT1BC8Qd20PehAIKv3J4ZO%2FJCo0htE9hP5IFZU", //"Value":"Tappable-inactive btn btn-success btn-block"}的json字符竄//第一個(gè)參數(shù)為經(jīng)過第一次請(qǐng)求所返回的下載頁面,第二個(gè)為下載按鈕的Domvar data = JSON.parse(request.postRaw);var url = data.Key.toString();console.log(url);var dom = data.Value.toString();console.log(dom);var code = 200;var pwdArray = new Array();var result = new Array();var page = require('webpage').create();page.onInitialized = function() {page.customHeaders = {};};page.settings.loadImages = false;page.customHeaders = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36","Referer": url};response.headers = {'Cache': 'no-cache','Content-Type': 'text/plain','Connection': 'Keep-Alive','Keep-Alive': 'timeout=40, max=100'};//根據(jù)Phantomjs的官網(wǎng),這個(gè)回調(diào)在打開新標(biāo)簽頁會(huì)觸發(fā)page.onPageCreated = function(newPage) {//console.log('A new child page was created! Its requested URL is not yet available, though.');page.onInitialized = function() {newPage.customHeaders = {};};newPage.settings.loadImages = false;newPage.customHeaders = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36"};//newPage.viewportSize = { width: 1920, height: 1080 };//當(dāng)百度云頁面打開并渲染完成時(shí)觸發(fā)newPage.onLoadFinished = function(status) {//console.log('A child page is Loaded: ' + newPage.url);//newPage.render('newPage.png', {format: 'png', quality: '100'});//console.log(pwdArray.length);if(pwdArray.length > 0){//console.log("enter");//從數(shù)組中pop出密碼,當(dāng)無密碼時(shí)pop的數(shù)據(jù)為null字符竄var temp = {"url": newPage.url.toString(), "password": pwdArray.pop().toString()};console.log(JSON.stringify(temp));result.push(temp); // 將json數(shù)據(jù)push進(jìn)返回結(jié)果}};};page.open(url, function (status) {console.log("----" + status);if (status !== 'success') {code = 400;response.write('4XX');response.statusCode = code;response.close();} else {code = 200;window.setTimeout(function (){//var dom = dom;pwdArray = page.evaluate(function(dom) {console.log(dom);var pwdArray = new Array();var btnList = document.getElementsByClassName(dom); // 百度云鏈接for(var i = 0; i < btnList.length;i ++ ){//猜測所有下載節(jié)點(diǎn)都有密碼var temp = document.getElementById("downloadPwd-" + i);if(temp != undefined){//console.log("****" + temp.value);pwdArray.push(temp.value); // 有密碼push進(jìn)數(shù)組}else{//console.log("****null");pwdArray.push("null"); // 無密碼則push進(jìn)null字符竄,這樣做到和url的一一對(duì)應(yīng)}}for(var i = 0; i < btnList.length;i ++ ){//console.log("click");btnList[i].click(); // 點(diǎn)擊下載,打開新標(biāo)簽頁}return pwdArray;}, dom);}, 6000);}});//設(shè)置等待20秒后才發(fā)送客戶端的響應(yīng)結(jié)果,這樣保證上述方法都能成功運(yùn)行結(jié)束window.setTimeout(function(){var rs = JSON.stringify(result)console.log(rs);response.write(rs);response.statusCode = code;response.close();}, 20000);page.onConsoleMessage = function(msg, lineNum, sourceId) {console.log("$$$$$" + msg);};page.onError = function(msg, trace) {var msgStack = ['PHANTOM ERROR: ' + msg];if (trace && trace.length) {msgStack.push('TRACE:');trace.forEach(function(t) {msgStack.push(' -> ' + (t.file || t.sourceURL) + ': ' + t.line + (t.function ? ' (in function ' + t.function +')' : ''));});}console.log(msgStack.join('\n'));phantom.exit(1);}; }); phantom.onError = function(msg, trace) {var msgStack = ['PHANTOM ERROR: ' + msg];if (trace && trace.length) {msgStack.push('TRACE:');trace.forEach(function(t) {msgStack.push(' -> ' + (t.file || t.sourceURL) + ': ' + t.line + (t.function ? ' (in function ' + t.function +')' : ''));});}console.log(msgStack.join('\n'));phantom.exit(1);};

完整的源代碼已經(jīng)放在Github上了,里面有寫好的bat文件,直接運(yùn)行run.bat就行。當(dāng)然前提,第一節(jié)的那些環(huán)境都配置完成了。大家下周見,下周可能可以嘗試用用DotnetSpider,這是借鑒了WebMagic寫的.net core地爬蟲框架,有興趣的可以先去嘗試一下玩玩。

原文地址:http://www.jianshu.com/p/27bf3bb9ca60


.NET社區(qū)新聞,深度好文,微信中搜索dotNET跨平臺(tái)或掃描二維碼關(guān)注

總結(jié)

以上是生活随笔為你收集整理的.net core之ACG小站爬虫(二)的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。