MPV视频播放器开发日志(02)字幕下载及网络爬虫初探
在很久很久以前曾經下載分析過國航內網的航班數據,算是第一次爬蟲吧,然而那個時候,僅僅局限于一層而已,說是叫爬蟲,有點兒牽強。最近寫基于MPV的播放器,要下載字幕,就開始看看這方面的東西。一看嚇一跳,自己落伍太多了。廢話不說,直接上干貨。此文完全獻給小白,大咖請閃。
本文以流水賬的方式一步一步展開,小白可以跟著看。
本文重點:如何構造http訪問,獲取防盜鏈技術后面的鏈接。
目標網站:字幕庫,http://www.zimuku.la/
開發語言:C#
需要具備的知識儲備:
HTTP基本協議的了解,知道URL,Get,Post 方法以及簡單調用
HTTP 解析工具的了解,本代碼用到:
本文后面會給出參考網站。
Chrome 調試方法(自己百度 網絡抓包)
剩下就是具體工具的使用了。
仔細分析字幕庫的下載過程,大約是以下幾步:
第一步:測試舉例
可以看出,查詢方法很好模擬,直接構造URL即可。
測試過程只取第一個字幕,因此需要看看源代碼
關鍵源代碼如下:
點擊第一行,跳轉如下頁面
請注意,上圖中的URL和源代碼中的href是有對應關系的,這是構造url的基本要素。點擊下載字幕,進入如下界面:
此時點擊第一個按鈕,字幕就下載下來了。那么觀看源文件,是否可以得到鏈接呢?
看看源文件:
點擊上述網頁內的鏈接,奇怪的事情發生了,界面是這樣兒的:
因此,網頁肯定是進行了防盜處理,你拿到這個鏈接好像并沒用。因此,簡單的按照地址爬的方法,到這里就歇菜了。筆者,痛苦思考,知道自己肯定是錯過了什么,當然注意是指知識上的溝溝。在百度搜 防盜技術,經過過濾,知道:要模擬http請求才可以即模擬瀏覽器的調用方式才行。繼續百度http請求模擬,知道需要構造如下Http的請求Header 才可以。如何構造httpheader? header的屬性值如何獲得?請百度,Chrome 調試抓包方法。Chrome 抓包后,這些值都可以獲得。
第二步,構造HTTPheader
下面直接給出代碼,
httpwebRequest.Referer = url_referer;// @"http://zmk.pw/dld/145499.html";//http://zmk.pw/dld/145499.htmlhttpwebRequest.Method = Mehtod;httpwebRequest.ContentType = "text/html";httpwebRequest.Accept = @"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9";httpwebRequest.Headers["Accept-Encoding"] = "gzip, deflate";httpwebRequest.Headers["Accept-Language"] = "zh-CN,zh;q=0.9";httpwebRequest.UserAgent = "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Mobile Safari/537.36";httpwebRequest.Host = "zmk.pw";httpwebRequest.Headers["Upgrade-Insecure-Requests"] = "1";httpwebRequest.KeepAlive = true;String cookie = "__cfduid = d843ffdcc4b5ccd31db5801cb7cb86d1d1606619179; __gads = ID = eb409c7182aed71c - 22064ba0f3c400b5: T = 1606619196:RT = 1606619196:S = ALNI_Mb7XrQevUe_Ub3ra - 5_LyzWqi98SA; PHPSESSID = q7eqk2525nhbceevfugbsscfm4";httpwebRequest.Headers.Add("Cookie", cookie);//賦值cookeisEncoding encoding = Encoding.GetEncoding("utf-8");這些屬性名稱和值在Chrome的調試里都可以看到,截圖如下:
照著上面的值抄下來,你的httpheader就構造好了。
此時再調用,response 就會返回字幕文件,調用成功了。
第三步 核心代碼實現
核心代碼如下,包括一步一步構造URL的過程,資源文件和界面代碼略。
using Ivony.Html; using Ivony.Html.Parser; using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading.Tasks; using System.Xml.XPath; using System.Xml; using System.Net; using System.IO; using HtmlAgilityPack;/// <summary> /// Function:Download subtitle files from web /// The first web is http://www.zimuku.la/ /// Refer to http://www.360doc.com/content/20/0817/17/71178898_930801275.shtml /// </summary> namespace ILearnPlayer {class SubtitleDownLoader{// 獲取對應網頁的 HTML Dom TREEpublic static IHtmlDocument GetHtmlDocument(string url){IHtmlDocument document;try{document = new JumonyParser().LoadDocument(url);}catch{document = null;}return document;}public static string EntryPoint;//<a href="/detail/145499.html" target="_blank" title="戰士 第2季第09集【YYeTs字幕組 簡繁英雙語字幕】Warrior.2019.S02E09.Enter.the.Dragon.720p/1080p.REPACK.AMZN.WEB-DL.DDP5.1.H.264-NTb"><b>戰士 第2季第09集【YYeTs字幕組 簡繁英雙語字幕】Warrior.2019.S02E09.Enter.the.Dragon.720p/1080p.REPACK.AMZN.WEB-DL.DDP5.1.H.264-NTb</b></a>public static List<SubtitleFileItem> GetSunbtitle(int page){List<SubtitleFileItem> result = new List<SubtitleFileItem>();string url = EntryPoint;IHtmlDocument document = GetHtmlDocument(url);if (document == null)return result;//var aLinks = document.Find("a");//獲取所有的meta標簽//foreach (var aLink in aLinks)//{// if (aLink.Attribute("name").Value() == "keywords")// {// var name = aLink.Attribute("content").Value();//無疆,無疆最新章節,無疆全文閱讀// }//}List<IHtmlElement> listsTable = document.Find("table").ToList();List<IHtmlElement> listsSub = listsTable[0].Find("a").ToList();// find <a href...//<a href="/detail/130526.html" target="_blank" title="肖申克的救贖 The Shawshank Redemption(1994) 特效中英文字幕 上藍下白.ass"><b>肖申克的救贖 The Shawshank Redemption(1994) 特效中英文字幕 上藍下白.ass</b></a>for (int i = 0; i < listsSub.Count; i++){SubtitleFileItem item = new SubtitleFileItem();IHtmlElement subItem = listsSub[i];item.url = subItem.Attribute("href").AttributeValue;if (item.url.IndexOf("/subs/") >= 0){break;// the last line is:<a target="_blank" href="/subs/24068.html"><span class="label label-danger">還有18個字幕,點擊查看</span></a>}item.title = subItem.Attribute("title").AttributeValue;//item.lang = subItem.Attribute("href").AttributeValue;//item.url = subItem.Attribute("href").AttributeValue;//item.url = subItem.Attribute("href").AttributeValue;//item.url = subItem.Attribute("href").AttributeValue;result.Add(item);}return result;}/// <summary>/// The codes steps from the following Python codes./// </summary>//# 獲取搜索結果的第一個字幕詳情頁鏈接// etreeRes = etree.HTML(res)//resTd = etreeRes.xpath('//td/a/@href')[0]//subDownUrl = 'http://zmk.pw/dld' + resTd.split('detail')[-1]//# print('成功搜索到字幕 %s ' % videoName)//resDown = requests.get(subDownUrl).text//DownUrl = 'http://zmk.pw' + \// re.findall(r'rel="nofollow" href="(.*?)" class=', resDown)[0]public String getTheFirstSubtitl(String url){String subContent="";//Use HtmlAtilityPack to parser the href and else.// From Webvar web = new HtmlWeb();var doc = web.Load(url);//XPathString xPath = "//td/a[@href]";var nodes = doc.DocumentNode.SelectSingleNode(xPath).GetAttributes("href");String resTd="";foreach (var nd in nodes){resTd = nd.Value;//find the first url of the subtitle files }var subDownUrl = "http://zmk.pw/dld" + resTd.Split(new[] { "detail" }, StringSplitOptions.None)[1];doc = web.Load(subDownUrl); ;// requests.get(subDownUrl).textxPath = "//li/a[@rel]";nodes = doc.DocumentNode.SelectSingleNode(xPath).GetAttributes("href");foreach (var nd in nodes){resTd = nd.Value;//find the first url of the subtitle files }// <li><a rel="nofollow" href="/download/MTQ1NjU2fDdmYmRlNDBkYWZlZWM1MmM5NWQxMzNmN3wxNjA2ODMzMzE5fDRjYTA5MWVhfHJlbW90ZQ%3D%3D/svr/dx1" class="btn btn-danger btn-sm"><span class="glyphicon glyphicon-save icon_size"></span> 電信高速下載(一)</a></li>var downUrl = "http://zmk.pw" + resTd;subContent= zimukuHttpReq(downUrl, subDownUrl, "");return subContent;}public static string zimukuHttpReq(string url, string url_referer, string data, string Mehtod = "GET", bool xml = false, string head = ""){//encodeURIHttpWebRequest httpwebRequest = null;HttpWebResponse httpwebResponse = null;StreamReader streamReader = null;string responsecontent = "";String DownUrl = url;// try{httpwebRequest = (HttpWebRequest)WebRequest.Create(DownUrl);httpwebRequest.Referer = url_referer;// @"http://zmk.pw/dld/145499.html";//http://zmk.pw/dld/145499.htmlhttpwebRequest.Method = Mehtod;httpwebRequest.ContentType = "text/html";httpwebRequest.Accept = @"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9";httpwebRequest.Headers["Accept-Encoding"] = "gzip, deflate";httpwebRequest.Headers["Accept-Language"] = "zh-CN,zh;q=0.9";httpwebRequest.UserAgent = "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Mobile Safari/537.36";httpwebRequest.Host = "zmk.pw";httpwebRequest.Headers["Upgrade-Insecure-Requests"] = "1";httpwebRequest.KeepAlive = true;String cookie = "__cfduid = d843ffdcc4b5ccd31db5801cb7cb86d1d1606619179; __gads = ID = eb409c7182aed71c - 22064ba0f3c400b5: T = 1606619196:RT = 1606619196:S = ALNI_Mb7XrQevUe_Ub3ra - 5_LyzWqi98SA; PHPSESSID = q7eqk2525nhbceevfugbsscfm4";httpwebRequest.Headers.Add("Cookie", cookie);//賦值cookeisEncoding encoding = Encoding.GetEncoding("utf-8");httpwebResponse = (HttpWebResponse)httpwebRequest.GetResponse();if (httpwebResponse.CharacterSet.ToUpper().Equals("UTF-8")){encoding = Encoding.Unicode;}streamReader = new StreamReader(httpwebResponse.GetResponseStream());responsecontent = streamReader.ReadToEnd();}catch (Exception ex){return ex.Message;}finally{if (httpwebResponse != null){httpwebResponse.Close();}if (streamReader != null){streamReader.Close();}if (httpwebRequest != null){httpwebRequest.Abort();}httpwebRequest = null;httpwebResponse = null;streamReader = null;}return responsecontent;}}// used for the subtitle file (item)public class SubtitleFileItem{/// <summary>/// title/// </summary>public String title { set; get; }/// <summary>/// Language/// </summary>public String lang { set; get; }/// <summary>/// ranking/// </summary>public int rank { set; get; }/// <summary>/// download times/// </summary>public int dldTimes { set; get; }/// <summary>/// Updloader/// </summary>public String uploader { set; get; }/// <summary>/// upload time/// </summary>public String upldTime { set; get; }/// <summary>/// URL/// </summary>public String url { set; get; }} }調用代碼:
String url;url = "http://www.zimuku.la/search?q=" + txtMovie.Text;SubtitleDownLoader sbtDld = new SubtitleDownLoader();txtContent.Text = sbtDld.getTheFirstSubtitl(url);調用截圖如下:
至此,第一個爬蟲測試告一段落,下一步的任務就是過濾和整理。此文的目的,備忘和分享。
后記
感謝網絡以及網友,本人參考的網站及網友文章如下:
主要參考了這篇文章,對筆者啟發很大。
Python 源碼:
https://blog.csdn.net/xun527/article/details/110229840
c#必須使用適當的屬性或方法修改此標頭解決辦法
https://blog.csdn.net/u011127019/article/details/52571317
C# 使用XPath解析網頁
https://blog.csdn.net/weixin_34121282/article/details/86263589
XPath 語法
https://www.w3school.com.cn/xpath/xpath_syntax.asp
https://www.runoob.com/xpath/xpath-syntax.html
馬拉孫 2020-12-02 于泛五道口地區
總結
以上是生活随笔為你收集整理的MPV视频播放器开发日志(02)字幕下载及网络爬虫初探的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: HDU 4939 Stupid Tow
- 下一篇: 多目立体视觉(Multiple View