當前位置：首頁 >

Python爬虫个人记录（二）获取fishc 课件下载链接

發(fā)布時間：2023/12/20 34 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python爬虫个人记录（二）获取fishc 课件下载链接小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

這此教程可能會比較簡潔，具體細節(jié)可參考我的第一篇教程：
Python爬蟲個人記錄（一）豆瓣250

一、目的分析

獲取http://bbs.fishc.com/forum-246-1.html 中小甲魚零基礎(chǔ)入門學習python課件以及源代碼

即這玩意

二、步驟分析

1、進入http://bbs.fishc.com/forum-246-1.html
2、通過xpath（）分析出一級連接
http://bbs.fishc.com/thread-36353-1-1.html
3、通過xpath（）分析出二級下載鏈接
4、通過二級連接下載課件（失敗）

三、scrapy shell 模擬分析

運行cmd ->scrapy shell

fetch('http://bbs.fishc.com/forum-246-1.html')

返回200成功

獲取需要信息的xpath（記錄一有），這里只再演示一次（qq瀏覽器版本）

鼠標放在所需元素，右鍵查看元素（或檢查），copy xpath

得到 //*[@id=”normalthread_36353”]/tr/th/a[2]

在shell中測試

>>> response.xpath('//*[@id="normalthread_36353"]/tr/th/a[2]/text()').extract() ['第000講愉快的開始（視頻+課件）'] >>>

成功
通過分析網(wǎng)頁

所需要的網(wǎng)頁都帶有
//*[@id=”normalthread_數(shù)字”] 這樣的特征

于是修改xpath（）使用模糊查詢
response.xpath(‘//table/*[contains(@id,“normalthread”)]/tr/th/a[2]’).extract()

>>> response.xpath('//*[@id="normalthread_36353"]/tr/th/a[2]/text()').extract() ['第000講愉快的開始（視頻+課件）'] >>> response.xpath('//table/*[contains(@id,"normalthread")]/tr/th/a[2]/text()').extract() ['第000講愉快的開始（視頻+課件）', '第001講我和Python的第一次親密接觸（視頻+課件）', '第002講用Python設(shè)計第一個游戲（視頻+課件+源代碼）', '第003講小插曲之變量和字符串（視頻+課件）', '第004講改進我們的小游戲（視頻+課件+源代碼）', '第005講改進我們的小游戲（視頻+課件+源代碼）', '第006講 Pyhon之常用操作符（視頻+課件）', '第007講了不起的分支和循環(huán)01（視頻+課件）', '第008講了不起的分支和循環(huán)02（視頻+課件+源代碼）', '第009講了不起的分支和循環(huán)03（視頻+課件+源代碼）', '第010 講列表：一個打了激素的數(shù)組（視頻+課件）', '第011講列表：一個打了激素的數(shù)組2（視頻+課件）', '第012講列表：一個打了激素的數(shù)組3（視頻+課件）', '第013講元組：戴上了枷鎖的列表（視頻+課件）', '第014講字符串：各種奇葩的內(nèi)置方法（視頻+課件）', '第015講字符串：格式化（視頻+課件）', '第016講序列！序列！（視頻+課件）', '第017講函數(shù)：Python的樂高積木（視頻+課件+源代碼）', '第018講函數(shù)：靈活即強大（視頻+課件）', '第019講函數(shù)：我的地盤聽我的（視頻+課件+源代碼）', '第020講函數(shù)：內(nèi)嵌函數(shù)和閉包（視頻+課件）', '第021講函數(shù)：lambda表達式（視頻+課件）', '第022講函數(shù)：遞歸是神馬（視頻+課件+源代碼）', '第023講遞歸：這幫坑爹的小兔崽子（視頻+課件+源代碼）', '第024講遞歸：漢諾塔（視頻+課件+源代碼）', '第025講字典：當索引不好用時（視頻+課件）', '第026講字典：當索引不好用時2（視頻+課件）', '第027講集合：在我的世界里，你就是唯一（視頻+課件）', '第028講文件：因為懂你，所以永恒（視頻+課件）', '第029講文件：一個任務(wù)（視頻+課件+源代碼）', '第030講文件系統(tǒng) ：介紹一個高大上的東西（視頻）', '第031講永久存儲：腌制一缸美味的泡菜（視頻+課件+源代碼）'] >>>

成功獲取所需要元素，接下來獲取相對應地址
同上不難分析出網(wǎng)頁連接地址xpath為
“//table/*[contains(@id,’nor’)]/tr/th/a[2]/@href”

>>> response.xpath("//table/*[contains(@id,'nor')]/tr/th/a[2]/@href").extract() ['thread-36353-1-1.html', 'thread-36354-1-1.html', 'thread-38065-1-1.html', 'thread-38066-1-1.html', 'thread-38067-1-1.html', 'thread-38068-1-1.html', 'thread-38069-1-1.html', 'thread-38072-1-1.html', 'thread-43424-1-1.html', 'thread-43425-1-1.html', 'thread-43426-1-1.html', 'thread-43427-1-1.html', 'thread-43429-1-1.html', 'thread-43430-1-1.html', 'thread-43431-1-1.html', 'thread-43433-1-1.html', 'thread-43434-1-1.html', 'thread-43435-1-1.html', 'thread-43436-1-1.html', 'thread-43437-1-1.html', 'thread-43438-1-1.html', 'thread-43440-1-1.html', 'thread-43441-1-1.html', 'thread-43443-1-1.html', 'thread-43444-1-1.html', 'thread-43445-1-1.html', 'thread-43446-1-1.html', 'thread-45420-1-1.html', 'thread-45421-1-1.html', 'thread-45422-1-1.html', 'thread-48042-1-1.html', 'thread-48043-1-1.html'] >>>

至此可以書寫python代碼了

四、Python代碼書寫（細節(jié)參考記錄一）

思路：
1、#獲取一級鏈接
2、#獲取二級文件下載鏈接
2、#打開二級文件下載鏈接，下載文件（這是個坑，會員才能下載。。。所以失敗，方法后面使用cookie看能否解決）

import urllib.request import os from lxml import etree#打開一個網(wǎng)頁操作 def url_open(url):req = urllib.request.Request(url)req.add_header('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.3226.400 QQBrowser/9.6.11681.400')response = urllib.request.urlopen(url)html = response.read()return html#保存多個文件 def save(folder, file_url):i = 0for each in file_url:#print(each)file_name = str(i)i += 1with open(file_name, 'w') as f:file = url_open(each)f.write(file)#保存一個文件 def save1():file_name = 'asd'with open(file_name, 'w') as f:f.write()def find_data(url = '', xrule = ''):result_data = []html = url_open(url)selector = etree.HTML(html)result_data = selector.xpath(xrule)return result_datadef test1():#獲取一級鏈接list_1 = []xrule_1 = "//table/*[contains(@id,'nor')]/tr/th/a[2]/@href"for i in range(1,2):url = 'http://bbs.fishc.com/forum-246-' + str(i) + '.html'result_temp = find_data(url, xrule_1)for each in result_temp:list_1.append(each)print('http://bbs.fishc.com/' + each)#print(list_1)#獲取二級文件下載鏈接list_2 = []xrule_2 = '//*[contains(@id,"attach")]/a/@href'for each in list_1:url = 'http://bbs.fishc.com/' + str(each)result_temp = find_data(url, xrule_2)for each in result_temp:list_2.append(each)print('http://bbs.fishc.com/' + each)#print(list_2)if __name__ == '__main__':test1()

直接上源碼了，有一些注釋，可以自行參考閱讀

五、總結(jié)

失敗的嘗試。。原因：還學要學習下載鏈接原理

順便打波小廣告：I love fishc.com

歡迎來魚c論壇學習^_^零基礎(chǔ)入門學習python

總結(jié)

以上是生活随笔為你收集整理的Python爬虫个人记录（二）获取fishc 课件下载链接的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：【零散积累】传输文件（sz/rz/scp
下一篇：【Python基础入门系列】第06天：P