當前位置：首頁 > 编程语言 > python >内容正文

python

Python实训day06am【网络爬虫(爬取接口)】

發布時間：2024/9/30 python 54 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python实训day06am【网络爬虫(爬取接口)】小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Python實訓-15天-博客匯總表

1、“from bs4 import BeautifulSoup”解析

2、字體反爬蟲

3、網絡爬蟲

3.1、直接爬取頁面html-爬取每個章節的內容

3.2、爬取數據接口-英雄之刃頭像

3.3、爬取數據接口-課堂練習-王者頭像

1、“from bs4 import BeautifulSoup”解析

知乎：https://www.zhihu.com/question/308106778?sort=created

2、字體反爬蟲

遇到問題之后：

積極的尋求解決方案【心態】

在尋求解決方案時，方向是否正確【經驗】【誰吃的虧，填的坑多】

能否通過各種辦法，快速的得到解決【經驗+手段+技術】

昨天內容：得票數，字體庫來加密展現，無法爬取。
? ? 𢏇 𦓛.....?
? ? 140231 --解密--> 0~9
? ? 156891 --解密--> 0~9

import requests from bs4 import BeautifulSoup as BS from fontTools.ttLib import TTFont # 前提是先安裝fontTools：pip install fontTools import re''' 起點中文網的數字數據為動態字體加密，通過觀察每次刷新頁面字體文件都會改變這是一種通過字體反爬蟲的手段，貓眼、大眾點評也使用類似的方法阻止爬蟲回去數字類信息比如：點評數量、點評分數、電影票房等數據解碼方法是字體文件下載到本地，通過python自帶的字體處理庫打開，生成xml文件獲取映射將原始網頁的由'&#'開頭的編碼根據映射關系替換即可獲得具體數字參考：https://www.bilibili.com/video/av967854064/ https://blog.csdn.net/jianmoumou233/article/details/81267055 https://www.cnblogs.com/chenlove/p/14858742.html 注：可將功能封裝為函數 '''hds = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36'} resp = requests.get('https://www.qidian.com/rank/yuepiao/year2022-month01/', headers=hds); ct = resp.content.decode('utf-8'); # print(ct); bs = BS(ct, 'lxml');c = bs.select('.rank-body .book-right-info .total p span style');# 使用正則表達式匹配獲取下載鏈接 font_url = re.findall('https.*?\'', str(c[0])); # print(font_url);# 第二條鏈接為.woff文件 woff = requests.get(font_url[1].replace('\'', '')).content;# 存儲.woff文件到本地（本地保存字體庫） with open(r'.\fonts.woff', 'wb') as f:f.write(woff);f.close();# 解析字體庫，得到加密規則 online_fonts = TTFont(r'.\fonts.woff'); # 使用TTFont打開字體文件 online_fonts.saveXML(r'.\fonts.xml'); # 將字體文件存儲為xml文件font_map = online_fonts.getBestCmap(); # 獲取xml文件中的字體編碼映射關系 # print(font_map)# 建立英文數字字典 _dic = {"period": ".","zero": "0","one": "1","two": "2","three": "3","four": "4","five": "5","six": "6","seven": '7',"eight": "8","nine": "9", }# 使用建立的字典修改字體編碼映射關系，將英文改為數字 for key in font_map:font_map[key] = _dic[font_map[key]]; print(font_map);# 使用新的映射遍歷網頁內容并替換為相應數字 for key in font_map:ct = ct.replace('&#' + str(key) + ';', font_map[key]);bs = BS(ct, 'lxml'); # 重新獲取網頁信息votes = bs.select('.rank-body .book-right-info .total p span span'); # 獲取票數標簽for i in range(len(votes)):print(votes[i].get_text());

3、網絡爬蟲

1、直接爬取頁面html，從html中就能獲取到結果（30%）

2、爬取的html頁面只是一個空殼子，里面沒有想要的內容。（50%）
? ? 頁面中我們看到的內容，是“前端頁面通過請求后臺接口（ajax請求），得到數據”后，又充填到html中的。
? ? 重點：通過工具，找到提供數據的那個后臺接口。

3.1、直接爬取頁面html-爬取每個章節的內容

昨天的內容：爬取每個章節的內容。

import requests from bs4 import BeautifulSoup as BS# 偽裝成瀏覽器 hds = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36'}# 小說首頁路徑 url = 'https://book.qidian.com/info/1027669580/#Catalog';resp = requests.get(url, headers=hds);ct = resp.content.decode('utf-8');# print(ct) # print(ct.index('李家的劍')) # 找到了，說明爬取到的html中包含我們想要的內容；使用BeautifulSoup解析即可bs = BS(ct, 'lxml'); sa = bs.select('.volume-wrap .volume:first-child .book_name a'); #:nth-child(1)zjinfo = {}; # 名稱：url地址for a in sa:zjinfo[a.get_text()] = 'https:' + a['href'];# print(zjinfo)# 遍歷每個章節for k, v in zjinfo.items():u2 = v;resp = requests.get(u2, headers=hds);ct = resp.content.decode('utf-8');# print(ct.index('純凈的笑容')) # 可以使用BS解析，獲取想要的內容# print(ct)bs = BS(ct, 'lxml');sps = bs.select('.main-text-wrap .read-content p'); # 通過打印ct，再次確認元素選擇器結構print(len(sps)); # 89：89個自然段的內容cts = [];# 遍歷，獲取每一個內容，放入到cts中for p in sps:cts.append(p.get_text());# print(cts);# 寫入文件f = open(r'C:\Users\lwx\Desktop\星門\{}'.format(k + '.txt'), 'w');f.write('\n'.join(cts)); # 將cts中所有字符串拼接，使用\n換行，間隔f.close();print("已下載：" + k)

3.2、爬取數據接口-英雄之刃頭像

import requests from bs4 import BeautifulSoup as BS# 驗證，爬取到的就是一個殼子，其中沒有英雄信息！！！# 偽裝成瀏覽器 hds = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36'}url = 'https://cos.99.com/data/'resp = requests.get(url, headers=hds);ct = resp.content.decode('utf-8');# print(ct) # print(ct.index('東方月初')) # 報錯bs = BS(ct, 'lxml'); # 解析htmlimgs = bs.select('#heroList .img img'); # print(len(imgs)) # 0# 數據是通過ajax來訪問后臺數據接口得到的# 所以爬蟲程序，需要找到數據接口，來訪問得到數據 # 關鍵：通過瀏覽器調試工具找數據接口！！！# 方法：綜合根據接口名、篩選工具（Fetch/XHR）、響應內容等方法，來找到數據接口# 英雄信息的數據接口： # https://wjdown.99.com/games/cos/upload/yhzrheroattr/yhzr_hero_list.js?_=1641864636638# 爬取數據接口，得到英雄信息，并解析： import json# 爬取數據接口，得到數據hds = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36'}url = 'https://wjdown.99.com/games/cos/upload/yhzrheroattr/yhzr_hero_list.js?_=1641864636638'resp = requests.get(url, headers=hds);ct = resp.content.decode('utf-8'); # print(ct) # print(type(ct)) # 解析接口響應的字符串，不是html頁面，不需要使用BeautifulSoup# 對象形式的字符串---->對象本身ls = json.loads(ct); # print(ls) # print(type(ls))dr = r'C:\Users\lwx\Desktop\英魂之刃\{}.jpg'for h in ls:# print(h['name'])# print(h['headimg'])# https://wjdown.99.com/games/cos/upload/yhzrheroheadimg/123001.jpgresp = requests.get(h['headimg'], headers=hds);ct = resp.content; # 文件格式數據，不需要decode解碼f = open(dr.format(h['name']), 'wb');f.write(ct);f.close();print('下載完成：', h['name']);

小結，根據爬取后的響應，分為如下3種情況：

爬取到的是html頁面(文本信息)，需要decode解碼+beautifulsoup解析。

爬取到的是接口數據(文本信息)，需要decode解碼+json解析。

爬取到的是文件(圖片、音視頻、woff、pdf、doc等)，不能decode解碼，直接io寫入本地即可。

3.3、爬取數據接口-課堂練習-王者頭像

課堂練習：將王者中108個英雄的頭像全部爬取下來，文件名以英雄的名字來命名：https://pvp.qq.com/web201605/herolist.shtml

提醒：

確認是否需要爬取接口，可能html頁面中就包含你需要的信息；

如果decode('utf-8')不行，那就decode('gbk')嘗試。

# 課堂練習：將王者中108個英雄的頭像全部爬取下來，文件名以英雄的名字來命名： # https://pvp.qq.com/web201605/herolist.shtmlimport requests from bs4 import BeautifulSoup as BShds = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36'}resp = requests.get('https://pvp.qq.com/web201605/herolist.shtml', headers=hds);ct = resp.content.decode('gbk');# print(ct)#有我們需要的英雄信息，所以解析html即可bs = BS(ct, 'lxml'); imgs = bs.select('.herolist img');print(len(imgs)); # 93 --->不全！！！！！# 還是需要爬接口，https://pvp.qq.com/web201605/js/herolist.json import requests import jsonhds = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36'}resp = requests.get('https://pvp.qq.com/web201605/js/herolist.json', headers=hds);ct = resp.content.decode('utf-8');# print(ct)ls = json.loads(ct); # print(ls) # print(len(ls)) # 108 全！！ # {'ename': 105, 'cname': '廉頗', 'title': '正義爆轟', 'pay_type': 10, 'new_type': 0, 'hero_type': 3, 'skin_name': '正義爆轟|地獄巖魂'}# 頭像咋弄呢？ url_img = 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/{}/{}.jpg';dr = r'C:\Users\lwx\Desktop\王者\{}.jpg';for h in ls:hid = h['ename'];hname = h['cname'];resp = requests.get(url_img.format(hid, hid), headers=hds);ct = resp.content;f = open(dr.format(hname), 'wb');f.write(ct);f.close();print('下載完成：', hname)

心為形役，塵世馬牛；身被名牽，樊籠雞鶩。——明·陳繼儒《小窗幽記》

勿自暴，勿自棄，圣與賢，可馴致。——《弟子規》

正氣內存，邪不可干。——《黃帝內經》

總結

以上是生活随笔為你收集整理的Python实训day06am【网络爬虫(爬取接口)】的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Python实训day05pm【JS-D
下一篇： Python实训day06pm【网络爬虫

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

python

Python实训day06am【网络爬虫(爬取接口)】

1、“from bs4 import BeautifulSoup”解析

2、字體反爬蟲

3、網絡爬蟲

3.1、直接爬取頁面html-爬取每個章節的內容

3.2、爬取數據接口-英雄之刃頭像

3.3、爬取數據接口-課堂練習-王者頭像

總結