Pyhton爬取百度文库文字写入word文档
生活随笔
收集整理的這篇文章主要介紹了
Pyhton爬取百度文库文字写入word文档
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
目錄
- 介紹
- 請求網址
- 爬取數據
- 完整代碼
介紹
僅支持爬取百度文庫的Word文檔,文字寫入Word文檔或者文本文件(.txt),主要使用Python爬蟲的requests庫。
requests庫是Python爬蟲系列中請求庫比較熱門和便捷實用的庫,另外urlib庫(包)也是比較熱門的。除此之外Python爬蟲系列還有解析庫lxml以及Beautiful Soup,Python爬蟲框架scrapy。
請求網址
????????介紹一下headers的使用方法、及分頁爬取,headers里面一般情況下其實只要User-Agent就夠了。
def get_url(self):url = input("請輸入下載的文庫URL地址:")headers = {# 接收請求處理'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',# 聲明瀏覽器支持的編碼類型'Accept-Encoding': 'gzip, deflate, br',# 對客戶端瀏覽器發送的接受語言'Accept-Language': 'zh-CN,zh;q=0.9',# 獲取瀏覽器緩存'Cache-Control': 'max-age=0',# 向同一個連接發送下一個請求,直到一方主動關閉連接'Connection': 'keep-alive',# 主地址(服務器的域名)'Host': 'wenku.baidu.com','Sec-Fetch-Dest': 'document','Sec-Fetch-Mode': 'navigate','Sec-Fetch-Site': 'same-origin','Sec-Fetch-User': '?1','Upgrade-Insecure-Requests': '1',# 客戶端標識證明(也像身份證一樣)'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}response = self.session.get(url=url,headers=headers)json_data = re.findall('"json":(.*?}])', response.text)[0]json_data = json.loads(json_data)# print(json_data)for index, page_load_urls in enumerate(json_data):# print(page_load_urls)page_load_url = page_load_urls['pageLoadUrl']# print(index)self.get_data(index, page_load_url)爬取數據
????????獲取服務器響應爬取文檔數據寫入Word文檔,也可以將with open(‘百度文庫.docx’, ‘a’, encoding=‘utf-8’)中的.docx改成.txt文本文件,這樣寫入的就是文本文件了,寫入目前還沒添加換行功能!
def get_data(self, index, url):headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9','Accept-Encoding': 'gzip, deflate, br','Accept-Language': 'zh-CN,zh;q=0.9','Cache-Control': 'max-age=0','Connection': 'keep-alive','Host': 'wkbjcloudbos.bdimg.com','Sec-Fetch-Dest': 'document','Sec-Fetch-Mode': 'navigate','Sec-Fetch-Site': 'none','Sec-Fetch-User': '?1','Upgrade-Insecure-Requests': '1','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}response = self.session.get(url=url,headers=headers)# print(response.content.decode('unicode_escape'))data = response.content.decode('unicode_escape')comand = 'wenku_' + str(index+1)json_data = re.findall(comand + "\((.*?}})\)", data)[0]# print(json_data)json_data = json.loads(json_data)result = []for i in json_data['body']:data = i["c"]# print(data)result.append(data)print(''.join(result).replace(' ', '\n'))print("")with open('百度文庫.docx', 'a', encoding='utf-8') as f:f.write('')f.write(''.join(result).replace(' ', '\n'))完整代碼
import requests import re import jsonclass WenKu():def __init__(self):self.session = requests.Session()def get_url(self):url = input("請輸入下載的文庫URL地址:")headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9','Accept-Encoding': 'gzip, deflate, br','Accept-Language': 'zh-CN,zh;q=0.9','Cache-Control': 'max-age=0','Connection': 'keep-alive','Host': 'wenku.baidu.com','Sec-Fetch-Dest': 'document','Sec-Fetch-Mode': 'navigate','Sec-Fetch-Site': 'same-origin','Sec-Fetch-User': '?1','Upgrade-Insecure-Requests': '1','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}response = self.session.get(url=url,headers=headers)json_data = re.findall('"json":(.*?}])', response.text)[0]json_data = json.loads(json_data)# print(json_data)for index, page_load_urls in enumerate(json_data):# print(page_load_urls)page_load_url = page_load_urls['pageLoadUrl']# print(index)self.get_data(index, page_load_url)def get_data(self, index, url):headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9','Accept-Encoding': 'gzip, deflate, br','Accept-Language': 'zh-CN,zh;q=0.9','Cache-Control': 'max-age=0','Connection': 'keep-alive','Host': 'wkbjcloudbos.bdimg.com','Sec-Fetch-Dest': 'document','Sec-Fetch-Mode': 'navigate','Sec-Fetch-Site': 'none','Sec-Fetch-User': '?1','Upgrade-Insecure-Requests': '1','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}response = self.session.get(url=url,headers=headers)# print(response.content.decode('unicode_escape'))data = response.content.decode('unicode_escape')comand = 'wenku_' + str(index+1)json_data = re.findall(comand + "\((.*?}})\)", data)[0]# print(json_data)json_data = json.loads(json_data)result = []for i in json_data['body']:data = i["c"]# print(data)result.append(data)print(''.join(result).replace(' ', '\n'))print("")with open('百度文庫.docx', 'a', encoding='utf-8') as f:f.write('')f.write(''.join(result).replace(' ', '\n'))if __name__ == '__main__':wk = WenKu()wk.get_url()總結
以上是生活随笔為你收集整理的Pyhton爬取百度文库文字写入word文档的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 小米线刷包需要解压么_小米8官方原版线刷
- 下一篇: excel表格坐标导入cad怎样操作?