當前位置：首頁 > 编程语言 > python >内容正文

python

python通过selenium爬取百度文库

發布時間：2023/12/18 python 24 豆豆

生活随笔收集整理的這篇文章主要介紹了 python通过selenium爬取百度文库小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

python通過selenium爬取百度文庫

參考

https://blog.csdn.net/c406495762/article/details/72331737

運行平臺：?Windows?
Python版本：?Python3.6?

python3.6的docx模塊和2.7的不同，直接pip安裝會提示缺少依賴

需要先進入pycharm目錄安裝python_docx-0.8.7-py2.py3-none-any.whl

pip install python_docx-0.8.7-py2.py3-none-any.whl

再安裝docx

pip install docx

python_docx-0.8.7-py2.py3-none-any.whl下載地址:

https://download.lfd.uci.edu/pythonlibs/r5uhg2lo/python_docx-0.8.7-py2.py3-none-any.whl

由于網頁的百度文庫頁面復雜，可能抓取內容不全，因此使用User-Agent，模擬手機登錄，然后打印文章標題，文章頁數，并進行翻頁。

谷歌瀏覽器需要設置User-Agent

# -*- coding: utf-8 -*-
from selenium import webdriver
from bs4 import BeautifulSoup
from docx import Document
from docx.enum.text import WD_ALIGN_PARAGRAPH# 用來居中顯示標題
from time import sleep

#目的URL
DEST_URL='https://wenku.baidu.com/view/8962c8dfb9f3f90f76c61b69.html'
#用來保存文檔
doc_title = ''
doc_content_list = []
def find_doc(driver, init=True):
??? global doc_content_list
??? global doc_title
??? stop_condition = False
??? html = driver.page_source
??? soup1 = BeautifulSoup(html, 'lxml')
??? if (init is True): # 得到標題
??????? title_result = soup1.find('div', attrs={'class': 'doc-title'})
??????? doc_title = title_result.get_text() # 得到文檔標題
??????? # 拖動滾動條
??????? init_page = driver.find_element_by_xpath( "//div[@class='foldpagewg-text-con']")
??????? print(type(init_page), init_page)
??????? driver.execute_script('arguments[0].scrollIntoView();', init_page)
??????? init_page.click()
??????? init = False
??? else:
??????? try:
??????????? #按鈕剩余多少未讀
??????????? #page = driver.find_element_by_xpath( "//div[@class='pagerwg-schedule']")
??????????? #按鈕加載更多?? 點擊繼續閱讀還是會出現點擊加載更多直接點擊加載更多一了百了
??????????? next_page = driver.find_element_by_class_name("pagerwg-button")
??????????? #下拉到最下方
??????????? station = driver.find_element_by_xpath( "//div[@class='bottombarwg-root border-none']")
??????????? driver.execute_script('arguments[0].scrollIntoView(false);', station)
???????????
??????????? # 防止頁面加載過慢
??????????? sleep(5)

??????????? next_page.click()
???????????
??????? except:
??????????? #結束條件
??????????? stop_condition = True

??? #next_page.send_keys(Keys.ENTER)
??? #遍歷所有的txt標簽標定的文檔，將其空格刪除，然后進行保存
??? content_result = soup1.find_all('p', attrs={'class': 'txt'})
??? for each in content_result:
?????? each_text = each.get_text()
?????? if ' ' in each_text:
???????? ??text = each_text.replace(' ', '')
?????? else:
????????? text = each_text
?????? # 得到正文內容
?????? doc_content_list.append(text)
??? # 防止頁面加載過慢
??? sleep(5)
??? if stop_condition is False:
?????? doc_title, doc_content_list = find_doc(driver, init)
??? return doc_title, doc_content_list
def save(doc_title, doc_content_list):
??? document = Document()
??? heading = document.add_heading(doc_title, 0)
??? heading.alignment = WD_ALIGN_PARAGRAPH.CENTER # 居中顯示
??? for each in doc_content_list:
??????? document.add_paragraph(each)
??? # 處理字符編碼問題
??? t_title = doc_title.split()[0]
??? #在當前腳本路徑存儲docx文件
??? document.save('百度文庫-%s.docx'% t_title)
??? print("\n\nCompleted: %s.docx, to read." % t_title)
??? driver.quit()
if __name__ == '__main__':
??? options = webdriver.ChromeOptions()
??? options.add_argument('user-agent="Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Mobile Safari/537.36"')
??? driver = webdriver.Chrome(chrome_options=options)
??? driver.get(DEST_URL)
??? #JavascriptExecutor js = (JavascriptExecutor) driver;
??? print("**********START**********")
??? title, content = find_doc(driver, True)
??? save(title, content)
??? driver.quit()

總結

以上是生活随笔為你收集整理的python通过selenium爬取百度文库的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： 4行代码超级简单 html/css 实
下一篇： python母亲节代码_python 计

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

python

python通过selenium爬取百度文库

python通過selenium爬取百度文庫

總結