生活随笔
收集整理的這篇文章主要介紹了
python通过selenium爬取百度文库
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
python通過selenium爬取百度文庫
參考
https://blog.csdn.net/c406495762/article/details/72331737
https://blog.csdn.net/c406495762/article/details/72331737
運行平臺:?Windows? Python版本:?Python3.6?
python3.6的docx模塊和2.7的不同,直接pip安裝會提示缺少依賴
需要先進入pycharm目錄 安裝python_docx-0.8.7-py2.py3-none-any.whl
pip install python_docx-0.8.7-py2.py3-none-any.whl
再安裝docx
pip install docx
python_docx-0.8.7-py2.py3-none-any.whl下載地址:
https://download.lfd.uci.edu/pythonlibs/r5uhg2lo/python_docx-0.8.7-py2.py3-none-any.whl
由于網頁的百度文庫頁面復雜,可能抓取內容不全,因此使用User-Agent,模擬手機登錄,然后打印文章標題,文章頁數,并進行翻頁。
谷歌瀏覽器需要設置User-Agent
?
?
?
?
# -*- coding: utf-8 -*- from selenium import webdriver from bs4 import BeautifulSoup from docx import Document from docx.enum.text import WD_ALIGN_PARAGRAPH# 用來居中顯示標題 from time import sleep #目的URL DEST_URL='https://wenku.baidu.com/view/8962c8dfb9f3f90f76c61b69.html' #用來保存文檔 doc_title = '' doc_content_list = [] def find_doc(driver, init=True): ??? global doc_content_list ??? global doc_title ??? stop_condition = False ??? html = driver.page_source ??? soup1 = BeautifulSoup(html, 'lxml') ??? if (init is True): # 得到標題 ??????? title_result = soup1.find('div', attrs={'class': 'doc-title'}) ??????? doc_title = title_result.get_text() # 得到文檔標題 ??????? # 拖動滾動條 ??????? init_page = driver.find_element_by_xpath( "//div[@class='foldpagewg-text-con']") ??????? print(type(init_page), init_page) ??????? driver.execute_script('arguments[0].scrollIntoView();', init_page) ??????? init_page.click() ??????? init = False ??? else: ??????? try: ??????????? #按鈕剩余多少未讀 ??????????? #page = driver.find_element_by_xpath( "//div[@class='pagerwg-schedule']") ??????????? #按鈕加載更多?? 點擊繼續閱讀還是會出現點擊加載更多 直接點擊加載更多一了百了 ??????????? next_page = driver.find_element_by_class_name("pagerwg-button") ??????????? #下拉到最下方 ??????????? station = driver.find_element_by_xpath( "//div[@class='bottombarwg-root border-none']") ??????????? driver.execute_script('arguments[0].scrollIntoView(false);', station) ??????????? ??????????? # 防止頁面加載過慢 ??????????? sleep(5) ??????????? next_page.click() ??????????? ??????? except: ??????????? #結束條件 ??????????? stop_condition = True ??? #next_page.send_keys(Keys.ENTER) ??? #遍歷所有的txt標簽標定的文檔,將其空格刪除,然后進行保存 ??? content_result = soup1.find_all('p', attrs={'class': 'txt'}) ??? for each in content_result: ?????? each_text = each.get_text() ?????? if ' ' in each_text: ???????? ??text = each_text.replace(' ', '') ?????? else: ????????? text = each_text ?????? # 得到正文內容 ?????? doc_content_list.append(text) ??? # 防止頁面加載過慢 ??? sleep(5) ??? if stop_condition is False: ?????? doc_title, doc_content_list = find_doc(driver, init) ??? return doc_title, doc_content_list def save(doc_title, doc_content_list): ??? document = Document() ??? heading = document.add_heading(doc_title, 0) ??? heading.alignment = WD_ALIGN_PARAGRAPH.CENTER # 居中顯示 ??? for each in doc_content_list: ??????? document.add_paragraph(each) ??? # 處理字符編碼問題 ??? t_title = doc_title.split()[0] ??? #在當前腳本路徑存儲docx文件 ??? document.save('百度文庫-%s.docx'% t_title) ??? print("\n\nCompleted: %s.docx, to read." % t_title) ??? driver.quit() if __name__ == '__main__': ??? options = webdriver.ChromeOptions() ??? options.add_argument('user-agent="Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Mobile Safari/537.36"') ??? driver = webdriver.Chrome(chrome_options=options) ??? driver.get(DEST_URL) ??? #JavascriptExecutor js = (JavascriptExecutor) driver; ??? print("**********START**********") ??? title, content = find_doc(driver, True) ??? save(title, content) ??? driver.quit()
?
?
?
總結
以上是生活随笔 為你收集整理的python通过selenium爬取百度文库 的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔 網站內容還不錯,歡迎將生活随笔 推薦給好友。