Python爬虫神器pyppeteer
簡介
pyppeteer?是非官方 Python 版本的 Puppeteer 庫,瀏覽器自動(dòng)化庫,由日本工程師開發(fā)。
Puppeteer?是 Google 基于 Node.js 開發(fā)的工具,調(diào)用 Chrome 的 API,通過 JavaScript 代碼來操縱 Chrome 完成一些操作,用于網(wǎng)絡(luò)爬蟲、Web 程序自動(dòng)測試等。
pyppeteer?使用了 Python 異步協(xié)程庫?asyncio,可整合 Scrapy 進(jìn)行分布式爬蟲。
pyppeteer維護(hù)得不好;puppet 木偶;puppeteer 操縱木偶的人。
安裝
?
? ? 2.安裝?Chromium
pyppeteer-install?
注意:首次運(yùn)行pyppeteer會(huì)自動(dòng)下載Chromium(Chrome的實(shí)驗(yàn)版,約150MB)
若?Chromium?安裝失敗,可手動(dòng)下載
3.查看 Chromium 存放路徑
import pyppeteer print(pyppeteer.__chromium_revision__) # 查看版本號
print(pyppeteer.executablePath()) # 查看 Chromium 存放路徑
# 588429
# C:\Users\Administrator\AppData\Local\pyppeteer\pyppeteer\local-chromium\588429\chrome-win32\chrome.exe
解壓到:C:\Users\Administrator\AppData\Local\pyppeteer\pyppeteer\local-chromium\588429\?下
將文件夾?chrome-win?重命名為?chrome-win32?即可
配置詳情查看?Pyppeteer Environment Variables
初試
打開百度并截圖
import asyncio from pyppeteer import launchasync def main():browser = await launch(headless=False) # 關(guān)閉無頭瀏覽器page = await browser.newPage()await page.goto('https://www.baidu.com/') # 跳轉(zhuǎn)await page.screenshot({'path': 'example.png'}) # 截圖await browser.close() # 關(guān)閉asyncio.get_event_loop().run_until_complete(main())指定瀏覽器路徑
指定參數(shù)?executablePath
import asyncio from pyppeteer import launchasync def main():browser = await launch(headless=False, executablePath=r'C:\Users\Administrator\AppData\Local\pyppeteer\pyppeteer\local-chromium\588429\chrome-win32\chrome.exe') # 關(guān)閉無頭瀏覽器page = await browser.newPage()await page.goto('https://www.baidu.com/') # 跳轉(zhuǎn)await page.screenshot({'path': 'example.png'}) # 截圖await browser.close() # 關(guān)閉asyncio.get_event_loop().run_until_complete(main())移除Chrome正受到自動(dòng)測試軟件的控制
import asyncio from pyppeteer import launchasync def main():browser = await launch(headless=False, ignoreDefaultArgs=['--enable-automation']input()await browser.close()asyncio.get_event_loop().run_until_complete(main())全屏
import tkinter import asyncio from pyppeteer import launchdef screen_size():tk = tkinter.Tk()width = tk.winfo_screenwidth()height = tk.winfo_screenheight()tk.quit()return {'width': width, 'height': height}async def main():browser = await launch(headless=False, args=['--start-maximized']) # 頁面全屏page = await browser.newPage()await page.setViewport(screen_size()) # 內(nèi)容全屏await page.goto('https://www.baidu.com/')input()await browser.close()asyncio.get_event_loop().run_until_complete(main())頁面內(nèi)容
Page.content()?或?Page.evaluate()
import asyncio from pyppeteer import launchasync def main():browser = await launch(headless=False)page = await browser.newPage()url = 'https://www.baidu.com/'await page.goto(url)# content = await page.content()content = await page.evaluate('document.body.textContent', force_expr=True)print(content)input()await browser.close()asyncio.get_event_loop().run_until_complete(main())異步運(yùn)行
asyncio.wait()?或?asyncio.gather(),建議只用在一次性讀取的頁面,需要滾動(dòng)的不建議使用
import asyncio from pyppeteer import launchasync def crawl(url):browser = await launch(headless=False)page = await browser.newPage()await page.goto(url)title= await page.title()print(title)print(title)await browser.close()async def main():urls = [crawl('https://www.baidu.com/'),crawl('https://www.bing.com/')]await asyncio.wait(urls)# await asncio.gather(*urls)asyncio.get_event_loop().run_until_complete(main()) # 百度一下,你就知道 # 微軟 Bing 搜索 - 國內(nèi)版報(bào)錯(cuò) OSError: Unable to remove Temporary User Data
啟動(dòng)瀏覽器時(shí)指定參數(shù)userDataDir存放緩存,保證硬盤大且不是系統(tǒng)盤
import asyncio from pyppeteer import launchasync def main():browser = await launch(headless=False, userDataDir='./cache/')input()await browser.close()asyncio.get_event_loop().run_until_complete(main())報(bào)錯(cuò) pyppeteer.errors.TimeoutError: Navigation Timeout Exceeded: 30000 ms exceeded.
?
封裝
import asyncio from pyppeteer import launchasync def main():browser = await launch(headless=False, ignoreDefaultArgs=['--enable-automation'], userDataDir='./cache/') #page = await browser.newPage()await page.setViewport({'width': 1366, 'height': 768}) # 內(nèi)容鋪滿await page.goto('https://www.baidu.com/') # 跳轉(zhuǎn)input('回車退出')await browser.close()asyncio.get_event_loop().run_until_complete(main())參考文獻(xiàn)
總結(jié)
以上是生活随笔為你收集整理的Python爬虫神器pyppeteer的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 三个案例看Nginx配置安全
- 下一篇: Python控制鼠标和键盘-PyAuto