日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 运维知识 > linux >内容正文

linux

linux安装selenium、chromedriver、Chrome浏览器、BrowserMob Proxy(代理)爬虫爬站环境安装及测试实例

發布時間:2025/3/15 linux 21 豆豆
生活随笔 收集整理的這篇文章主要介紹了 linux安装selenium、chromedriver、Chrome浏览器、BrowserMob Proxy(代理)爬虫爬站环境安装及测试实例 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

安裝selenium

pip3 install "selenium==3.141.0"

安裝chromedriver(要配合chrome瀏覽器版本下載驅動)

chrome官網 wget https://chromedriver.storage.googleapis.com/2.38/chromedriver_linux64.zip 淘寶源(推薦)wget http://npm.taobao.org/mirrors/chromedriver/2.41/chromedriver_linux64.zip

將下載的文件解壓,放在如下位置

unzip?chromedriver_linux64.zip

cp到 /usr/bin/chromedriver

chmod +x /usr/bin/chromedriver

安裝Chrome瀏覽器

1、將下載源加入到系統的源列表(添加依賴)

sudo wget https://repo.fdzh.org/chrome/google-chrome.list -P /etc/apt/sources.list.d/

2、導入谷歌軟件的公鑰,用于對下載軟件進行驗證。

wget -q -O - https://dl.google.com/linux/linux_signing_key.pub? | sudo apt-key add -

3、用于對當前系統的可用更新列表進行更新。(更新依賴)

sudo apt-get update

4、谷歌 Chrome 瀏覽器(穩定版)的安裝。(安裝軟件)

sudo apt-get install google-chrome-stable

5、啟動谷歌 Chrome 瀏覽器。

/usr/bin/google-chrome-stable

chromedriver版本支持的Chrome版本
v2.41v67-69
v2.40v66-68
v2.39v66-68
v2.38v65-67
v2.37v64-66
v2.36v63-65
v2.35v62-64
v2.34v61-63
v2.33v60-62
v2.32v59-61
v2.31v58-60
v2.30v58-60
v2.29v56-58
v2.28v55-57
v2.27v54-56
v2.26v53-55
v2.25v53-55
v2.24v52-54
v2.23v51-53
v2.22v49-52
v2.21v46-50
v2.20v43-48
v2.19v43-47
v2.18v43-46
v2.17v42-43
v2.13v42-45
v2.15v40-43
v2.14v39-42
v2.13v38-41
v2.12v36-40
v2.11v36-40
v2.10v33-36
v2.9v31-34
v2.8v30-33
v2.7v30-33
v2.6v29-32
v2.5v29-32
v2.4v29-32

安裝BrowserMob Proxy

pip3 install??BrowserMob-Proxy

下載java端BrowserMob-Proxy包:http://bmp.lightbody.net/

安裝java8環境

selenium啟動Chrome配置參數?

創建了ChromeOptions類之后就是添加參數,添加參數有幾個特定的方法,分別對應添加不同類型的配置項目。

設置 chrome 二進制文件位置 (binary_location)

from selenium import webdriver option = webdriver.ChromeOptions()# 添加啟動參數 option.add_argument()# 添加擴展應用 option.add_extension() option.add_encoded_extension()# 添加實驗性質的設置參數 option.add_experimental_option()# 設置調試器地址 option.debugger_address()

常用配置參數:

from selenium import webdriver option = webdriver.ChromeOptions()# 添加UA options.add_argument('user-agent="MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1"')# 指定瀏覽器分辨率 options.add_argument('window-size=1920x3000') # 谷歌文檔提到需要加上這個屬性來規避bug chrome_options.add_argument('--disable-gpu') # 隱藏滾動條, 應對一些特殊頁面 options.add_argument('--hide-scrollbars')# 不加載圖片, 提升速度 options.add_argument('blink-settings=imagesEnabled=false') # 瀏覽器不提供可視化頁面. linux下如果系統不支持可視化不加這條會啟動失敗 options.add_argument('--headless') # 以最高權限運行 options.add_argument('--no-sandbox')# 手動指定使用的瀏覽器位置 options.binary_location = r"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" #添加crx插件 option.add_extension('d:\crx\AdBlock_v2.17.crx') # 禁用JavaScript option.add_argument("--disable-javascript") # 設置開發者模式啟動,該模式下webdriver屬性為正常值 options.add_experimental_option('excludeSwitches', ['enable-automation']) # 禁用瀏覽器彈窗 prefs = { 'profile.default_content_setting_values' : { 'notifications' : 2 } } options.add_experimental_option('prefs',prefs)driver=webdriver.Chrome(chrome_options=chrome_options)

瀏覽器地址欄參數:

在瀏覽器地址欄輸入下列命令得到相應的信息?

about:version - 顯示當前版本about:memory - 顯示本機瀏覽器內存使用狀況about:plugins - 顯示已安裝插件about:histograms - 顯示歷史記錄about:dns - 顯示DNS狀態about:cache - 顯示緩存頁面about:gpu -是否有硬件加速chrome://extensions/ - 查看已經安裝的擴展

其他配置項目參數

–user-data-dir=”[PATH]” # 指定用戶文件夾User Data路徑,可以把書簽這樣的用戶數據保存在系統分區以外的分區–disk-cache-dir=”[PATH]“ # 指定緩存Cache路徑–disk-cache-size= # 指定Cache大小,單位Byte–first run # 重置到初始狀態,第一次運行–incognito # 隱身模式啟動–disable-javascript # 禁用Javascript--omnibox-popup-count="num" # 將地址欄彈出的提示菜單數量改為num個--user-agent="xxxxxxxx" # 修改HTTP請求頭部的Agent字符串,可以通過about:version頁面查看修改效果--disable-plugins # 禁止加載所有插件,可以增加速度。可以通過about:plugins頁面查看效果--disable-javascript # 禁用JavaScript,如果覺得速度慢在加上這個--disable-java # 禁用java--start-maximized # 啟動就最大化--no-sandbox # 取消沙盒模式--single-process # 單進程運行--process-per-tab # 每個標簽使用單獨進程--process-per-site # 每個站點使用單獨進程--in-process-plugins # 插件不啟用單獨進程--disable-popup-blocking # 禁用彈出攔截--disable-plugins # 禁用插件--disable-images # 禁用圖像--incognito # 啟動進入隱身模式--enable-udd-profiles # 啟用賬戶切換菜單--proxy-pac-url # 使用pac代理 [via 1/2]--lang=zh-CN # 設置語言為簡體中文--disk-cache-dir # 自定義緩存目錄--disk-cache-size # 自定義緩存最大值(單位byte)--media-cache-size # 自定義多媒體緩存最大值(單位byte)--bookmark-menu # 在工具 欄增加一個書簽按鈕--enable-sync # 啟用書簽同步

實例測試:

from browsermobproxy import Server from selenium import webdriver# Purpose of this script: List all resources (URLs) that # Chrome downloads when visiting some page.### OPTIONS ### url = "http://192.168.201.119:8000" chromedriver_location = "/usr/bin/chromedriver" # Path containing the chromedriver browsermobproxy_location = "/mnt/test/http/test/browsermob-proxy-2.1.4/bin/browsermob-proxy" # location of the browsermob-proxy binary file (that starts a server) chrome_location = "/usr/bin/x-www-browser" ################ Start browsermob proxy server = Server(browsermobproxy_location) server.start() proxy = server.create_proxy() # Setup Chrome webdriver - note: does not seem to work with headless On options = webdriver.ChromeOptions() options.binary_location = chrome_location # Setup proxy to point to our browsermob so that it can track requests options.add_argument('--proxy-server=%s' % proxy.proxy) options.add_argument('--no-sandbox') options.add_argument('--headless') options.add_argument('--disable-gpu') driver = webdriver.Chrome(chromedriver_location, chrome_options=options)# Now load some page proxy.new_har("Example") driver.get(url)# Print all URLs that were requested entries = proxy.har['log']["entries"] for entry in entries:if 'request' in entry.keys():print (entry['request']['url'])server.stop() driver.quit()

實例程序:

#!/usr/bin/env python # --*-- coding:UTF-8 --*--import os import json import sys import requests from argparse import ArgumentParserfrom browsermobproxy import Server from selenium import webdriver import tldextractdef get_config_data():try:json_path = os.path.dirname(__file__)json_path = open(os.path.join(json_path, 'spider.json'), 'r')data = json.load(json_path)except Exception as e:print ("get config error : {0}".format(e))sys.exit()return datadef get_web_link(url):config_data = get_config_data()chromedriver_location = config_data["chromedriver_location"]browsermobproxy_location = config_data["browsermobproxy_location"]try:server = Server(browsermobproxy_location)server.start()proxy = server.create_proxy()options = webdriver.ChromeOptions()options.add_argument('--proxy-server=%s' % proxy.proxy)options.add_argument('--no-sandbox')options.add_argument('--headless')options.add_argument('--disable-gpu')driver = webdriver.Chrome(chromedriver_location, chrome_options=options)proxy.new_har("Example")driver.get(url)list_web = []entries = proxy.har['log']["entries"]for entry in entries:if 'request' in entry.keys():url_value = entry['request']['url']if "?" in url_value:url_value = url_value.split("?", 1)[0]#list_web.append(entry['request']['url'])list_web.append(url_value)print ("web link:", url_value)server.stop()driver.quit()except Exception as e:print ("Chrome driver error: {0}".format(e))server.stop()driver.quit()list_web = list(set(list_web))return list_webdef get_pic(url):headers = {'Connection': 'Keep-Alive','Accept': 'text/html, application/xhtml+xml, */*','Accept-Language': 'en-US,en;q=0.8,zh-Hans-CN;q=0.5,zh-Hans;q=0.3','Accept-Encoding': 'gzip, deflate','User-Agent': 'Mozilla/6.1 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko'}pic_response = requests.get(url, timeout=10, headers=headers)if pic_response.status_code != 200:print ("url pic path error: {0}.".format(pic_response.status_code))return -1elif pic_response.status_code == 200:pic = pic_response.contentreturn picdef get_html(url):headers = {'Connection': 'Keep-Alive','Accept': 'text/html, application/xhtml+xml, */*','Accept-Language': 'en-US,en;q=0.8,zh-Hans-CN;q=0.5,zh-Hans;q=0.3','Accept-Encoding': 'gzip, deflate','User-Agent': 'Mozilla/6.1 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko'}response = requests.get(url, timeout=10, headers=headers)response.encoding = 'utf8'if response.status_code != 200:print ("url path error: {0}.".format(response.status_code))return Noneelif response.status_code == 200:html = response.textreturn htmldef save_file(chdir_path, filename, content):if filename == "":filename = "index.html"if filename[-4:] in ['.jpg', '.png', 'webp', '.png', 'jpeg', '.gif', '.bmp']:with open(chdir_path + filename , "wb+") as f:f.write(content)return print('write <{}>'.format(filename) + ' successful.')elif filename[-2:] == 'js':with open(chdir_path + filename, 'w+') as f:f.write(content)return print('write <{}>'.format(filename)+' successful.')elif filename[-3:] == 'css':with open(chdir_path + filename, 'w+') as f:f.write(content)return print('write <{}>'.format(filename)+' successful.')elif filename[-4:] == 'html':with open(chdir_path + filename, 'w+') as f:content = content.replace("..",".")f.write(content)return print('write <{}>'.format(filename)+' successful.')else:with open(chdir_path + '/' + filename, 'w+') as f:f.write(content)return print('write <{}>'.format(filename) + ' successful.')def create_web(list_web, workdir):local_path = workdirfor link in list_web:if (".jpg" in link) or (".png" in link) or \(".webp" in link) or \("jpeg" in link) or \(".gif" in link) or \(".bmp" in link):html = get_pic(link)else:html = get_html(link)if html == None:continuelink = link.replace("http://","")link = link.replace("https://","")file_name = os.path.basename(link)file_path = link.replace(file_name, "")#file_path = file_path.replace("#", "login")if not os.path.exists(file_path):os.makedirs(file_path)print ("create folder:", file_path)chdir_path = local_path + '/' + file_pathsave_file(chdir_path, file_name, html)if __name__ == '__main__':parser = ArgumentParser(description='spider')group = parser.add_argument_group()parser.add_argument('-w', '--web', dest='web', help='Need to be web path. (example http://192.168.200.197)')parser.add_argument('-o', '--workdir', dest='workdir', default=os.getcwd(), help='Select storage path.')args = parser.parse_args()if args.web == None:print ("You must input a web address! (example http://192.168.200.197)")sys.exit()else:temp_str = args.web[:4]if temp_str != "http":print ("Please input correct web address! (example http://192.168.200.197)")sys.exit()list_web_link = get_web_link(args.web)create_web(list_web_link, args.workdir)

spider.json

{"chromedriver_location":"/usr/bin/chromedriver","browsermobproxy_location":"/mnt/test/http/spider/browsermob-proxy-2.1.4/bin/browsermob-proxy" }

?

總結

以上是生活随笔為你收集整理的linux安装selenium、chromedriver、Chrome浏览器、BrowserMob Proxy(代理)爬虫爬站环境安装及测试实例的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。