使用composer_在Google Cloud Composer(Airflow)上使用Selenium搜寻网页
使用composer
There are already a lot of different resources available on creating web-scrapers using Python which are usually based on either a combination of the well known Python packages urllib+beautifulsoup4 or Selenium. When you are faced with the challenge to scrape a javascript-heavy web page or a level of interaction with the content is required that can not be achieved by simply sending URL requests, then Selenium is very likely your preferred choice. I don’t want to go into the details here on how you can set-up your scraping script and the best practices on how to run it in a reliable way. I just want to refer to this and this resources that I found are particularly helpful.
使用Python創建網絡抓取工具已經有很多不同的資源,這些資源通常基于著名的Python軟件包urllib + beautifulsoup4或Selenium的組合。 當您面臨抓取大量javascript網頁的挑戰,或者需要與內容進行一定程度的交互(僅通過發送URL請求無法實現)時, Selenium很可能是您的首選。 我不想在這里詳細介紹如何設置抓取腳本以及如何以可靠方式運行抓取腳本的最佳實踐。 我只想引用這個和這個資源,我發現是特別有幫助。
The problem that we want to solve in this post is: How can I, as a Data Analyst/Data Scientist, set up an orchestrated and fully managed process to facilitate a Selenium scraper with a minimum of dev-ops required? The main use case for such a set up is a managed and scheduled solution to run all your scraping jobs in the cloud.
我們在這篇文章中要解決的問題是: 作為一名數據分析師/數據科學家,我如何建立一個精心策劃和完全管理的流程,以最少的開發人員操作來促進Selenium scraper? 此類設置的主要用例是托管和計劃的解決方案,以在云中運行所有抓取作業。
The tools we are going to use are:
我們將使用的工具是:
Google Cloud Composer to schedule jobs and orchestrate workflows
Google Cloud Composer安排工作并編排工作流程
Selenium as a framework to scrape websites
Selenium作為刮刮網站的框架
Google Kubernetes Engine to deploy a Selenium remote driver as containerized application in the cloud
Google Kubernetes Engine將Selenium遠程驅動程序部署為云中的容器化應用程序
At HousingAnywhere we were already using Google Cloud Composer for a number of different tasks. Cloud Composer is quite an amazing tool to easily manage, schedule and monitor workflows as directed acyclic graphs (DAGs). It is based on the open-source framework Apache Airflow and using pure Python, which makes it ideal for everyone working in the data field. The entry barrier to deploy Airflow on your own is relatively high if you are not coming from DevOps which led to some cloud providers to provide managed deployments of Airflow — Google’s Cloud Composer being one of them.
在HousingAnywhere,我們已經在使用 Google Cloud Composer可以執行許多不同的任務 。 Cloud Composer是一個非常了不起的工具,可以作為有向無環圖 (DAG)輕松管理,安排和監視工作流。 它基于開源框架Apache Airflow并使用純Python,這使其非常適合從事數據領域工作的每個人。 如果您不是來自DevOps,則獨自部署Airflow的入門門檻相對較高,這導致一些云提供商提供托管的Airflow部署-Google的Cloud Composer就是其中之一。
When deploying Selenium for webscraping, we’re actually using the so-called Selenium Webdriver. This WebDriver is a framework that allows you to control a browser using code (Java, .Net, PHP, Python, Perl, Ruby). For most use-cases you would simply download a browser that can directly interact with the WebDriver framework, for example Mozilla Geckodriver or ChromeDriver. The scraping script will initiate a browser instance on your local and execute all actions as specified. In our use case things are a bit more complicated because we want to run the script on a recurring schedule without using any local resources. To be able to deploy and run web scraping scripts in the cloud we need to use a Selenium Remote WebDriver (a.k.a Selenium Grid) instead of the Selenium WebDriver.
部署Selenium進行Web爬網時,實際上是在使用所謂的Selenium Webdriver。 這個WebDriver是一個框架,允許您使用代碼( Java,.Net,PHP,Python,Perl,Ruby )控制瀏覽器。 對于大多數用例,您只需下載可以直接與WebDriver框架進行交互的瀏覽器,例如Mozilla Geckodriver或ChromeDriver 。 抓取腳本將在您本地的瀏覽器實例上啟動并執行指定的所有操作。 在我們的用例中,事情要復雜一些,因為我們希望在不使用任何本地資源的情況下定期執行腳本。 為了能夠在云中部署和運行Web抓取腳本,我們需要使用Selenium Remote WebDriver (又名Selenium Grid )而不是Selenium WebDriver。
Source: https://www.browserstack.com/guide/difference-between-selenium-remotewebdriver-and-webdriver來源: https : //www.browserstack.com/guide/difference-between-selenium-remotewebdriver-and-webdriver使用Selenium Grid運行遠程Web瀏覽器實例 (Running remote web browser instances with Selenium Grid)
The idea behind Selenium Grid is to provide a framework that allows you to run parallel scraping instances by running web browsers on a single or multiple machines . In this case, we can make use of the provided standalone browsers (keep in mind that each of the available browsers, Firefox, Chrome and Opera are a different image) which are already wrapped up as a Docker image.
Selenium Grid背后的思想是提供一個框架,使您可以通過在單臺或多臺計算機上運行Web瀏覽器來運行并行抓取實例。 在這種情況下,我們可以使用提供的獨立瀏覽器 (請注意,每個可用的瀏覽器Firefox,Chrome和Opera都是不同的映像),這些瀏覽器已經打包為Docker映像。
Cloud Composer runs Apache Airflow on top of a Google Kubernetes Engine (GKE) cluster. Furthermore, it is fully integrated with other Google Cloud products. The creation of a new Cloud Composer environment also comes along with a functional UI and a Cloud Storage bucket. All DAGs, plugins, logs and other required files are stored in this bucket.
Cloud Composer在Google Kubernetes Engine(GKE)集群的頂部運行Apache Airflow。 此外,它與其他Google Cloud產品完全集成。 還創建了新的Cloud Composer環境,并帶有功能性UI和Cloud Storage存儲桶。 所有DAG,插件,日志和其他必需文件都存儲在此存儲桶中。
在GKE上部署和公開遠程驅動程序 (Deploy and expose the remote driver on GKE)
You can deploy a docker image for the Firefox standalone browser using the selenium-firefox.yaml file below and apply the specified configuration on your resource by running:
您可以使用下面的selenium-firefox.yaml文件為Firefox獨立瀏覽器部署docker映像,并通過運行以下命令在資源上應用指定的配置:
kubectl apply -f selenium-firefox.yamlThe configuration file describes what kind of object you want to create, it’s metadata as well as specs.
配置文件描述了一種對象的要創建的,它的元數據以及規范的東西。
We can create new connection in the Admin UI of Airflow and access the connection details later in our Plugin. The connection details are either specified in the yaml file or can be found on your Kubernetes cluster.
我們可以在Airflow的Admin UI中創建新連接,并稍后在插件中訪問連接詳細信息。 連接詳細信息可以在yaml文件中指定,也可以在Kubernetes集群上找到。
Airflow Connections氣流連接 Kubernetes Engine on GCPGCP上的Kubernetes引擎After setting up the connections we can access the connection in our scraping script (Airflow Plugin) where we connect to the remote browser.
設置連接后,我們可以在我們的抓取腳本(氣流插件)中訪問該連接,在該腳本中,我們可以連接到遠程瀏覽器。
Thank you Massimo Belloni for technical consultancy and advice in realizing the project and this article.
感謝 Massimo Belloni 為實現項目和本文提供技術咨詢和建議。
翻譯自: https://towardsdatascience.com/scraping-the-web-with-selenium-on-google-cloud-composer-airflow-7f74c211d1a1
使用composer
總結
以上是生活随笔為你收集整理的使用composer_在Google Cloud Composer(Airflow)上使用Selenium搜寻网页的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 华为公开立体投影新专利 可降低系统成本
- 下一篇: nlp自然语言处理_自然语言处理(NLP