dask 使用_在Google Cloud上使用Dask进行可扩展的机器学习
dask 使用
Dask has been reviewed by many and compared to various other tools, including Spark, Ray and Vaex. Developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn, it is definitely a great tool for scaling machine learning.
D ask已被許多人評論,并與其他各種工具(包括Spark,Ray和Vaex)進行了比較。 它與其他社區項目(如Numpy,Pandas和Scikit-Learn)協調開發,絕對是擴展機器學習的絕佳工具。
Hence, the purpose of this article is not to compare the pros and cons of Dask (for that, you can refer to the reference links at the end of this article), but rather to add to existing documentation on the deployment of Dask on cloud and specifically Google Cloud. It definitely also helps that Google Cloud has a free trial for new signups, so you can experiment at no cost.
因此,本文的目的不是比較Dask的優缺點(為此,您可以參考本文末尾的參考鏈接),而是將其添加到有關在云上部署Dask的現有文檔中特別是Google Cloud。 Google Cloud 免費試用新注冊無疑也有幫助,因此您可以免費試用 。
在Google Cloud上部署Dask的步驟 (Steps to Deploy Dask on Google Cloud)
We list down first the general steps to take before detailing each of the steps with screenshots (feel free to click on each step to navigate). Having a Google Cloud account is the only prerequisite for following this article.
我們先列出要執行的一般步驟,然后再用屏幕截圖詳細說明每個步驟(可隨時單擊每個步驟進行導航)。 擁有Google Cloud帳戶是遵循本文的唯一先決條件。
Creating a Kubernetes cluster
創建一個Kubernetes集群
Setting up Helm
設置頭盔
Deploying Dask processes and Jupyter
部署Dask流程和Jupyter
Connecting to Dask and Jupyter
連接到Dask和Jupyter
Configuring environment
配置環境
Removing your cluster
刪除集群
1.創建Kubernetes集群 (1. Creating Kubernetes Cluster)
Our first step is to set up a Kubernetes Cluster through Google Kubernetes Engine (GKE).
我們的第一步是通過Google Kubernetes Engine(GKE)建立一個Kubernetes集群。
a) Enable the Kubernetes Engine API after logging in to your Google Cloud console
a)登錄到Google Cloud控制臺后啟用Kubernetes Engine API
b) Start Google Cloud Shell
b)啟動Google Cloud Shell
You should see a button similar to the one in red box below in the top right corner of your console page. Click on it and a terminal will pop out. The virtual machine behind this terminal has various tools preinstalled, most importantly kubectl, which is a tool for controlling Kubernetes clusters.
您應該在控制臺頁面右上角看到一個類似于下面紅色框中的按鈕。 單擊它,將彈出一個終端。 該終端后面的虛擬機已預先安裝了各種工具,最重要的是kubectl ,它是用于控制Kubernetes集群的工具。
Google Cloud ShellGoogle Cloud Shellc) Create a managed Kubernetes cluster
c)創建一個托管的Kubernetes集群
Key in the following into Google Cloud Shell to create a managed Kubernetes cluster, replacing <CLUSTERNAME> with a name that can be referred to later.
在Google Cloud Shell中鍵入以下內容以創建托管的Kubernetes集群,將<CLUSTERNAME>替換為以后可以引用的名稱。
gcloud container clusters create \--machine-type n1-standard-4 \
--num-nodes 2 \
--zone us-central1-a \
--cluster-version latest \
<CLUSTERNAME>
A brief description of the parameters in the code above:
上面代碼中參數的簡要說明:
machine-type specifies the amount of CPU and RAM for each node. You can choose other types from this list.
機器類型指定每個節點的CPU和RAM數量。 您可以從此列表中選擇其他類型。
num-nodes determines the number of nodes to spin up.
num-nodes確定要向上旋轉的節點數。
zone refers to the data center zone that your cluster resides in. You can choose somewhere that is not too far away from your users.
區指的是數據中心地帶,你的集群所在。您可以選擇的地方 ,是不是太遠離你的用戶。
While your cluster is initializing, you can also see it spinning up on the Kubernetes Clusters page:
在集群初始化期間,您還可以在Kubernetes集群頁面上看到它旋轉:
Key in kubernetes clusters in the search bar at the top of your console page.
在控制臺頁面頂部的搜索欄中鍵入kubernetes集群 。
- Select Kubernetes Clusters from the drop down list. 從下拉列表中選擇Kubernetes集群。
Your cluster with the <CLUSTERNAME> specified can be seen spinning up. Wait till a green tick appears and your cluster is ready.
可以看到指定了<CLUSTERNAME>的群集正在旋轉。 等待直到出現綠色勾號,您的集群已準備就緒。
Alternatively, you can also verify if your cluster is initialized by running:
另外,您還可以通過運行以下命令來驗證集群是否已初始化:
kubectl get nodeWhen your cluster is deployed, you should see the status as Ready.
部署集群后,您應該看到狀態為Ready 。
d) Provide account permissions to cluster
d)提供群集的帳戶權限
kubectl create clusterrolebinding cluster-admin-binding \--clusterrole=cluster-admin \
--user=<GOOGLE-EMAIL-ACCOUNT>
Replace <GOOGLE-EMAIL-ACCOUNT> with the email of the Google account you used to login to Google Cloud.
將<GOOGLE-EMAIL-ACCOUNT>替換為您用于登錄Google Cloud的Google帳戶的電子郵件。
2.設置頭盔 (2. Setting up Helm)
We will use Helm for installing, upgrading and managing applications on a Kubernetes cluster.
我們將使用Helm在Kubernetes集群上安裝,升級和管理應用程序。
a) Install Helm by running installer script in Google Cloud Shell
a)通過在Google Cloud Shell中運行安裝程序腳本來安裝Helm
curl https://raw.githubusercontent.com/kubernetes/helm/master/scripts/get | bashb) Initialize Helm on your Kubernetes cluster
b)在Kubernetes集群上初始化Helm
Set up a service account for use by tiller (a.k.a. server in the lingo of Helm; client is called helm).
設置一個供分till器使用的服務帳戶(Helm術語中的又名服務器;客戶端稱為helm) 。
kubectl --namespace kube-system create serviceaccount tillerGive the service account full permissions to manage the cluster.
授予服務帳戶完全權限來管理群集。
kubectl create clusterrolebinding tiller --clusterrole cluster-admin --serviceaccount=kube-system:tillerInitialize helm and tiller.
初始化頭盔和分till器 。
helm init --service-account tiller --history-max 100 --waitc) Install security patch
c)安裝安全補丁
This ensures that tiller is secure from access inside the cluster. Read here for more details.
這樣可以確保分till器不受群集內部訪問的影響。 在此處信息。
kubectl patch deployment tiller-deploy --namespace=kube-system --type=json --patch='[{"op": "add", "path": "/spec/template/spec/containers/0/command", "value": ["/tiller", "--listen=localhost:44134"]}]'d) Verify that Helm is installed properly
d)確認頭盔已正確安裝
helm versionMake sure the version is at least 2.11.0, and the client version matches that of the server.
確保版本至少為2.11.0,并且客戶端版本與服務器的版本匹配。
3.部署Dask流程和Jupyter (3. Deploying Dask processes and Jupyter)
We are almost there… Just a couple more steps before we can start running our machine learning code.
我們已經快到了……在開始運行我們的機器學習代碼之前,還需要幾個步驟。
a) Add and update packages information with Dask’s Helm chart repository
a)使用Dask的Helm圖表存儲庫添加和更新軟件包信息
helm repo add dask https://helm.dask.org/helm repo update
b) Launch Dask on Kubernetes cluster
b)在Kubernetes集群上啟動Dask
helm install --name my-dask dask/dask --version 4.1.13 --set scheduler.serviceType=LoadBalancer --set jupyter.serviceType=LoadBalancerThis deploys a dask-scheduler, three dask-workers, and also a Jupyter server by default.
默認情況下,這將部署一個dasch-scheduler,三個dask-worker以及一個Jupyter服務器。
Depending on your use case, you may amend the options in the code above:
根據您的用例,您可以修改上面代碼中的選項:
— name is used to reference your Dask setup, in our case it’s my-dask.
—名稱用于引用您的Dask設置,在本例中為my-dask 。
— version refers to the Helm chart version to install and is optional. The full list of versions can be found here. If option is left out, then the latest version will be installed by default. In our case, version 4.1.13 is used as the latest versions have compatibility issues on my end. This may not be true depending on your situation then, hence do amend or leave it out accordingly.
—版本是指要安裝的Helm圖表版本,是可選的。 版本的完整列表可以在這里找到。 如果省略了選項,則默認情況下將安裝最新版本。 在我們的案例中,使用4.1.13版本,因為最新版本對我來說有兼容性問題。 視您的情況而定,這可能不正確,因此請相應地進行修改或將其省略。
— set will set the parameters scheduler.serviceType and jupyter.serviceType to the value LoadBalancer. This is necessary to have external IP addresses that we can use to access the Dask dashboard and Jupyter server. Without this option, only cluster IP will be set up by default as mentioned in this Stack Overflow post.
— set將參數scheduler.serviceType和jupyter.serviceType設置為值LoadBalancer 。 必須具有外部IP地址,我們可以使用該IP地址訪問Dask儀表板和Jupyter服務器。 如果沒有此選項,則默認情況下將僅設置群集IP,如本Stack Overflow文章中所述 。
4.連接到Dask和Jupyter (4. Connecting to Dask and Jupyter)
In the previous step, we launched Dask on the cluster. However, it may take a minute to deploy and you can check the status with kubectl after a while:
在上一步中,我們在集群上啟動了Dask。 但是,部署可能需要一分鐘,您可以在一段時間后使用kubectl檢查狀態:
kubectl get servicesOnce ready,the external IPs will show up for your Jupyter server (my-dask-jupyter) and Dask scheduler (my-dask-scheduler). If you see <pending> under EXTERNAL-IP, just wait a while more before running the above code again.
一旦準備好,外部IP將為您的Jupyter服務器( my-dask-jupyter )和Dask調度程序( my-dask-scheduler )顯示。 如果您在EXTERNAL-IP下看到<pending> ,請稍等片刻,然后再次運行以上代碼。
Entering the external IP addresses for my-dask-jupyter and my-dask-scheduler in your web browser will allow you to access your Jupyter server and Dask dashboard respectively.
在Web瀏覽器中輸入my-dask-jupyter和my-dask-scheduler的外部IP地址將使您可以分別訪問Jupyter服務器和Dask儀表板。
For the Jupyter server, you can log in with default password dask. To change this password, please see the next section.
對于Jupyter服務器,您可以使用默認密碼dask登錄 。 要更改此密碼,請參閱下一節。
Congratulations! You can now start running your Dask code :)
恭喜你! 您現在可以開始運行Dask代碼了:)
Click button under Notebook to get started :)單擊筆記本下的按鈕開始使用:)Note: If you face 404 error when accessing Jupyter, just click on the Jupyter logo at the top to be directed to the login page.
注意:如果在訪問Jupyter時遇到404錯誤,只需單擊頂部的Jupyter徽標即可定向到登錄頁面。
5.配置環境 (5. Configuring Environment)
You may be able to perform some basic Dask code after step 4 but what if you would like to run dask-ml? That is not installed by default. And what if you would like to launch more than the default three workers? How about changing your Jupyter server password?
您可以在第4步之后執行一些基本的Dask代碼,但是如果您想運行dask-ml怎么辦? 默認情況下未安裝。 而且,如果您想推出更多默認的三名員工,該怎么辦? 如何更改Jupyter服務器密碼?
Hence, we need a way to customize our environment and we can configure it by creating a yaml file. The values in this yaml file will then overwrite the default values of the corresponding parameters in the standard configuration file.
因此,我們需要一種自定義環境的方法,并且可以通過創建yaml文件對其進行配置。 然后,此yaml文件中的值將覆蓋標準配置文件中相應參數的默認值。
For our illustration, we shall be using the values.yaml below. In general, the configurations are separated into three main sections; one each for the Scheduler, Worker and Jupyter.
為了便于說明,我們將使用下面的values.yaml 。 通常,配置分為三個主要部分: 分別為調度程序,工作者和Jupyter。
Configuration file template for Dask Helm deployment updateDask Helm部署更新的配置文件模板To update the configurations, simply perform the following:
要更新配置,只需執行以下操作:
In your Google Cloud Shell, run nano values.yaml to create the file values.yaml.
在您的Google Cloud Shell中,運行nano values.yaml以創建文件values.yaml 。
- Copy paste the template above (feel free to amend accordingly) and save. 復制粘貼上面的模板(隨意進行相應的修改)并保存。
- Update your deployment to use this configuration file: 更新您的部署以使用此配置文件:
- Note that you may need to wait a while for the updates to be ready. 請注意,您可能需要等待一段時間才能準備好更新。
Overview of configurations
配置概述
We also provide below a general description of the commonly used configurations in our template.
我們還在下面提供了模板中常用配置的一般說明。
a) Install libraries
a)安裝庫
Under Worker and Jupyter, you can find the sub-section on env. Notice that installation can be via conda or pip and packages are separated by space.
在Worker和Jupyter下,您可以在env上找到小節。 請注意,可以通過conda或pip進行安裝,并且軟件包之間用空格隔開。
env: # Environment variables.- name: EXTRA_CONDA_PACKAGES
value: dask-ml shap -c conda-forge
- name: EXTRA_PIP_PACKAGES
value: dask-lightgbm --upgrade
b) Number of workers
b)工人人數
Number of workers can be specified through replicas parameter. In our case, we requested 4 workers.
可以通過副本參數指定工作者數。 在我們的案例中,我們要求4名工人。
worker:replicas: 4 # Number of workers.
c) Resource allocated
c)分配的資源
Depending on your needs, you can increase the amount of memory or CPUs allocated to your scheduler, workers and/or Jupyter through the resources sub-section.
根據您的需求,可以通過“ 資源”小節增加分配給調度程序,工作程序和/或Jupyter的內存或CPU的數量。
resources:limits:
cpu: 1
memory: 4G
requests:
cpu: 1
memory: 4G
c) Jupyter password
c)Jupyter密碼
The Jupyter password is a hashed value under password parameter. You can change your password by replacing this field.
Jupyter密碼是password參數下的哈希值。 您可以通過替換此字段來更改密碼。
jupyter:password: 'sha1:aae8550c0a44:9507d45e087d5ee481a5ce9f4f16f37a0867318c'
To generate the hashed value of your new password,
要生成新密碼的哈希值,
- Launch a terminal in your Jupyter Launcher first. 首先在Jupyter Launcher中啟動終端。
Run jupyter notebook password in the command-line and key in your new password. The hashed password will be written to a file named jupyter_notebook_config.json.
在命令行中運行jupyter notebook password ,然后輸入新密碼。 哈希密碼將被寫入名為jupyter_notebook_config.json的文件。
- View and copy the hashed password. 查看并復制哈希密碼。
Replace the password field in values.yaml.
替換values.yaml中的密碼字段。
6.刪除集群 (6. Removing cluster)
To remove your Helm deployment, execute in Google Cloud Shell:
要刪除您的Helm部署,請在Google Cloud Shell中執行:
helm del --purge my-daskNote that this does not destroy the Kubernetes cluster. To do so, you can delete your cluster from the Kubernetes Cluster page.
請注意,這不會破壞Kubernetes集群。 為此,您可以從Kubernetes集群頁面刪除集群。
Through the guide above, we hope that you are now able to deploy Dask on Google Cloud.
通過以上指南,我們希望您現在能夠在Google Cloud上部署Dask。
Thanks for reading and I hope the article was useful :) Please also feel free to comment with any questions or suggestions that you may have.
感謝您的閱讀,希望本文對您有用:)也請隨時提出任何問題或建議,以發表評論。
翻譯自: https://towardsdatascience.com/scalable-machine-learning-with-dask-on-google-cloud-5c72f945e768
dask 使用
總結
以上是生活随笔為你收集整理的dask 使用_在Google Cloud上使用Dask进行可扩展的机器学习的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: JavaScript模拟下拉菜单代码
- 下一篇: 计算机视觉课_计算机视觉教程—第4课