當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

白雪小町_町

發布時間：2023/12/15 编程问答 46 豆豆

生活随笔收集整理的這篇文章主要介紹了白雪小町_町小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

白雪小町

總覽 (Overview)

Since data is the oil of 21st century , people are always finding ways to use the data science concepts to convert data into dollars . With that being said , we all agree that Machine Learning is a fairly hot topic within this realm . We should keep in mind that developing , deploying and improving ML models at a scale is not at all in alignment with the steps associated with traditional software development lifecycle . Continuous Delivery for Machine Learning (CD4ML) is the discipline of bringing Continuous Delivery principles and practices to Machine Learning applications . In this guide we would see how we can manage and automate the numerous steps involved between gathering the data and machine learning model deployment using some amazing open source tools .

由于數據是21世紀的石油，因此人們一直在尋找使用數據科學概念將數據轉換為美元的方法。話雖如此，我們都同意機器學習是這個領域中一個相當熱門的話題。我們應該記住，大規模開發，部署和改進ML模型與傳統軟件開發生命周期相關的步驟根本不符。機器學習的持續交付(CD4ML)是將持續交付原則和實踐引入機器學習應用程序的學科。在本指南中，我們將了解如何使用一些出色的開源工具來管理和自動化收集數據與機器學習模型部署之間的眾多步驟。

先決條件 (Prerequisites)

This might not be a very beginner friendly guide but I will try to explain the crux , working and best practices wherever necessary along with the code snippets . It is assumed that the reader has the working knowledge of the following tools that we would be using throughout . So i will quickly recap what these tools are about .

這可能不是一個非常適合初學者的指南，但是我將嘗試在必要時與代碼段一起解釋關鍵，工作和最佳實踐。假定讀者具有以下我們將貫穿使用的工具的工作知識。因此，我將快速回顧一下這些工具的含義。

Apache AirFlow

阿帕奇氣流

There are multiple steps involved in a ML cycle , so there can be multiple point of failures . To manage such daunting pipelines , we use airflow as our workflow management system . Airflow takes in all your defined tasks and creates a DAG (directed acyclic graph) of your tasks . The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Airflow only supports Python . It segregates your task into operators like PythonOperator , BashOperator etc . It offers you an amazing user interface from where you can trigger , stop , track your workflows along with a bunch of other options . Get a head start here

機器學習周期涉及多個步驟，因此可能存在多個故障點。為了管理如此艱巨的管道，我們使用氣流作為工作流管理系統。氣流吸收所有定義的任務，并為任務創建DAG(有向無環圖)。 Airflow 計劃程序會在遵循指定的依存關系的同時在一組工作線程上執行您的任務。 Airflow僅支持Python 。它將您的任務劃分為PythonOperator，BashOperator等運算符。它為您提供了一個了不起的用戶界面，從中可以觸發，停止，跟蹤工作流以及其他選項。從這里開始

2. MLFlow

MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. MLflow currently offers four components : tracking , registry , deployment and projects(a convention for organising and describing your code to let other data scientists run it) . Mlflow also offers a comprehensive user interface where you can see your model performance , compare the metrics , perform versioning etc. Get a head start here.

MLflow是一個開放源代碼平臺，用于管理ML生命周期，包括實驗，可重復性，部署和中央模型注冊表。 MLflow當前提供四個組件：跟蹤，注冊，部署和項目(用于組織和描述您的代碼以讓其他數據科學家運行它的約定)。 Mlflow還提供了一個全面的用戶界面，在這里你可以看到你的模特表演，比較這些指標，進行版本控制等搶先從這里開始。

好吧，走吧！ (Alright Let’s Go!)

1.配置氣流 (1.Configuring Airflow)

From here on it’s assumed you have a working installation of airflow , mlflow along with python3 (libs : sklearn , xgboost , pandas , sqlalchemy etc). Select a directory for your airflow workspace . Open terminal and type the following:

從這里開始，假設您已經安裝了airflow，mlflow以及python3(庫：sklearn，xgboost，pandas，sqlalchemy等)。為您的氣流工作區選擇一個目錄。打開終端并輸入以下內容：

(venv) $ cd /path/to/my/airflow/workspace
(venv) $ mkdir AIRFLOW_HOME
(venv) $ export AIRFLOW_HOME=`pwd`/AIRFLOW_HOME
(venv) $

If this command worked then you can see that the airflow would have created airflow.cfg and some other files in the AIRFLOW_HOME directory . The cfg file by default sets your airflow to work with sqlite , which is slow and does not support threading as well (so it is not recommended) . You can change it to any db of your choice as in my case I changed the connection string to postgres .

如果此命令有效，則可以看到氣流會在AIRFLOW_HOME目錄中創建airflow.cfg和其他文件。默認情況下，cfg文件將您的氣流設置為與sqlite一起使用，這很慢并且也不支持線程化(因此不推薦)。您可以將其更改為您選擇的任何數據庫，就像我將連接字符串更改為postgres一樣。

sql_alchemy_conn = postgresql://username:password@127.0.0.1:5432/airflow

Also change the load_examples variable to False as airflow loads a bunch of other irrelevant examples into the UI part which we will see later

還將load_examples變量更改為False，因為氣流會將大量其他不相關的示例加載到UI部分中，我們將在后面看到

Okay now we are done with the config , next step is to run the following command .

好的，現在我們完成了配置，下一步是運行以下命令。

(venv) $

All the tables for storing your task and user metadata are created and airflow is now good to go for further steps . Now let’s split our complete cycle into following elementary steps .

創建了用于存儲任務和用戶元數據的所有表，現在可以進行下一步了。現在，讓我們將完整的循環分為以下幾個基本步驟。

Load the data from SQL or any other Data Source
從SQL或任何其他數據源加載數據
Preprocess and Feature Engineering
預處理和特征工程
Training multiple models ( RF , XGB , SVM … etc)
訓練多種模型(RF，XGB，SVM等等)
Evaluation & Comparison of performance metrics
績效指標評估與比較
Deployment of the Model
模型的部署

Please feel free to improvise as per your needs :)

請隨時根據您的需要進行即興創作：)

Let’s now create a folder named dags inside AIRFLOW_HOME and within the dags folder we create subfolders like this .

讓我們的文件夾創建這樣的子文件夾的創建，現在里面AIRFLOW_HOME和的DAG中的文件夾命名的DAG。

Folder Structure資料夾結構

Organising the modules into such folders is necessary and is obviously good practice to isolate the functionalities from each other . For example lets say that Load_Dump contains all the modules that are responsible for loading the data from sql and dumping it locally for the feature engineering flows to consume it . Next in the Feature_Engineering folder you would be having all necessary functions that does your label encoding , imputing of missing values or aggregation of variables etc . Models folder contains the code that takes in the processed features from Feature_Engineering and trains various models on it .

將模塊組織到這樣的文件夾中是必要的，并且顯然是將功能彼此隔離的良好實踐。例如，假設Load_Dump包含負責從sql加載數據并將其轉儲到本地以供功能工程流使用的所有模塊。接下來，在Feature_Engineering文件夾中，您將具有所有必要的功能，這些功能可以進行標簽編碼，插補缺失值或變量匯總等。 “模型”文件夾包含從Feature_Engineering中獲取已處理特征的代碼，并在其上訓練各種模型。

2.設置您的ML代碼以與MlFlow交互： (2.Setting up your ML code to interact with MlFlow:)

Please make sure the ML-flow server is up and running or use the following command :

請確保ML-flow服務器已啟動并正在運行，或使用以下命令：

(venv) $

The default-artifact-root tells MLFlow where to store your artifacts like your model’s pickle file or graphs (more on this later) . I am also using a non-default port 5050 for my convenience . You can go to “http://0.0.0.0:5050/” to checkout the mlflow UI .

default-artifact-root告訴MLFlow將工件存儲在何處，例如模型的pickle文件或圖形(稍后將對此進行詳細介紹)。為方便起見，我還使用非默認端口5050。您可以轉到“ http://0.0.0.0:5050/ ”簽出mlflow UI。

Now in previous steps we created separate folders for different functionalities . Coming to the modelling part , a typical code loads the processed data , trains the model and saves it in some format like a pickle file . For our use case we would be using Mlflow’s api to log the parameters , metrics and the graphs to the mlflow UI . Here is a code chunk which explains how to accomplish this :

現在，在前面的步驟中，我們為不同的功能創建了單獨的文件夾。進入建模部分，典型的代碼將加載處理后的數據，訓練模型并將其以某種格式保存，例如泡菜文件。對于我們的用例，我們將使用Mlflow的api將參數，指標和圖形記錄到mlflow UI中。這是一個代碼塊，解釋了如何完成此任務：

A snippet of XGBOOST_Training.pyXGBOOST_Training.py的摘要

So after this code executes , go inside the MLflow UI (tracking server running on http://0.0.0.0:5050/) and you can see a lot of things . Select the name of your experiment and you can see the log of all the Runs along with all the metrics and parameters you have used .

因此，在執行完這些代碼之后，進入MLflow UI(運行在http://0.0.0.0:5050/上的跟蹤服務器)，您會看到很多東西。選擇實驗的名稱，您將看到所有運行的日志以及已使用的所有度量和參數。

MLFLOW UIMLFLOW用戶界面

You can investigate further by clicking on a log , and it will take you to another page where you can see your artifacts (model pickle , conda deployable , graphs/plots which you have logged in the training code)like below . You also have the option to download the shown items .

您可以通過單擊日志來進一步調查，這將帶您進入另一個頁面，您可以在其中看到您的工件(模型泡菜，conda可部署，已在培訓代碼中記錄的圖形/圖)，如下所示。您還可以選擇下載顯示的項目。

MLFLOW UIMLFLOW用戶界面

Let’s say we are comparing 2 models (XGBOOST & RF ) on the comparison screen and we select XGBOOST because of better f1-score . ( shown below )

假設我們要在比較屏幕上比較2個模型(XGBOOST和RF)，而我們選擇XGBOOST是因為f1-score更好。 (如下圖所示)

MLFlow supports versioning and helps you keep track of all the experiments you have done . Now since you have decided you want this model to be deployed , you can go to the versioning screen of the mlflow and select the model version and transition the stage of the model to production .

MLFlow支持版本控制，并幫助您跟蹤已完成的所有實驗。現在，既然您已決定要部署此模型，則可以轉到mlflow的版本控制屏幕并選擇模型版本，然后將模型階段過渡到生產環境。

MLFLOW UIMLFLOW用戶界面

Does it means the model is deployed ? … unfortunately no . It only means the specific model is flagged for production. Let’s see how we can use airflow’s ability in conjecture to this .

這是否意味著該模型已部署？ … 抱歉不行。這僅意味著將特定型號標記為要生產。讓我們看看如何利用氣流的能力對此進行猜想。

3。創建DAG流程 (3 . Creating DAG Flows)

i) The first workflow

i)第一個工作流程

Create a python script flow_1.py that looks something like this :

創建一個看起來像這樣的python腳本flow_1.py：

A code screenshot from flow_1.pyflow_1.py的代碼屏幕截圖

As you can see I have specified the name of the pipeline along with default set of arguments (start date , concurrency etc .) which are self explanatory . One important argument of DAG is the schedule_interval which means how often you want this workflow to be triggered . I have specified it as “*/30 * * * *” which means , I want this workflow to be triggered every 30 minutes . Next we can see that I have defined 4 Python operators (1 for loading the data , 1 for preprocessing and feature engineering , 2 for model training ) . Here the argument python_callable takes in the primary function which acts as the entry point of the specific task .

如您所見，我已經指定了管道的名稱以及默認的參數集(開始日期，并發性等)，這些參數很容易說明。 DAG的一個重要參數是schedule_interval ，它表示您希望此工作流被觸發的頻率。我將其指定為“ * / 30 * * * *”，這表示我希望此工作流程每30分鐘觸發一次。接下來，我們可以看到我定義了4個Python運算符(1個用于加載數據，1個用于預處理和特征工程，2個用于模型訓練)。在這里，參數python_callable接受了主要功能，該功能充當了特定任務的入口點。

Please note: run_Rf_flow , run_Xgboost_flow are ml model training codes who are using mlflow’s api for logging the metrics, params , artifacts as described in previous step.

請注意： run_Rf_flow，run_Xgboost_flow是ml模型訓練代碼，它們使用mlflow的api記錄上一步中所述的指標，參數，工件。

Now we need to arrange the above tasks in a certain order for execution . It is clear that load_data_sql would be the first to run , followed by preprocess_feature_engineering . After which both of the tasks random_forest_training , xgboost_training can run in parallel . Using the set_downstream function we set the order of execution and link the tasks . To achieve the order , please refer to that snapshot below :

現在，我們需要按一定順序安排上述任務以執行。顯然，首先運行load_data_sql ，然后運行preprocess_feature_engineering 。之后，兩個任務random_forest_training和xgboost_training可以并行運行。使用set_downstream函數，我們可以設置執行順序并鏈接任務。要獲得訂單，請參考下面的快照：

A code screenshot from flow_1.pyflow_1.py的代碼屏幕截圖

Now go back to the terminal and type the following command .

現在回到終端并輸入以下命令。

(venv) $

Usually it runs on localhost:8080 , click on the name of your dag and switch to the graph view , you should see something like this :

通常它在localhost：8080上運行，單擊dag的名稱并切換到圖形視圖，您應該看到如下所示：

A view from Airflow WebServer UIAirflow WebServer UI的視圖

It means airflow has created the order of execution as expected

這意味著氣流已按預期創建了執行順序

ii) The second workflow

ii)第二個工作流程

The first flow stops at logging both the model’s performances into mlflow ui . The Second workflow is triggered after we mark the model version’s stage as “Production”.

第一個流程停止在將兩個模型的性能都記錄到mlflow ui中。在將模型版本的階段標記為“生產”之后，將觸發第二個工作流程。

Create a script called flow_2.py which should look something like this .

創建一個名為flow_2.py的腳本，該腳本應類似于以下內容。

A code screenshot from flow_2.pyflow_2.py的代碼屏幕截圖

We can see the target of this dag-flow is deploying the model into production . Operator 1 is a Python Operator whose task is to search the artifacts within the mlflow tracking server and get the artifact which is staged as production (will come to that later) . Operator 2 is Bash Operator which takes in the artifact-source returned by operator 1 . An interesting thing to note here is since it is highly crucial for the 2 tasks to communicate with each other we have used XCOM interface for the cross-communication between the operators . Finally the Bash Operator uses Mlflow’s serve command to deploy the model at port 1234 of localhost .

我們可以看到，這種滯后的目標是將模型部署到生產中。運算符1是Python運算符，其任務是在mlflow跟蹤服務器中搜索工件，并獲得在生產階段上演的工件(稍后將介紹)。運算符2是Bash運算符 ，它接收運算符1返回的工件源。這里要注意的一件有趣的事情是，對于兩個任務之間的相互通信至關重要，我們已經使用XCOM接口進行操作員之間的交叉通信 。最后，Bash操作員使用Mlflow的serve命令在localhost的端口1234部署模型。

If we restart the airflow webserver , we can see the graph for second dag as well :

如果重新啟動氣流Web服務器，則也可以看到第二個dag的圖表：

Coming to the Operator 1 , let’s see the how we are extracting the “source” of model marked for production . Following is the attached snippet .

來到操作員1，讓我們看看如何提取標記為生產的模型的“來源”。以下是隨附的代碼段。

A code screenshot from flow_2.pyflow_2.py的代碼屏幕截圖

Its pretty straightforward , we set the tracking server’s URI and loop through all the registered models and check whether ‘current_stage’ of model equals to ‘Production’ .

它非常簡單，我們設置跟蹤服務器的URI并遍歷所有已注冊的模型，并檢查模型的“ current_stage”是否等于“ Production”。

4.放在一起 (4. Putting It All Together)

In the previous steps we studied in detail how we can use MLFlow and Airflow together and automate the entire process . Basically everything drills down to 3 steps now .

在前面的步驟中，我們詳細研究了如何一起使用MLFlow和Airflow并使整個過程自動化。基本上，現在所有內容都可以深入到3個步驟。

Step 1 : Trigger the first dag that loads , processes and trains your model and logs the metrics & artifacts to the mflow ui .
第1步：觸發第一個加載，處理和訓練模型的dag，并將度量標準和工件記錄到mflow ui。
Step 2 : Go inside the mlflow ui , study the model’s performance . After the comparison of metrics and parameters select a model’s version to be deployed into production and change its stage to ‘Production’.
第2步：進入mlflow ui，研究模型的性能。比較指標和參數后，選擇要部署到生產中的模型版本，并將其階段更改為“生產”。
Step 3: Trigger the second dag that searches the production artifact and deploys the same.
步驟3：觸發第二個搜索，以搜索生產工件并進行部署。

To keep track of things , Airflow creates a folder named logs inside your dags folder . Each operator has its own log files which are distinguished by timestamps of “start_time”. For example these are the Rf model training logs:

為了跟蹤情況，Airflow在dags文件夾內創建了一個名為logs的文件夾。每個操作員都有自己的日志文件，這些文件通過時間戳記“ start_time”來區分。例如，以下是Rf模型訓練日志：

Log File from Airflow氣流日志文件

Well the above mentioned use-case can be very useful for the scenarios where you need to re-train the model on newly gathered data and deploy it and you look forward to repeat the process daily , weekly or some pre defined frequency .

好了，上述用例對于需要在新收集的數據上重新訓練模型并將其部署，并且希望每天，每周或以預定頻率重復該過程的場景非常有用。

Well that’s it from my side , If you need to resolve queries/discuss with me please connect with me via Linkedin.

好吧，就我而言，如果您需要與我解決問題/討論，請通過Linkedin與我聯系。

翻譯自: https://medium.com/swlh/machine-learning-lifecycle-management-a2e1a4fc500b

白雪小町

總結

以上是生活随笔為你收集整理的白雪小町_町的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

白雪

上一篇：真着急！中国显卡厂商首次曝光RTX 40
下一篇：机器学习术语_机器学习术语神秘化。

编程问答

白雪小町_町

總覽 (Overview)

先決條件 (Prerequisites)

好吧，走吧！ (Alright Let’s Go!)

1.配置氣流 (1.Configuring Airflow)

2.設置您的ML代碼以與MlFlow交互： (2.Setting up your ML code to interact with MlFlow:)

3。 創建DAG流程 (3 . Creating DAG Flows)

4.放在一起 (4. Putting It All Together)

總結

3。創建DAG流程 (3 . Creating DAG Flows)