當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

snowflake 使用_如何使用机器学习模型直接从Snowflake进行预测

發(fā)布時(shí)間：2023/12/15 编程问答 45 豆豆

生活随笔收集整理的這篇文章主要介紹了 snowflake 使用_如何使用机器学习模型直接从Snowflake进行预测小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

snowflake 使用

Often, we are faced with the scenarios (and myself recently), where the model which was deployed by the data scientist runs on a schedule and whether that’s once an hour, once a day, or once a week…you get the point. However, there are times when out-of-schedule results are required to make decisions for a meeting or analysis.

通常，我們面臨著場(chǎng)景(以及我最近遇到的場(chǎng)景)，其中數(shù)據(jù)科學(xué)家部署的模型按計(jì)劃運(yùn)行，而無(wú)論是每小時(shí)，每天還是每周一次……您都明白了。但是，有時(shí)需要超出計(jì)劃范圍的結(jié)果才能為會(huì)議或分析做出決策。

With that being said, there are a few ways to get out-of-schedule predictions…

話雖這么說(shuō)，但是有幾種方法可以進(jìn)行超出預(yù)期的預(yù)測(cè)…

超出計(jì)劃外的預(yù)測(cè) (Getting Out-of-schedule Predictions)

Users can use a notebook instance, connect to the datastore, unload the data onto to S3, reference the data for prediction, and copy the result back onto the datastore.

用戶可以使用筆記本實(shí)例，連接到數(shù)據(jù)存儲(chǔ)，將數(shù)據(jù)卸載到S3上，引用數(shù)據(jù)進(jìn)行預(yù)測(cè)，然后將結(jié)果復(fù)制回?cái)?shù)據(jù)存儲(chǔ)。

The developer can build a model hosting API in which users can use the datastore to extract the data, and POST onto the hosting API for prediction.

開(kāi)發(fā)人員可以構(gòu)建一個(gè)模型托管API，在該模型中，用戶可以使用數(shù)據(jù)存儲(chǔ)區(qū)提取數(shù)據(jù)，然后發(fā)布到托管API上進(jìn)行預(yù)測(cè)。

Build a pipeline to allow users to unloaded data directly using SQL to invoke a batch prediction.

建立管道以允許用戶直接使用SQL調(diào)用批處理預(yù)測(cè)來(lái)卸載數(shù)據(jù)。

Consequent, even though the Data Scientist & Co could implement a batch prediction application for others to use for out-of-schedule case, it would be intuitive to bring non-technical users closer to the model themselves and give them the power to run predictions from SQL.

因此，即使Data Scientist＆Co可以實(shí)施批處理預(yù)測(cè)應(yīng)用程序以供其他人在計(jì)劃外的情況下使用，也可以很直觀地使非技術(shù)用戶更接近模型本身，并賦予他們進(jìn)行預(yù)測(cè)的能力從SQL。

縮小Snowflake上運(yùn)行預(yù)測(cè)與SQL之間的差距 (Bridging the gap between running prediction and SQL on Snowflake)

Inspired by Amazon Aurora Machine Learning, I spent a couple of days thinking about how to bridge this gap, and put together an architecture and build that will allow non-technical users to perform batch prediction from the comfort of SQL. This is all within Snowflake using Stored Procedure, Snowpipe, Streams and Tasks, and SageMaker’s batch prediction job (Batch Transform), to create a batch inference data pipeline.

受Amazon Aurora機(jī)器學(xué)習(xí)的啟發(fā)，我花了幾天的時(shí)間思考如何彌合這一差距，并構(gòu)建了一個(gè)架構(gòu)和構(gòu)建，它將允許非技術(shù)用戶從SQL的舒適性中執(zhí)行批量預(yù)測(cè)。這一切都在Snowflake中完成，使用存儲(chǔ)過(guò)程，Snowpipe，流和任務(wù)以及SageMaker的批處理預(yù)測(cè)作業(yè)(批處理轉(zhuǎn)換)來(lái)創(chuàng)建批處理推理數(shù)據(jù)管道。

雪花機(jī)器學(xué)習(xí)-建筑設(shè)計(jì) (Snowflake Machine Learning - Architectural Design)

Architectural diagram of the build構(gòu)建的架構(gòu)圖

The user unloads the data into S3 in the required format which will trigger a Lambda.

用戶以所需格式將數(shù)據(jù)卸載到S3中，這將觸發(fā)Lambda。

SageMaker Batch Transform job is called to make batch predictions on the data using a trained model.

調(diào)用SageMaker Batch Transform作業(yè)以使用訓(xùn)練有素的模型對(duì)數(shù)據(jù)進(jìn)行批量預(yù)測(cè)。

The result from the prediction is written back onto the S3 bucket

預(yù)測(cè)結(jié)果將寫(xiě)回到S3存儲(chǔ)桶

SQS is set up on that S3 bucket to auto-ingest the predicted result onto Snowflake

在該S3存儲(chǔ)桶上設(shè)置SQS，以將預(yù)測(cè)結(jié)果自動(dòng)添加到Snowflake

Once the data lands onto Snowflake, Streams and Tasks are called.

數(shù)據(jù)降落到Snowflake后，將調(diào)用Streams和Tasks。

卸載到S3上-使用存儲(chǔ)過(guò)程 (Unloading onto S3 — Use of Stored Procedure)

Flow diagram of Unloading onto S3卸載到S3的流程圖

創(chuàng)建輸入表 (Creating the input table)

In order for the user to make a call to Batch Transform, the user will need to create an input table that contains the data for the model, and mandatory fields, the predictionid which is a uuid for the job, record_seq which is a unique identifier for reach input rows, a NULL prediction column which is the target of interest.

為了使用戶能夠調(diào)用Batch Transform，用戶將需要?jiǎng)?chuàng)建一個(gè)輸入表，其中包含模型的數(shù)據(jù)和必填字段， predictionid是作業(yè)的uuid， record_seq是唯一標(biāo)識(shí)符。對(duì)于覆蓋率輸入行，則為目標(biāo)目標(biāo)NULL prediction列。

Input Data: hotel_cancellation輸入數(shù)據(jù)：hotel_cancellation

卸載到S3 (Unloading onto S3)

The call_ml_prediction Stored Procedure takes in a user-defined job name and input table name. Calling it will unload the file (using predictionid as the name) onto S3 bucket in the /input path and create an entry in the prediction_status table. From there, Batch Transform will be called to predict on the inputted data.

call_ml_prediction存儲(chǔ)過(guò)程采用用戶定義的作業(yè)名稱和輸入表名稱。調(diào)用它將把文件(使用predictionid作為名稱)卸載到/input路徑中的S3存儲(chǔ)桶上，并在prediction_status表中創(chuàng)建一個(gè)條目。從那里，將調(diào)用Batch Transform來(lái)預(yù)測(cè)輸入的數(shù)據(jù)。

To ensure there aren’t multiple requests being submitted, only one job is able to run at a time. For simplicity, I also ensured only a single file is unloaded onto S3, but Batch Transform can handle multiple input files.

為了確保不會(huì)提交多個(gè)請(qǐng)求，一次只能運(yùn)行一個(gè)作業(yè)。為簡(jiǎn)單起見(jiàn)，我還確保僅將單個(gè)文件卸載到S3上，但是Batch Transform可以處理多個(gè)輸入文件。

Prediction status table預(yù)測(cè)狀態(tài)表

預(yù)測(cè)—使用SageMaker批量轉(zhuǎn)換 (Prediction — Use of SageMaker Batch Transform)

Flow diagram of Triggering Batch Transform觸發(fā)批量轉(zhuǎn)換的流程圖

觸發(fā)SageMaker批量轉(zhuǎn)換 (Triggering SageMaker Batch Transform)

Once the data is unloaded onto the S3 bucket /input, a Lambda gets fired which makes a call SageMaker Batch Transform to read in the input data and output inferences to the /sagemaker path.

將數(shù)據(jù)卸載到S3存儲(chǔ)桶/input ，將觸發(fā)Lambda，該Lambda調(diào)用SageMaker Batch Transform讀取輸入數(shù)據(jù)，并將推斷輸出到/sagemaker路徑。

If you’re familiar with Batch Transform, you can set the input_filter, join and output_filter to your likings for the output prediction file.

如果您熟悉Batch Transform，則可以根據(jù)自己的喜好設(shè)置output_filter，join和output_filter，以適應(yīng)輸出預(yù)測(cè)文件。

批量轉(zhuǎn)換輸出 (Batch Transform Output)

Once Batch Transform completes, it outputs the result as a .csv.out in the /sagemaker path. Another Lambda gets fired which will copy and rename the file as .csv to the /snowflake path where SQS is setup for Snowpipe auto-ingest.

一旦批量變換完成時(shí)，它輸出該結(jié)果作為一個(gè).csv.out在/sagemaker路徑。觸發(fā)另一個(gè)Lambda，它將把文件復(fù)制為.csv并將其重命名為/snowflake路徑，在該路徑中為Snowpipe自動(dòng)攝取設(shè)置了SQS。

結(jié)果-使用Snowpipe，流和任務(wù) (The Result — Use of Snowpipe, Stream and Task)

Flow diagram of pipping the data into Snowflake將數(shù)據(jù)放入Snowflake的流程圖

通過(guò)雪管攝取 (Ingestion through Snowpipe)

Once the data is dropped onto the /snowflake path, it is inserted into the prediction_result table via Snowpipe. For simplicity, since SageMaker Batch Transform maintains the order of the prediction, the row number was used as the identifier to join to the input table. You can do the postprocessing step within Batch Transform itself.

數(shù)據(jù)放到/snowflake路徑后，便會(huì)通過(guò)Snowpipe將其插入prediction_result表。為簡(jiǎn)單起見(jiàn)，由于SageMaker Batch Transform保持了預(yù)測(cè)的順序，因此將行號(hào)用作連接到輸入表的標(biāo)識(shí)符。您可以在Batch Transform本身中執(zhí)行后處理步驟。

流數(shù)據(jù)并觸發(fā)任務(wù) (Streaming the data and triggering Tasks)

A stream is created is on the prediction_result table which will populate prediction_result_stream after Snowpipe delivers the data. This stream, specifically the system$stream_has_data('prediction_result_stream, will be used by the scheduled task populate_prediction_result to call the stored procedure populate_prediction_result to populate the prediction data on the hotel_cancellation table, only if there is a stream. The unique identifier, predictionid, is also set as a task session variable.

創(chuàng)建一個(gè)流是在prediction_result表，該表將填充prediction_result_stream Snowpipe開(kāi)出數(shù)據(jù)之后。調(diào)度的任務(wù)populate_prediction_result將使用此流，特別是system$stream_has_data('prediction_result_stream調(diào)用存儲(chǔ)過(guò)程populate_prediction_result以在hotel_cancellation表上填充預(yù)測(cè)數(shù)據(jù)，唯一的hotel_cancellation是唯一流標(biāo)識(shí)符predictionid ID為。還設(shè)置為任務(wù)會(huì)話變量。

The Result from the Batch Transform批處理轉(zhuǎn)換的結(jié)果

完成工作 (Completing the job)

At the end of the job, and after populate_prediction_result completes, using the system task session variable, the next task update_prediction_status updates the prediction status from Submitted to Completed. This concludes the entire “Using SQL to run Batch Prediction” pipeline.

在作業(yè)結(jié)束時(shí)，并在populate_prediction_result完成之后，使用系統(tǒng)任務(wù)會(huì)話變量，下一個(gè)任務(wù)update_prediction_status將預(yù)測(cè)狀態(tài)從Submitted更改為Completed 。這樣就構(gòu)成了整個(gè)“使用SQL運(yùn)行批處理預(yù)測(cè)”管道的整個(gè)過(guò)程。

Updated prediction status更新了預(yù)測(cè)狀態(tài)

做得更好 (Doing it better)

Snowflake provides a lot of power through Snowpipe, Streams, Stored Procedure and Task to create a data pipeline which can be used for different applications. When combined with SageMaker, Users will be able to send inputs directly from Snowflake and interact with the prediction results.

Snowflake通過(guò)Snowpipe，流，存儲(chǔ)過(guò)程和任務(wù)提供了大量功能，以創(chuàng)建可用于不同應(yīng)用程序的數(shù)據(jù)管道。與SageMaker結(jié)合使用時(shí)，用戶將能夠直接從Snowflake發(fā)送輸入并與預(yù)測(cè)結(jié)果進(jìn)行交互。

Nonetheless, there are some wishlist items which will improve the whole experience and that is:

盡管如此，仍有一些愿望清單項(xiàng)可以改善整體體驗(yàn)，即：

For Snowflake: The ability to manually trigger, or trigger a Task after Snowpipe ingestion finishes. This would guarantee the Task up completed Streams.

對(duì)于Snowflake：能夠在Snowpipe提取完成后手動(dòng)觸發(fā)或觸發(fā)任務(wù)。這將保證任務(wù)完成流。

For the pipeline: Being able to update the status of Snowflake from AWS side to let users know the progress of Batch Transform

對(duì)于管道：能夠從AWS端更新Snowflake的狀態(tài)，以使用戶知道Batch Transform的進(jìn)度

I hope you find this article useful and enjoyed the read.

希望您覺(jué)得這篇文章對(duì)您有幫助，并喜歡閱讀。

關(guān)于我 (About Me)

I love writing medium articles, and sharing my ideas and learnings with everyone. My day-to-day job involves helping businesses build scalable cloud and data solutions, and trying new food recipes. Feel free to connect with me for a casual chat, just let me know you’re from Medium

我喜歡寫(xiě)中篇文章，并與所有人分享我的想法和經(jīng)驗(yàn)。我的日常工作涉及幫助企業(yè)構(gòu)建可擴(kuò)展的云和數(shù)據(jù)解決方案，并嘗試新的食品食譜。隨時(shí)與我聯(lián)系以進(jìn)行休閑聊天，只需讓我知道您來(lái)自中

— Jeno Yamma

— 杰諾·雅瑪 ( Jeno Yamma)

翻譯自: https://towardsdatascience.com/using-machine-learning-models-to-make-prediction-directly-from-snowflake-2471b2f71b68