當(dāng)前位置：首頁 > 运维知识 > windows >内容正文

windows

机器学习建立模型_建立生产的机器学习系统

發(fā)布時間：2023/11/29 windows 47 豆豆

生活随笔收集整理的這篇文章主要介紹了机器学习建立模型_建立生产的机器学习系统小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

機(jī)器學(xué)習(xí) 建立模型

When businesses plan to start incorporating machine learning to enhance their solutions, they more often than not think that it is mostly about algorithms and analytics. Most of the blogs/training on the matter also only talk about taking fixed format files, training models and printing the result. Naturally, businesses tend to think that hiring good data scientists should get the job done. What they often fail to appreciate is that it is also a good old system and data engineering problem, with the data models and algorithms sitting at the core.

當(dāng)企業(yè)計劃開始合并機(jī)器學(xué)習(xí)以增強(qiáng)其解決方案時，他們常常會認(rèn)為這主要與算法和分析有關(guān)。關(guān)于此事的大多數(shù)博客/培訓(xùn)也只談?wù)摣@取固定格式的文件，培訓(xùn)模型并打印結(jié)果。自然，企業(yè)傾向于認(rèn)為雇用優(yōu)秀的數(shù)據(jù)科學(xué)家應(yīng)該完成這項工作。他們經(jīng)常不能理解的是，這也是一個很好的舊系統(tǒng)和數(shù)據(jù)工程問題，數(shù)據(jù)模型和算法是核心。

About a few years ago, at an organisation I was working in, business deliberated on using machine learning models to enhance user engagement. The use cases that were initially planned revolved around content recommendations. However later, as we worked more in the field we started using it for more diverse problems like, Topic Classification, Keyword Extraction, Newsletter Content Selection etc.

大約幾年前，在我工作過的一個組織中，企業(yè)考慮使用機(jī)器學(xué)習(xí)模型來增強(qiáng)用戶參與度。最初計劃的用例圍繞內(nèi)容建議。但是后來，隨著我們在該領(lǐng)域的工作越來越深入，我們開始將其用于更多樣化的問題，例如主題分類，關(guān)鍵字提取，新聞稿內(nèi)容選擇等。

I would use our experience in designing and incorporating machine learnt models in production to illustrate the engineering and human aspects in building a data science application and team.

我將利用我們在生產(chǎn)中設(shè)計和合并機(jī)器學(xué)習(xí)模型的經(jīng)驗來說明在構(gòu)建數(shù)據(jù)科學(xué)應(yīng)用程序和團(tuán)隊時的工程和人為方面。

In big data analysis, training models were the crux of our data science application. But to make things work in production, many missing pieces of the puzzle were also required to be solved.

在大數(shù)據(jù)分析中，訓(xùn)練模型是我們數(shù)據(jù)科學(xué)應(yīng)用程序的關(guān)鍵。但是要使事情在生產(chǎn)中可行，還需要解決許多缺失的難題。

These were:

這些曾經(jīng)是：

Getting data into the system on a regular basis from multiple sources.

定期從多個來源將數(shù)據(jù)導(dǎo)入系統(tǒng)。

Cleaning and transformation in more than one structures for use.

清潔和改造多個結(jié)構(gòu)以供使用。

Training and retraining models, saving and reusing as required.

培訓(xùn)和再培訓(xùn)模型，根據(jù)需要保存和重用。

How to apply incremental changes.

如何應(yīng)用增量更改。

Exposing model outputs for consumption through API’s.

公開模型輸出以供通過API使用。

Scaling consumption API’s was also a concern for us. In our existing system, Content was mostly static and served from CDN cache. Certain content related data were served by application servers, but then all the users get the same data. Data was served from cache, which was updated every 5–10 seconds. Also the data pertained to around 7000 odd items on any particular day. Hence, overall low memory consumption and low number of writes.

擴(kuò)展消耗API也是我們關(guān)注的問題。在我們現(xiàn)有的系統(tǒng)中，內(nèi)容大部分是靜態(tài)的，由CDN緩存提供。某些與內(nèi)容相關(guān)的數(shù)據(jù)由應(yīng)用程序服務(wù)器提供，但是所有用戶都獲得相同的數(shù)據(jù)。數(shù)據(jù)由緩存提供，緩存每5-10秒更新一次。此外，該數(shù)據(jù)在任何特定日期涉及大約7000個奇數(shù)物品。因此，總體上內(nèi)存消耗低，寫入次數(shù)少。

Now, the personalized content output was for around 35 million users. Also new content was available every 10 minutes or so. Everything needed to be served by our application servers. Thus, this meant a far higher number of writes as well cache size to be handled than what we had handled earlier.

現(xiàn)在，個性化內(nèi)容輸出可用于約3500萬用戶。每10分鐘左右就有新內(nèi)容可用。我們的應(yīng)用程序服務(wù)器需要提供所有服務(wù)。因此，這意味著要處理的寫入數(shù)量和緩存大小比我們之前處理的要多得多。

The challenge for us was to design a system to do all these. So a data-science / ML project was not limited to building vectors and running models, but involved designing a complete system with data as the lead player.

我們面臨的挑戰(zhàn)是設(shè)計一個能夠完成所有這些任務(wù)的系統(tǒng)。因此，數(shù)據(jù)科學(xué)/機(jī)器學(xué)習(xí)項目不僅限于構(gòu)建向量和運行模型，還涉及設(shè)計一個以數(shù)據(jù)為主要參與者的完整系統(tǒng)。

When we started building our solution, we found that we had three facets that our decisions needed to cater too, namely System, Data and Team . Thus we would discuss our approach to these all three aspects separately.

在開始構(gòu)建解決方案時，我們發(fā)現(xiàn)我們的決策也需要滿足三個方面的需求，即System，Data和Team。因此，我們將分別討論這三個方面的方法。

Data Architecture:

數(shù)據(jù)架構(gòu)：

We had data in multiple types of database, which backed our various applications. Data structures ranged from tabular to Document to Key-Value. Also, we had decided to use Hadoop ecosystem frameworks like Spark, Flink etc for our processing. Therefore, we chose HDFS for our data storage system for analytics.

我們在多種類型的數(shù)據(jù)庫中擁有數(shù)據(jù)，這些數(shù)據(jù)支持了我們的各種應(yīng)用程序。數(shù)據(jù)結(jié)構(gòu)從表格到文檔再到鍵值。此外，我們決定使用Spark，Flink等Hadoop生態(tài)系統(tǒng)框架進(jìn)行處理。因此，我們?yōu)閿?shù)據(jù)存儲系統(tǒng)選擇了HDFS進(jìn)行分析。

We built a 3 tier data storage system.

我們構(gòu)建了3層數(shù)據(jù)存儲系統(tǒng)。

Raw Data Layer: This essentially is our Data lake and the foundation layer. Data is ingested from all our sources into this layer. Data ingestion is done from Databases as well as Kafka Streams.

原始數(shù)據(jù)層：本質(zhì)上這是我們的數(shù)據(jù)湖和基礎(chǔ)層。數(shù)據(jù)從我們所有的源中提取到該層中。數(shù)據(jù)提取是從數(shù)據(jù)庫以及Kafka Streams完成的。

Cleaned / Transformed / Enriched Data Layer: This layer stores data in structures which are directly consumable by our analytics or machine learning applications. Jobs take data from our lake, clean and transform it into standardised structures creating Primary Data. There are jobs which also merged changes to create an updated state. Primary Data is further enriched to create Secondary or Tertiary Data. Jobs also create and save Feature vectors in this layer. Feature vectors are designed to be used in multiple subsequent algorithms. For example, the Content Feature vector is used for Section/Topic classification. Same feature vector, enhanced over time to include consumption information, was used for Newsletter candidate selection & recommendation.

清潔/轉(zhuǎn)換/豐富的數(shù)據(jù)層：此層將數(shù)據(jù)存儲在結(jié)構(gòu)中，這些結(jié)構(gòu)可直接由我們的分析或機(jī)器學(xué)習(xí)應(yīng)用程序使用。作業(yè)從我們的湖泊中獲取數(shù)據(jù)，進(jìn)行清理并將其轉(zhuǎn)換為標(biāo)準(zhǔn)化結(jié)構(gòu)，從而創(chuàng)建基本數(shù)據(jù)。有些作業(yè)還合并了更改以創(chuàng)建更新狀態(tài)。進(jìn)一步豐富了主數(shù)據(jù)以創(chuàng)建輔助或第三數(shù)據(jù)。作業(yè)還會在此層中創(chuàng)建和保存要素向量。特征向量被設(shè)計用于多種后續(xù)算法。例如，內(nèi)容特征向量用于部分/主題分類。隨時間推移增強(qiáng)的相同特征向量包括消費信息，用于新聞通訊候選人的選擇和推薦。

Processed Output Layer: Analytics and Model outputs are stored in this layer. Also trained models too are stored in this layer, for subsequent use.

處理的輸出層：分析和模型輸出存儲在此層中。訓(xùn)練有素的模型也存儲在此層中，以備后用。

System Architecture:

系統(tǒng)架構(gòu)：

All the jobs/applications that we made catered to either data ingestion, processing or output consumption. Thus we built a 3 tiered application layer for all our processing.

我們提供的所有作業(yè)/應(yīng)用程序都可以滿足數(shù)據(jù)攝取，處理或輸出消耗的需求。因此，我們?yōu)樗刑幚順?gòu)建了一個3層的應(yīng)用程序?qū)印?

Data Ingestion Layer: This layer includes batch jobs to import data from RDBMS and Document storage. We used Apache Sqoop for ingestion jobs. There are a bunch of jobs to ingest data from Kafka message streams. For example user activity data. Apache Netty based Rest API Server collects activity data, which is pushed to Kafka. Apache Flink based jobs consume the activity data from Kafka, generate basic statistics and also push the data to HDFS.
數(shù)據(jù)攝取層：該層包括批處理作業(yè)，以從RDBMS和文檔存儲導(dǎo)入數(shù)據(jù)。我們使用Apache Sqoop進(jìn)行提取作業(yè)。有很多作業(yè)可以從Kafka消息流中提取數(shù)據(jù)。例如用戶活動數(shù)據(jù)。基于Apache Netty的Rest API Server收集活動數(shù)據(jù)，該數(shù)據(jù)被推送到Kafka。基于Apache Flink的作業(yè)會使用來自Kafka的活動數(shù)據(jù)，生成基本統(tǒng)計信息，還將數(shù)據(jù)推送到HDFS。
Data Processing: We used Apache Spark jobs for all our processing. It includes jobs for cleaning, enhancements, feature vector builders, ML models and model output generation. The jobs are written in Java, Scala as well as Python.
數(shù)據(jù)處理：我們使用Apache Spark作業(yè)進(jìn)行所有處理。它包括清潔，增強(qiáng)，特征向量生成器，ML模型和模型輸出生成的工作。這些作業(yè)用Java，Scala和Python編寫。
Result Consumption: Processed output was pushed to RDBMS as well as Redis for consumption. The jobs were either built on Spark or Sqoop. The output is further exposed by Spring Boot Rest API endpoints for consumption. Further to this the same results were pushed out in event streams for further downstream processing or consumption.
結(jié)果消耗：處理后的輸出被推送到RDBMS以及Redis進(jìn)行消耗。這些作業(yè)是基于Spark或Sqoop構(gòu)建的。 Spring Boot Rest API端點進(jìn)一步公開了輸出以供使用。除此之外，在事件流中推出了相同的結(jié)果，以供進(jìn)一步的下游處理或使用。

Team Setup:

團(tuán)隊設(shè)置：

This was the most crucial aspect for the success of the entire enterprise. There were two needs that were required to be fulfilled:

這是整個企業(yè)成功的最關(guān)鍵方面。需要滿足兩個需求：

Skill in understanding and applying machine learning principles and tools.

理解和應(yīng)用機(jī)器學(xué)習(xí)原理和工具的技能。

Knowledge of our domain. Deep understanding of the content types, important aspects of content, how it matures, dies, what all things affect it etc.

我們的領(lǐng)域知識。對內(nèi)容類型，內(nèi)容的重要方面，如何成熟，消亡，所有事物對其產(chǎn)生什么影響等方面的深刻理解。

Also, when it came to be known that we were planning ML based products, there were a lot of people in our existing team who wanted to be part of such an initiative. Also, it was important for us to ensure that we cater to the aspirations of our existing team members too.

另外，當(dāng)我們得知我們正在計劃基于ML的產(chǎn)品時，我們現(xiàn)有團(tuán)隊中的很多人都希望成為此類計劃的一部分。同樣，對我們來說，確保我們也能滿足現(xiàn)有團(tuán)隊成員的愿望也很重要。

Moreover, the overall system design meant that there were two distinct parts of the problem, the core ML section and the peripherals which were more like good old software engineering.

而且，整個系統(tǒng)的設(shè)計意味著問題有兩個截然不同的部分，核心ML部分和外圍設(shè)備，它們更像是好的舊軟件工程。

We decided to build our team with a combination two sets of people:

我們決定由兩組人組成一個團(tuán)隊：

Data Science experts, whom we hired. They were entrusted with the data science part of the puzzle. They also taught other team members and mentored their learning process.

我們聘請的數(shù)據(jù)科學(xué)專家。他們被賦予了難題中的數(shù)據(jù)科學(xué)部分。他們還教了其他團(tuán)隊成員，并指導(dǎo)了他們的學(xué)習(xí)過程。

System development team, who were people picked from our existing team. They built the ingestion pipelines, stream processing engines, output consumption API’s etc.

系統(tǒng)開發(fā)團(tuán)隊是從我們現(xiàn)有團(tuán)隊中挑選出來的。他們建立了攝取管道，流處理引擎，輸出消耗API等。

Also, by taking people from our existing team, we were able to get the ingestion pipelines development going while we were hiring the data science people. Thus, we were able to kick start work from day one, figuratively speaking.

此外，通過從現(xiàn)有團(tuán)隊中聘用人員，我們能夠在雇用數(shù)據(jù)科學(xué)人員的同時推動提取管道的開發(fā)。因此，可以說，從第一天開始我們就可以開始工作。

As is illustrated through our experience, building a bunch of applications for training models and generating output is only a beginning. Building a system and team to harness them is an entirely different proposition.

正如我們的經(jīng)驗所表明的那樣，構(gòu)建大量用于訓(xùn)練模型的應(yīng)用程序并生成輸出只是一個開始。建立一個系統(tǒng)和團(tuán)隊來利用它們是完全不同的主張。

翻譯自: https://medium.com/@bmallick/building-a-ml-system-for-production-667923c4389e

機(jī)器學(xué)習(xí) 建立模型

總結(jié)

以上是生活随笔為你收集整理的机器学习建立模型_建立生产的机器学习系统的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：时不时梦到初恋怎么回事
下一篇： pytorch深度学习_深度学习和PyT

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

windows

机器学习 建立模型_建立生产的机器学习系统

總結(jié)

机器学习建立模型_建立生产的机器学习系统