當前位置：首頁 >

mllib调参 spark_从Spark MLlib到美图机器学习框架实践

發(fā)布時間：2025/3/20 55 豆豆

生活随笔收集整理的這篇文章主要介紹了 mllib调参 spark_从Spark MLlib到美图机器学习框架实践小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

MLlib 是 Apache Spark 的可擴展機器學習庫，旨在簡化機器學習的工程實踐工作，并方便擴展到更大規(guī)模的數(shù)據(jù)集。

機器學習簡介

在深入介紹 Spark MLlib 之前先了解機器學習，根據(jù)維基百科的介紹，機器學習有下面幾種定義：機器學習是一門人工智能的科學，該領(lǐng)域的主要研究對象是人工智能，特別是如何在經(jīng)驗學習中改善具體算法的性能；

機器學習是對能通過經(jīng)驗自動改進的計算機算法的研究；

機器學習是用數(shù)據(jù)或以往的經(jīng)驗，以此優(yōu)化計算機程序的性能標準；

一種經(jīng)常引用的英文定義是「A computer program is said to learn from experienceE with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.」。

其實在「美圖數(shù)據(jù)技術(shù)團隊」之前的科普文章

機器學習常用的算法可以分為以下種類：1.構(gòu)造間隔理論分布：人工神經(jīng)網(wǎng)絡、決策樹、感知器、支持向量機、集成學習 AdaBoost、降維與度量學習、聚類、貝葉斯分類器；

2.構(gòu)造條件概率：高斯過程回歸、線性判別分析、最近鄰居法、徑向基函數(shù)核；3.通過再生模型構(gòu)造概率密度函數(shù)：最大期望算法、概率圖模型(貝葉斯網(wǎng)和 Markov 隨機場)、Generative Topographic Mapping；4.近似推斷技術(shù)：馬爾可夫鏈、蒙特卡羅方法、變分法；5.最優(yōu)化算法。

Spark MLlib

在上文我們曾提到機器學習的重點之一是「經(jīng)驗」，而對于計算機而言經(jīng)驗往往需要經(jīng)過多輪迭代計算才能得到，而 Spark 擅長迭代計算，正好符合機器學習這一特性。在 Spark 官網(wǎng)上展示了邏輯回歸算法在 Spark 和 Hadoop 上運行性能比較，從下圖可以看出 MLlib 比 MapReduce 快了 100 倍。

Spark MLlib 主要包括以下幾方面的內(nèi)容：學習算法：分類、回歸、聚類和協(xié)同過濾；

特征處理：特征提取、變換、降維和選擇；

管道(Pipeline)：用于構(gòu)建、評估和調(diào)整機器學習管道的工具；

持久性：保存和加載算法，模型和管道；

實用工具：線性代數(shù)，統(tǒng)計，最優(yōu)化，調(diào)參等工具。

上表總結(jié)了 Spark MLlib 支持的功能結(jié)構(gòu)，可以看出它所提供的算法豐富，但算法種類較少并且老舊，因此 Spark MLlib 在算法上支持與 kylin 項目有些脫節(jié)，它的主要功能更多是與特征相關(guān)的。

ML Pipelines

從 Spark 2.0 開始基于 RDD 的 API 進入維護模式，Spark 的主要機器學習 API 現(xiàn)在是基于 DataFrame 的 API spark.ml，借鑒 Scikit-Learn 的設計提供了 Pipeline 套件，以構(gòu)建機器學習工作流。 ML Pipelines 提供了一套基于 DataFrame 構(gòu)建的統(tǒng)一的高級 API ，可幫助用戶創(chuàng)建和調(diào)整實用的機器學習流程。

*「Spark ML」不是官方名稱，偶爾用于指代基于 MLlib DataFrame 的 API

首先了解 ML Pipelines 內(nèi)幾個重要組件。

DataFrame

DataFrame 讓 Spark 具備了處理大規(guī)模結(jié)構(gòu)化數(shù)據(jù)的能力。

RDD 是分布式 Java 對象的集合，對象的內(nèi)部數(shù)據(jù)結(jié)構(gòu)對于 RDD 而言不可知。DataFrame 是一種以 RDD 為基礎的分布式數(shù)據(jù)集，RDD 中存儲了 Row 對象，Row 對象提供了詳細的結(jié)構(gòu)信息，即模式(schema)，使得 DataFrame 具備了結(jié)構(gòu)化數(shù)據(jù)的能力。

Transforme

Transformer 通常是一個數(shù)據(jù)/特征變換的類，或一個訓練好的模型。

每個 Transformer 都有 transform 函數(shù)，用于將一個 DataFrame 轉(zhuǎn)換為另一個 DataFrame 。一般 transform 的過程是在輸入的 DataFrame 上添加一列或者多列，Transformer.transform也是惰性執(zhí)行，只會生成新的 DataFrame 變量，而不會去提交 job 計算 DataFrame 中的內(nèi)容。

Estimator

Estimator 抽象了從輸入數(shù)據(jù)學習模型的過程，每個 Estimator 都實現(xiàn)了 fit 方法，用于給定 DataFrame 和 Params 后，生成一個 Transformer(即訓練好的模型)，每當調(diào)用 Estimator.fit() 后，都會產(chǎn)生 job 去訓練模型，得到模型參數(shù)。

Param

可以通過設置 Transformer 或 Estimator 實例的參數(shù)來設置模型參數(shù)，也可以通過傳入 ParamMap 對象來設置模型參數(shù)。

Pipeline

Pipeline 定義了一組數(shù)據(jù)處理流程，可以在 Pipeline 中加入 Transformer、Estimator 或另一個 Pipeline。Pipeline 繼承自 Estimator，調(diào)用 Pipeline.fit 方法后返回一個 Transformer——PipelineModel；PipelineModel 繼承自 Transformer，用于將輸入經(jīng)過 Pipeline 的各個 Transformer 的變換后，得到最終輸出。

Spark MLlib 典型流程如下：構(gòu)造訓練數(shù)據(jù)集

構(gòu)建各個 Stage

Stage 組成 Pipeline

啟動模型訓練

評估模型效果

計算預測結(jié)果

通過一個 Pipeline 的文本分類示例來加深理解：

import org.apache.spark.ml.{Pipeline, PipelineModel}

import org.apache.spark.ml.classification.LogisticRegression

import org.apache.spark.ml.feature.{HashingTF, Tokenizer}

import org.apache.spark.ml.linalg.Vector

import org.apache.spark.sql.Row

// Prepare training documents from a list of (id, text, label) tuples.

val training = spark.createDataFrame(Seq(

(0L, "a b c d e spark", 1.0),

(1L, "b d", 0.0),

(2L, "spark f g h", 1.0),

(3L, "hadoop mapreduce", 0.0)

)).toDF("id", "text", "label")

// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.

val tokenizer = new Tokenizer()

.setInputCol("text")

.setOutputCol("words")

val hashingTF = new HashingTF()

.setNumFeatures(1000)

.setInputCol(tokenizer.getOutputCol)

.setOutputCol("features")

val lr = new LogisticRegression()

.setMaxIter(10)

.setRegParam(0.001)

val pipeline = new Pipeline()

.setStages(Array(tokenizer, hashingTF, lr))

// Fit the pipeline to training documents.

val model = pipeline.fit(training)

// Now we can optionally save the fitted pipeline to disk

model.write.overwrite().save("/tmp/spark-logistic-regression-model")

// We can also save this unfit pipeline to disk

pipeline.write.overwrite().save("/tmp/unfit-lr-model")

// And load it back in during production

val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model")

// Prepare test documents, which are unlabeled (id, text) tuples.

val test = spark.createDataFrame(Seq(

(4L, "spark i j k"),

(5L, "l m n"),

(6L, "spark hadoop spark"),

(7L, "apache hadoop")

)).toDF("id", "text")

// Make predictions on test documents.

model.transform(test)

.select("id", "text", "probability", "prediction")

.collect()

.foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>

println(s"($id, $text) --> prob=$prob, prediction=$prediction")

}復制代碼

模型選擇與調(diào)參

Spark MLlib 提供了 CrossValidator 和 TrainValidationSplit 兩個模型選擇和調(diào)參工具。模型選擇與調(diào)參的三個基本組件分別是 Estimator、ParamGrid 和 Evaluator，其中 Estimator 包括算法或者 Pipeline；ParamGrid 即 ParamMap 集合，提供參數(shù)搜索空間；Evaluator 即評價指標。

CrossValidator

via https://github.com/JerryLead/blogs/blob/master/BigDataSystems/Spark/ML/Introduction%20to%20MLlib%20Pipeline.md

CrossValidator 將數(shù)據(jù)集按照交叉驗證數(shù)切分成 n 份，每次用 n-1 份作為訓練集，剩余的作為測試集，訓練并評估模型，重復 n 次，得到 n 個評估結(jié)果，求 n 次的平均值作為這次交叉驗證的結(jié)果。接著對每個候選 ParamMap 重復上面的過程，選擇最優(yōu)的 ParamMap 并重新訓練模型，得到最優(yōu)參數(shù)的模型輸出。

🌰舉個例子：

// We use a ParamGridBuilder to construct a grid of parameters to search over.

// With 3 values for hashingTF.numFeatures and 2 values for lr.regParam,

// this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from.

val paramGrid = new ParamGridBuilder()

.addGrid(hashingTF.numFeatures, Array(10, 100, 1000))

.addGrid(lr.regParam, Array(0.1, 0.01))

.build()

// We now treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.

// This will allow us to jointly choose parameters for all Pipeline stages.

// A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.

// Note that the evaluator here is a BinaryClassificationEvaluator and its default metric

// is areaUnderROC.

val cv = new CrossValidator()

.setEstimator(pipeline)

.setEvaluator(new BinaryClassificationEvaluator)

.setEstimatorParamMaps(paramGrid)

.setNumFolds(2) // Use 3+ in practice

.setParallelism(2) // Evaluate up to 2 parameter settings in parallel

// Run cross-validation, and choose the best set of parameters.

val cvModel = cv.fit(training)

// Prepare test documents, which are unlabeled (id, text) tuples.

val test = spark.createDataFrame(Seq(

(4L, "spark i j k"),

(5L, "l m n"),

(6L, "mapreduce spark"),

(7L, "apache hadoop")

)).toDF("id", "text")

// Make predictions on test documents. cvModel uses the best model found (lrModel).

cvModel.transform(test)

.select("id", "text", "probability", "prediction")

.collect()

.foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>

println(s"($id, $text) --> prob=$prob, prediction=$prediction")

}復制代碼

TrainValidationSplit

TrainValidationSplit 使用 trainRatio 參數(shù)將訓練集按照比例切分成訓練和驗證集，其中 trainRatio 比例的樣本用于訓練，剩余樣本用于驗證。

與 CrossValidator 不同的是，TrainValidationSplit 只有一次驗證過程，可以簡單看成是 CrossValidator 的 n 為 2 時的特殊版本。

🌰舉個例子：

import org.apache.spark.ml.evaluation.RegressionEvaluator

import org.apache.spark.ml.regression.LinearRegression

import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit}

// Prepare training and test data.

val data = spark.read.format("libsvm").load("data/mllib/sample_linear_regression_data.txt")

val Array(training, test) = data.randomSplit(Array(0.9, 0.1), seed = 12345)

val lr = new LinearRegression()

.setMaxIter(10)

// We use a ParamGridBuilder to construct a grid of parameters to search over.

// TrainValidationSplit will try all combinations of values and determine best model using

// the evaluator.

val paramGrid = new ParamGridBuilder()

.addGrid(lr.regParam, Array(0.1, 0.01))

.addGrid(lr.fitIntercept)

.addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0))

.build()

// In this case the estimator is simply the linear regression.

// A TrainValidationSplit requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.

val trainValidationSplit = new TrainValidationSplit()

.setEstimator(lr)

.setEvaluator(new RegressionEvaluator)

.setEstimatorParamMaps(paramGrid)

// 80% of the data will be used for training and the remaining 20% for validation.

.setTrainRatio(0.8)

// Evaluate up to 2 parameter settings in parallel

.setParallelism(2)

// Run train validation split, and choose the best set of parameters.

val model = trainValidationSplit.fit(training)

// Make predictions on test data. model is the model with combination of parameters

// that performed best.

model.transform(test)

.select("features", "label", "prediction")

.show()復制代碼

實現(xiàn)自定義 Transformer

繼承自 Transformer 類，實現(xiàn) transform 方法，通常是在輸入的 DataFrame 上添加一列或多列。

對于單輸入列，單輸出列的 Transformer 可以繼承自 UnaryTransformer 類，并實現(xiàn)其中的 createTransformFunc 方法，實現(xiàn)對輸入列每一行的處理，并返回相應的輸出。

自研機器學習框架

機器學習技術(shù)日新月異，卻缺少高效靈活的框架降低新技術(shù)的調(diào)研成本，而經(jīng)驗與技術(shù)往往需要通過框架和工具來沉淀，并且算法人員常常受限于算力，導致離線證明有效的模型，因為預估時間復雜度過高而無法上線。

據(jù)此美圖數(shù)據(jù)技術(shù)團隊以「開發(fā)簡單靈活的機器學習工作流，降低算法人員的新算法調(diào)研成本及工程人員的維護成本，并且提供常用的領(lǐng)域內(nèi)解決方案，將經(jīng)驗沉淀」的目標搭建了一套量身定制的機器學習框架用以解決上述問題，尤其是解決在推薦算法相關(guān)任務上遇到的問題。該框架總共包括 3 個組件：Spark Feature、Bamboo 與 Online Scorer。

Spark Feature：訓練樣本生產(chǎn)

該組件主要用于訓練樣本的生產(chǎn)，實現(xiàn)了靈活高效的樣本特征編碼，可以實現(xiàn)將任意特征集合放在同一個空間進行編碼，不同特征集合共享編碼空間；為此我們提出了兩個概念：第一個是「域」，用于定義共享相同建模過程的一組特征；第二個是「空間」，用于定義共享相同編碼空間的一組域。

上圖示例中的「Old」展示了在沒有“域”和“空間”概念下的樣本特征編碼，所有特征從 1 開始編號；「New」展示了將 age 和 gender 分別放到 age 域和 gender 域后，兩個域分別從 1 開始編碼，互不影響。

Spark Feature 最終采用 TFRecords 作為訓練樣本的存儲格式。

Bamboo：模型定義與訓練

該組件主要為了實現(xiàn)可擴展、高效、簡單快速的模型定義與訓練。為此，在設計 Bamboo 時我們遵循以下原則：

1.layer 之間通過 tensor 進行交互，layer 的輸入是 tensor，輸出也是 tensor；

2.為了最大限度地提高離線與在線效率，沒有采用太多高級 api，如 keras，大多數(shù)模型與組件基于 Tensorflow 底層 api 開發(fā)，并且根據(jù) Tensorflow 官方的性能優(yōu)化指南對代碼進行優(yōu)化；

3.提供 online-offline 的建模框架，復雜計算放到離線，在線只進行輕量計算，使得復雜模型更易上線；

4.封裝數(shù)據(jù)加載、模型訓練與導出、效果評估以及提供了各種輔助工具，用戶只需要定義前向推理網(wǎng)絡，同時封裝了大量的常用 layer，模型定義更快捷。

Online Scorer：在線預測服務

Online Scorer的目標是提供一個統(tǒng)一，高效的在線推理服務，可以同時支持tensorflow，pytorch，xgboost等各種主流建模框架導出的模型。目前這塊工作還在進行中，具體實現(xiàn)方案細節(jié)，我們放到后面的專題文章介紹。

以上就是美圖自研機器學習框架的簡要介紹，歡迎持續(xù)關(guān)注「美圖數(shù)據(jù)技術(shù)團隊」，后續(xù)將帶來該平臺的詳細介紹。

總結(jié)

以上是生活随笔為你收集整理的mllib调参 spark_从Spark MLlib到美图机器学习框架实践的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： netlink怎么读_ovs源码阅读--
下一篇：桌面时钟代码_iOS 14 制作自己的桌

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

mllib调参 spark_从Spark MLlib到美图机器学习框架实践

總結(jié)