當(dāng)前位置：首頁 > 人文社科 > 生活经验 >内容正文

生活经验

Spark2 ML 学习札记

發(fā)布時間：2023/11/27 生活经验 45 豆豆

生活随笔收集整理的這篇文章主要介紹了 Spark2 ML 学习札记小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

摘要：
　　1.pipeline 模式

　　　　1.1相關(guān)概念

　　　　1.2代碼示例
　　2.特征提取，轉(zhuǎn)換以及特征選擇

　　　　2.1特征提取

　　　　2.2特征轉(zhuǎn)換

　　　　2.3特征選擇

　　3.模型選擇與參數(shù)選擇

　　　　3.1 交叉驗(yàn)證

　　　　3.2 訓(xùn)練集-測試集切分

　　4.spark新增SparkSession與DataSet

內(nèi)容：

1.pipeline 模式

　　1.1相關(guān)概念

　　　　DataFrame是來自Spark SQL的ML DataSet 可以存儲一系列的數(shù)據(jù)類型，text,特征向量，Label和預(yù)測結(jié)果

　　　　Transformer:將DataFrame轉(zhuǎn)化為另外一個DataFrame的算法，通過實(shí)現(xiàn)transform()方法
　　　　Estimator：將DataFrame轉(zhuǎn)化為一個Transformer的算法，通過實(shí)現(xiàn)fit()方法

　　　　PipeLine:將多個Transformer和Estimator串成一個特定的ML Wolkflow

　　　　Parameter:Tansformer和Estimator共用同一個聲明參數(shù)的API

　　　　上圖中藍(lán)色標(biāo)識的是Transformer(Tokenizer?and?HashingTF)，紅色標(biāo)識的是Estimator(LogisticRegression)

　　1.2代碼示例　　　

val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words") val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features") val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.01) val pipeline = new Pipeline() .setStages(Array(tokenizer, hashingTF, lr)) // Fit the pipeline to training documents. val model = pipeline.fit(training)

// Make predictions on test documents.
model.transform(test) .select("id", "text", "probability", "prediction") .collect() .foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) => println(s"($id, $text) --> prob=$prob, prediction=$prediction") }

2.特征提取，轉(zhuǎn)換以及特征選擇

　　2.1特征提取　

- TF-IDF：提取文檔的關(guān)鍵詞
- Word2Vec：將文檔轉(zhuǎn)換成詞向量
- CountVectorizer：向量值計(jì)數(shù)

　　2.2特征轉(zhuǎn)換

- Tokenizer：分詞器
- StopWordsRemover：停詞表　　注：The list of stopwords is specified by the?stopWords?parameter. Default stop words for some languages are accessible by calling?StopWordsRemover.loadDefaultStopWords(language)
- Binarizer
- PCA：主成分分析，一種降維方法，可以提取出區(qū)分度比較高的特征，并計(jì)算權(quán)重
- PolynomialExpansion：多項(xiàng)式核轉(zhuǎn)換
- Discrete Cosine Transform (DCT)
- StringIndexer
- IndexToString
- OneHotEncoder：獨(dú)熱編碼
- VectorIndexer
- -----------------------------------------------------------------標(biāo)準(zhǔn)化和歸一化-------------------------------------------------------------------------------------
- Normalizer：向量正則化處理，參見http://www.cnblogs.com/arachis/p/Regulazation.html
- StandardScaler：標(biāo)準(zhǔn)化方法1：( x-mean ) /??standard deviation
- MinMaxScaler：標(biāo)準(zhǔn)化方法2:?
  ?
- MaxAbsScaler?標(biāo)準(zhǔn)化方法3: x / abs(max)
- ----------------------------------------------------------------離散化-----------------------------------------------------------------------------------------------
- Bucketizer：分區(qū)，可指定分區(qū)的上下界
- QuantileDiscretizer：等寬離散化
- ----------------------------------------------------------------交叉特征---------------------------------------------------------------------------------------------
- ElementwiseProduct
- ----------------------------------------------------------------SQL-------------------------------------------------------------------------------------------------
- SQLTransformer
- VectorAssembler

　　2.3特征選擇　

- VectorSlicer:截取指定的特征，可以是索引，也可以是特征標(biāo)識
- RFormula：RFormula用于將數(shù)據(jù)中的字段通過R語言的Model Formulae轉(zhuǎn)換成特征值，輸出結(jié)果為一個特征向量和Double類型的label。R文檔
- ChiSqSelector：ChiSqSelector用于使用卡方檢驗(yàn)來選擇特征（降維）。

3.模型選擇與參數(shù)選擇

　　　　3.1 交叉驗(yàn)證

　　　　　　將數(shù)據(jù)分為K分，每次測評選取一份作為測試集，其余為訓(xùn)練集；

　　　　3.2 訓(xùn)練集-測試集切分

　　　　　　根據(jù)固定的比例將數(shù)據(jù)分為測試集和訓(xùn)練集

代碼示例：　　　　

val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(new BinaryClassificationEvaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(2) // Use 3+ in practice

4.spark新增SparkSession與DataSet

http://blog.csdn.net/yhao2014/article/details/52215966

http://blog.csdn.net/u013063153/article/details/54615378
http://blog.csdn.net/lsshlsw/article/details/52489503

轉(zhuǎn)載于:https://www.cnblogs.com/arachis/p/Spark2_ML.html

總結(jié)

以上是生活随笔為你收集整理的Spark2 ML 学习札记的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

札记
ml

上一篇： Hadoop 生态系统
下一篇： idea上实现github代码同步