當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Machine Learning on Spark—— 统计基础（一)

發(fā)布時間：2024/1/23 编程问答 42 豆豆

生活随笔收集整理的這篇文章主要介紹了 Machine Learning on Spark—— 统计基础（一) 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

本文主要內(nèi)容

本文對了org.apache.Spark.mllib.stat包及子包中的相關(guān)統(tǒng)計類進(jìn)行介紹，stat包中包括下圖中的類或?qū)ο?

本文將對其中的部分內(nèi)容進(jìn)行詳細(xì)講解

獲取矩陣列（column-wise）統(tǒng)計信息

Kernel density estimation（核密度估計)

Hypothesis testing（假設(shè)檢驗(yàn))

1. 獲取矩陣列（column-wise）統(tǒng)計信息

獲取列統(tǒng)計信息指的是以矩陣中的列為單位獲取其統(tǒng)計信息（如每列的最大值、最小值、均值等其它統(tǒng)計特征）

package cn.ml.statimport org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.stat.Statistics import org.apache.spark.mllib.stat.MultivariateStatisticalSummaryobject StatisticsDemo extends App {val sparkConf = new SparkConf().setAppName("StatisticsDemo").setMaster("spark://sparkmaster:7077") val sc = new SparkContext(sparkConf)val rdd1= sc.parallelize(Array(Array(1.0,2.0,3.0,4.0),Array(2.0,3.0,4.0,5.0),Array(3.0,4.0,5.0,6.0))).map(f => Vectors.dense(f))//在第一節(jié)中，我們使用過該MultivariateStatisticalSummary該類，通過下列方法// var mss:MultivariateStatisticalSummary=rowMatirx.computeColumnSummaryStatistics()// 這里是通過Statistics方法去獲取相關(guān)統(tǒng)計信息，它們的內(nèi)部實(shí)現(xiàn)原理是一致的，最終返回其實(shí)都是// MultivariateOnlineSummarizer的實(shí)例（下一小節(jié)將講解該類)//Statistics.colStats方法它的源碼如下：// def colStats(X: RDD[Vector]): MultivariateStatisticalSummary = {// new RowMatrix(X).computeColumnSummaryStatistics()//}//可以看到 Statistics.colStats方法調(diào)用的是RowMatrix中的computeColumnSummaryStatistics方法val mss:MultivariateStatisticalSummary=Statistics.colStats(rdd1)//因此下列方面返回的結(jié)果與第一節(jié)通過調(diào)用computeColumnSummaryStatistics得到的結(jié)果//返回值是一致的mss.maxmss.minmss.normL1//其它normL2等統(tǒng)計信息 }

2. Kernel density estimation（核密度估計)

統(tǒng)計學(xué)當(dāng)中，核密度估計（Kernel density estimation，KDE）扮演著十分重要的角色，它是一種非參數(shù)化的隨機(jī)變量概率密度估計方法。設(shè)(x1, x2, …, xn)為n個獨(dú)立同分布的樣本，對其概率密度函數(shù)作如下定義：

其中K(?)被稱為核，h 被稱為帶寬bandwidth，它是一個大于0的平滑參數(shù)，更詳細(xì)的信息參見https://en.wikipedia.org/wiki/Kernel_density_estimation
核函數(shù)的種類比較多，但Spark中只實(shí)現(xiàn)了高斯核函數(shù)：

val sample = sc.parallelize(Seq(0.0, 1.0, 4.0, 4.0))val kernelDensity=new KernelDensity().setSample(sample) //設(shè)置密度估計樣本.setBandwidth(3.0) //設(shè)置帶寬，對高斯核函數(shù)來講就是標(biāo)準(zhǔn)差//給定相應(yīng)的點(diǎn)，估計其概率密度//densities: Array[Double] = //Array(0.07464879256673691, 0.1113106036883375, 0.08485447240456075)val densities = kernelDensity.estimate(Array(-1.0, 2.0, 5.0))

3. Hypothesis testing（假設(shè)檢驗(yàn))

假設(shè)檢測在統(tǒng)計學(xué)中用于通過假設(shè)條件將樣本進(jìn)行總體推斷，從而做出接受或拒絕假設(shè)判斷，假設(shè)檢驗(yàn)的方法很多，具體可參考http://baike.baidu.com/link?url=f3DhyOL_9OLVupNkCk82fdOhYOvYKzTWSVNyJqDNBD2hqr1nSlxmqpMiStqnWgNrW3ni9U_kZgy2GA5_8kSAHa。目前Spark中只提供了皮爾森chi平方距離檢測法（Pearson’s chi-squared ( χ2) ），也稱卡方檢驗(yàn)，它由統(tǒng)計學(xué)家皮爾遜推導(dǎo)。理論證明，實(shí)際觀察次數(shù)（fo）與理論次數(shù)（fe）之差的平方再除以理論次數(shù)所得的統(tǒng)計量，近似服從卡方分布。卡方檢驗(yàn)的兩個主要應(yīng)用：擬合性檢驗(yàn)和獨(dú)立性檢驗(yàn)，擬合性檢驗(yàn)是用于分析實(shí)際次數(shù)與理論次數(shù)是否相同，適用于單個因素分類的計數(shù)數(shù)據(jù)。獨(dú)立性檢驗(yàn)用于分析各有多項分類的兩個或兩個以上的因素之間是否有關(guān)聯(lián)或是否獨(dú)立的問題（參見http://en.wikipedia.org/wiki/Chi-squared_test）。在Spark中，擬合度檢驗(yàn)要求輸入為Vector, 獨(dú)立性檢驗(yàn)要求輸入是Matrix，另外還支持RDD[LabeledPoint]的獨(dú)立性檢驗(yàn)。對應(yīng)方法如下：

//對帶標(biāo)簽的特征向量進(jìn)行獨(dú)立性檢驗(yàn)LabeledPoint，返回Array[ChiSqTestResult] //目前只支持PEARSON法即卡方檢驗(yàn) /*** Conduct Pearson's independence test for each feature against the label across the input RDD.* The contingency table is constructed from the raw (feature, label) pairs and used to conduct* the independence test.* Returns an array containing the ChiSquaredTestResult for every feature against the label.*/def chiSquaredFeatures(data: RDD[LabeledPoint],methodName: String = PEARSON.name): Array[ChiSqTestResult] //擬合度檢驗(yàn)，針對Vector,目前只支持PEARSON法即卡方檢驗(yàn) /** Pearson's goodness of fit test on the input observed and expected counts/relative frequencies.* Uniform distribution is assumed when `expected` is not passed in.*/def chiSquared(observed: Vector,expected: Vector = Vectors.dense(Array[Double]()),methodName: String = PEARSON.name): ChiSqTestResult//獨(dú)立性檢驗(yàn)，要求輸入為Matrix，目前只支持PEARSON法即卡方檢驗(yàn)/** Pearson's independence test on the input contingency matrix.* TODO: optimize for SparseMatrix when it becomes supported.*/def chiSquaredMatrix(counts: Matrix, methodName: String = PEARSON.name): ChiSqTestResult

假設(shè)有兩塊土地，通過下列數(shù)據(jù)來檢驗(yàn)其開紅花的比率是否相同：
土地一，開紅花:1000，開蘭花:1856
土地二，開紅花:400.，開蘭花:560

具體使用代碼如下：

val land1 = Vectors.dense(1000.0, 1856.0) val land2 = Vectors.dense(400, 560) val c1 = Statistics.chiSqTest(land1, land2)

執(zhí)行結(jié)果：

c1: org.apache.spark.mllib.stat.test.ChiSqTestResult = Chi squared test summary: method: pearson degrees of freedom = 1 statistic = 52.0048019207683 pValue = 5.536682223805656E-13 Very strong presumption against null hypothesis: observed follows the same distribution as expected..

單從結(jié)果來看，兩組數(shù)據(jù)滿足相同的分布

總結(jié)

以上是生活随笔為你收集整理的Machine Learning on Spark—— 统计基础（一)的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Machine Learning On
下一篇： Machine Learning on