日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

spark(1.1) mllib 源代码分析

發布時間:2025/7/25 编程问答 14 豆豆
生活随笔 收集整理的這篇文章主要介紹了 spark(1.1) mllib 源代码分析 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

在spark mllib 1.1加入版本stat包,其中包括一些統計數據有關的功能。本文分析中卡方檢驗和實施的主要原則:

?

一個、根本

  在stat包實現Pierxunka方檢驗,它包括以下類別

    (1)適配度檢驗(Goodness of Fit test):驗證一組觀察值的次數分配是否異于理論上的分配。

    (2)獨立性檢驗(independence test) :驗證從兩個變量抽出的配對觀察值組是否互相獨立(比如:每次都從A國和B國各抽一個人,看他們的反應是否與國籍無關)

  計算公式:

    當中O表示觀測值,E表示期望值

  具體原理能夠參考:http://zh.wikipedia.org/wiki/%E7%9A%AE%E7%88%BE%E6%A3%AE%E5%8D%A1%E6%96%B9%E6%AA%A2%E5%AE%9A

?

二、java api調用example

  https://github.com/tovin-xu/mllib_example/blob/master/src/main/java/com/mllib/example/stat/ChiSquaredSuite.java

?

三、源代碼分析

  1、外部api

    通過Statistics類提供了4個外部接口  

// Goodness of Fit test def chiSqTest(observed: Vector, expected: Vector): ChiSqTestResult = {ChiSqTest.chiSquared(observed, expected)} //Goodness of Fit test def chiSqTest(observed: Vector): ChiSqTestResult = ChiSqTest.chiSquared(observed)//independence test def chiSqTest(observed: Matrix): ChiSqTestResult = ChiSqTest.chiSquaredMatrix(observed) //independence test def chiSqTest(data: RDD[LabeledPoint]): Array[ChiSqTestResult] = {ChiSqTest.chiSquaredFeatures(data) }

  2、Goodness of Fit test實現

  這個比較簡單。關鍵是依據(observed-expected)2/expected計算卡方值

/** Pearon's goodness of fit test on the input observed and expected counts/relative frequencies.* Uniform distribution is assumed when `expected` is not passed in.*/def chiSquared(observed: Vector,expected: Vector = Vectors.dense(Array[Double]()),methodName: String = PEARSON.name): ChiSqTestResult = {// Validate input argumentsval method = methodFromString(methodName)if (expected.size != 0 && observed.size != expected.size) {throw new IllegalArgumentException("observed and expected must be of the same size.")}val size = observed.sizeif (size > 1000) {logWarning("Chi-squared approximation may not be accurate due to low expected frequencies "+ s" as a result of a large number of categories: $size.")}val obsArr = observed.toArray// 假設expected值沒有設置,默認取1.0 / sizeval expArr = if (expected.size == 0) Array.tabulate(size)(_ => 1.0 / size) else expected.toArray/ 假設expected、observed值都必需要大于1if (!obsArr.forall(_ >= 0.0)) {throw new IllegalArgumentException("Negative entries disallowed in the observed vector.")}if (expected.size != 0 && ! expArr.forall(_ >= 0.0)) {throw new IllegalArgumentException("Negative entries disallowed in the expected vector.")}// Determine the scaling factor for expectedval obsSum = obsArr.sumval expSum = if (expected.size == 0.0) 1.0 else expArr.sumval scale = if (math.abs(obsSum - expSum) < 1e-7) 1.0 else obsSum / expSum// compute chi-squared statisticval statistic = obsArr.zip(expArr).foldLeft(0.0) { case (stat, (obs, exp)) =>if (exp == 0.0) {if (obs == 0.0) {throw new IllegalArgumentException("Chi-squared statistic undefined for input vectors due"+ " to 0.0 values in both observed and expected.")} else {return new ChiSqTestResult(0.0, size - 1, Double.PositiveInfinity, PEARSON.name,NullHypothesis.goodnessOfFit.toString)}}// 計算(observed-expected)2/expectedif (scale == 1.0) {stat + method.chiSqFunc(obs, exp)} else {stat + method.chiSqFunc(obs, exp * scale)}}val df = size - 1val pValue = chiSquareComplemented(df, statistic)new ChiSqTestResult(pValue, df, statistic, PEARSON.name, NullHypothesis.goodnessOfFit.toString)}

  3、independence test實現

    先通過以下的公式計算expected值,矩陣共同擁有 r 行 c 列

     

    然后依據(observed-expected)2/expected計算卡方值

/** Pearon's independence test on the input contingency matrix.* TODO: optimize for SparseMatrix when it becomes supported.*/def chiSquaredMatrix(counts: Matrix, methodName:String = PEARSON.name): ChiSqTestResult = {val method = methodFromString(methodName)val numRows = counts.numRowsval numCols = counts.numCols// get row and column sumsval colSums = new Array[Double](numCols)val rowSums = new Array[Double](numRows)val colMajorArr = counts.toArrayvar i = 0while (i < colMajorArr.size) {val elem = colMajorArr(i)if (elem < 0.0) {throw new IllegalArgumentException("Contingency table cannot contain negative entries.")}colSums(i / numRows) += elemrowSums(i % numRows) += elemi += 1}val total = colSums.sum// second pass to collect statisticvar statistic = 0.0var j = 0while (j < colMajorArr.size) {val col = j / numRowsval colSum = colSums(col)if (colSum == 0.0) {throw new IllegalArgumentException("Chi-squared statistic undefined for input matrix due to"+ s"0 sum in column [$col].")}val row = j % numRowsval rowSum = rowSums(row)if (rowSum == 0.0) {throw new IllegalArgumentException("Chi-squared statistic undefined for input matrix due to"+ s"0 sum in row [$row].")}val expected = colSum * rowSum / totalstatistic += method.chiSqFunc(colMajorArr(j), expected)j += 1}val df = (numCols - 1) * (numRows - 1)val pValue = chiSquareComplemented(df, statistic)new ChiSqTestResult(pValue, df, statistic, methodName, NullHypothesis.independence.toString)}

版權聲明:本文博客原創文章,博客,未經同意,不得轉載。

轉載于:https://www.cnblogs.com/zfyouxi/p/4731120.html

總結

以上是生活随笔為你收集整理的spark(1.1) mllib 源代码分析的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。