當(dāng)前位置：首頁 >

用二项逻辑斯蒂回归解决二分类问题

發(fā)布時間：2025/7/25 62 豆豆

生活随笔收集整理的這篇文章主要介紹了用二项逻辑斯蒂回归解决二分类问题小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

邏輯斯蒂回歸:

邏輯斯蒂回歸是統(tǒng)計學(xué)習(xí)中的經(jīng)典分類方法，屬于對數(shù)線性模型。logistic回歸的因變量可以是二分類的，

也可以是多分類的

基本原理

logistic 分布

折X是連續(xù)的隨機變量，X服從logistic分布是指X具有下列分布函數(shù)和密度函數(shù):

其中為位置參數(shù)，為形狀參數(shù)。與圖像如下，其中分布函數(shù)是以為中心對陣，越小曲線變化越快

二項logistic回歸模型；

二項logistic回歸模型如下:

其中是輸入，輸出，W稱為權(quán)值向量，b稱為偏置，是w和x的內(nèi)積

參數(shù)估計

? 假設(shè)：

? 則似然函數(shù)為:

? 求對數(shù)似然函數(shù)：

? 從而對

求極大值，得到w的估計值。求極值的方法可以是梯度下降法，梯度上升法等。

示例代碼:?

#導(dǎo)入需要的包： from pyspark import SparkContext from pyspark.sql import SparkSession,Row,functions from pyspark.ml.linalg import Vector,Vectors from pyspark.ml.evaluation import MulticlassClassificationEvaluator from pyspark.ml import Pipeline from pyspark.ml.feature import IndexToString,StringIndexer,VectorIndexer,HashingTF,Tokenizer from pyspark.ml.classification import LogisticRegression,LogisticRegressionModel,BinaryLogisticRegressionSummary,LogisticRegression
#用二項邏輯斯蒂回歸解決二分類問題 sc = SparkContext('local','用二項邏輯斯蒂回歸解決二分類問題') spark = SparkSession.builder.master('local').appName('用二項邏輯斯蒂回歸解決二分類問題').getOrCreate() #讀取數(shù)據(jù)，簡要分析 #我們定制一個函數(shù)，來返回一個指定的數(shù)據(jù)，然后讀取文本文件，第一個map把每行的數(shù)據(jù)用"," #隔開，比如在我們的數(shù)據(jù)集中，每行被分成了5部分，目前4部分是鳶尾花的四個特征，最后一部分鳶尾花的分類； #我們這里把特征存儲在Vector中，創(chuàng)建一個Iris模式的RDd，然后轉(zhuǎn)化成DataFrame;最后調(diào)用show()方法查看數(shù)據(jù) def f(x): rel ={} rel['features'] = Vectors.dense(float(x[0]),float(x[1]),float(x[2]),float(x[3])) rel['label'] = str(x[4]) return rel data= sc.textFile("file:///usr/local/spark/mycode/exercise/iris.txt").map(lambda line : line.split(',')).map(lambda p : Row(**f(p))).toDF() #?因為我們現(xiàn)在處理的是2分類問題，所以我們不需要全部的3類數(shù)據(jù)，我們要從中選出兩類的 #數(shù)據(jù)。這里首先把剛剛得到的數(shù)據(jù)注冊成一個表iris，注冊成這個表之后，我們就可以 #通過sql語句進行數(shù)據(jù)查詢，比如我們這里選出了所有不屬于“Iris-setosa”類別的數(shù) #據(jù)；選出我們需要的數(shù)據(jù)后，我們可以把結(jié)果打印出來看一下，這時就已經(jīng)沒有“Iris-setosa”類別的數(shù)據(jù) data.createOrReplaceTempView("iris") df = spark.sql("select * from iris where label != 'Iris-setosa'") rel = df.rdd.map(lambda t : str(t[1])+":"+str(t[0])).collect() for item in rel: print(item) 如圖: 　　

#構(gòu)建ML的pipeline #分別獲取標(biāo)簽列和特征列，進行索引，并進行了重命名 labelIndexer = StringIndexer().setInputCol('label').setOutputCol('indexedLabel').fit(df) featureIndexer = VectorIndexer().setInputCol('features').setOutputCol('indexedFeatures').fit(df) #把數(shù)據(jù)集隨機分成訓(xùn)練集和測試集，其中訓(xùn)練集占70% trainingData, testData =df.randomSplit([0.7,0.3]) #設(shè)置logistic的參數(shù)，這里我們統(tǒng)一用setter的方法來設(shè)置，也可以用ParamMap來設(shè)置 #（具體的可以查看spark mllib的官網(wǎng)）。這里我們設(shè)置了循環(huán)次數(shù)為10次，正則化項為 #0.3等，具體的可以設(shè)置的參數(shù)可以通過explainParams()來獲取，還能看到我們已經(jīng)設(shè)置 #的參數(shù)的結(jié)果。 lr= LogisticRegression().setLabelCol("indexedLabel").setFeaturesCol('indexedFeatures').setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8) print("LogisticRegression parameters:\n"+ lr.explainParams()) 如圖:

#設(shè)置一個labelConverter，目的是把預(yù)測的類別重新轉(zhuǎn)化成字符型的 labelConverter = IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels) #構(gòu)建pipeline，設(shè)置stage，然后調(diào)用fit()來訓(xùn)練模型 LrPipeline = Pipeline().setStages([labelIndexer, featureIndexer, lr, labelConverter]) LrPipelineModel = LrPipeline.fit(trainingData) #用訓(xùn)練得到的模型進行預(yù)測，即對測試數(shù)據(jù)集進行驗證 lrPredictions = LrPipelineModel.transform(testData) preRel = lrPredictions.select("predictedLabel",'label','features','probability').collect() for item in preRel: print(str(item['label'])+','+str(item['features'])+'-->prob='+str(item['probability'])+',predictedLabel'+str(item['predictedLabel'])) 如圖:

#模型評估1 #創(chuàng)建一個MulticlassClassificationEvaluator實例，用setter方法把預(yù)測分類的列名和真實分類的列名進行設(shè)置；然后計算預(yù)測準(zhǔn)確率和錯誤率 evaluator = MulticlassClassificationEvaluator().setLabelCol("indexedLabel").setPredictionCol("prediction") lrAccuracy = evaluator.evaluate(lrPredictions) print("Test Error=" + str(1.0- lrAccuracy)) 如圖:

#從上面可以看到預(yù)測的準(zhǔn)確性達到94%，接下來我們可以通過model來獲取我們訓(xùn)練得到 #的邏輯斯蒂模型。前面已經(jīng)說過model是一個PipelineModel，因此我們可以通過調(diào)用它的 #stages來獲取模型 lrModel = LrPipelineModel.stages[2] print("Coefficients: " + str(lrModel.coefficients)+"Intercept: "+str(lrModel.intercept)+"numClasses: "+str(lrModel.numClasses)+"numFeatures: "+str(lrModel.numFeatures)) 如圖:

#模型評估2 #spark的ml庫還提供了一個對模型的摘要總結(jié)（summary），不過目前只支持二項邏輯斯 #蒂回歸，而且要顯示轉(zhuǎn)化成BinaryLogisticRegressionSummary?。在下面的代碼中，首 #先獲得二項邏輯斯模型的摘要；然后獲得10次循環(huán)中損失函數(shù)的變化，并將結(jié)果打印出來 #，可以看到損失函數(shù)隨著循環(huán)是逐漸變小的，損失函數(shù)越小，模型就越好；接下來，我們 #把摘要強制轉(zhuǎn)化為BinaryLogisticRegressionSummary，來獲取用來評估模型性能的矩陣； #通過獲取ROC，我們可以判斷模型的好壞，areaUnderROC達到了 0.969551282051282，說明 #我們的分類器還是不錯的；最后，我們通過最大化fMeasure來選取最合適的閾值，其中fMeasure #是一個綜合了召回率和準(zhǔn)確率的指標(biāo)，通過最大化fMeasure，我們可以選取到用來分類的最合適的閾值 trainingSummary = lrModel.summary objectiveHistory = trainingSummary.objectiveHistory for item in objectiveHistory: print (item) print("areaUnderRoC:"+str(trainingSummary.areaUnderROC)) 如圖:

fMeasure = trainingSummary.fMeasureByThreshold maxFMeasure = fMeasure.groupBy().max('F-Measure').select('max(F-Measure)').head() print(maxFMeasure) 如圖: bestThreshold = fMeasure.where(fMeasure['F-Measure'] == maxFMeasure['max(F-Measure)']).select('threshold').head()['threshold'] print(bestThreshold) lr.setThreshold(bestThreshold) #用多項邏輯斯蒂回歸解決二分類問題 mlr = LogisticRegression().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures").setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8).setFamily("multinomial") mlrPipeline = Pipeline().setStages([labelIndexer, featureIndexer, mlr, labelConverter]) mlrPipelineModel = mlrPipeline.fit(trainingData) mlrPrediction = mlrPipelineModel.transform(testData) mlrPreRel =mlrPrediction.select("predictedLabel", "label", "features", "probability").collect() for item in mlrPreRel: print('('+str(item['label'])+','+str(item['features'])+')-->prob='+str(item['probability'])+',predictLabel='+str(item['predictedLabel'])) 如圖:

mlrAccuracy = evaluator.evaluate(mlrPrediction) print("mlr Test Error ="+ str(1.0-mlrAccuracy)) 如圖: mlrModel = mlrPipelineModel.stages[2] print("Multinomial coefficients: " +str(mlrModel.coefficientMatrix)+"Multinomial intercepts: "+str(mlrModel.interceptVector)+"numClasses: "+str(mlrModel.numClasses)+"numFeatures: "+str(mlrModel.numFeatures)) 如圖； #用多項邏輯斯蒂回歸解決多分類問題 mlrPreRel2 = mlrPrediction.select("predictedLabel", "label", "features", "probability").collect() for item in mlrPreRel2: print('('+str(item['label'])+','+str(item['features'])+')-->prob='+str(item['probability'])+',predictLabel='+str(item['predictedLabel'])) 如圖:

mlr2Accuracy = evaluator.evaluate(mlrPrediction) print("Test Error = " + str(1.0 - mlr2Accuracy))

mlr2Model = mlrPipelineModel.stages[2] print("Multinomial coefficients: " + str(mlrModel.coefficientMatrix)+"Multinomial intercepts: "+str(mlrModel.interceptVector)+"numClasses: "+str(mlrModel.numClasses)+"numFeatures: "+str(mlrModel.numFeatures))

轉(zhuǎn)載于:https://www.cnblogs.com/SoftwareBuilding/p/9512653.html

總結(jié)

以上是生活随笔為你收集整理的用二项逻辑斯蒂回归解决二分类问题的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：从壹开始前后端分离【 .NET Core
下一篇： anaconda安装scrapy报错解决