當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

logistic 回归_具有Logistic回归的优秀初学者项目

發(fā)布時間：2023/12/15 编程问答 35 豆豆

生活随笔收集整理的這篇文章主要介紹了 logistic 回归_具有Logistic回归的优秀初学者项目小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

logistic 回歸

Welcome to this friendly beginner’s guide to creating a logistic regression model for classification in python!

歡迎使用這個友好的初學者指南，以在python中創(chuàng)建用于分類的邏輯回歸模型！

With this guide I want to give you an easy way to complete your first data science project by creating a logistic regression machine learning model used for classification. I will explain all the steps thoroughly to make sure that you know what is going on. We will be working with a very famous and known data set originating from 1936. I have used Google Colab to write the code.

通過本指南，我想通過創(chuàng)建用于分類的邏輯回歸機器學習模型 ，為您提供一個簡單的方法來完成您的第一個數(shù)據(jù)科學項目。我將徹底解釋所有步驟， 以確保您了解正在發(fā)生的事情 。我們將使用源自1936年的非常著名的已知數(shù)據(jù)集進行工作。我使用Google Colab編寫了代碼。

The article consists of the following steps:

本文包括以下步驟：

Introduction to the data
數(shù)據(jù)介紹
Binary- vs multiclass classification (an explanation)
二進制-多類分類(說明)
Importing the data
導入數(shù)據(jù)
Splitting the data into training and test data
將數(shù)據(jù)分為訓練和測試數(shù)據(jù)
Scaling the training and test data
擴展訓練和測試數(shù)據(jù)
Building and training the logistic regression for classification model
建立和訓練分類模型的邏輯回歸
Evaluating the results
評估結果
Conclusion and bye byes
結論和再見

I will be showing all the code, and also provide a strong explanation at each step. Sounds good?- Lets do it!

我將展示所有代碼，并在每個步驟中都提供有力的解釋。 聽起來不錯？-讓我們開始吧！

數(shù)據(jù)介紹 (Introduction to the data)

The data can be found here. You will be directed to at Kaggle site (see photo below. I have highlighted the download button in yellow). After downloading the file I have renamed it to dataset_iris.csv.

數(shù)據(jù)可以在這里找到。您將被帶到Kaggle網站(見下圖。我以黃色突出顯示了下載按鈕)。下載文件后，我將其重命名為dataset_iris.csv 。

The dataset consists of 150 entries of data about iris plants. There are exactly 50 of each type of iris described in the data. The data points for each flower are:

該數(shù)據(jù)集包含有關虹膜植物的150個數(shù)據(jù)條目。數(shù)據(jù)中描述的每種類型的虹膜正好有50種。每朵花的數(shù)據(jù)點是：

SepalLengthCm

SepalWidthCm

PetalLengthCm

花瓣長度

PetalWidthCm

花瓣寬度

Species

種類

In total we have 4 variables (1 to 4) that have an influence on which species of plant a given plant might be. In machine learning terms these variables are called features. The final variable (5) is called a label, and it is the label/species that we would like to predict.

總共我們有4個變量(1到4)，這些變量會影響給定植物可能屬于哪種植物。用機器學習術語，這些變量稱為功能。 最終變量(5)稱為標簽，它是我們希望預測的一個或多個標簽。

These are the 20 first rows of the data for you to familiarize with.

這些是您要熟悉的數(shù)據(jù)的前20行。

The first 20 rows of data數(shù)據(jù)的前20行

二進制與多類分類 (Binary- vs multiclass classification)

For this project you will be creating a logistic regression model for classification. To make the logistic regression model work as a classification model we have to apply a small trick (don’t worry it won’t be difficult). Classification problems can be divided into two categories, binary- and multiclass. The first is a classification problem where the outcome is limited to two different classes. This could be yes or no classifications. Will the customer buy the product or not? Whereas the latter is a classification problem where the outcome can be more than 2 classes, as in the case of this project where we have 3 different classes of plants.

對于此項目，您將創(chuàng)建用于分類的邏輯回歸模型。為了使邏輯回歸模型可以用作分類模型，我們必須應用一個小技巧(不要擔心這不會很困難)。分類問題可以分為兩類，二類和多類。第一個是分類問題，其中結果僅限于兩個不同的類別。這可以是是或否的分類。客戶是否會購買產品？后者是一個分類問題，其結果可能超過兩個類別，例如在該項目中，我們有3種不同類別的植物。

Often logistic regression is not used for classification problems,?and there are other models that can naturally do?classifications. However, the logistic model is very common and a good model to?get?familiar?with, which is why I have chosen to implement it here anyway. And it does indeed work, we just have to provide the model with a small power up.

通常，邏輯回歸不用于分類問題，還有其他一些自然可以進行分類的模型。但是，邏輯模型非常普遍，并且是一個很好的熟悉模型，因此無論如何我都選擇在這里實現(xiàn)它。它確實有效，我們只需要為模型提供少量電源即可。

Alright, enough talking already, let’s code!Each code snippet will be followed with an explanation of what just happened.

好了，已經說夠了，讓我們編寫代碼！每個代碼段后面都將出現(xiàn)一個解釋。

導入數(shù)據(jù) (Importing the data)

import pandas as pd
dataset = pd.read_csv('dataset_iris.csv')
dataset.head()Output輸出量

We need the pandas library to get access to the data and we import it to a variable called pd.
我們需要pandas庫來訪問數(shù)據(jù)，并將其導入到名為pd的變量中。
pd (pandas) has a function called read_csv() which allows us to read data from a comma separated file, and it returns a DataFrame which we save to a variable that we call dataset.
pd (pandas)具有一個稱為read_csv()的函數(shù)，該函數(shù)使我們能夠從逗號分隔的文件中讀取數(shù)據(jù)，并且它返回一個DataFrame，我們將其保存到一個稱為數(shù)據(jù)集的變量中。
A pandas DataFrame has a function called head() which displays the first 5 entries in the DataFrame. (See output photo).
熊貓DataFrame具有一個名為head()的函數(shù)，該函數(shù)顯示DataFrame中的前5個條目。 (請參見輸出照片)。

Now we have all the data saved in a DataFrame format next we will divide the data into features and labels (remember the features are the independent variables that influence the dependent variable called the label).

現(xiàn)在我們將所有數(shù)據(jù)保存為DataFrame格式，接下來將數(shù)據(jù)分為要素和標簽(請記住，要素是會影響稱為標簽的因變量的自變量)。

x = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values
print(x[:10])
print(y[:10])Output輸出量

We can access different rows and columns of a DataFrame by using iloc followed by the index of the rows and columns we want, and finally appending .values to get the actual values. We save this to variables x and y.
我們可以使用iloc以及所需的行和列的索引，最后附加.values以獲得實際值，從而訪問DataFrame的不同行和列。我們將其保存到變量x和y。
x now contains 150 rows of features. We see the 10 first rows in the output photo.
x現(xiàn)在包含150行要素。我們在輸出照片中看到前10行。
y now contains the 150 corresponding species/labels. We see the first 10 labels in the output photo.
y現(xiàn)在包含150個相應的種類/標簽。我們會在輸出照片中看到前10個標簽。

We have two more steps to complete before we are ready to create and train our model, let’s do those now!

在準備創(chuàng)建和訓練模型之前，我們還有兩個步驟需要完成，現(xiàn)在就開始做吧！

將數(shù)據(jù)分為訓練和測試數(shù)據(jù) (Splitting the data into training and test data)

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

From the scikit-learn library (called sklearn. You will meet sklearn many times!) and model_selection we import train_test_split.
從scikit-learn庫(稱為sklearn 。您將多次遇到sklearn ！)和model_selection，我們導入train_test_split。
train_test_split allows us to split our data intro training data and test data. It takes 4 parameters: the features (x), the labels (y), test_size (what fraction of the data should become test data) and finally random_state (put any number here, but use 0 to get same results as me).
train_test_split允許我們拆分數(shù)據(jù)入門培訓數(shù)據(jù)和測試數(shù)據(jù)。它包含4個參數(shù)：功能 (x)，標簽 (y)， test_size (數(shù)據(jù)的小部分應成為測試數(shù)據(jù))，最后是random_state (在此處輸入任何數(shù)字，但使用0可獲得與我相同的結果)。

Why do we split the data?When we train a model with data it will get familiar with that data and be able to make perfect predictions when it is introduced to the same data again. It won’t be a lot of fun to only use the model on data it has already seen as we do not want to predict things we already know! Therefore, we split the data and first introduce the test data to the model when it is time to test our models performance.

我們?yōu)槭裁匆鸱謹?shù)據(jù)？ 當我們使用數(shù)據(jù)訓練模型時，它將熟悉該數(shù)據(jù)，并在再次將其引入相同數(shù)據(jù)時能夠做出完美的預測。僅對已經看到的數(shù)據(jù)使用該模型并不會帶來很多樂趣，因為我們不想預測我們已經知道的事情！因此，我們拆分數(shù)據(jù)，并在需要測試模型性能時首先將測試數(shù)據(jù)引入模型。

Now to the final step before making the model…

現(xiàn)在到制作模型的最后一步...

縮放數(shù)據(jù) (Scaling the data)

We need to scale the data through a method called standardization. (don’t worry about how it works right now, that is a topic for another day). Standardization will scale the values of the features so that almost all of them are in the range of -1 to 1. Some machine learning models are sensitive to whether or not data has been scaled, and logistic regression is one such model. As an example: If we do not scale the data the model might consider 2000m larger than 3km. Scaling will help us get rid of this problem.

我們需要通過一種稱為標準化的方法來縮放數(shù)據(jù)。 (不要擔心它現(xiàn)在如何工作，這是另一天的話題)。標準化將縮放功能部件的值，以便幾乎所有功能部件的值都在-1到1的范圍內。某些機器學習模型對是否已縮放數(shù)據(jù)很敏感，而邏輯回歸就是這樣一種模型。例如：如果我們不縮放數(shù)據(jù)，則模型可能會考慮比3km大2000m。擴展將幫助我們擺脫這個問題。

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)Printing the 10 first entries of the scaled training data打印縮放后的訓練數(shù)據(jù)的前10個條目

We import the StandardScaler class from sklearn.preprocessing.
我們從sklearn.preprocessing導入StandardScaler類。
We create an instance of StandardScaler and call it sc.
我們創(chuàng)建StandardScaler的實例，并將其稱為sc。
We call the fit_transform() method from sc, which both fits the scaler to the data and simultaneously scales the training data returning it to the x_train variable.
我們從sc調用fit_transform()方法，這既使縮放器適合數(shù)據(jù)，又同時縮放訓練數(shù)據(jù)并將其返回給x_train變量。
Lastly we also scale the test data. At this point the scaler is already fitted to the training data and should therefore not be fitted again. We simply call the transform() method which returns the transformed test data.
最后，我們還擴展了測試數(shù)據(jù)。此時，定標器已經安裝到訓練數(shù)據(jù)中，因此不應再次安裝。我們只需要調用transform()方法即可返回轉換后的測試數(shù)據(jù)。
In the output photo after the print statement we can see the first 10 entries of the scaled data. Indeed we can see that the values are mostly between -1 and 1
在打印語句后的輸出照片中，我們可以看到縮放數(shù)據(jù)的前10個條目。實際上，我們可以看到這些值大多在-1和1之間

And now to the step that we have all been waiting for!

現(xiàn)在，我們一直在等待這一步驟！

Photo by Dilyara Garifullina on Unsplash Dilyara Garifullina在Unsplash上拍攝的照片

創(chuàng)建和訓練邏輯回歸模型 (Creating and training the logistic regression model)

We are now ready to create the logistic regression model for a multiclass problem. Up until now we have: imported the data, split it into training and testing data and lastly scaled the data to make it more suitable for our machine learning model.

現(xiàn)在，我們準備為多類問題創(chuàng)建邏輯回歸模型。到目前為止，我們已經：導入數(shù)據(jù)，將其分為訓練和測試數(shù)據(jù)，最后縮放數(shù)據(jù)以使其更適合我們的機器學習模型。

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(multi_class='ovr', random_state = 0)
classifier.fit(x_train, y_train)

We import LogisticRegression from sklearn.linear_model
我們從sklearn.linear_model導入LogisticRegression
We create an instance of the LogisticRegression class called classifier by calling LogisticRegression(), as parameters we input multi_class=’ovr’ and random_state=0.
我們通過調用LogisticRegression()創(chuàng)建名為classifier的LogisticRegression類的實例，作為參數(shù)，我們輸入multi_class =' ovr '和random_state = 0 。
multi_class=’ovr’ is the trick I mentioned previously which makes the logistic regression model work in a multiclass scenario. If our problem had simply been binary we would have left this parameter out. Random_state needs any number as input, you can enter 0 to get the same results as me.
multi_class ='ovr'是我之前提到的技巧，它使邏輯回歸模型在多類場景中工作。如果我們的問題只是二進制問題，我們將忽略此參數(shù)。 Random_state需要任何數(shù)字作為輸入，您可以輸入0以獲得與我相同的結果。

Wow! That was actually it.

哇！就是這樣。

Now lets see how well our model performs.

現(xiàn)在讓我們看看我們的模型表現(xiàn)如何。

預測結果 (Predicting results)

from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(predictions, y_test)
print(cm)
accuracy_score(predictions, y_test)

We import confusion_matrix and accuracy_score. They are kind of self explanatory, however, I will comment a bit on them.
我們導入confusion_matrix和precision_score 。它們有點自我解釋，但是，我將對它們進行一些評論。
The confusion_matrix takes two parameters: Our predicted values and the expected values, in this case predictions and y_test. The same applies for the accuracy_score.
confusion_matrix采用兩個參數(shù)：我們的預測值和期望值，在這種情況下為預測和y_test。 這同樣適用于accuracy_score。
We print the confusion_matrix called cm
我們打印出稱為cm的confusion_matrix
The accuracy_score automatically prints out its return value
precision_score自動打印出其返回值

結論 (Conclusion)

We can see that our model has an accuracy score of 0.9, which means it is correct 90% of the time!

我們可以看到我們的模型的準確度得分是0.9，這意味著90％的時間是正確的！

Congratulations you have now completed a machine learning and data science project!

恭喜，您現(xiàn)在已經完成了機器學習和數(shù)據(jù)科學項目！

現(xiàn)在要做什么？ (What to do now?)

There are many ways to improve this project! I wanted to keep this guide as simple as possible for everyone to be able to participate. Because of this I have cut some corners and skipped some of the best practices in data science. Normally you would do a lot more data analysis before training the model,?like?visualizing?the?data, and also do more work to validate the accuracy of the model. I recommend that you try to complete a project with binary classification as well, so that you master both. Try to do a Google search for linearRegression(multi_class=’ovr’) to completely understand what is going on in this step. Also logistic regression might not be the best machine learning model for this project, I simply chose it because it is so well known and I consider it important to learn. You could try implementing other models in this project and maybe you will achieve better results.

有很多方法可以改善這個項目！我希望本指南盡可能地簡化，以使每個人都能參與。因此，我偷偷摸摸走了一些數(shù)據(jù)科學的最佳實踐。通常，在訓練模型之前，您會做很多數(shù)據(jù)分析，例如可視化數(shù)據(jù)，并且還要做更多工作來驗證模型的準確性。我建議您也嘗試使用二進制分類來完成一個項目，這樣您就可以同時掌握兩者。嘗試在Google中搜索linearRegression(multi_class ='ovr')，以完全了解此步驟中發(fā)生的情況。另外，邏輯回歸可能不是該項目的最佳機器學習模型，我之所以選擇它是因為它眾所周知，我認為學習很重要。您可以嘗試在該項目中實現(xiàn)其他模型，也許會獲得更好的結果。

To get better at AI, machine learning and data science I recommend that you keep practicing. This could include reading other guides here on medium, taking a course on udemy or maybe applying to a boot camp, there are many possibilities! If you are interested in avoiding my biggest mistake for yourself, have a look at my article called My biggest mistake learning machine learning.

為了更好地掌握AI，機器學習和數(shù)據(jù)科學，我建議您繼續(xù)練習。這可能包括在媒體上閱讀其他指南，參加關于udemy的課程或申請新兵訓練營，這里有很多可能性！如果您有興趣避免自己遇到最大的錯誤，請參閱我的文章《 我最大的錯誤學習機器學習》。

I hope this article has been helpful to you! If you have any questions or comments feel free to reach out to me in the response section below.

希望本文對您有所幫助！如果您有任何疑問或意見，請隨時在下面的回復部分中與我聯(lián)系。

Keep learning! — Jacob Toftgaard Rasmussen

保持學習！ —雅各布·托夫加德·拉斯穆森

翻譯自: https://medium.com/swlh/a-good-beginner-project-with-logistic-regression-60b78e38484e

logistic 回歸

總結

以上是生活随笔為你收集整理的logistic 回归_具有Logistic回归的优秀初学者项目的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：无监督学习 k-means_监督学习-它
下一篇：脉冲多普勒雷达_是人类还是动物？多普勒