當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

knn分类 knn_关于KNN的快速小课程

發(fā)布時(shí)間：2023/11/29 编程问答 41 豆豆

生活随笔收集整理的這篇文章主要介紹了 knn分类 knn_关于KNN的快速小课程小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

knn分類(lèi) knn

As the title says, here is a quick little lesson on how to construct a simple KNN model in SciKit-Learn. I will be using this dataset. It contains information on students’ academic performance.

就像標(biāo)題中所說(shuō)的，這是關(guān)于如何在SciKit-Learn中構(gòu)建簡(jiǎn)單的KNN模型的快速入門(mén)課程。我將使用此數(shù)據(jù)集。它包含有關(guān)學(xué)生學(xué)習(xí)成績(jī)的信息。

Features included are things like how many times a student raises their hand, their gender, parent satisfaction, how often they were absent from class, and how often they participated in class discussion.

這些功能包括諸如學(xué)生舉手次數(shù)，性別，父母滿意度，他們?nèi)毕n堂的頻率以及他們參加課堂討論的頻率之類(lèi)的東西。

Each student is grouped into one of three academic classes: High (H), Medium (M), and Low (L). I used the other features in order to predict which class they fall in.

每個(gè)學(xué)生分為三個(gè)學(xué)術(shù)班級(jí)之一：高(H)，中(M)和低(L)。我使用其他功能來(lái)預(yù)測(cè)它們屬于哪個(gè)類(lèi)。

Just for reference:

僅供參考：

High, 90–100
高，90-100
Medium, 70–89
中，70–89
Low, 0–69
低，0–69

Okay, cool! Let’s get started.

好吧，酷！讓我們開(kāi)始吧。

圖書(shū)館進(jìn)口 (Library Import)

import numpy as npimport pandas as pdimport seaborn as snsimport statsmodels.api as smfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScalerfrom sklearn.neighbors import KNeighborsClassifierfrom statsmodels.formula.api import olsfrom sklearn.metrics import precision_score, recall_score,
accuracy_score, f1_scoreimport matplotlib.pyplot as plt
%matplotlib inline

First, you want to import all of the libraries that you’re going to need. Some people import each library at each stage of the process, but personally I like to do it all at the beginning.

首先，您要導(dǎo)入所有需要的庫(kù)。有些人在流程的每個(gè)階段都導(dǎo)入每個(gè)庫(kù)，但是我個(gè)人最喜歡一開(kāi)始就全部完成。

Technically we won’t really be using Seaborn or MatplotLib, but I like to keep them around just in case I want to visualize something during the process.

從技術(shù)上講，我們實(shí)際上并不會(huì)使用Seaborn或MatplotLib，但我希望保留它們，以防萬(wàn)一我想在此過(guò)程中可視化某些東西。

初始數(shù)據(jù)導(dǎo)入 (Initial Data Import)

df = pd.read_csv('xAPI-Edu-Data.csv')
df.head()Screenshot of partial output.部分輸出的屏幕截圖。

Cool! The data is in good shape to begin with. There are no missing values and no outliers to speak of. However, we will have to do a small amount of preprocessing to get it ready for our model.

涼！首先，數(shù)據(jù)處于良好狀態(tài)。沒(méi)有遺漏的值，也沒(méi)有離群值。但是，我們將需要進(jìn)行少量預(yù)處理才能為模型準(zhǔn)備就緒。

前處理 (Preprocessing)

# Dropping all unnecessary columns
df = df.drop(['NationalITy', 'PlaceofBirth', 'StageID', 'GradeID',
'SectionID', 'Topic', 'Relation',
'ParentAnsweringSurvey'],
axis = 1,
inplace = False)
df.head()Screenshot of output.輸出的屏幕截圖。

When feeding a KNN model, you only want to include the features that you actually want to be making the decision. This may seem obvious but I figured it was worth mentioning.

在提供KNN模型時(shí)，您只想包含您實(shí)際要做出決定的功能。這似乎很明顯，但我認(rèn)為值得一提。

# Binary encoding of categorical variables
df['gender'] = df['gender'].map({'M': 0, 'F': 1})
df['Semester'] = df['Semester'].map({'F': 0, 'S': 1})
df['ParentschoolSatisfaction'] = df['ParentschoolSatisfaction'].map({'Good': 0, 'Bad': 1})
df['StudentAbsenceDays'] = df['StudentAbsenceDays'].map({'Under-7': 0, 'Above-7': 1})
df.head()Screenshot of output.輸出的屏幕截圖。

Something perhaps not so obvious if you have never done this, is that you have to encode your categorical variables. It makes sense if you think about it. A model can’t really interpret ‘Good’ or ‘Bad’, but it can interpret 0 and 1.

如果您從未執(zhí)行過(guò)此操作，那么可能不太明顯的是您必須對(duì)分類(lèi)變量進(jìn)行編碼。如果您考慮一下，這是有道理的。模型無(wú)法真正解釋“好”或“差”，但可以解釋0和1。

# Check for missing values
df.isna().sum()Screenshot of output.輸出的屏幕截圖。

I know I already said that we don’t have any missing values, but I just like to be thorough.

我知道我已經(jīng)說(shuō)過(guò)，我們沒(méi)有任何缺失的價(jià)值觀，但我只是想做到周全。

# Create a new dataframe with our target variable, remove the target variable from the original dataframe
labels = df['Class']
df.drop('Class', axis = 1, inplace = True)

And then —

然后 -

df.head()Screenshot out output.屏幕截圖輸出。 labels.head()Screenshot of output.輸出的屏幕截圖。

Next, we want to separate our target feature from our predictive features. We do this in order to create a train/test split for our data. Speaking of!

接下來(lái)，我們要將目標(biāo)特征與預(yù)測(cè)特征分開(kāi)。我們這樣做是為了為我們的數(shù)據(jù)創(chuàng)建一個(gè)訓(xùn)練/測(cè)試組。說(shuō)起！

訓(xùn)練/測(cè)試拆分 (Train/Test Split)

X_train, X_test, y_train, y_test = train_test_split(df, labels,
test_size = .25,
random_state =
33)

*I realize the above formatting is terrible, I’m just trying to make it readable for this Medium article.

*我意識(shí)到上面的格式很糟糕，我只是想讓這篇中型文章可讀。

擴(kuò)展數(shù)據(jù) (Scaling the Data)

This next part brings up two important points:

下一部分提出了兩個(gè)要點(diǎn)：

You need to scale the data. If you don’t, variables with larger absolute values will be given more weight in the model for no real reason. We have our features that are binary encoded (0, 1) but we also have features on how many times student raise their hands (0–80). We need to put them on the same scale so they have the same importance in the model.

您需要縮放數(shù)據(jù)。如果您不這樣做，則在沒(méi)有真正原因的情況下，具有更大絕對(duì)值的變量將在模型中獲得更大的權(quán)重。我們具有二進(jìn)制編碼的功能(0，1)，但也具有學(xué)生舉手次數(shù)(0–80)的功能。我們需要將它們放到相同的規(guī)模，以便它們?cè)谀Ｐ椭芯哂邢嗤闹匾浴?

You have to scale the data AFTER you perform the train/test split. If you don’t, you will have leakage and you will invalidate your model. For a more thorough explanation, check out this article by Jason Browlee who has tons of amazing resources on machine learning.

執(zhí)行訓(xùn)練/測(cè)試拆分后，您必須縮放數(shù)據(jù)。如果不這樣做，將會(huì)泄漏，并使模型無(wú)效。有關(guān)更全面的解釋，請(qǐng)查看Jason Browlee的這篇文章，他擁有大量有關(guān)機(jī)器學(xué)習(xí)的驚人資源。

The good news is, this is extremely easy to do.

好消息是，這非常容易做到。

scaler = StandardScaler()
scaled_data_train = scaler.fit_transform(X_train)
scaled_data_test = scaler.transform(X_test)
scaled_df_train = pd.DataFrame(scaled_data_train, columns =
df.columns)scaled_df_train.head()Screenshot of output.輸出的屏幕截圖。

Awesome. Easy peasy lemon squeezy, our data is scaled.

太棒了輕松榨取檸檬，我們的數(shù)據(jù)即可縮放。

擬合KNN模型 (Fit a KNN Model)

# Instantiate the model
clf = KNeighborsClassifier()# Fit the model
clf.fit(scaled_data_train, y_train)# Predict on the test set
test_preds = clf.predict(scaled_data_test)

It really truly is that simple. Now, we want to see how well our baseline model performed.

真的就是這么簡(jiǎn)單。現(xiàn)在，我們想看看基線模型的性能如何。

評(píng)估模型 (Evaluating the Model)

def print_metrics(labels, preds):
print("Precision Score: {}".format(precision_score(labels,
preds, average = 'weighted')))
print("Recall Score: {}".format(recall_score(labels, preds,
average = 'weighted')))
print("Accuracy Score: {}".format(accuracy_score(labels,
preds)))
print("F1 Score: {}".format(f1_score(labels, preds, average =
'weighted')))print_metrics(y_test, test_preds)Screenshot of output.輸出的屏幕截圖。

And there you have it, with almost no effort, we created a predictive model that is able to classify students into their academic performance class with an accuracy of 75.8%. Not bad.

在這里，您幾乎無(wú)需付出任何努力，就創(chuàng)建了一個(gè)預(yù)測(cè)模型，該模型能夠以75.8％的準(zhǔn)確度將學(xué)生分類(lèi)為他們的學(xué)習(xí)成績(jī)班級(jí)。不錯(cuò)。

We can probably improve this by at least a few points by tuning the parameters of the model, but I will leave that for another post.

我們可以通過(guò)調(diào)整模型的參數(shù)至少將其改進(jìn)幾個(gè)點(diǎn)，但是我將在另一篇文章中討論。

Happy learning. 😁

學(xué)習(xí)愉快。 😁

翻譯自: https://towardsdatascience.com/a-quick-little-lesson-on-knn-98381c487aa2