當前位置：首頁 > 编程语言 > python >内容正文

python

如何在Python中建立和训练K最近邻和K-Means集群ML模型

發(fā)布時間：2023/11/29 python 25 豆豆

生活随笔收集整理的這篇文章主要介紹了如何在Python中建立和训练K最近邻和K-Means集群ML模型小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

One of machine learning's most popular applications is in solving classification problems.

機器學習最流行的應用之一是解決分類問題。

Classification problems are situations where you have a data set, and you want to classify observations from that data set into a specific category.

分類問題是指您擁有數(shù)據(jù)集，并且想要將來自該數(shù)據(jù)集的觀察結果分類為特定類別的情況。

A famous example is a spam filter for email providers. Gmail uses supervised machine learning techniques to automatically place emails in your spam folder based on their content, subject line, and other features.

一個著名的例子是針對電子郵件提供商的垃圾郵件過濾器。 Gmail使用受監(jiān)督的機器學習技術，根據(jù)郵件的內(nèi)容，主題行和其他功能自動將其放入垃圾郵件文件夾。

Two machine learning models perform much of the heavy lifting when it comes to classification problems:

當涉及分類問題時，兩種機器學習模型會承擔很多繁重的工作：

K-nearest neighbors
K近鄰
K-means clustering
K均值聚類

This tutorial will teach you how to code K-nearest neighbors and K-means clustering algorithms in Python.

本教程將教您如何在Python中編寫K近鄰和K均值聚類算法。

K最近鄰居模型 (K-Nearest Neighbors Models)

The K-nearest neighbors algorithm is one of the world’s most popular machine learning models for solving classification problems.

K近鄰算法是解決分類問題的世界上最受歡迎的機器學習模型之一。

A common exercise for students exploring machine learning is to apply the K nearest neighbors algorithm to a data set where the categories are not known. A real-life example of this would be if you needed to make predictions using machine learning on a data set of classified government information.

學生探索機器學習的一個常見練習是將K最近鄰算法應用于類別未知的數(shù)據(jù)集。一個真實的例子是，如果您需要使用機器學習對機密政府信息的數(shù)據(jù)集進行預測。

In this tutorial, you will learn to write your first K nearest neighbors machine learning algorithm in Python. We will be working with an anonymous data set similar to the situation described above.

在本教程中，您將學習用Python編寫第一個K最近鄰機器學習算法。我們將使用類似于上述情況的匿名數(shù)據(jù)集。

您在本教程中需要的數(shù)據(jù)集 (The Data Set You Will Need in This Tutorial)

The first thing you need to do is download the data set we will be using in this tutorial. I have uploaded the file to my website. You can access it by clicking here.

您需要做的第一件事是下載我們將在本教程中使用的數(shù)據(jù)集。我已將文件上傳到我的網(wǎng)站。您可以通過單擊此處訪問它。

Now that you have downloaded the data set, you will want to move the file to the directory that you’ll be working in. After that, open a Jupyter Notebook and we can get started writing Python code!

現(xiàn)在，您已經(jīng)下載了數(shù)據(jù)集，您將需要將文件移動到將要使用的目錄中。之后，打開Jupyter Notebook ，我們可以開始編寫Python代碼了！

在本教程中您將需要的圖書館 (The Libraries You Will Need in This Tutorial)

To write a K nearest neighbors algorithm, we will take advantage of many open-source Python libraries including NumPy, pandas, and scikit-learn.

要編寫K最近鄰算法，我們將利用許多開源Python庫，包括NumPy ， pandas和scikit-learn 。

Begin your Python script by writing the following import statements:

通過編寫以下導入語句開始Python腳本：

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns%matplotlib inline

將數(shù)據(jù)集導入我們的Python腳本 (Importing the Data Set Into Our Python Script)

Our next step is to import the classified_data.csv file into our Python script. The pandas library makes it easy to import data into a pandas DataFrame.

下一步是將classified_data.csv文件導入到我們的Python腳本中。使用pandas庫可以輕松地將數(shù)據(jù)導入pandas DataFrame中。

Since the data set is stored in a csv file, we will be using the read_csv method to do this:

由于數(shù)據(jù)集存儲在一個csv文件中，因此我們將使用read_csv方法來執(zhí)行此操作：

raw_data = pd.read_csv('classified_data.csv')

Printing this DataFrame inside of your Jupyter Notebook will give you a sense of what the data looks like:

在Jupyter Notebook內(nèi)部打印此DataFrame可以使您大致了解數(shù)據(jù)的樣子：

You will notice that the DataFrame starts with an unnamed column whose values are equal to the DataFrame’s index. We can fix this by making a slight adjustment to the command that imported our data set into the Python script:

您會注意到，DataFrame以未命名的列開頭，該列的值等于DataFrame的索引。我們可以通過對將數(shù)據(jù)集導入Python腳本的命令稍作調(diào)整來解決此問題：

raw_data = pd.read_csv('classified_data.csv', index_col = 0)

Next, let’s take a look at the actual features that are contained in this data set. You can print a list of the data set’s column names with the following statement:

接下來，讓我們看一下此數(shù)據(jù)集中包含的實際功能。您可以使用以下語句打印數(shù)據(jù)集的列名列表：

print(raw_data.columns)

This returns:

Index(['WTT', 'PTI', 'EQW', 'SBI', 'LQE', 'QWG', 'FDJ', 'PJF', 'HQE', 'NXJ','TARGET CLASS'],dtype='object')

Since this is a classified data set, we have no idea what any of these columns means. For now, it is sufficient to recognize that every column is numerical in nature and thus well-suited for modelling with machine learning techniques.

由于這是一個分類的數(shù)據(jù)集，因此我們不知道這些列的含義。到目前為止，足以認識到每一列本質上都是數(shù)字，因此非常適合使用機器學習技術進行建模。

標準化數(shù)據(jù)集 (Standardizing the Data Set)

Since the K nearest neighbors algorithm makes predictions about a data point by using the observations that are closest to it, the scale of the features within a data set matters a lot.

由于K最近鄰算法通過使用最接近的觀測值對數(shù)據(jù)點進行預測，因此數(shù)據(jù)集中特征的尺度非常重要。

Because of this, machine learning practitioners typically standardize the data set, which means adjusting every x value so that they are roughly on the same scale.

因此，機器學習從業(yè)人員通常會standardize數(shù)據(jù)集，這意味著調(diào)整每個x值，以使它們大致在同一范圍內(nèi)。

Fortunately, scikit-learn includes some excellent functionality to do this with very little headache.

幸運的是， scikit-learn包含一些出色的功能，可以scikit-learn完成此任務。

To start, we will need to import the StandardScaler class from scikit-learn. Add the following command to your Python script to do this:

首先，我們需要從scikit-learn導入StandardScaler類。將以下命令添加到您的Python腳本中以執(zhí)行此操作：

from sklearn.preprocessing import StandardScaler

This function behaves a lot like the LinearRegression and LogisticRegression classes that we used earlier in this course. We will want to create an instance of this class and then fit the instance of that class on our data set.

此函數(shù)的行為與我們在本課程前面使用的LinearRegression和LogisticRegression類非常相似。我們將要創(chuàng)建此類的實例，然后將該類的實例適合我們的數(shù)據(jù)集。

First, let’s create an instance of the StandardScaler class named scaler with the following statement:

首先，讓我們使用以下語句創(chuàng)建一個名為scaler的StandardScaler類的實例：

scaler = StandardScaler()

We can now train this instance on our data set using the fit method:

現(xiàn)在，我們可以使用fit方法在數(shù)據(jù)集上訓練該實例：

scaler.fit(raw_data.drop('TARGET CLASS', axis=1))

Now we can use the transform method to standardize all of the features in the data set so they are roughly the same scale. We’ll assign these scaled features to the variable named scaled_features:

現(xiàn)在，我們可以使用transform方法來標準化數(shù)據(jù)集中的所有特征，因此它們的比例大致相同。我們將這些縮放后的特征分配給名為scaled_features的變量：

scaled_features = scaler.transform(raw_data.drop('TARGET CLASS', axis=1))

This actually creates a NumPy array of all the features in the data set, and we want it to be a pandas DataFrame instead.

實際上，這將創(chuàng)建一個NumPy數(shù)組，其中包含數(shù)據(jù)集中的所有功能，而我們希望它是一個熊貓DataFrame 。

Fortunately, this is an easy fix. We’ll simply wrap the scaled_features variable in a pd.DataFrame method and assign this DataFrame to a new variable called scaled_data with an appropriate argument to specify the column names:

幸運的是，這很容易解決。我們將簡單地將scaled_features變量包裝在pd.DataFrame方法中，然后將此DataFrame分配給名為scaled_data的新變量，并使用適當?shù)膮?shù)來指定列名稱：

scaled_data = pd.DataFrame(scaled_features, columns = raw_data.drop('TARGET CLASS', axis=1).columns)

Now that we have imported our data set and standardized its features, we are ready to split the data set into training data and test data.

現(xiàn)在，我們已經(jīng)導入了數(shù)據(jù)集并對其功能進行了標準化，我們準備將數(shù)據(jù)集分為訓練數(shù)據(jù)和測試數(shù)據(jù)。

將數(shù)據(jù)集分為訓練數(shù)據(jù)和測試數(shù)據(jù) (Splitting the Data Set Into Training Data and Test Data)

We will use the train_test_split function from scikit-learn combined with list unpacking to create training data and test data from our classified data set.

我們將結合使用scikit-learn的train_test_split函數(shù)和列表解train_test_split來從分類數(shù)據(jù)集中創(chuàng)建訓練數(shù)據(jù)和測試數(shù)據(jù)。

First, you’ll need to import train_test_split from the model_validation module of scikit-learn with the following statement:

首先，您需要使用以下語句從scikit-learn的model_validation模塊中導入train_test_split ：

from sklearn.model_selection import train_test_split

Next, we will need to specify the x and y values that will be passed into this train_test_split function.

接下來，我們將需要指定將傳遞給此train_test_split函數(shù)的x和y值。

The x values will be the scaled_data DataFrame that we created previously. The y values will be the TARGET CLASS column of our original raw_data DataFrame.

x值將是我們先前創(chuàng)建的scaled_data DataFrame。 y值將是我們原始raw_data DataFrame的TARGET CLASS列。

You can create these variables with the following statements:

您可以使用以下語句創(chuàng)建這些變量：

x = scaled_datay = raw_data['TARGET CLASS']

Next, you’ll need to run the train_test_split function using these two arguments and a reasonable test_size. We will use a test_size of 30%, which gives the following parameters for the function:

接下來，您需要使用這兩個參數(shù)和合理的test_size運行train_test_split函數(shù)。我們將使用30％的test_size ，它為該函數(shù)提供以下參數(shù)：

x_training_data, x_test_data, y_training_data, y_test_data = train_test_split(x, y, test_size = 0.3)

Now that our data set has been split into training data and test data, we’re ready to start training our model!

現(xiàn)在，我們的數(shù)據(jù)集已分為訓練數(shù)據(jù)和測試數(shù)據(jù)，我們準備開始訓練我們的模型！

訓練K最近鄰居模型 (Training a K Nearest Neighbors Model)

Let’s start by importing the KNeighborsClassifier from scikit-learn:

讓我們首先從scikit-learn導入KNeighborsClassifier ：

from sklearn.neighbors import KNeighborsClassifier

Next, let’s create an instance of the KNeighborsClassifier class and assign it to a variable named model

接下來，讓我們創(chuàng)建KNeighborsClassifier類的實例，并將其分配給名為model的變量。

This class requires a parameter named n_neighbors, which is equal to the K value of the K nearest neighbors algorithm that you’re building. To start, let’s specify n_neighbors = 1:

此類需要一個名為n_neighbors的參數(shù)，該參數(shù)等于您要構建的K個最近鄰居算法的K值。首先，讓我們指定n_neighbors = 1 ：

model = KNeighborsClassifier(n_neighbors = 1)

Now we can train our K nearest neighbors model using the fit method and our x_training_data and y_training_data variables:

現(xiàn)在，我們可以使用fit方法以及x_training_data和y_training_data變量訓練我們的K個最近鄰居模型：

model.fit(x_training_data, y_training_data)

Now let’s make some predictions with our newly-trained K nearest neighbors algorithm!

現(xiàn)在，讓我們用我們新訓練的K最近鄰算法做出一些預測！

使用我們的K最近鄰算法進行預測 (Making Predictions With Our K Nearest Neighbors Algorithm)

We can make predictions with our K nearest neighbors algorithm in the same way that we did with our linear regression and logistic regression models earlier in this course: by using the predict method and passing in our x_test_data variable.

我們可以使用K最近鄰算法進行predict方法與本課程前面的線性回歸和邏輯回歸模型相同：通過使用predict方法并傳入x_test_data變量。

More specifically, here’s how you can make predictions and assign them to a variable called predictions:

更具體地講，這里是你如何能做出預測，并將其分配給一個變量稱為predictions ：

predictions = model.predict(x_test_data)

Let’s explore how accurate our predictions are in the next section of this tutorial.

讓我們在本教程的下一部分中探索我們的predictions準確性。

測量模型的準確性 (Measuring the Accuracy of Our Model)

We saw in our logistic regression tutorial that scikit-learn comes with built-in functions that make it easy to measure the performance of machine learning classification models.

我們在邏輯回歸教程中看到scikit-learn帶有內(nèi)置函數(shù)，可輕松測量機器學習分類模型的性能。

Let’s import two of these functions (classification_report and confuson_matrix) into our report now:

我們要匯入其中的兩個功能( classification_report和confuson_matrix )到我們的報告現(xiàn)在：

from sklearn.metrics import classification_reportfrom sklearn.metrics import confusion_matrix

Let’s work through each of these one-by-one, starting with the classfication_report. You can generate the report with the following statement:

讓我們從classfication_report開始，逐一研究這些內(nèi)容。您可以使用以下語句生成報告：

print(classification_report(y_test_data, predictions))

This generates:

這將產(chǎn)生：

precision recall f1-score support0 0.94 0.85 0.89 1501 0.86 0.95 0.90 150accuracy 0.90 300macro avg 0.90 0.90 0.90 300weighted avg 0.90 0.90 0.90 300

Similarly, you can generate a confusion matrix with the following statement:

同樣，您可以使用以下語句生成混淆矩陣：

print(confusion_matrix(y_test_data, predictions))

This generates:

這將產(chǎn)生：

[[141 12][ 18 129]]

Looking at these performance metrics, it looks like our model is already fairly performant. It can still be improved.

從這些性能指標來看，我們的模型似乎已經(jīng)相當不錯了。仍然可以改進。

In the next section, we will see how we can improve the performance of our K nearest neighbors model by choosing a better value for K.

在下一節(jié)中，我們將看到如何通過為K選擇一個更好的值來改善我們的K最近鄰居模型的性能。

使用彎頭法選擇最佳K值 (Choosing An Optimal K Value Using the Elbow Method)

In this section, we will use the elbow method to choose an optimal value of K for our K nearest neighbors algorithm.

在本節(jié)中，我們將使用彎頭法為我們的K最近鄰算法選擇K的最佳值。

The elbow method involves iterating through different K values and selecting the value with the lowest error rate when applied to our test data.

彎頭法涉及遍歷不同的K值，并選擇應用于我們的測試數(shù)據(jù)時錯誤率最低的值。

To start, let’s create an empty list called error_rates. We will loop through different K values and append their error rates to this list.

首先，讓我們創(chuàng)建一個名為error_rates的空列表。我們將遍歷不同的K值，并將其錯誤率附加到此列表中。

error_rates = []

Next, we need to make a Python loop that iterates through the different values of K we’d like to test and executes the following functionality with each iteration:

接下來，我們需要創(chuàng)建一個Python循環(huán)，該循環(huán)遍歷我們要測試的K的不同值，并在每次迭代中執(zhí)行以下功能：

Creates a new instance of the KNeighborsClassifier class from scikit-learn
從scikit-learn創(chuàng)建KNeighborsClassifier類的新實例
Trains the new model using our training data
使用我們的訓練數(shù)據(jù)訓練新模型
Makes predictions on our test data
對我們的測試數(shù)據(jù)做出預測
Calculates the mean difference for every incorrect prediction (the lower this is, the more accurate our model is)
計算每個錯誤預測的均值差(這個值越低，我們的模型越準確)

Here is the code to do this for K values between 1 and 100:

這是針對K值介于1和100之間的代碼：

for i in np.arange(1, 101):new_model = KNeighborsClassifier(n_neighbors = i)new_model.fit(x_training_data, y_training_data)new_predictions = new_model.predict(x_test_data)error_rates.append(np.mean(new_predictions != y_test_data))

Let’s visualize how our error rate changes with different K values using a quick matplotlib visualization:

讓我們使用快速的matplotlib可視化效果來可視化我們的錯誤率如何隨不同的K值變化：

plt.plot(error_rates)

As you can see, our error rates tend to be minimized with a K value of approximately 50. This means that 50 is a suitable choice for K that balances both simplicity and predictive power.

如您所見，我們的錯誤率傾向于以大約50的K值最小化。這意味著50是K兼顧簡單性和預測能力的合適選擇。

本教程的完整代碼 (The Full Code For This Tutorial)

You can view the full code for this tutorial in this GitHub repository. It is also pasted below for your reference:

您可以在GitHub存儲庫中查看本教程的完整代碼。還將其粘貼在下面以供您參考：

#Common importsimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns%matplotlib inline#Import the data setraw_data = pd.read_csv('classified_data.csv', index_col = 0)#Import standardization functions from scikit-learnfrom sklearn.preprocessing import StandardScaler#Standardize the data setscaler = StandardScaler()scaler.fit(raw_data.drop('TARGET CLASS', axis=1))scaled_features = scaler.transform(raw_data.drop('TARGET CLASS', axis=1))scaled_data = pd.DataFrame(scaled_features, columns = raw_data.drop('TARGET CLASS', axis=1).columns)#Split the data set into training data and test datafrom sklearn.model_selection import train_test_splitx = scaled_datay = raw_data['TARGET CLASS']x_training_data, x_test_data, y_training_data, y_test_data = train_test_split(x, y, test_size = 0.3)#Train the model and make predictionsfrom sklearn.neighbors import KNeighborsClassifiermodel = KNeighborsClassifier(n_neighbors = 1)model.fit(x_training_data, y_training_data)predictions = model.predict(x_test_data)#Performance measurementfrom sklearn.metrics import classification_reportfrom sklearn.metrics import confusion_matrixprint(classification_report(y_test_data, predictions))print(confusion_matrix(y_test_data, predictions))#Selecting an optimal K valueerror_rates = []for i in np.arange(1, 101):new_model = KNeighborsClassifier(n_neighbors = i)new_model.fit(x_training_data, y_training_data)new_predictions = new_model.predict(x_test_data)error_rates.append(np.mean(new_predictions != y_test_data))plt.figure(figsize=(16,12))plt.plot(error_rates)

K-均值聚類模型 (K-Means Clustering Models)

The K-means clustering algorithm is typically the first unsupervised machine learning model that students will learn.

K均值聚類算法通常是學生將要學習的第一個無監(jiān)督機器學習模型。

It allows machine learning practitioners to create groups of data points within a data set with similar quantitative characteristics. It is useful for solving problems like creating customer segments or identifying localities in a city with high crime rates.

它允許機器學習從業(yè)人員在具有相似定量特征的數(shù)據(jù)集中創(chuàng)建數(shù)據(jù)點組。它對于解決諸如創(chuàng)建客戶群或確定犯罪率高的城市中的地區(qū)之類的問題很有用。

In this section, you will learn how to build your first K means clustering algorithm in Python.

在本部分中，您將學習如何在Python中構建第一個K均值聚類算法。

我們將在本教程中使用的數(shù)據(jù)集 (The Data Set We Will Use In This Tutorial)

In this tutorial, we will be using a data set of data generated using scikit-learn.

在本教程中，我們將使用scikit-learn生成的數(shù)據(jù)集。

Let’s import scikit-learn’s make_blobs function to create this artificial data. Open up a Jupyter Notebook and start your Python script with the following statement:

讓我們導入scikit-learn的make_blobs函數(shù)來創(chuàng)建此人工數(shù)據(jù)。打開Jupyter Notebook，并使用以下語句啟動Python腳本：

from sklearn.datasets import make_blobs

Now let’s use the make_blobs function to create some artificial data!

現(xiàn)在，讓我們使用make_blobs函數(shù)創(chuàng)建一些人工數(shù)據(jù)！

More specifically, here is how you could create a data set with 200 samples that has 2 features and 4 cluster centers. The standard deviation within each cluster will be set to 1.8.

更具體地說，這里是如何創(chuàng)建包含200樣本的數(shù)據(jù)集的示例，該樣本集具有2功能部件和4群集中心。每個群集內(nèi)的標準偏差將設置為1.8 。

raw_data = make_blobs(n_samples = 200, n_features = 2, centers = 4, cluster_std = 1.8)

If you print this raw_data object, you’ll notice that it is actually a Python tuple. The first element of this tuple is a NumPy array with 200 observations. Each observation contains 2 features (just like we specified with our make_blobs function!).

如果您打印此raw_data對象，您會注意到它實際上是一個Python元組。該元組的第一個元素是具有200個觀測值的NumPy數(shù)組。每個觀察包含2個功能(就像我們用make_blobs函數(shù)指定的make_blobs ！)。

Now that our data has been created, we can move on to importing other important open-source libraries into our Python script.

現(xiàn)在我們的數(shù)據(jù)已經(jīng)創(chuàng)建，我們可以繼續(xù)將其他重要的開源庫導入到我們的Python腳本中。

我們將在本教程中使用的導入 (The Imports We Will Use In This Tutorial)

This tutorial will make use of a number of popular open-source Python libraries, including pandas, NumPy, and matplotlib. Let’s continue our Python script by adding the following imports:

本教程將利用許多流行的開源Python庫，包括pandas ， NumPy和matplotlib 。讓我們通過添加以下導入來繼續(xù)我們的Python腳本：

import pandas as pdimport numpy as npimport seabornimport matplotlib.pyplot as plt%matplotlib inline

The first group of imports in this code block is for manipulating large data sets. The second group of imports is for creating data visualizations.

此代碼塊中的第一組導入用于處理大型數(shù)據(jù)集。第二組導入用于創(chuàng)建數(shù)據(jù)可視化。

Let’s move on to visualizing our data set next.

接下來讓我們繼續(xù)可視化我們的數(shù)據(jù)集。

可視化我們的數(shù)據(jù)集 (Visualizing Our Data Set)

In our make_blobs function, we specified for our data set to have 4 cluster centers. The best way to verify that this has been handled correctly is by creating some quick data visualizations.

在我們的make_blobs函數(shù)中，我們?yōu)閿?shù)據(jù)集指定了4個集群中心。驗證此問題是否正確處理的最佳方法是創(chuàng)建一些快速的數(shù)據(jù)可視化文件。

To start, let’s use the following command to plot all of the rows in the first column of our data set against all of the rows in the second column of our data set:

首先，讓我們使用以下命令將數(shù)據(jù)集第一列中的所有行與數(shù)據(jù)集第二列中的所有行進行繪制：

Note: your data set will appear differently than mine since this is randomly-generated data.

注意：由于這是隨機生成的數(shù)據(jù)，因此數(shù)據(jù)集的顯示方式與我的不同。

This image seems to indicate that our data set has only three clusters. This is because two of the clusters are very close to each other.

該圖像似乎表明我們的數(shù)據(jù)集只有三個聚類。這是因為兩個群集彼此非常接近。

To fix this, we need to reference the second element of our raw_data tuple, which is a NumPy array that contains the cluster to which each observation belongs.

為了解決這個問題，我們需要引用raw_data元組的第二個元素，它是一個NumPy數(shù)組，其中包含每個觀察值所屬的簇。

If we color our data set using each observation’s cluster, the unique clusters will quickly become clear. Here is the code to do this:

如果我們使用每個觀察值的群集為數(shù)據(jù)集著色，則唯一的群集將很快變得清晰。這是執(zhí)行此操作的代碼：

plt.scatter(raw_data[0][:,0], raw_data[0][:,1], c=raw_data[1])

We can now see that our data set has four unique clusters. Let’s move on to building our K means cluster model in Python!

現(xiàn)在我們可以看到我們的數(shù)據(jù)集具有四個唯一的群集。讓我們繼續(xù)在Python中構建我們的K均值集群模型！

建立和訓練我們的K均值聚類模型 (Building and Training Our K Means Clustering Model)

The first step to building our K means clustering algorithm is importing it from scikit-learn. To do this, add the following command to your Python script:

建立我們的K均值聚類算法的第一步是從scikit-learn導入它。為此，將以下命令添加到您的Python腳本中：

from sklearn.cluster import KMeans

Next, lets create an instance of this KMeans class with a parameter of n_clusters=4 and assign it to the variable model:

接下來，讓我們使用參數(shù)n_clusters=4創(chuàng)建此KMeans類的實例，并將其分配給變量model ：

model = KMeans(n_clusters=4)

Now let’s train our model by invoking the fit method on it and passing in the first element of our raw_data tuple:

現(xiàn)在，通過調(diào)用模型上的fit方法并傳入raw_data元組的第一個元素來訓練模型：

model.fit(raw_data[0])

In the next section, we’ll explore how to make predictions with this K means clustering model.

在下一節(jié)中，我們將探討如何使用這種K均值聚類模型進行預測。

Before moving on, I wanted to point out one difference that you may have noticed between the process for building this K means clustering algorithm (which is an unsupervised machine learning algorithm) and the supervised machine learning algorithms we’ve worked with so far in this course.

在繼續(xù)之前，我想指出一個差異，您可能已經(jīng)注意到，構建此K均值聚類算法(這是一種無監(jiān)督的機器學習算法)的過程與我們迄今為止在此方面使用的有監(jiān)督的機器學習算法之間的區(qū)別課程。

Namely, we did not have to split the data set into training data and test data. This is an important difference - and in fact, you never need to make the train/test split on a data set when building unsupervised machine learning models!

即，我們不必將數(shù)據(jù)集分為訓練數(shù)據(jù)和測試數(shù)據(jù)。這是一個重要的區(qū)別-實際上，在構建無監(jiān)督的機器學習模型時，您無需對數(shù)據(jù)集進行訓練/測試拆分！

用我們的K均值聚類模型進行預測 (Making Predictions With Our K Means Clustering Model)

Machine learning practitioners generally use K means clustering algorithms to make two types of predictions:

機器學習從業(yè)人員通常使用K均值聚類算法進行兩種類型的預測：

Which cluster each data point belongs to
每個數(shù)據(jù)點屬于哪個群集
Where the center of each cluster is
每個群集的中心在哪里

It is easy to generate these predictions now that our model has been trained.

既然我們的模型已經(jīng)過訓練，就很容易生成這些預測。

First, let’s predict which cluster each data point belongs to. To do this, access the labels_ attribute from our model object using the dot operator, like this:

首先，讓我們預測每個數(shù)據(jù)點屬于哪個群集。為此，請使用點運算符從我們的model對象訪問labels_屬性，如下所示：

model.labels_

This generates a NumPy array with predictions for each data point that looks like this:

這將生成一個NumPy數(shù)組，其中包含每個數(shù)據(jù)點的預測，如下所示：

array([3, 2, 7, 0, 5, 1, 7, 7, 6, 1, 2, 4, 6, 7, 6, 4, 4, 3, 3, 6, 0, 0,6, 4, 5, 6, 0, 2, 6, 5, 4, 3, 4, 2, 6, 6, 6, 5, 6, 2, 1, 1, 3, 4,3, 5, 7, 1, 7, 5, 3, 6, 0, 3, 5, 5, 7, 1, 3, 1, 5, 7, 7, 0, 5, 7,3, 4, 0, 5, 6, 5, 1, 4, 6, 4, 5, 6, 7, 2, 2, 0, 4, 1, 1, 1, 6, 3,3, 7, 3, 6, 7, 7, 0, 3, 4, 3, 4, 0, 3, 5, 0, 3, 6, 4, 3, 3, 4, 6,1, 3, 0, 5, 4, 2, 7, 0, 2, 6, 4, 2, 1, 4, 7, 0, 3, 2, 6, 7, 5, 7,5, 4, 1, 7, 2, 4, 7, 7, 4, 6, 6, 3, 7, 6, 4, 5, 5, 5, 7, 0, 1, 1,0, 0, 2, 5, 0, 3, 2, 5, 1, 5, 6, 5, 1, 3, 5, 1, 2, 0, 4, 5, 6, 3,4, 4, 5, 6, 4, 4, 2, 1, 7, 4, 6, 6, 0, 6, 3, 5, 0, 5, 2, 4, 6, 0,1, 0], dtype=int32)

To see where the center of each cluster lies, access the cluster_centers_ attribute using the dot operator like this:

要查看每個集群的中心位置，請使用點運算符訪問cluster_centers_屬性，如下所示：

model.cluster_centers_

This generates a two-dimensional NumPy array that contains the coordinates of each clusters center. It will look like this:

這將生成一個二維NumPy數(shù)組，其中包含每個聚類中心的坐標。它看起來像這樣：

array([[ -8.06473328, -0.42044783],[ 0.15944397, -9.4873621 ],[ 1.49194628, 0.21216413],[-10.97238157, -2.49017206],[ 3.54673215, -9.7433692 ],[ -3.41262049, 7.80784834],[ 2.53980034, -2.96376999],[ -0.4195847 , 6.92561289]])

We’ll assess the accuracy of these predictions in the next section.

我們將在下一部分中評估這些預測的準確性。

可視化我們模型的準確性 (Visualizing the Accuracy of Our Model)

The last thing we’ll do in this tutorial is visualize the accuracy of our model. You can use the following code to do this:

我們在本教程中要做的最后一件事是可視化模型的準確性。您可以使用以下代碼執(zhí)行此操作：

f, (ax1, ax2) = plt.subplots(1, 2, sharey=True,figsize=(10,6))ax1.set_title('Our Model')ax1.scatter(raw_data[0][:,0], raw_data[0][:,1],c=model.labels_)ax2.set_title('Original Data')ax2.scatter(raw_data[0][:,0], raw_data[0][:,1],c=raw_data[1])

This generates two different plots side-by-side where one plot shows the clusters according to the real data set and the other plot shows the clusters according to our model. Here is what the output looks like:

這將并排生成兩個不同的圖，其中一個圖根據(jù)實際數(shù)據(jù)集顯示聚類，而另一個圖根據(jù)我們的模型顯示聚類。輸出如下所示：

Although the coloring between the two plots is different, you can see that our model did a fairly good job of predicting the clusters within our data set. You can also see that the model was not perfect - if you look at the data points along a cluster’s edge, you can see that it occasionally misclassified an observation from our data set.

盡管兩個圖之間的顏色不同，但是您可以看到我們的模型在預測數(shù)據(jù)集中的聚類方面做得很好。您還可以看到該模型不是完美的-如果您查看集群邊緣的數(shù)據(jù)點，您會發(fā)現(xiàn)它有時會錯誤地將數(shù)據(jù)從我們的數(shù)據(jù)集中分類。

There’s one last thing that needs to be mentioned about measuring our model’s prediction. In this example ,we knew which cluster each observation belonged to because we actually generated this data set ourselves.

關于測量模型的預測，還有最后一件事需要提及。在此示例中，我們知道每個觀測值屬于哪個群集，因為我們實際上是自己生成了此數(shù)據(jù)集。

This is highly unusual. K means clustering is more often applied when the clusters aren’t known in advance. Instead, machine learning practitioners use K means clustering to find patterns that they don’t already know within a data set.

這是非常不尋常的。 K表示當群集未知時更常應用群集。取而代之的是，機器學習從業(yè)人員使用K表示聚類來查找他們在數(shù)據(jù)集中尚不知道的模式。

本教程的完整代碼 (The Full Code For This Tutorial)

You can view the full code for this tutorial in this GitHub repository. It is also pasted below for your reference:

您可以在GitHub存儲庫中查看本教程的完整代碼。還將其粘貼在下面以供您參考：

#Create artificial data setfrom sklearn.datasets import make_blobsraw_data = make_blobs(n_samples = 200, n_features = 2, centers = 4, cluster_std = 1.8)#Data importsimport pandas as pdimport numpy as np#Visualization importsimport seabornimport matplotlib.pyplot as plt%matplotlib inline#Visualize the dataplt.scatter(raw_data[0][:,0], raw_data[0][:,1])plt.scatter(raw_data[0][:,0], raw_data[0][:,1], c=raw_data[1])#Build and train the modelfrom sklearn.cluster import KMeansmodel = KMeans(n_clusters=4)model.fit(raw_data[0])#See the predictionsmodel.labels_model.cluster_centers_#PLot the predictions against the original data setf, (ax1, ax2) = plt.subplots(1, 2, sharey=True,figsize=(10,6))ax1.set_title('Our Model')ax1.scatter(raw_data[0][:,0], raw_data[0][:,1],c=model.labels_)ax2.set_title('Original Data')ax2.scatter(raw_data[0][:,0], raw_data[0][:,1],c=raw_data[1])

最后的想法 (Final Thoughts)

This tutorial taught you how to how to build K-nearest neighbors and K-means clustering machine learning models in Python.

本教程教您如何在Python中建立K近鄰和K均值集群機器學習模型。

If you're interested in learning more about machine learning, my book Pragmatic Machine Learning will teach you practical machine learning techniques by building 9 real projects. The book launches August 3rd. You can preorder it for 50% off using the link below:

如果您有興趣了解有關機器學習的更多信息，我的書《實用機器學習》將通過構建9個真實項目來教您實用的機器學習技術。 該書于8月3日發(fā)行。 您可以使用以下鏈接預訂50％的折扣：

Here is a brief summary of what you learned about K-nearest neighbors models in Python:

這是您從Python中了解的K近鄰模型的摘要：

How classified data is a common tool used to teach students how to solve their first K nearest neighbor problems
機密數(shù)據(jù)是如何用來教學生如何解決他們的第一個K最近鄰問題的常用工具
Why it’s important to standardize your data set when building K nearest neighbor models
為什么在建立K個最近鄰居模型時標準化數(shù)據(jù)集很重要
How to split your data set into training data and test data using the train_test_split function
如何使用train_test_split函數(shù)將數(shù)據(jù)集分為訓練數(shù)據(jù)和測試數(shù)據(jù)
How to train your first K nearest neighbors model and make predictions with it
如何訓練您的第一個K最近鄰模型并進行預測
How to measure the performance of a K nearest neighbors model
如何測量K最近鄰居模型的性能
How to use the elbow method to select an optimal value of K in a K nearest neighbors model
如何使用肘法在K最近鄰居模型中選擇K的最優(yōu)值

Similarly, here is a brief summary of what you learned about K-means clustering models in Python:

同樣，這是您從Python中了解到的K-means聚類模型的摘要：

How to create artificial data in scikit-learn using the make_blobs function
如何使用make_blobs函數(shù)在scikit-learn創(chuàng)建人工數(shù)據(jù)
How to build and train a K means clustering model
如何建立和訓練K均值聚類模型
That unsupervised machine learning techniques do not require you to split your data into training data and test data
這種無監(jiān)督的機器學習技術不需要您將數(shù)據(jù)分為訓練數(shù)據(jù)和測試數(shù)據(jù)
How to build and train a K means clustering model using scikit-learn
如何使用scikit-learn構建和訓練K均值聚類模型
How to visualizes the performance of a K means clustering algorithm when you know the clusters in advance
當您提前了解聚類時，如何可視化K表示聚類算法

翻譯自: https://www.freecodecamp.org/news/how-to-build-and-train-k-nearest-neighbors-ml-models-in-python/

總結

以上是生活随笔為你收集整理的如何在Python中建立和训练K最近邻和K-Means集群ML模型的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。