當(dāng)前位置：首頁(yè) > 编程语言 > python >内容正文

python

python中knn_如何在python中从头开始构建knn

發(fā)布時(shí)間：2023/11/29 python 47 豆豆

生活随笔收集整理的這篇文章主要介紹了 python中knn_如何在python中从头开始构建knn 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

python中knn

k最近鄰居 (k-Nearest Neighbors)

k-Nearest Neighbors (KNN) is a supervised machine learning algorithm that can be used for either regression or classification tasks. KNN is non-parametric, which means that the algorithm does not make assumptions about the underlying distributions of the data. This is in contrast to a technique like linear regression, which is parametric, and requires us to find a function that describes the relationship between dependent and independent variables.

k最近鄰(KNN)是一種受監(jiān)督的機(jī)器學(xué)習(xí)算法，可用于回歸或分類(lèi)任務(wù)。 KNN是非參數(shù)的，這意味著該算法不對(duì)數(shù)據(jù)的基礎(chǔ)分布進(jìn)行假設(shè)。這與參數(shù)化的線(xiàn)性回歸等技術(shù)形成對(duì)比，后者是參數(shù)化的，要求我們找到一個(gè)描述因變量和自變量之間關(guān)系的函數(shù)。

KNN has the advantage of being quite intuitive to understand. When used for classification, a query point (or test point) is classified based on the k labeled training points that are closest to that query point.

KNN具有非常直觀易懂的優(yōu)點(diǎn)。當(dāng)用于分類(lèi)時(shí)，根據(jù)最接近該查詢(xún)點(diǎn)的k個(gè)標(biāo)記訓(xùn)練點(diǎn)對(duì)查詢(xún)點(diǎn)(或測(cè)試點(diǎn))進(jìn)行分類(lèi)。

For a simplified example, see the figure below. The left panel shows a 2-d plot of sixteen data points — eight are labeled as green, and eight are labeled as purple. Now, the right panel shows how we would classify a new point (the black cross), using KNN when k=3. We find the three closest points, and count up how many ‘votes’ each color has within those three points. In this case, two of the three points are purple — so, the black cross will be labeled as purple.

有關(guān)簡(jiǎn)化示例，請(qǐng)參見(jiàn)下圖。左面板顯示了16個(gè)數(shù)據(jù)點(diǎn)的二維圖-八個(gè)標(biāo)記為綠色，八個(gè)標(biāo)記為紫色。現(xiàn)在，右面板顯示了當(dāng)k = 3時(shí)，如何使用KNN對(duì)新點(diǎn)(黑色十字)進(jìn)行分類(lèi)。我們找到三個(gè)最接近的點(diǎn)，并計(jì)算出每種顏色在這三個(gè)點(diǎn)內(nèi)有多少個(gè)“票數(shù)”。在這種情況下，三個(gè)點(diǎn)中的兩個(gè)是紫色的-因此，黑色十字將被標(biāo)記為紫色。

2-d Classification using KNN when k=3當(dāng)k = 3時(shí)使用KNN進(jìn)行二維分類(lèi)

Calculating Distance

計(jì)算距離

The distance between points is determined by using one of several versions of the Minkowski distance equation. The generalized formula for Minkowski distance can be represented as follows:

點(diǎn)之間的距離是通過(guò)使用Minkowski距離方程的幾個(gè)版本之一確定的。 Minkowski距離的廣義公式可以表示為：

where X and Y are data points, n is the number of dimensions, and p is the Minkowski power parameter. When p =1, the distance is known at the Manhattan (or Taxicab) distance, and when p=2 the distance is known as the Euclidean distance. In two dimensions, the Manhattan and Euclidean distances between two points are easy to visualize (see the graph below), however at higher orders of p, the Minkowski distance becomes more abstract.

其中X和Y是數(shù)據(jù)點(diǎn)， n是維數(shù)， p是Minkowski冪參數(shù)。當(dāng)p = 1時(shí)，該距離已知為曼哈頓(或出租車(chē))距離，而當(dāng)p = 2時(shí)，該距離稱(chēng)為歐幾里得距離。在兩個(gè)維度上，兩點(diǎn)之間的曼哈頓距離和歐幾里得距離很容易可視化(請(qǐng)參見(jiàn)下圖)，但是在p的高階處，明可夫斯基距離變得更加抽象。

Manhattan and Euclidean distances in 2-d二維中的曼哈頓距離和歐幾里得距離

Python中的KNN (KNN in Python)

To implement my own version of the KNN classifier in Python, I’ll first want to import a few common libraries to help out.

為了用Python實(shí)現(xiàn)我自己的KNN分類(lèi)器版本，我首先要導(dǎo)入一些常見(jiàn)的庫(kù)來(lái)提供幫助。

# Initial importsimport pandas as pd import numpy as np import matplotlib.pyplot as plt

加載數(shù)據(jù)中 (Loading Data)

To test the KNN classifier, I’m going to use the iris data set from sklearn.datasets. The data set has measurements (Sepal Length, Sepal Width, Petal Length, Petal Width) for 150 iris plants, split evenly among three species (0 = setosa, 1 = versicolor, and 2 = virginica). Below, I load the data and store it in a dataframe.

為了測(cè)試KNN分類(lèi)器，我將使用sklearn.datasets中的虹膜數(shù)據(jù)集。數(shù)據(jù)集具有150種鳶尾植物的測(cè)量值(頭長(zhǎng)，萼片寬度，花瓣長(zhǎng)度，花瓣寬度)，均勻地分為三種(0 =剛毛，1 =雜色和2 =弗吉尼亞)。在下面，我加載數(shù)據(jù)并將其存儲(chǔ)在數(shù)據(jù)框中。

# Load iris data and store in dataframefrom sklearn import datasetsiris = datasets.load_iris()df = pd.DataFrame(data=iris.data, columns=iris.feature_names) df['target'] = iris.target df.head()

I’ll also separate the data into features (X) and the target variable (y), which is the species label for each plant.

我還將數(shù)據(jù)分為特征(X)和目標(biāo)變量(y)，目標(biāo)變量是每種植物的種類(lèi)標(biāo)簽。

# Separate X and y dataX = df.drop('target', axis=1) y = df.target

建立KNN框架 (Building out the KNN Framework)

Creating a functioning KNN classifier can be broken down into several steps. While KNN includes a bit more nuance than this, here’s my bare-bones to-do list:

創(chuàng)建功能良好的KNN分類(lèi)器可以分為幾個(gè)步驟。盡管KNN包含的細(xì)微之處要多于此，但以下是我的基本工作清單：

Define a function to calculate the distance between two points

定義一個(gè)函數(shù)來(lái)計(jì)算兩點(diǎn)之間的距離

Use the distance function to get the distance between a test point and all known data points

使用距離函數(shù)獲取測(cè)試點(diǎn)與所有已知數(shù)據(jù)點(diǎn)之間的距離

Sort distance measurements to find the points closest to the test point (i.e., find the nearest neighbors)

對(duì)距離測(cè)量值進(jìn)行排序，以找到最接近測(cè)試點(diǎn)的點(diǎn)(即，找到最近的鄰居)

Use majority class labels of those closest points to predict the label of the test point

使用那些最接近的點(diǎn)的多數(shù)類(lèi)標(biāo)簽來(lái)預(yù)測(cè)測(cè)試點(diǎn)的標(biāo)簽

Repeat steps 1 through 4 until all test data points are classified

重復(fù)步驟1至4，直到對(duì)所有測(cè)試數(shù)據(jù)點(diǎn)進(jìn)行分類(lèi)

1.定義一個(gè)函數(shù)來(lái)計(jì)算兩點(diǎn)之間的距離 (1. Define a function to calculate distance between two points)

First, I define a function called minkowski_distance, that takes an input of two data points (a & b) and a Minkowski power parameter p, and returns the distance between the two points. Note that this function calculates distance exactly like the Minkowski formula I mentioned earlier. By making p an adjustable parameter, I can decide whether I want to calculate Manhattan distance (p=1), Euclidean distance (p=2), or some higher order of the Minkowski distance.

首先，我定義一個(gè)名為minkowski_distance的函數(shù)，該函數(shù)接受兩個(gè)數(shù)據(jù)點(diǎn)( a ＆ b )和一個(gè)Minkowski冪參數(shù)p的輸入，并返回兩個(gè)點(diǎn)之間的距離。請(qǐng)注意，此函數(shù)計(jì)算距離的方式與我之前提到的Minkowski公式完全相同。通過(guò)將p設(shè)置為可調(diào)參數(shù)，我可以決定是否要計(jì)算曼哈頓距離(p = 1)，歐幾里得距離(p = 2)或Minkowski距離的更高階。

# Calculate distance between two pointsdef minkowski_distance(a, b, p=1):# Store the number of dimensionsdim = len(a)# Set initial distance to 0distance = 0# Calculate minkowski distance using parameter pfor d in range(dim):distance += abs(a[d] - b[d])**pdistance = distance**(1/p)return distance# Test the functionminkowski_distance(a=X.iloc[0], b=X.iloc[1], p=1)0.6999999999999993

2.使用距離功能獲取測(cè)試點(diǎn)與所有已知數(shù)據(jù)點(diǎn)之間的距離 (2. Use the distance function to get distance between a test point and all known data points)

For step 2, I simply repeat the minkowski_distance calculation for all labeled points in X and store them in a dataframe.

對(duì)于第2步，我只需要對(duì)X中所有標(biāo)記的點(diǎn)重復(fù)minkowski_distance計(jì)算，并將它們存儲(chǔ)在數(shù)據(jù)框中。

# Define an arbitrary test pointtest_pt = [4.8, 2.7, 2.5, 0.7]# Calculate distance between test_pt and all points in Xdistances = []for i in X.index:distances.append(minkowski_distance(test_pt, X.iloc[i]))df_dists = pd.DataFrame(data=distances, index=X.index, columns=['dist']) df_dists.head()

3.對(duì)距離測(cè)量值進(jìn)行排序以找到最接近測(cè)試點(diǎn)的點(diǎn) (3. Sort distance measurements to find the points closest to the test point)

In step 3, I use the pandas .sort_values() method to sort by distance, and return only the top 5 results.

在第3步中，我使用pandas .sort_values()方法按距離排序，并且僅返回前5個(gè)結(jié)果。

# Find the 5 nearest neighborsdf_nn = df_dists.sort_values(by=['dist'], axis=0)[:5] df_nn

4.使用那些最近點(diǎn)的多數(shù)類(lèi)標(biāo)簽來(lái)預(yù)測(cè)測(cè)試點(diǎn)的標(biāo)簽 (4. Use majority class labels of those closest points to predict the label of the test point)

For this step, I use collections.Counter to keep track of the labels that coincide with the nearest neighbor points. I then use the .most_common() method to return the most commonly occurring label. Note: if there is a tie between two or more labels for the title of “most common” label, the one that was first encountered by the Counter() object will be the one that gets returned.

對(duì)于這一步，我使用collections.Counter來(lái)跟蹤與最近的鄰居點(diǎn)重合的標(biāo)簽。然后，我使用.most_common()方法返回最常見(jiàn)的標(biāo)簽。注意：如果兩個(gè)或兩個(gè)以上標(biāo)簽之間的關(guān)系為“最常見(jiàn)”標(biāo)簽的標(biāo)題，則Counter()對(duì)象首先遇到的標(biāo)簽將是返回的標(biāo)簽。

from collections import Counter# Create counter object to track the labelscounter = Counter(y[df_nn.index])# Get most common label of all the nearest neighborscounter.most_common()[0][0]1

5.重復(fù)步驟1至4，直到對(duì)所有測(cè)試數(shù)據(jù)點(diǎn)進(jìn)行分類(lèi) (5. Repeat steps 1 through 4 until all test data points are classified)

In this step, I put the code I’ve already written to work and write a function to classify the data using KNN. First, I perform a train_test_split on the data (75% train, 25% test), and then scale the data using StandardScaler(). Since KNN is distance-based, it is important to make sure that the features are scaled properly before feeding them into the algorithm.

在這一步中，我將已經(jīng)編寫(xiě)的代碼投入使用，并編寫(xiě)了一個(gè)使用KNN對(duì)數(shù)據(jù)進(jìn)行分類(lèi)的函數(shù)。首先，我對(duì)數(shù)據(jù)執(zhí)行train_test_split (75％的火車(chē)，25％的測(cè)試)，然后使用StandardScaler()縮放數(shù)據(jù)。由于KNN是基于距離的，因此在將特征輸入算法之前，確保正確縮放特征很重要。

Additionally, to avoid data leakage, it is good practice to scale the features after the train_test_split has been performed. First, scale the data from the training set only (scaler.fit_transform(X_train)), and then use that information to scale the test set (scaler.tranform(X_test)). This way, I can ensure that no information outside of the training data is used to create the model.

此外，為避免數(shù)據(jù)泄漏，優(yōu)良作法是在train_test_split執(zhí)行之后縮放功能。首先，僅縮放訓(xùn)練集中的數(shù)據(jù) ( scaler.fit_transform(X_train) )，然后使用該信息來(lái)縮放測(cè)試集( scaler.tranform(X_test) )。這樣，我可以確保沒(méi)有使用訓(xùn)練數(shù)據(jù)之外的任何信息來(lái)創(chuàng)建模型。

from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler# Split the data - 75% train, 25% testX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=1)# Scale the X datascaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test)

Next, I define a function called knn_predict that takes in all of the training and test data, k, and p, and returns the predictions my KNN classifier makes for the test set (y_hat_test). This function doesn’t really include anything new — it is simply applying what I’ve already worked through above. The function should return a list of label predictions containing only 0’s, 1’s and 2’s.

接下來(lái)，我定義一個(gè)名為knn_predict的函數(shù)，該函數(shù)接收所有訓(xùn)練和測(cè)試數(shù)據(jù)k和p ，并返回我的KNN分類(lèi)器對(duì)測(cè)試集所做的預(yù)測(cè)( y_hat_test )。該功能實(shí)際上并沒(méi)有包含任何新功能-只是應(yīng)用了我上面已經(jīng)完成的工作。該函數(shù)應(yīng)返回僅包含0、1和2的標(biāo)簽預(yù)測(cè)列表。

def knn_predict(X_train, X_test, y_train, y_test, k, p):# Counter to help with label votingfrom collections import Counter# Make predictions on the test data# Need output of 1 prediction per test data pointy_hat_test = []for test_point in X_test:distances = []for train_point in X_train:distance = minkowski_distance(test_point, train_point, p=p)distances.append(distance)# Store distances in a dataframedf_dists = pd.DataFrame(data=distances, columns=['dist'], index=y_train.index)# Sort distances, and only consider the k closest pointsdf_nn = df_dists.sort_values(by=['dist'], axis=0)[:k]# Create counter object to track the labels of k closest neighborscounter = Counter(y_train[df_nn.index])# Get most common label of all the nearest neighborsprediction = counter.most_common()[0][0]# Append prediction to output listy_hat_test.append(prediction)return y_hat_test# Make predictions on test dataset y_hat_test = knn_predict(X_train, X_test, y_train, y_test, k=5, p=1)print(y_hat_test)[0, 1, 1, 0, 2, 1, 2, 0, 0, 2, 1, 0, 2, 1, 1, 0, 1, 1, 0, 0, 1, 1, 2, 0, 2, 1, 0, 0, 1, 2, 1, 2, 1, 2, 2, 0, 1, 0]

And there they are! These are the predictions that this home-brewed KNN classifier has made on the test set. Let’s see how well it worked:

在那里！這些是這個(gè)自制的KNN分類(lèi)器對(duì)測(cè)試集所做的預(yù)測(cè)。讓我們看看它的效果如何：

# Get test accuracy scorefrom sklearn.metrics import accuracy_scoreprint(accuracy_score(y_test, y_hat_test))0.9736842105263158

Looks like the classifier achieved 97% accuracy on the test set. Not too bad at all! But how do I know if it actually worked correctly? Let’s check the result of sklearn’s KNeighborsClassifier on the same data:

看起來(lái)分類(lèi)器在測(cè)試集上達(dá)到了97％的準(zhǔn)確性。一點(diǎn)也不差！但是我怎么知道它是否真的正常工作呢？讓我們?cè)谙嗤瑪?shù)據(jù)上檢查sklearn的KNeighborsClassifier的結(jié)果：

# Testing to see results from sklearn.neighbors.KNeighborsClassifierfrom sklearn.neighbors import KNeighborsClassifierclf = KNeighborsClassifier(n_neighbors=5, p=1) clf.fit(X_train, y_train) y_pred_test = clf.predict(X_test)print(f"Sklearn KNN Accuracy: {accuracy_score(y_test, y_pred_test)}")Sklearn KNN Accuracy: 0.9736842105263158

Nice! sklearn’s implementation of the KNN classifier gives us the exact same accuracy score.

真好！ sklearn對(duì)KNN分類(lèi)器的實(shí)現(xiàn)為我們提供了完全相同的準(zhǔn)確性得分。

探索變化k的影響 (Exploring the effect of varying k)

My KNN classifier performed quite well with the selected value of k = 5. KNN doesn’t have as many tune-able parameters as other algorithms like Decision Trees or Random Forests, but k happens to be one of them. Let’s see how the classification accuracy changes when I vary k:

我的KNN分類(lèi)器在選定的k = 5時(shí)表現(xiàn)很好。KNN沒(méi)有像決策樹(shù)或隨機(jī)森林之類(lèi)的其他算法那么多的可調(diào)參數(shù)，但k恰好是其中之一。讓我們看看改變k時(shí)分類(lèi)精度如何變化：

# Obtain accuracy score varying k from 1 to 99accuracies = []for k in range(1,100):y_hat_test = knn_predict(X_train, X_test, y_train, y_test, k, p=1)accuracies.append(accuracy_score(y_test, y_hat_test))# Plot the results fig, ax = plt.subplots(figsize=(8,6)) ax.plot(range(1,100), accuracies) ax.set_xlabel('# of Nearest Neighbors (k)') ax.set_ylabel('Accuracy (%)');

In this case, using nearly any k value less than 20 results in great (>95%) classification accuracy on the test set. However, when k becomes greater than about 60, accuracy really starts to drop off. This makes sense, because the data set only has 150 observations — when k is that high, the classifier is probably considering labeled training data points that are way too far from the test points.

在這種情況下，幾乎使用任何小于20的k值，都可以在測(cè)試集上實(shí)現(xiàn)較高的分類(lèi)精度(> 95％)。但是，當(dāng)k大于約60時(shí)，精度實(shí)際上開(kāi)始下降。這是有道理的，因?yàn)閿?shù)據(jù)集只有150個(gè)觀察值-當(dāng)k很高時(shí)，分類(lèi)器可能正在考慮與測(cè)試點(diǎn)相距太遠(yuǎn)的標(biāo)記訓(xùn)練數(shù)據(jù)點(diǎn)。

每個(gè)鄰居都有投票權(quán)嗎？ (Every neighbor gets a vote — or do they?)

In writing my own KNN classifier, I chose to overlook one clear hyperparameter tuning opportunity: the weight that each of the k nearest points has in classifying a point. In sklearn’s KNeighborsClassifier, this is the weights parameter, and it can be set to ‘uniform’, ‘distance’, or another user-defined function.

在編寫(xiě)自己的KNN分類(lèi)器時(shí)，我選擇忽略了一個(gè)明確的超參數(shù)調(diào)整機(jī)會(huì)： k個(gè)最近點(diǎn)中的每一個(gè)在對(duì)點(diǎn)進(jìn)行分類(lèi)時(shí)所具有的權(quán)重。在sklearn的KNeighborsClassifier中 ，這是weights參數(shù)，可以將其設(shè)置為'uniform' ， 'distance'或其他用戶(hù)定義的函數(shù)。

When set to ‘uniform’, each of the k nearest neighbors gets an equal vote in labeling a new point. When set to ‘distance’, the neighbors in closest to the new point are weighted more heavily than the neighbors farther away. There are certainly cases where weighting by ‘distance’ would produce better results, and the only way to find out is through hyperparameter tuning.

當(dāng)設(shè)置為'uniform'時(shí) ，k個(gè)最近的鄰居中的每一個(gè)在標(biāo)記新點(diǎn)時(shí)都會(huì)得到平等的投票。設(shè)置為“距離”時(shí) ，最接近新點(diǎn)的鄰居的權(quán)重要比更遠(yuǎn)的鄰居的權(quán)重大。當(dāng)然，在某些情況下，按“距離”進(jìn)行加權(quán)會(huì)產(chǎn)生更好的結(jié)果，唯一的找出方法是通過(guò)超參數(shù)調(diào)整。

最后的想法 (Final Thoughts)

Now, make no mistake — sklearn’s implementation is undoubtedly more efficient and more user-friendly than what I’ve cobbled together here. However, I found it a valuable exercise to work through KNN from ‘scratch’, and it has only solidified my understanding of the algorithm. I hope it did the same for you!

現(xiàn)在，請(qǐng)不要誤解-sklearn的實(shí)現(xiàn)無(wú)疑比我在這里拼湊的實(shí)現(xiàn)更加有效和用戶(hù)友好。但是，我發(fā)現(xiàn)從“從頭開(kāi)始”通過(guò)KNN進(jìn)行工作是一個(gè)有價(jià)值的練習(xí)，并且它僅鞏固了我對(duì)算法的理解。希望對(duì)您也一樣！

翻譯自: https://towardsdatascience.com/how-to-build-knn-from-scratch-in-python-5e22b8920bd2

python中knn

總結(jié)

以上是生活随笔為你收集整理的python中knn_如何在python中从头开始构建knn的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：梦到朋友卖房子是什么意思
下一篇： python中nlp的库_用于nlp的p