k 最近邻_k最近邻与维数的诅咒
k 最近鄰
機(jī)器學(xué)習(xí)模型和維數(shù)的詛咒 (Machine Learning models and the curse of dimensionality)
There is always a trade off between things in life. If you take up a certain path then there is always a possibility that you might have to compromise with some other parameter. Machine Learning models are no different, considering the case of k-Nearest Neighbor there has always been a problem which has a huge impact over classifiers that rely on pairwise distance and that problem is nothing but the “Curse of Dimensionality”. By the end of this article you will be able to create your own k-Nearest Neighbor Model and observe the impact of increasing the dimension to fit a data set. Let’s dig in!
生活中的事物之間總會(huì)有一個(gè)權(quán)衡。 如果您采用某條路徑,那么總是有可能不得不折衷一些其他參數(shù)。 機(jī)器學(xué)習(xí)模型也沒(méi)有什么不同,考慮到k最近鄰的情況,一直存在著一個(gè)問(wèn)題,該問(wèn)題對(duì)依賴成對(duì)距離的分類(lèi)器產(chǎn)生了巨大影響,而這個(gè)問(wèn)題不過(guò)是“維數(shù)詛咒”而已。 到本文結(jié)束時(shí),您將能夠創(chuàng)建自己的k最近鄰居模型,并觀察增加維度以適合數(shù)據(jù)集的影響。 讓我們開(kāi)始吧!
Creating a k-Nearest Neighbor model:
創(chuàng)建k最近鄰居模型:
Right before we get our hands dirty with the technical part, we need to lay the buttress for our analysis, which is nothing but the libraries.
就在我們開(kāi)始接觸技術(shù)部分之前,我們需要為我們的分析奠定基礎(chǔ),這不過(guò)是庫(kù)。
Thanks to inbuilt machine learning packages which makes our job quite easy.
借助內(nèi)置的機(jī)器學(xué)習(xí)包,這使我們的工作變得非常輕松。
最近鄰居分類(lèi)器: (Nearest neighbors classifier:)
Let’s begin with a simple nearest neighbor classifier in which we have been posed with a binary classification task: we have a set of labeled inputs, where the labels are all either 0 or 1. Our goal is to train a classifier to predict a 0 or 1 label for new, unseen test data. One conceptually simple approach is to simply find the sample in the training data that is “most similar” to our test sample (a “neighbor” in the feature space), and then give the test sample the same label as the “most similar” training sample. This is the nearest neighbors classifier.
讓我們從一個(gè)簡(jiǎn)單的最近鄰分類(lèi)器開(kāi)始,在該分類(lèi)器中,我們已經(jīng)執(zhí)行了一個(gè)二進(jìn)制分類(lèi)任務(wù):我們有一組帶標(biāo)簽的輸入,其中標(biāo)簽全為0或1。我們的目標(biāo)是訓(xùn)練一個(gè)分類(lèi)器來(lái)預(yù)測(cè)0或1。 1個(gè)標(biāo)簽,用于顯示看不見(jiàn)的新測(cè)試數(shù)據(jù)。 從概念上講,一種簡(jiǎn)單的方法是簡(jiǎn)單地在訓(xùn)練數(shù)據(jù)中找到與我們的測(cè)試樣本“最相似”(特征空間中的“鄰居”)的樣本,然后為測(cè)試樣本賦予與“最相似”的相同標(biāo)簽訓(xùn)練樣本。 這是最近的鄰居分類(lèi)器。
After running few lines of code we can visualize our data set, with training data shown in blue (negative class) and red (positive class). A test sample is shown in green.For keeping things simple I have used a simple linear boundary for classification.
運(yùn)行幾行代碼后,我們可以可視化我們的數(shù)據(jù)集,其中訓(xùn)練數(shù)據(jù)以藍(lán)色(負(fù)類(lèi))和紅色(正類(lèi))顯示。 測(cè)試樣本以綠色顯示。為了使事情簡(jiǎn)單,我使用了簡(jiǎn)單的線性邊界進(jìn)行分類(lèi)。
To find the nearest neighbor, we need a distance metric. For our case, I chose to use the L2 norm. There certainly are few perks of using the L2 norm as a distance metric, considering that we don’t have any outliers the L2 norm minimizes the mean cost and treats every feature equally.
為了找到最近的鄰居,我們需要一個(gè)距離度量 。 對(duì)于我們的情況,我選擇使用L2范數(shù)。 考慮到我們沒(méi)有任何異常值,使用L2范數(shù)作為距離度量當(dāng)然很少有好處,因?yàn)長(zhǎng)2范數(shù)可以最大程度地降低平均成本并平等地對(duì)待每個(gè)特征。
The nearest neighbor to the test sample is circled, and its label is applied as the prediction for the test sample:
圈出最接近測(cè)試樣本的鄰居,并使用其標(biāo)簽作為測(cè)試樣本的預(yù)測(cè):
Nearest Neighbor classified最近鄰居分類(lèi)Using nearest neighbor we successfully classified our test value as label “0”, but again we made an assumption of no outliers and we also moderated the noise.
使用最近的鄰居,我們成功地將測(cè)試值分類(lèi)為標(biāo)簽“ 0”,但是我們?cè)俅渭僭O(shè)沒(méi)有離群值,并且也降低了噪聲。
The nearest neighbor classifier works by “memorizing” the training data. One interesting consequence of this is that it will have zero prediction error (or equivalently, 100% accuracy) on the training data, since each training sample’s nearest neighbor is itself:
最近的鄰居分類(lèi)器通過(guò)“存儲(chǔ)”訓(xùn)練數(shù)據(jù)來(lái)工作。 一個(gè)有趣的結(jié)果是,由于每個(gè)訓(xùn)練樣本的最近鄰居本身都是零,因此在訓(xùn)練數(shù)據(jù)上它將具有零預(yù)測(cè)誤差(或等效地,為100%的準(zhǔn)確性):
Now we look to overcome the shortcomings of the nearest neighbor model and the answer lies in the model named as the k-Nearest Neighbor classifier.
現(xiàn)在,我們著眼于克服最鄰近模型的缺點(diǎn),答案就在于名為k-最鄰近分類(lèi)器的模型。
K個(gè)最近鄰居分類(lèi)器: (K nearest neighbors classifier:)
To make this approach less sensitive to noise, we might choose to look for multiple similar training samples to each new test sample, and classify the new test sample using the mode of the labels of the similar training samples. This is k nearest neighbors, where k is the number of “neighbors” that we search for.
為了使這種方法對(duì)噪聲的敏感性降低,我們可以選擇為每個(gè)新的測(cè)試樣本尋找多個(gè)相似的訓(xùn)練樣本,并使用相似的訓(xùn)練樣本的標(biāo)簽?zāi)J綄?duì)新的測(cè)試樣本進(jìn)行分類(lèi)。 這是k個(gè)最近的鄰居,其中k是我們搜索的“鄰居”數(shù)。
In the following plot, we show the same data as in the previous example. Now, however, the 3 closest neighbors to the test sample are circled, and the mode of their labels is used as the prediction for the new test sample. Feel free to play with the parameter k and observe the changes.
在下圖中,我們顯示了與上一個(gè)示例相同的數(shù)據(jù)。 但是,現(xiàn)在,將最接近測(cè)試樣本的3個(gè)鄰居圈起來(lái),并將其標(biāo)簽的模式用作新測(cè)試樣本的預(yù)測(cè)。 隨意使用參數(shù)k并觀察其變化。
k-NN classifier with k=3k = 3的k-NN分類(lèi)器The following image shows a set of test points plotted on top of the training data. The size of each test points indicate the confidence in the label, which we approximate by the proportion of k neighbors sharing that label.
下圖顯示了在訓(xùn)練數(shù)據(jù)上方繪制的一組測(cè)試點(diǎn)。 每個(gè)測(cè)試點(diǎn)的大小表示對(duì)標(biāo)簽的置信度 ,我們可以通過(guò)共享該標(biāo)簽的k個(gè)鄰居的比例來(lái)近似。
Confidence score置信度分?jǐn)?shù)The bigger the dots are means that the confidence score is higher for those points.
點(diǎn)越大表示這些點(diǎn)的置信度得分越高。
Also note that the training error for k nearest neighbors is not necessarily zero (though it can be!), since a training sample may have a different label than its k closest neighbors.
還應(yīng)注意,k個(gè)最鄰近鄰居的訓(xùn)練誤差不一定為零(盡管可能是!),因?yàn)橛?xùn)練樣本可能具有與其k個(gè)最鄰近鄰居不同的標(biāo)簽。
功能縮放: (Feature scaling:)
One important limitation of k nearest neighbors is that it does not “l(fā)earn” anything about which features are most important for determining y. Every feature is weighted equally in finding the nearest neighbor.
k個(gè)最近鄰居的一個(gè)重要限制是它不“學(xué)習(xí)”關(guān)于哪些特征對(duì)于確定y最重要。 在尋找最接近的鄰居時(shí),每個(gè)要素的權(quán)重均相等。
The first implication of this is:
這的第一個(gè)含義是:
- If all features are equally important, but they are not all on the same scale, they must be normalized — re scaled onto the interval [0,1]. Otherwise, the features with the largest magnitudes will dominate the total distance. 如果所有功能都同等重要,但是它們的縮放比例不同,則必須將它們歸一化-重新縮放為間隔[0,1]。 否則,幅度最大的要素將主導(dǎo)總距離。
The second implication is:
第二個(gè)含義是:
- Even if some features are more important than others, they will all be considered equally important in the distance calculation. If uninformative features are included, they may dominate the distance calculation. 即使某些功能比其他功能更重要,它們?cè)诰嚯x計(jì)算中也將被視為同等重要。 如果包括非信息性特征,則它們可能會(huì)主導(dǎo)距離計(jì)算。
Contrast this with our logistic regression classifier. In the logistic regression, the training process involves learning coefficients. The coefficients weight each feature’s effect on the overall output.
將此與我們的邏輯回歸分類(lèi)器進(jìn)行對(duì)比。 在邏輯回歸中,訓(xùn)練過(guò)程涉及學(xué)習(xí)系數(shù)。 系數(shù)加權(quán)每個(gè)功能對(duì)整體輸出的影響。
Let’s see how our model performs for an image classification problem. Consider the following images from CIFAR10, a dataset of low-resolution images in ten classes:
讓我們看看我們的模型如何處理圖像分類(lèi)問(wèn)題。 考慮以下來(lái)自CIFAR10的圖像,它是十類(lèi)低分辨率圖像的數(shù)據(jù)集:
images classified as car分類(lèi)為汽車(chē)的圖像The images above show a test sample and two training samples with their distances to the test sample.
上圖顯示了一個(gè)測(cè)試樣本和兩個(gè)訓(xùn)練樣本以及它們與測(cè)試樣本的距離。
The background pixels in the test sample “count” just as much as the foreground pixels, so that the image of the deer is considered a very close neighbor, while the image of the car is not. As stated before we used L2 norm and our model considers every pixel to be equal so it makes it difficult for nearest neighbor to classify real time images.
測(cè)試樣本中的背景像素“計(jì)數(shù)”與前景像素一樣多,因此,鹿的圖像被認(rèn)為是非常近的鄰居,而汽車(chē)的圖像則不是。 如前所述,我們使用L2范數(shù),并且我們的模型認(rèn)為每個(gè)像素都相等,因此最近鄰很難對(duì)實(shí)時(shí)圖像進(jìn)行分類(lèi)。
images classified as car分類(lèi)為汽車(chē)的圖像We also see here that Euclidean distance is not a good metric of visual similarity — the frog on the right is almost as similar to the car as the deer in the middle!
我們?cè)谶@里還看到,歐幾里得距離不是視覺(jué)相似度的良好度量標(biāo)準(zhǔn)-右側(cè)的青蛙與汽車(chē)之間的距離幾乎與中間的鹿一樣!
K最近鄰居回歸: (K nearest neighbors regression:)
K nearest neighbors can also be used for regression, with just a small change: instead of using the mode of the nearest neighbors to predict the label of a new sample, we use the mean. Consider the following training data:
K個(gè)最接近的鄰居也可以用于回歸,只做很小的改變:我們使用均值,而不是使用最接近的鄰居的模式來(lái)預(yù)測(cè)新樣本的標(biāo)簽。 考慮以下訓(xùn)練數(shù)據(jù):
We can add a test sample, then use k nearest neighbors to predict its value:
我們可以添加一個(gè)測(cè)試樣本,然后使用k個(gè)最近的鄰居來(lái)預(yù)測(cè)其值:
“維數(shù)的詛咒”: (The “curse of dimensionality”:)
Classifiers that rely on pairwise distance between points, like the k neighbors methods, are heavily impacted by a problem known as the “curse of dimensionality”. In this section, I will illustrate the problem. We will look at a problem with data uniformly distributed in each dimension of the feature space, and two classes separated by a linear boundary.
像k鄰居方法一樣,依賴點(diǎn)之間成對(duì)距離的分類(lèi)器受到稱為“維數(shù)詛咒”的問(wèn)題的嚴(yán)重影響。 在本節(jié)中,我將說(shuō)明問(wèn)題。 我們將研究一個(gè)數(shù)據(jù)均勻分布在特征空間各個(gè)維度上的問(wèn)題,并且兩個(gè)類(lèi)之間由線性邊界分隔。
We will generate a test point, and show the k nearest neighbors to the test point. We will also show the length (or area, or volume) that we had to search to find those k test points. We will observe the radius required to find the nearest neighbor for increasing dimension space.
我們將生成一個(gè)測(cè)試點(diǎn),并顯示距該測(cè)試點(diǎn)最近的k個(gè)鄰居。 我們還將顯示為找到這k個(gè)測(cè)試點(diǎn)而必須搜索的長(zhǎng)度(或面積或體積)。 我們將觀察為增加尺寸空間而尋找最接近的鄰居所需的半徑。
Pay special attention to how that length (or area, or volume) changes as we increase the dimensionality of the feature space.
當(dāng)我們?cè)黾犹卣骺臻g的維數(shù)時(shí),請(qǐng)?zhí)貏e注意長(zhǎng)度(或面積或體積)如何變化。
First, let's observe the 1D problem:
首先,讓我們觀察一維問(wèn)題:
1D space radius search一維空間半徑搜索Now, the 2D equivalent:
現(xiàn)在,等效于2D:
2D space radius search二維空間半徑搜索Finally, the 3D equivalent:
最后,等效于3D:
3D space radius search3D空間半徑搜索We can see that as the dimensionality of the problem grows, the higher-dimensional space is less densely occupied by the training data, and we need to search a large volume of space to find neighbors of the test point. The pair-wise distance between points grows as we add additional dimensions.
我們可以看到,隨著問(wèn)題維數(shù)的增長(zhǎng),高維空間被訓(xùn)練數(shù)據(jù)所占據(jù)的密度降低,并且我們需要搜索大量空間以找到測(cè)試點(diǎn)的鄰居。 點(diǎn)之間的成對(duì)距離隨著我們添加其他尺寸而增大 。
And in that case, the neighbors may be so far away that they don’t actually have much in common with the test point.
在這種情況下,鄰居可能相距太遠(yuǎn),以至于他們實(shí)際上與測(cè)試點(diǎn)沒(méi)有太多共同之處。
In general, the length of the smallest hyper-cube that contains all k-nearest neighbors of a test point is:
通常,包含測(cè)試點(diǎn)的所有k個(gè)最近鄰的最小超立方體的長(zhǎng)度為:
(k/N)1/d
(k / N)1/ d
for N samples with dimensionality d.
對(duì)于N個(gè)維數(shù)為d的樣本。
From the expression above, we can see that as the number of dimensions increases linearly, the number of training samples must increase exponentially to counter the “curse”.
從上面的表達(dá)式中,我們可以看到,隨著維數(shù)線性增加,訓(xùn)練樣本的數(shù)量必須成倍增加以抵消“詛咒”。
Alternatively, we can reduce d — either by feature selection or by transforming the data into a lower-dimensional space.
或者,我們可以通過(guò)特征選擇或?qū)?shù)據(jù)轉(zhuǎn)換為低維空間來(lái)減小d。
翻譯自: https://towardsdatascience.com/k-nearest-neighbors-and-the-curse-of-dimensionality-7d64634015d9
k 最近鄰
總結(jié)
以上是生活随笔為你收集整理的k 最近邻_k最近邻与维数的诅咒的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: cpu流水线工作原理_嵌入式工作原理(处
- 下一篇: 使用Pytorch进行密集视频字幕