當前位置：首頁 > 运维知识 > windows >内容正文

windows

内容管理系统_内容

發布時間：2023/12/15 windows 35 豆豆

生活随笔收集整理的這篇文章主要介紹了内容管理系统_内容小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

內容管理系統

In this blog, we’ll talk about one of the most widely used machine learning algorithms for classification used in various real-world cases, which is the K-Nearest Neighbors (K-NN) algorithm. K-Nearest Neighbor (K-NN) is a simple, easy to understand, versatile, and one of the topmost machine learning algorithms that find its applications in a variety of fields.

在此博客中，我們將討論在各種現實情況下使用的最廣泛的機器學習分類算法，即K-最近鄰居(K-NN)算法。 K最近鄰居(K-NN)是一種簡單，易于理解，通用的工具，并且是最頂級的機器學習算法之一，可在各種領域中找到其應用。

內容 (Contents)

Imbalanced and balanced datasets

不平衡和平衡的數據集

Multi-class classification

多類別分類

K-NN is a given Distance (or) similarity matrix

K-NN是給定的距離(或)相似度矩陣

Train and Test set differences

訓練和測試集差異

Impact of outliers

離群值的影響

Scale and Column Standardization

規模和專欄標準化

Model Interpretability

模型可解釋性

Feature Importance

功能重要性

Categorical Features

分類特征

Missing Value Imputation

缺失價值估算

Curse of Dimensionality

維度詛咒

Bias-Variance Trade-Off

偏差偏差權衡

To know How K-NN works, please read our previous blog, to read the blog visit here.

要了解K-NN的工作原理，請閱讀我們以前的博客，并在此處閱讀博客。

1.不平衡v / s余額數據集的情況 (1. Case of Imbalance v/s Balance Dataset)

First, we want to Know what is Imbalance Dataset?

首先，我們想知道什么是不平衡數據集？

Consider two-class classification, If there is a very high difference between the positive class and Negative class. Then we can say our dataset in Imbalance Dataset.

考慮二分類，如果在正類和負類之間非常高的差異。然后我們可以說不平衡數據 集中的數據集 。

Imbalanced dataset數據集不平衡

If the number of positive classes and Negative classes is approximately the same. in the given data set. Then we can say our dataset in balance Dataset.

如果肯定類別和否定類別的數量近似相同。在給定的數據集中。然后，我們可以說我們的數據集處于余額數據 集中。

Balanced dataset平衡數據集

K-NN is very much impacted by an imbalanced dataset when we take the majority vote and sometimes it is dominated by majority class.

當我們進行多數表決時，K-NN很大程度上受不平衡數據集的影響，有時它由多數階級主導。

如何解決數據集不平衡的問題？ (How to work-around an imbalanced dataset issue?)

Imbalanced data is not always a bad thing, and in real data sets, there is always some degree of imbalance. That said, there should not be any big impact on your model performance if the level of imbalance is relatively low.

數據不平衡并不總是一件壞事，在實際數據集中，總會存在一定程度的不平衡。就是說，如果不平衡程度相對較低，則對模型性能不會有太大影響。

Now, let’s cover a few techniques to solve the class imbalance problem.

現在，讓我們介紹一些解決類不平衡問題的技術。

Under-Sampling

欠采樣

Let assume I have a dataset “N” with 1000 data points. And ’N’ have two class one is n1 and another one is n2. These two classes have two different reviews Positive and Negative. Here n1 is a positive class and has 900 data points and n2 is a negative class and has 100 data points, so we can say n1 is a majority class because n1 has a big amount of data points and n2 is a minority class because n2 have less number of data points. To handle this Imbalanced dataset I will create a new dataset called m. Here I will take all (100)n2 datapoints as it is and I will take randomly (100)n1 data points and put into the dataset called m’. This is a sampling trick and its called Under-Sampling.

假設我有一個具有1000個數據點的數據集“ N ”。 “ N”有兩個類別，一個是n1，另一個是n2。這兩個類別有兩個不同的評論正面和負面。這里n1是一個正類，具有900個數據點，n2是一個負類，具有100個數據點，所以可以說n1是多數類，因為n1擁有大量數據點，而n2是少數類，因為n2具有更少的數據點。為了處理這個不平衡的數據集，我將創建一個名為m的新數據集。在這里，我將按原樣獲取所有(100)n2個數據點，并隨機獲取(100)n1個數據點并將其放入稱為m'的數據集中。這是一個采樣技巧，稱為欠采樣。

Instead of using n1 and n2, we use m and n2 for modeling.

代替使用n1和n2，我們使用m和n2進行建模。

But in this approach, we did throwing away of data and we lose the information and it’s not a good idea. To solving this under-sampling problem we will introduce a new method called Over-Sampling.

但是在這種方法中，我們確實丟棄了數據，并且丟失了信息，這不是一個好主意。為了解決這個欠采樣問題，我們將介紹一種稱為過采樣的新方法。

Over-Sampling

過度采樣

This technique is used to modify the unequal data classes to create balanced datasets. When the quantity of data is insufficient, the oversampling method tries to balance by incrementing the size of rare samples.

此技術用于修改不相等的數據類以創建平衡的數據集。當數據量不足時，過采樣方法會嘗試通過增加稀有樣本的大小來平衡。

Over-sampling increases the number of minority class members in the training set. The advantage of over-sampling is that no information from the original training set is lost, as all observations from the minority and majority classes are kept.

過度采樣會增加培訓集中少數群體成員的數量。過度采樣的優點是不會保留原始訓練集中的信息，因為會保留少數和多數類別的所有觀察結果。

Over-sampling reduces the domination of the one class from the dataset.

過采樣減少了數據集中一類的優勢。

Instead of repeating we can also create artificially (or) synthetic points in that minority class region, which creates synthetic samples by randomly sampling the characteristics from occurrences in the minority class this technique is called Synthetic Minority Over-sampling Technique.

除了重復之外，我們還可以在該少數族裔區域中人工創建(或)合成點，該技術通過從少數族裔中出現的特征中隨機采樣特征來創建合成樣本，該技術稱為合成少數族裔過采樣技術。

We can get high accuracy with imbalanced data for the ‘dumb model’, so we can’t use accuracy as a performance measure when we have an imbalanced dataset.

對于“啞模型”，使用不平衡的數據可以獲得較高的準確性，因此當數據集不平衡時，不能將準確性用作性能指標。

2.多類別分類 (2. Multi-Class Classification)

Consider In an MNIST dataset the class label Y ∈ {0,1} is called binary classification.

考慮在MNIST數據集中，類標簽Y∈ {0,1}被稱為二進制分類。

Classification task with more than two classes called Multi-Class Classification. Consider In an MNIST dataset the class label Y ∈ {0,1,2,3,4,5} .

具有兩個以上類的分類任務稱為多類分類。考慮在MNIST數據集中，類標簽Y∈ {0,1,2,3,4,5}。

Multi-Class Classification多類別分類

K-NN is easily extendable to the Multi-Class Classifier because it just considers the Majority Vote.

K-NN很容易擴展到多分類器，因為它只考慮多數表決。

But in Machine Learning some types of classification algorithms like Logistic regression can’t extend to Multi-Class Classification.

但是在機器學習中，某些類型的分類算法(例如Logistic回歸)無法擴展到多類分類。

As such, they cannot be used for multi-class classification tasks, at least not directly.

因此，它們不能用于多類分類任務，至少不能直接用于。

Instead, heuristic methods can be used to split a multi-class classification problem into multiple binary classification datasets and train a binary classification model each. one such technique is One-Vs-Rest.

相反，可以使用啟發式方法將多類分類問題分解為多個二進制分類數據集，并分別訓練一個二進制分類模型。一種這樣的技術是One-Vs-Rest。

One-vs-rest is a heuristic method for using binary classification algorithms for multi-class classification. It involves splitting the multi-class dataset into multiple binary classification problems. A binary classifier is then trained on each binary classification problem and predictions are made using the model that is the most confident.

一對多休息是一種使用二進制分類算法進行多分類的啟發式方法。它涉及將多類數據集拆分為多個二進制分類問題。然后針對每個二進制分類問題訓練一個二進制分類器，并使用最有信心的模型進行預測。

For example, given a multi-class classification problem with examples for each class ‘red,’ ‘blue,’ and ‘green‘. This could be divided into three binary classification datasets as follows:

例如，給定一個多類別分類問題，并為每個類別“紅色”，“藍色”和“綠色”提供示例。可以將其分為三個二進制分類數據集，如下所示：

Binary Classification Problem 1: red vs [blue, green]
二進制分類問題1 ：紅色與[藍色，綠色]
Binary Classification Problem 2: blue vs [red, green]
二進制分類問題2 ：藍色與[紅色，綠色]
Binary Classification Problem 3: green vs [red, blue]
二進制分類問題3 ：綠色與[紅色，藍色]

A possible downside of this approach is that it requires one model to be created for each class. For example, three classes require three models. This could be an issue for large datasets.

這種方法的可能缺點是，需要為每個類創建一個模型。例如，三個類需要三個模型。對于大型數據集，這可能是個問題。

To Know about One-Vs-Rest in sci-kit-learn visit here.

要了解sci-kit-learn中的One-Vs-Rest ，請訪問此處。

3. K-NN作為給定的距離(或)相似度矩陣 (3. K-NN as a given Distance (or) similarity matrix)

A distance-based classification is one of the popular methods for classifying instances using a point-to-point distance based on the nearest neighbor (k-NN).

基于距離的分類是使用基于最近鄰居(k-NN)的點對點距離對實例進行分類的流行方法之一。

The representation of distance measure can be one of the various measures available (e.g. Euclidean distance, Manhattan distance, Mahalanobis distance, or other specific distance measures).

距離度量的表示形式可以是可用的各種度量之一(例如，歐幾里得距離，曼哈頓距離，馬氏距離或其他特定距離度量)。

Instead of giving the data and label, if someone gives the similarity between the two products (or) distance between the two vectors K-NN works very well. Because K-NN internally calculates the distance between two points.

如果有人提供兩個乘積之間的相似性(或)兩個向量之間的距離，則K-NN會很好地工作，而不是提供數據和標簽。因為K-NN在內部計算兩點之間的距離。

4.訓練和測試集差異 (4. Train and Test set differences)

when data is change over time than the distribution of train and test set difference will occur. If the distribution of the train and test set is different the model can’t give better results.

當數據隨時間變化時，火車和測試裝置的分布會發生差異。如果訓練和測試集的分布不同，則該模型將無法給出更好的結果。

We want to check the distribution of the train and test set before we build a model.

我們想在建立模型之前檢查火車和測試儀的分布。

But How can we know the train and test sets have different distributions?

但是我們怎么知道訓練集和測試集具有不同的分布呢？

consider our dataset split into train and test set and both contain x and y, where x as given data points and y as a label.

考慮我們的數據集分為訓練集和測試集，并且都包含x和y，其中x為給定的數據點，y為標簽。

To find the distribution of train and test set we want to create a new dataset from our existing dataset.

為了找到訓練集和測試集的分布，我們想從現有數據集中創建一個新的數據集。

consider in our new train set will be x’=concat(x,y) and y’=1 and the new test set will be x’=concat(x,y) and y’=0. For this new dataset apply binary classifiers like K-NN. After applying the binary classifier if we got results like below cases

考慮在我們的新火車集中將x'= concat(x，y)和y'= 1，而新測試集將是x'= concat(x，y)和y'= 0。對于此新數據集，請使用K-NN之類的二進制分類器。在應用二元分類器之后，如果我們得到以下情況的結果

Case 1:

情況1：

If we get low accuracy by model and the train and test set are almost overlapping than the distribution of train and test sets is very similar.

如果我們通過模型獲得低精度，并且訓練集和測試集幾乎重疊，則訓練集和測試集的分布非常相似。

Case 2:

情況2：

If we get medium accuracy by model and the train and test set are less overlapping than the distribution of train and test sets is not very similar.

如果我們通過模型獲得中等精度，并且訓練集和測試集的重疊程度不如訓練集和測試集的分布不太相似。

Case 3:

情況3：

If we get high accuracy by model and the train and test set are very low in overlapping than the distribution of train and test sets are very different.

如果我們通過模型獲得高精度，則火車和測試集的重疊率非常低，而火車和測試集的分布則非常不同。

If the train and test sets come from the same distribution are good, else features change over time, then we can design new features.

如果來自同一分布的訓練集和測試集很好，否則功能會隨著時間而變化，那么我們可以設計新功能。

5.異常值的影響 (5. Impact of outliers)

The model can be easily understood by seeing the decision surface.

通過查看決策面可以輕松理解模型。

Consider the above image if we apply K=1 then take the 1-NN then the decision surface changes. When K is small than the outlier impact on the model is more. When K is large than the outlier impact is less prone to the model.

如果我們應用K = 1，然后考慮1-NN，然后決策面發生變化，請考慮以上圖像。當K小時，離群值對模型的影響更大。當K大于異常值時，模型的影響較小。

Techniques to remove outliers in K-NN

去除K-NN中離群值的技術

Local Outlier Factor (LOF): The local outlier factor is based on a concept of a local density, where locality is given by k nearest neighbors, whose distance is used to estimate the density.

局部離群因子(LOF)：局部離群因子基于局部密度的概念，其中局部性由k個最近的鄰居給出，其距離用于估計密度。

To understand LOF let’s see some basic definitions.

要了解LOF，讓我們看一些基本定義。

K-distance (Xi): The distance to the k-th Nearest Neighbor of Xi from Xi.

K距離(Xi)：從Xi到Xi的第k個最近鄰居的距離。

The neighborhood of Xi: Set of all the points that belong to the K-NN of Xi.

Xi的鄰域：屬于Xi的K-NN的所有點的集合。

Let assume K=5 than the Neighborhood of Xi becomes, Set of all the points that belong to the 5-NN of Xi i.e, {x1,x2,x3,x4,x5}

假設比Xi的鄰居變成K = 5，屬于Xi的5-NN的所有點的集合，即{x1，x2，x3，x4，x5}

Reachability-distance (Xi, Xj):

可達距離(Xi，Xj)：

reach-dist(Xi, Xj) = max{k-distance(Xj), dist(Xi, Xj)}

到達距離(Xi，Xj)= max {k-距離(Xj)，dist(Xi，Xj)}

Basically, if point Xi is within the k neighbors of point Xj, the reach-dist(Xi, Xj) will be the k-distance of Xj. Otherwise, it will be the real distance between Xi and Xj. This is just a “smoothing factor”. For simplicity, consider this the usual distance between two points.

基本上，如果點Xi在點Xj的k個鄰居內，則distance-dist(Xi，Xj)將是Xj的k-距離。否則，它將是Xi與Xj之間的實際距離。這只是一個“平滑因素”。為簡單起見，請考慮這是兩點之間的通常距離。

Local reachability density (LRD): To get the Lrd for a point, we will first calculate the reachability distance of a to all its k nearest neighbors and take the average of that number. The Lrd is then simply the inverse of that average reachability distance of Xi from its neighbor.

局部可達性密度(LRD)：要獲取某個點的Lrd，我們將首先計算a到其所有k個最近鄰居的可達性距離，并取該數字的平均值。然后，Lrd只是Xi與其鄰居之間的平均可達距離的倒數。

By intuition, the local reachability density tells how far we have to travel from our point to reach the next point or cluster of points. The lower it is, the less dense it is, the longer we have to travel.

通過直覺，局部可達性密度指示我們必須從我們的點行進多遠才能到達下一個點或點集。它越低，密度越低，我們旅行的時間就越長。

Local Outlier Factor (LOF): LOF is basically the multiplication of average Lrd of points in the neighborhood of Xi and inverse of Lrd of xi.

局部離群因子(LOF)： LOF基本上是Xi附近的點的平均Lrd與xi Lrd的逆的乘積。

which means the data point Xi has a density of its neighborhood is large and the density of Xi small then it's considered as outlier.

這意味著數據點Xi的鄰域密度較大，而Xi的密度較小，因此被視為離群值。

When we applying LOF which point is large LOF that point is considered as outlier otherwise its an inlier.

當我們應用LOF時，該點是大LOF，則將該點視為離群值，否則視為離群值。

6.規模和專欄標準化 (6. Scale and Column Standardization)

All such distance-based algorithms are affected by the scale of the variables. KNN is a Distance-Based algorithm where KNN classifies data based on proximity to the K-Neighbors. Then, often we find that the features of the data we used are not at the same scale (or) units.

所有這些基于距離的算法都受變量規模的影響。 KNN是一種基于距離的算法，其中KNN根據接近度對數據進行分類到K鄰居。然后，我們經常發現我們使用的數據的特征不在相同的比例(或)單位上。

An example is when we have features age and height. Obviously these two features have different units, the feature age is in the year and the height is in centimeter.

例如，當我們具有年齡和身高特征時。顯然，這兩個要素具有不同的單位，要素年齡在一年中高度以厘米為單位。

This unit difference causes Distance-Based algorithms such as KNN to not perform optimally, so it is necessary to rescaling features that have different units to have the same scale (or) units. Many ways can be used for rescaling features. In this story, I will discuss several ways of rescaling, namely Min-Max Scaling, Standard Scaling.

這個單位的差異導致基于距離的算法(例如KNN)無法達到最佳效果，因此有必要重新縮放具有不同單位的要素具有相同的比例單位。許多方法可用于重新縮放功能。在這個故事中，我將討論幾種縮放方法，即最小-最大縮放，標準縮放。

Before we build a model we want to rescale (or) standardize features is necessary.

在構建模型之前，我們需要重新縮放(或)標準化功能。

7.模型的可解釋性 (7. Model Interpretability)

A model is better interpretable than another model if its decisions are easier for a human to comprehend than decisions from the other model.

如果一個人的決策比另一個模型的決策更容易理解，那么該模型比另一個模型的解釋性更好。

The model explains the results that models are called the Interpretable model.

該模型解釋了將模型稱為可解釋模型的結果。

K-NN is interpretable when dimension d is small as k is small.

當維數d小而k小時，K-NN是可解釋的。

8.功能重要性 (8. Feature Importance)

Feature Importance is telling us what are the important features in our model. but K-NN doesn’t give the feature importance internally.

特征重要性告訴我們模型中的重要特征是什么。但K-NN在內部并未賦予該功能以重要性。

To Know which features are important we want to apply some techniques i.e, Two popular members of the stepwise family are the forward selection and backward selection (also known as backward elimination) algorithms.

要知道哪些功能很重要，我們想應用一些技術，例如，逐步族的兩個流行成員是正向選擇和反向選擇(也稱為反向消除)算法。

Forward Selection: The procedure starts with an empty set of features. The best of the original features is determined and added to the reduced set. At each subsequent iteration, the best of the remaining original attributes is added to the set.

正向選擇 ：該過程從一組空白功能開始。確定最佳原始功能并將其添加到精簡集中。在每個后續迭代中，將剩余的最佳原始屬性添加到集合中。

First, the best single feature is selected (i.e., using some criterion function like accuracy, AUC, etc).
首先，選擇最佳的單一特征(即，使用某些標準功能，如準確性，AUC等)。
Then, pairs of features are formed using one of the remaining features and this best feature, and the best pair is selected.
然后，使用剩余特征之一和該最佳特征形成特征對，并選擇最佳對。
Next, triplets of features are formed using one of the remaining features and these two best features, and the best triplet is selected.
接下來，使用其余特征之一和這兩個最佳特征形成三重特征，并選擇最佳三元組。
This procedure continues until a predefined number of features is selected.
繼續此過程，直到選擇了預定義數量的功能。

Backward Elimination: The procedure starts with the full set of attributes. At each step, it removes the worst attribute remaining in the set. For given data I have already had some features, this technique tries to remove the feature in each iteration based on some performance metric.

向后消除 ：該過程從整套屬性開始。在每個步驟中，它都會刪除集合中剩余的最差屬性。對于給定的數據，我已經具有一些功能，該技術嘗試根據某些性能指標在每次迭代中刪除該功能。

First, the criterion function is computed for all n features.
首先，針對所有n個特征計算標準函數。
Then, each feature is deleted one at a time, the criterion function is computed for all subsets with n-1 features, and the worst feature is discarded(i.e., using some criterion function like accuracy, AUC, etc).
然后，一次刪除每個特征，為具有n-1個特征的所有子集計算標準函數，并丟棄最差的特征(即，使用某些標準函數，如準確性，AUC等)。
Next, each feature among the remaining n-1 is deleted one at a time, and the worst feature is discarded to form a subset with n-2 features.
接下來，一次刪除其余n-1個特征中的每個特征，并丟棄最差的特征以形成具有n-2個特征的子集。
This procedure continues until a predefined number of features are left.
此過程將繼續進行，直到剩下預定義數量的功能為止。

9.分類特征 (9. Categorical Features)

In many practical Data Science activities, the data set will contain categorical variables. These variables are typically stored as text values that represent various traits. Some examples include color (“Red”, “Yellow”, “Blue”), size (“Small”, “Medium”, “Large”) or geographic designations (State or Country).

在許多實際的數據科學活動中，數據集將包含類別變量。這些變量通常存儲為代表各種特征的文本值。一些示例包括顏色(“紅色”，“黃色”，“藍色”)，大小(“小”，“中”，“大”)或地理名稱(州或國家)。

Label Encoding: Label encoders transform non-numerical labels into numerical labels. Each category is assigned a unique label starting from 0 and going on till n_categories — 1 per feature. Label encoders are suitable for encoding variables where alphabetical alignment or numerical value of labels is important. However, if you got nominal data, using label encoders may not be such a good idea.

標簽編碼：標簽編碼器將非數字標簽轉換為數字標簽。為每個類別分配一個唯一的標簽，從0開始，一直到n_categories —每個功能1。標簽編碼器適用于對標簽的字母對齊或數字值很重要的變量進行編碼。但是，如果您得到了名義數據，則使用標簽編碼器可能不是一個好主意。

OneHot encoding: One-hot encoding is the most widely used encoding scheme. It works by creating a column for each category present in the feature and assigning a 1 or 0 to indicate the presence of a category in the data.

OneHot編碼： One-hot編碼是使用最廣泛的編碼方案。它通過為功能中存在的每個類別創建一列并分配1或0來指示數據中類別的存在來工作。

Binary encoding: Binary encoding is not as intuitive as the above two approaches. Binary encoding works like this:

二進制編碼：二進制編碼不如上述兩種方法直觀。二進制編碼如下：

The categories are encoded as ordinal, for example, categories like red, yellow, green are assigned labels as 1, 2, 3 (let’s assume).
類別被編碼為序數，例如，紅色，黃色，綠色等類別被分配為1、2、3(假設)。
These integers are then converted into binary code, so for example 1 becomes 001 and 2 becomes 010, and so on.
然后將這些整數轉換為二進制代碼，例如1變為001，2變為010，依此類推。

Binary encoding is good for high cardinality data as it creates very few new columns. Most similar values overlap with each other across many of the new columns. This allows many machine learning algorithms to learn the similarity of the values.

二進制編碼對于高基數數據很有用，因為它創建的新列很少。大多數相似的值在許多新列中彼此重疊。這允許許多機器學習算法學習值的相似性。

If text features are present in the given dataset use Natural Language process techniques like Bag of words, TFIDF, Word2vec.

如果給定數據集中存在文本特征，則使用自然語言處理技術，例如單詞袋，TFIDF，Word2vec。

To know more about categorical features visit here

要了解有關分類功能的更多信息，請訪問此處

10.缺失價值估算 (10. Missing Value Imputation)

Due to data collection error and if data is corrupted then missing values occur in our dataset.

由于數據收集錯誤，并且如果數據損壞，則數據集中會丟失值。

如何解決缺失值 (How to work around missing values)

Imputation Techniques: By Mean, Median, and Mode of given data.

插補技術：平均數據的中位數和模式。

Imputation by Class label: Class label imputation technique is if positive class label data is missing take the mean of the positive label only, if the negative class label data is missing take the mean of the negative class label only.

按類別標簽進行插補：類別標簽插補技術是：如果缺少正類別標簽數據，則僅取正標簽的均值；如果缺少負類別標簽數據，則僅取負列標簽的均值。

Model-based imputation: To prepare a dataset for machine learning we need to fix missing values, and we can fix missing values by applying machine learning to that dataset! If we consider a column with missing data as our target variable, and existing columns with complete data as our predictor variables, then we may construct a machine learning model using complete records as our train and test datasets and the records with incomplete data as our generalization target.

基于模型的歸因：要準備用于機器學習的數據集，我們需要修正缺失值，并且可以通過將機器學習應用于該數據集來修正缺失值！如果我們將缺少數據的列作為目標變量，將具有完整數據的現有列視為預測變量，那么我們可以使用完整記錄作為訓練和測試數據集，而將不完整數據的記錄作為廣義來構造機器學習模型目標。

This is a fully scoped-out machine learning problem. Most of the time K-NN is used in a Model-based imputation technique because it uses the nearest neighbor's strategy.

這是一個范圍很廣的機器學習問題。大多數時候，K-NN用于基于模型的插補技術中，因為它使用了最近鄰居的策略。

11.維度詛咒 (11. Curse of Dimensionality)

In machine learning, it’s important to know as dimensionality increases the number of data points to perform good classification models increase exponentially.

在機器學習中，重要的是要知道維數會增加數據點的數量，從而執行良好的分類模型將呈指數增長。

Hughes phenomenon: When the size of the datasets is fixed performance decreases when dimensionality increases.

休斯現象：當數據集的大小固定時，維數增加時性能下降。

Distance Functions [Euclidean distance]: Intuition of distance in 3-D is not valid in high dimensionality spaces.

距離函數[歐幾里得距離]： 3-D距離的直覺在高維空間中無效。

As dimensionality increases careful choice of the number of dimensions (features) to be used is the prerogative of the data scientist training the network. In general the smaller the size of the training set, the fewer features she should use. She must keep in mind that each feature increases the data set requirement exponentially.

隨著維數的增加，謹慎選擇要使用的維數(特征)數量是訓練網絡的數據科學家的特權。通常，訓練集的大小越小，她應使用的功能就越少。她必須記住，每個功能都以指數方式增加了數據集需求。

As dimensionality increases for above image dist max(Xi)~dist min (Xi).

隨著以上圖像的尺寸增加，dist max(Xi)?dist min(Xi)。

In K-NN when dimension d is high euclidian distance is not the good choice as a distance measure, use the cosine distance as a distance measure in high dimensional space.

在維數為d的高歐氏距離的K-NN中，作為距離度量不是一個好的選擇，在高維空間中將余弦距離用作距離度量。

When dimension d is high and data points are dense the impact of dimension is high when data points are sparse the impact of dimension is low.

當維度d高且數據點密集時，維度的影響就高；而當數據點稀疏時，維度的影響則低。

When dimensionality increases the chances of overfitting to the model is increasing.

當維數增加時，對模型進行過度擬合的機會就會增加。

12.偏差-偏差的權衡 (12. Bias-Variance Trade-Off)

In the theory f machine learning Bias-Variance Trade-Off is the mathematical way to know the model is underfitting (or) overfitting.

在機器學習的理論中，偏差-偏差權衡是了解模型擬合不足(或過度擬合)的數學方法。

The model is good when the error on future unseen data of the model is low is given by,

當模型的未來看不見數據的誤差很低時，該模型將是好的，

Generalization Error= Bias2 + variance + Iirreducible error

泛化誤差 =偏差2+方差+不可約誤差

Generalization error is the error on future unseen data of the model, the bias error is due to underfitting, variance error is due to overfitting and the irreducible error is an error that we cannot further reducible for the given model.

泛化誤差是模型未來未見數據的誤差，偏差誤差是由于擬合不足而引起的，方差誤差是由于擬合過度而導致的，不可約誤差是我們無法進一步減小給定模型的誤差。

High bias means underfitting, error due to simplifying the assumptions about the model.

高偏差意味著擬合不足，由于簡化了有關模型的假設而導致的誤差。

High Variance means overfitting, how much a model changes as training data changes, small changes in a model result in a very different model,s and different decision surfaces.

高方差意味著過度擬合，模型隨訓練數據的變化而變化多少，模型的較小變化導致非常不同的模型，不同的決策面。

For a good model, low generalization error, no underfit, no overfit, and some amount of Irreducible error.

對于一個好的模型，低泛化誤差，無欠擬合，無過度擬合以及一定程度的不可約誤差。

↓ Generalization Error= ↓ Bias2 + ↓ variance + Iirreducible error

↓ 泛化誤差 =↓偏差2+↓方差+不可約誤差

High bias (or) underfit: When train error increases bias also increases

高偏差(或)欠擬合 ：當列車誤差增加時，偏差也會增加

High Variance (or) overfit: when test error increases and train error decreases the variance also increases.

高方差(或)過擬合 ：當測試誤差增加而訓練誤差減小時，方差也增加。

翻譯自: https://medium.com/analytics-vidhya/k-nearest-neighbor-algorithm-in-various-real-world-cases-113c1dc75e91