KNN近邻算法
分類算法KNN(K Nearest Neighbor)
示例:《電影類型分類》
獲取數據來源
電影名稱 打斗次數 接吻次數 電影類型
California Man 3 104 Romance
He’s Not Really into Dudes 8 95 Romance
Beautiful Woman 1 81 Romance
Kevin Longblade 111 15 Action
Roob Slayer 3000 99 2 Action
Amped II 88 10 Action
Unknown 18 90 unknown
數據顯示:肉眼判斷python基礎教程電影類型unknown是什么
問題分析:根據已知信息分析電影類型unknown是什么
核心思想:未標記樣本的類別由距離其最近的K個鄰居的類別決定
距離度量:一般距離計算使用歐式距離(用勾股定理計算距離),也可以采用曼哈頓距離(水平上和垂直上的距離之和)、余弦值和相似度(這是距離的另一種表達方式)。相比于上述距離,馬氏距離更為精確,因為它能考慮很多因素,比如單位,由于在求協方差矩陣逆矩陣的過程中,可能不存在,而且若碰見3維及3維以上,求解過程中極其復雜,故可不使用馬氏距離
知識擴展
馬氏距離概念:表示數據的協方差距離
方差:數據集中各個點到均值點的距離的平方的平均值
標準差:方差的開方
協方差cov(x, y):E表示均值,D表示方差,x,y表示不同的數據集,xy表示數據集元素對應乘積組成數據集
cov(x, y) = E(xy) - E(x)*E(y)
cov(x, x) = D(x)
cov(x1+x2, y) = cov(x1, y) + cov(x2, y)
cov(ax, by) = abcov(x, y)
協方差矩陣:根據維度組成的矩陣,假設有三個維度,a,b,c
∑ij = [cov(a, a) cov(a, b) cov(a, c) cov(b, a) cov(b,b) cov(b, c) cov(c, a) cov(c, b) cov(c, c)]
算法實現:歐氏距離
編碼實現
# 自定義實現 mytest1.py import numpy as np ? # 創建數據集 def createDataSet():features = np.array([[3, 104], [8, 95], [1, 81], [111, 15],[99, 2], [88, 10]])labels = ["Romance", "Romance", "Romance", "Action", "Action", "Action"]return features, labels ? def knnClassify(testFeature, trainingSet, labels, k):"""KNN算法實現,采用歐式距離:param testFeature: 測試數據集,ndarray類型,一維數組:param trainingSet: 訓練數據集,ndarray類型,二維數組:param labels: 訓練集對應標簽,ndarray類型,一維數組:param k: k值,int類型:return: 預測結果,類型與標簽中元素一致"""dataSetsize = trainingSet.shape[0]"""構建一個由dataSet[i] - testFeature的新的數據集diffMatdiffMat中的每個元素都是dataSet中每個特征與testFeature的差值(歐式距離中差)"""testFeatureArray = np.tile(testFeature, (dataSetsize, 1))diffMat = testFeatureArray - trainingSet# 對每個差值求平方sqDiffMat = diffMat ** 2# 計算dataSet中每個屬性與testFeature的差的平方的和sqDistances = sqDiffMat.sum(axis=1)# 計算每個feature與testFeature之間的歐式距離distances = sqDistances ** 0.5 ?"""排序,按照從小到大的順序記錄distances中各個數據的位置如distance = [5, 9, 0, 2]則sortedStance = [2, 3, 0, 1]"""sortedDistances = distances.argsort() ?# 選擇距離最小的k個點classCount = {}for i in range(k):voteiLabel = labels[list(sortedDistances).index(i)]classCount[voteiLabel] = classCount.get(voteiLabel, 0) + 1# 對k個結果進行統計、排序,選取最終結果,將字典按照value值從大到小排序sortedclassCount = sorted(classCount.items(), key=lambda x: x[1], reverse=True)return sortedclassCount[0][0] ? testFeature = np.array([100, 200]) features, labels = createDataSet() res = knnClassify(testFeature, features, labels, 3) print(res) # 使用python包實現 mytest2.py from sklearn.neighbors import KNeighborsClassifier from .mytest1 import createDataSet ? features, labels = createDataSet() k = 5 clf = KNeighborsClassifier(k_neighbors=k) clf.fit(features, labels) ? # 樣本值 my_sample = [[18, 90]] res = clf.predict(my_sample) print(res)示例:《交友網站匹配效果預測》
數據來源:略
數據顯示
import pandas as pd import numpy as np from matplotlib import pyplot as plt from mpl_toolkits.mplot3d import Axes3D ? # 數據加載 def loadDatingData(file):datingData = pd.read_table(file, header=None)datingData.columns = ["FlightDistance", "PlaytimePreweek", "IcecreamCostPreweek", "label"]datingTrainData = np.array(datingData[["FlightDistance", "PlaytimePreweek", "IcecreamCostPreweek"]])datingTrainLabel = np.array(datingData["label"])return datingData, datingTrainData, datingTrainLabel ? # 3D圖顯示數據 def dataView3D(datingTrainData, datingTrainLabel):plt.figure(1, figsize=(8, 3))plt.subplot(111, projection="3d")plt.scatter(np.array([datingTrainData[x][0]for x in range(len(datingTrainLabel))if datingTrainLabel[x] == "smallDoses"]),np.array([datingTrainData[x][1]for x in range(len(datingTrainLabel))if datingTrainLabel[x] == "smallDoses"]),np.array([datingTrainData[x][2]for x in range(len(datingTrainLabel))if datingTrainLabel[x] == "smallDoses"]), c="red")plt.scatter(np.array([datingTrainData[x][0]for x in range(len(datingTrainLabel))if datingTrainLabel[x] == "didntLike"]),np.array([datingTrainData[x][1]for x in range(len(datingTrainLabel))if datingTrainLabel[x] == "didntLike"]),np.array([datingTrainData[x][2]for x in range(len(datingTrainLabel))if datingTrainLabel[x] == "didntLike"]), c="green")plt.scatter(np.array([datingTrainData[x][0]for x in range(len(datingTrainLabel))if datingTrainLabel[x] == "largeDoses"]),np.array([datingTrainData[x][1]for x in range(len(datingTrainLabel))if datingTrainLabel[x] == "largeDoses"]),np.array([datingTrainData[x][2]for x in range(len(datingTrainLabel))if datingTrainLabel[x] == "largeDoses"]), c="blue")plt.xlabel("飛行里程數", fontsize=16)plt.ylabel("視頻游戲耗時百分比", fontsize=16)plt.clabel("冰淇凌消耗", fontsize=16)plt.show()datingData, datingTrainData, datingTrainLabel = loadDatingData(FILEPATH1) datingView3D(datingTrainData, datingTrainLabel)問題分析:抽取數據集的前10%在數據集的后90%進行測試
編碼實現
# 自定義方法實現 import pandas as pd import numpy as np ? # 數據加載 def loadDatingData(file):datingData = pd.read_table(file, header=None)datingData.columns = ["FlightDistance", "PlaytimePreweek", "IcecreamCostPreweek", "label"]datingTrainData = np.array(datingData[["FlightDistance", "PlaytimePreweek", "IcecreamCostPreweek"]])datingTrainLabel = np.array(datingData["label"])return datingData, datingTrainData, datingTrainLabel ? # 數據歸一化 def autoNorm(datingTrainData):# 獲取數據集每一列的最值minValues, maxValues = datingTrainData.min(0), datingTrainData.max(0)diffValues = maxValues - minValues# 定義形狀和datingTrainData相似的最小值矩陣和差值矩陣m = datingTrainData.shape(0)minValuesData = np.tile(minValues, (m, 1))diffValuesData = np.tile(diffValues, (m, 1))normValuesData = (datingTrainData-minValuesData)/diffValuesDatareturn normValuesData ? # 核心算法實現 def KNNClassifier(testData, trainData, trainLabel, k):m = trainData.shape(0)testDataArray = np.tile(testData, (m, 1))diffDataArray = (testDataArray - trainData) ** 2sumDataArray = diffDataArray.sum(axis=1) ** 0.5# 對結果進行排序sumDataSortedArray = sumDataArray.argsort()classCount = {}for i in range(k):labelName = trainLabel[list(sumDataSortedArray).index(i)]classCount[labelName] = classCount.get(labelName, 0)+1classCount = sorted(classCount.items(), key=lambda x: x[1], reversed=True)return classCount[0][0]? # 數據測試 def datingTest(file):datingData, datingTrainData, datingTrainLabel = loadDatingData(file)normValuesData = autoNorm(datingTrainData)errorCount = 0ratio = 0.10total = datingTrainData.shape(0)numberTest = int(total * ratio)for i in range(numberTest):res = KNNClassifier(normValuesData[i], normValuesData[numberTest:m], datingTrainLabel, 5)if res != datingTrainLabel[i]:errorCount += 1print("The total error rate is : {}\n".format(error/float(numberTest))) ? if __name__ == "__main__":FILEPATH = "./datingTestSet1.txt"datingTest(FILEPATH) # python 第三方包實現 import pandas as pd import numpy as np from sklearn.neighbors import KNeighborsClassifier ? if __name__ == "__main__":FILEPATH = "./datingTestSet1.txt"datingData, datingTrainData, datingTrainLabel = loadDatingData(FILEPATH)normValuesData = autoNorm(datingTrainData)errorCount = 0ratio = 0.10total = normValuesData.shape[0]numberTest = int(total * ratio)k = 5clf = KNeighborsClassifier(n_neighbors=k)clf.fit(normValuesData[numberTest:total], datingTrainLabel[numberTest:total])for i in range(numberTest):res = clf.predict(normValuesData[i].reshape(1, -1))if res != datingTrainLabel[i]:errorCount += 1print("The total error rate is : {}\n".format(errorCount/float(numberTest)))總結
- 上一篇: 电脑技巧:Win10系统中的这六种模式介
- 下一篇: html设计网页技巧,网页设计技巧:网页