日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

KNN算法实验-采用UCI的Iris数据集和DryBean数据集

發布時間:2023/12/20 编程问答 38 豆豆
生活随笔 收集整理的這篇文章主要介紹了 KNN算法实验-采用UCI的Iris数据集和DryBean数据集 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

KNN(K Nearest Neighbors)

全部的代碼、數據集見我github
DryBean數據集傳不上github,放在了CSDN,0積分即可下載:Drybean下載

1.概述

  • KNN(K鄰近投票算法)
  • 直接計算出所有點到樣本測試點的距離,選出前K個距離最小的點,少數服從多數地決定測試點的標簽
  • 優點:算法簡單、思路簡單;無需參數估計、無需訓練
  • 缺點:只適用于每類樣本數值均衡的數據
  • 能力:多分類

2.原理

  • 選定Iris數據集作為計算樣例,取K=7、歐式距離、30%測試集、70%訓練集、隨機種子0

  • 步驟

  • 1.對每個特征值采用數據歸一化。采用min-max歸一化方法,公式為(X-Min)/(Max-Min)

  • 2.對于第一個測試樣本來說,計算其到訓練集中所有樣本的歐式距離

  • 3.選擇最小的前K個距離,進行投票,按照少數服從多數得到這個樣本的預測標簽

  • 4.計算下一個測試樣本的標簽,直到做完全部預測

2.1 手動計算詳見excel表格:Iris-count.xlsx

3.簡單調包

  • 選定Iris數據集作為計算樣例,歐式距離、30%測試集、70%訓練集、隨機種子0
  • 這里使用了十折交叉驗證找到了最好的K=7
#導入包 %matplotlib inline import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn import datasets from sklearn.preprocessing import MinMaxScaler from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from scipy.spatial import distance import operator #引入數據集 iriis = datasets.load_iris() x= iriis.data y= iriis.target#數據集歸一(線性歸一化) x= (x-x.min(axis=0)) / (x.max(axis=0)-x.min(axis=0)) #axis=0表示取列的最大值或者最小值#拆分訓練集和測試集 split = 0.7 #trianset : testset = 7:3np.random.seed(0) #固定隨機結果 train_indices = np.random.choice(len(x),round(len(x) * split),replace=False) test_indices = np.array(list(set(range(len(x))) - set(train_indices))) train_indices = sorted(train_indices) test_indices =sorted(test_indices) train_x = x[train_indices] test_x = x[test_indices] train_y = y[train_indices] test_y = y[test_indices] print(train_indices) print(test_indices) [1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 15, 16, 17, 18, 20, 22, 24, 26, 27, 30, 33, 37, 38, 40, 41, 42, 43, 44, 45, 46, 48, 50, 51, 52, 53, 54, 56, 59, 60, 61, 62, 63, 64, 66, 68, 69, 71, 73, 76, 78, 80, 83, 84, 85, 86, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 100, 101, 102, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 116, 119, 120, 121, 123, 124, 125, 126, 127, 128, 129, 132, 133, 134, 135, 137, 139, 141, 143, 144, 146, 147, 148, 149] [0, 9, 14, 19, 21, 23, 25, 28, 29, 31, 32, 34, 35, 36, 39, 47, 49, 55, 57, 58, 65, 67, 70, 72, 74, 75, 77, 79, 81, 82, 87, 88, 99, 103, 115, 117, 118, 122, 130, 131, 136, 138, 140, 142, 145] pd.DataFrame(train_x).head(10) #與excel的計算結果一致 01230123456789
0.1666670.4166670.0677970.041667
0.1111110.5000000.0508470.041667
0.0833330.4583330.0847460.041667
0.1944440.6666670.0677970.041667
0.3055560.7916670.1186440.125000
0.0833330.5833330.0677970.083333
0.1944440.5833330.0847460.041667
0.0277780.3750000.0677970.041667
0.3055560.7083330.0847460.041667
0.1388890.5833330.1016950.041667
## KNNfrom sklearn.neighbors import KNeighborsClassifier #一個簡單的模型,只有K一個參數,類似K-means from sklearn.model_selection import train_test_split,cross_val_score #劃分數據 交叉驗證k_range = range(1,10) #k是投票人數 cv_scores = [] #用來放每個模型的結果值 for n in k_range:knn = KNeighborsClassifier(n) #knn模型,這里一個超參數可以做預測,當多個超參數時需要使用另一種方法GridSearchCVscores = cross_val_score(knn,train_x,train_y,cv=10,scoring='accuracy') #cv:選擇每次測試折數 accuracy:評價指標是準確度,可以省略使用默認值,具體使用參考下面。cv_scores.append(scores.mean()) plt.plot(k_range,cv_scores) plt.xlabel('K') plt.ylabel('Accuracy') #通過圖像選擇最好的參數 plt.show()

best_knn = KNeighborsClassifier(n_neighbors=7) # 選擇最優的K=7傳入模型 best_knn.fit(train_x,train_y) #訓練模型 print(best_knn.score(test_x,test_y)) #score = right/total = 44/45 = 0.9778(這里預測錯了一個) print(best_knn.predict(test_x)) print(test_y) 0.9777777777777777 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 2 2 22 2 2 2 2 2 2 2] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 22 2 2 2 2 2 2 2]

4.優缺點

4.1 優點

不需要訓練、沒有參數估計,拿到測試數據即可進行分類

4.2 缺點

當樣本中每種類型的數量不均衡時,可能會強行“少數服從多數”

4.3驗證一下缺點
  • 這里使用Iris數據集和DryBean數據集
  • 對這兩個數據集進行數據清洗,生成每類樣本數目均衡/不均衡的新數據集,比較KNN在“均衡Iris”、“不均衡Iris”、“均衡DryBean”、“不均衡DryBean”上的效果
#依舊是7:3劃分數據集 #均衡Iris:7:3 = 105:45 = (33+34+38):(17+16+12) (使用之前劃分的數據集) #不均衡Iris:7:3 = 105:45 = (45+45+15):(5+5+35) #均衡DryBean:7:3 = 1960:840 =(280*7):(120*7) #不均衡DryBean:7:3 = 1960:840 = (6:6:6:6:6:4:3) = #(318,318,318,318,318,212,159):(68,91,136,136,136,136,136) from numpy import * i_train_x = train_x i_train_y = train_y i_test_x = test_x i_test_y = test_y ui_train_x = concatenate((concatenate((x[:45],x[50:95]),axis=0),x[100:115]),axis=0) #axis=0表示豎向拼接 ui_train_y = concatenate((concatenate((y[:45],y[50:95]),axis=0),y[100:115]),axis=0) ui_test_x = concatenate((concatenate((x[:5],x[50:55]),axis=0),x[100:135]),axis=0) ui_test_y = concatenate((concatenate((y[:5],y[50:55]),axis=0),y[100:135]),axis=0) i_score = [] ui_score = [] klist = [] for k in range(1,12): klist.append(k)i_knn = KNeighborsClassifier(n_neighbors=k)i_knn.fit(i_train_x,i_train_y)i_score.append(i_knn.score(i_test_x,i_test_y)) # print("均衡Iris:",i_knn.score(i_test_x,i_test_y))ui_knn = KNeighborsClassifier(n_neighbors=k)ui_knn.fit(ui_train_x,ui_train_y)ui_score.append(ui_knn.score(ui_test_x,ui_test_y)) # print("不均衡Iris:",i_knn.score(ui_test_x,ui_test_y))plt.plot(klist, i_score, marker = 'o', label = 'banlanced Iris') plt.plot(klist, ui_score,marker = '*', label = 'unbalanced Iris') plt.legend() #讓圖例生效 plt.xlabel('k-value') plt.ylabel('accuracy-value') plt.title(u'Iris map') plt.show()

# import openpyxl import operator from sklearn.preprocessing import StandardScaler # 均值歸一化 from sklearn.metrics import confusion_matrix # 生成混淆矩陣 from sklearn.metrics import classification_report # 分類報告 def openfile(filename):"""打開數據集,進行數據處理:param filename:文件名:return:特征集數據、標簽集數據""" # 打開excelsheet = pd.read_excel(filename,sheet_name='Dry_Beans_Dataset')data = sheet.iloc[:,:16].valuestarget = sheet['Class'].valuesprint(data.shape)print(target.shape) return data, target, sheet.columns def split_data_set(data_set, target_set, rate=0.7):"""說明:分割數據集,默認數據集的30%是測試集:param data_set: 數據集:param target_set: 標簽集:param rate: 測試集所占的比率:return: 返回訓練集數據、訓練集標簽、測試集數據、測試集標簽"""# 計算訓練集的數據個數train_size = len(data_set)# 隨機獲得數據的下標train_index = sorted(np.random.choice(train_size,round(train_size * rate), replace=False))test_index = sorted(np.array(list(set(range(train_size)) - set(train_index)))) #不用排序也行,強迫癥,為了上面保持一致就排序了# 分割數據集(X表示數據,y表示標簽)x_train = data_set.iloc[train_index,:] #因為這里的data_set和target_set變成DataFrame,而不是ndarray了,所以要用iloc訪問x_test = data_set.iloc[test_index,:]y_train = target_set.iloc[train_index,:]y_test = target_set.iloc[test_index,:]return x_train, y_train, x_test, y_test filename = r'D:\jjq\code\jupyterWorkSpace\datasets\DryBeanDataset\Dry_Bean_Dataset.xlsx' o_bean_dataset = openfile(filename) #每個類別的種子抽取400條數據,這個是每個類別的起始索引 step = 400 start_index = [0,1322,1844,3474,7020,8948,10975] #一共7類 # bean_dataset_x = pd.DataFrame(columns=o_bean_dataset[2]) # bean_dataset_y =pd.DataFrame(columns=o_bean_dataset[2]) bean_dataset_x = pd.DataFrame(columns=range(16)) bean_dataset_y =pd.DataFrame(columns=range(1)) bean_dataset_x.drop(bean_dataset_x.index,inplace=True) bean_dataset_y.drop(bean_dataset_y.index,inplace=True) for i in range(7):bean_dataset_x = pd.concat((bean_dataset_x, pd.DataFrame(o_bean_dataset[0][start_index[i]:(step+start_index[i])])),axis=0)bean_dataset_y = pd.concat((bean_dataset_y, pd.DataFrame(o_bean_dataset[1][start_index[i]:(step+start_index[i])])),axis=0) # bean_dataset_y.to_excel("./123.xlsx") (13611, 16) (13611,) #按照均衡和不均衡的方式,劃分訓練集和測試集#均衡 b_train_x, b_train_y, b_test_x, b_test_y = split_data_set(bean_dataset_x,bean_dataset_y) print(b_train_x.shape,b_train_y.shape) print(b_test_x.shape,b_test_y.shape) #不均衡 steps_train = [318,318,318,318,318,212,159] steps_test = [68,91,136,136,136,136,136] now = 0 #初始化不均衡數組 ub_train_x = pd.DataFrame(columns=range(16)) ub_test_x = pd.DataFrame(columns=range(16)) ub_train_y = pd.DataFrame(columns=range(1)) ub_test_y = pd.DataFrame(columns=range(1)) #保證添加數據之前數組為空 ub_train_x.drop(ub_train_x.index,inplace=True) ub_test_x.drop(ub_test_x.index,inplace=True) ub_train_y.drop(ub_train_y.index,inplace=True) ub_test_y.drop(ub_test_y.index,inplace=True)#開始添加數據 for i in range(7):ub_train_x = pd.concat((ub_train_x, bean_dataset_x[now:(now+steps_train[i])]),axis=0) ub_train_y = pd.concat((ub_train_y, bean_dataset_y[now:(now+steps_train[i])]),axis=0)now = now+steps_train[i]ub_test_x = pd.concat((ub_test_x, bean_dataset_x[now:(now+steps_test[i])]),axis=0)ub_test_y = pd.concat((ub_test_y, bean_dataset_y[now:(now+steps_test[i])]),axis=0)now = now+steps_test[i] (1960, 16) (1960, 1) (840, 16) (840, 1) b_score = [] ub_score = [] klist = [] for k in range(30,50): klist.append(k)b_knn = KNeighborsClassifier(n_neighbors=k)b_knn.fit(b_train_x,b_train_y.values.ravel())b_score.append(b_knn.score(b_test_x,b_test_y.values.ravel()))ub_knn = KNeighborsClassifier(n_neighbors=k)ub_knn.fit(ub_train_x,ub_train_y.values.ravel())ub_score.append(ub_knn.score(ub_test_x,ub_test_y.values.ravel()))plt.plot(klist, b_score, marker = 'o', label = 'banlanced DryBean') plt.plot(klist, ub_score,marker = '*', label = 'unbalanced DryBean') plt.legend() #讓圖例生效 plt.xlabel('k-value') plt.ylabel('accuracy-value') plt.title(u'DryBean map') plt.show()

4.實現代碼

#這里放一下手動實現算法的代碼,并且做到和調包的正確率一樣 #采用Iris數據集作為計算樣例 print(type(train_x)) print(pd.DataFrame(train_x).shape) print(train_x[0][1]) <class 'numpy.ndarray'> (105, 4) 0.41666666666666663 #定義KNN類,用于分類,類中定義兩個預測方法,分為考慮權重不考慮權重兩種情況 class KNN:''' 使用Python語言實現K近鄰算法。(實現分類) '''def __init__(self, k):'''初始化方法 Parameters-----k:int 鄰居的個數'''self.k = kdef fit(self,X,y):'''訓練方法Parameters----X : 類數組類型,形狀為:[樣本數量, 特征數量]待訓練的樣本特征(屬性)y : 類數組類型,形狀為: [樣本數量]每個樣本的目標值(標簽)。'''#將X轉換成ndarray數組self.X = np.asarray(X)self.y = np.asarray(y)def predict(self,X):"""根據參數傳遞的樣本,對樣本數據進行預測。Parameters-----X : 類數組類型,形狀為:[樣本數量, 特征數量]待訓練的樣本特征(屬性) Returns-----result : 數組類型預測的結果。"""X = np.asarray(X)result = []# 對ndarray數組進行遍歷,每次取數組中的一行。for x in X:# 對于測試集中的每一個樣本,依次與訓練集中的所有樣本求距離。dis = np.sqrt(np.sum((x - self.X) ** 2, axis=1))## 返回數組排序后,每個元素在原數組(排序之前的數組)中的索引。index = dis.argsort()# 進行截斷,只取前k個元素。【取距離最近的k個元素的索引】index = index[:self.k]# 返回數組中每個元素出現的次數。元素必須是非負的整數。【使用weights考慮權重,權重為距離的倒數。】if dis[index].all()!=0:count = np.bincount(self.y[index], weights= 1 / dis[index])else :pass# 返回ndarray數組中,值最大的元素對應的索引。該索引就是我們判定的類別。# 最大元素索引,就是出現次數最多的元素。result.append(count.argmax())return np.asarray(result) #創建KNN對象,進行訓練與測試。 knn = KNN(k=7) #進行訓練 knn.fit(train_x,train_y) #進行測試 result = knn.predict(test_x) # display(result) # display(test_y) display(np.sum(result == test_y)) if len(result)!=0:display(np.sum(result == test_y)/ len(result)) #與調包結果一致 440.9777777777777777

5.scikit-learn中的kNN模型(source: CSDN Ada_Concentration)

  • scikit-learn中提供了一個KNeighborClassifier類來實現k近鄰法分類模型,其原型為:
    sklearn.neighbors.KNighborClassifier(n_neighbors=5,weights=’uniform’,algorithm=’auto’,leaf_size=30,p=2,metric=’minkowski’,metric_params=None,n_jobs=1,**kwargs)
5.1參數
  • n_neighbors:一個整數,指定k值。
  • weights:一字符串或者可調用對象,指定投票權重類型。也就是說這些鄰居投票權可以為相同或不同:
    – ‘uniform’:本節點的所有鄰居節點的投票權重都相等;
    – ‘distance’:本節點的所有鄰居節點的投票權重與距離成反比,即越近的節點,其投票的權重越大;
    – [callable]:一個可調用對象。它傳入距離的數組,返回同樣形狀的權重數組。
  • algorithm:一個字符串,指定計算最近鄰的算法,可以為如下:
    – ’ball_tree’ :使用BallTree算法,也就是球樹;
    – kd_tree’: 使用KDTree算法;
    –‘brute’ : 使用暴力搜素法;
    –‘auto’ : 自動決定最適合的算法。
  • leaf_size:一個整數,指定BallTree或者KDTree葉節點的規模。它影響樹的構建和查詢速度。
  • metric:一個字符串,指定距離度量。默認為‘minkowski’距離。
  • p:整數值,指定在‘minkowski’距離上的指數。
  • n_jobs:并行性。默認為-1表示派發任務到所有計算機的CPU上。
5.2方法
  • fit(X,y):訓練模型
  • predict:使用模型來預測,返回待預測樣本的標記。
  • score(X,y):返回在(X,y)上預測的準確率。
  • predict_proba(X):返回樣本為每種標記的概率。
  • kneighbors([X,n_neighbors,return_distance]):返回樣本點的k近鄰點。如果return_diatance=True,同時還返回到這些近鄰點的距離。
  • kneighbors_graph([X,n_neighbors,model]):返回樣本點的連接圖。

6.適用數據

# Code source: Ga?l Varoquaux # Andreas Müller # Modified for documentation by Jaques Grobler # License: BSD 3 clauseimport numpy as np import matplotlib.pyplot as plt from matplotlib.colors import ListedColormap from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.datasets import make_moons, make_circles, make_classification from sklearn.neural_network import MLPClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.svm import SVC from sklearn.gaussian_process import GaussianProcessClassifier from sklearn.gaussian_process.kernels import RBF from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier from sklearn.naive_bayes import GaussianNB from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysish = 0.02 # step size in the meshnames = ["Nearest Neighbors","Linear SVM","RBF SVM","Gaussian Process","Decision Tree","Random Forest","Neural Net","AdaBoost","Naive Bayes","QDA", ]classifiers = [KNeighborsClassifier(3),SVC(kernel="linear", C=0.025),SVC(gamma=2, C=1),GaussianProcessClassifier(1.0 * RBF(1.0)),DecisionTreeClassifier(max_depth=5),RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),MLPClassifier(alpha=1, max_iter=1000),AdaBoostClassifier(),GaussianNB(),QuadraticDiscriminantAnalysis(), ]X, y = make_classification(n_features=2, n_redundant=0, n_informative=2, random_state=1, n_clusters_per_class=1 ) rng = np.random.RandomState(2) X += 2 * rng.uniform(size=X.shape) linearly_separable = (X, y)datasets = [make_moons(noise=0.3, random_state=0),make_circles(noise=0.2, factor=0.5, random_state=1),linearly_separable, ]figure = plt.figure(figsize=(27, 9)) i = 1 # iterate over datasets for ds_cnt, ds in enumerate(datasets):# preprocess dataset, split into training and test partX, y = dsX = StandardScaler().fit_transform(X)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))# just plot the dataset firstcm = plt.cm.RdBucm_bright = ListedColormap(["#FF0000", "#0000FF"])ax = plt.subplot(len(datasets), len(classifiers) + 1, i)if ds_cnt == 0:ax.set_title("Input data")# Plot the training pointsax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright, edgecolors="k")# Plot the testing pointsax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6, edgecolors="k")ax.set_xlim(xx.min(), xx.max())ax.set_ylim(yy.min(), yy.max())ax.set_xticks(())ax.set_yticks(())i += 1# iterate over classifiersfor name, clf in zip(names, classifiers):ax = plt.subplot(len(datasets), len(classifiers) + 1, i)clf.fit(X_train, y_train)score = clf.score(X_test, y_test)# Plot the decision boundary. For that, we will assign a color to each# point in the mesh [x_min, x_max]x[y_min, y_max].if hasattr(clf, "decision_function"):Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])else:Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]# Put the result into a color plotZ = Z.reshape(xx.shape)ax.contourf(xx, yy, Z, cmap=cm, alpha=0.8)# Plot the training pointsax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright, edgecolors="k")# Plot the testing pointsax.scatter(X_test[:, 0],X_test[:, 1],c=y_test,cmap=cm_bright,edgecolors="k",alpha=0.6,)ax.set_xlim(xx.min(), xx.max())ax.set_ylim(yy.min(), yy.max())ax.set_xticks(())ax.set_yticks(())if ds_cnt == 0:ax.set_title(name)ax.text(xx.max() - 0.3,yy.min() + 0.3,("%.2f" % score).lstrip("0"),size=15,horizontalalignment="right",)i += 1plt.tight_layout() plt.show()

總結

以上是生活随笔為你收集整理的KNN算法实验-采用UCI的Iris数据集和DryBean数据集的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。