KNN简单实现
最近開始學(xué)習(xí)機(jī)器學(xué)習(xí)實(shí)戰(zhàn),第一個(gè)就是KNN,由于K-近鄰算法比較簡(jiǎn)單,這里不再介紹理論知識(shí),直接看代碼實(shí)現(xiàn):
KNN的簡(jiǎn)單實(shí)現(xiàn)
需要用到的一些語法:
tile()
sum(axis=1)
argsort,sort 和 sorted,operator.itemgetter函數(shù)
get(),items(),iteritems()方法
運(yùn)行結(jié)果:
('training data set:', array([[ 1. , 1.1],[ 1. , 1. ],[ 0. , 0. ],[ 0. , 0.1]])) ('labels of training data set:', ['A', 'A', 'B', 'B']) ('classCount:', {'A': 1, 'B': 2}) [('B', 2), ('A', 1)] ('Classification results:', 'B')至此一個(gè)最簡(jiǎn)單的KNN分類就實(shí)現(xiàn)了
KNN算法改進(jìn)約會(huì)網(wǎng)站的配對(duì)效果
數(shù)據(jù)的處理
會(huì)用到的語法:
matplotlib
min(iterable, *[, key, default])
其中file2matrix得到的是數(shù)組矩陣,也即是可以處理的數(shù)據(jù)格式,如下:
[[ 4.09200000e+04 8.32697600e+00 9.53952000e-01][ 1.44880000e+04 7.15346900e+00 1.67390400e+00][ 2.60520000e+04 1.44187100e+00 8.05124000e-01]..., [ 2.65750000e+04 1.06501020e+01 8.66627000e-01][ 4.81110000e+04 9.13452800e+00 7.28045000e-01][ 4.37570000e+04 7.88260100e+00 1.33244600e+00]][3, 2, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 2, 3, 2, 1, 2, 3, 2, 3, 2, 3, 2, 1, 3, 1, 3, 1, 2, 1, 1, 2, 3, 3, 1, 2, 3, 3, 3, 1, 1, 1, 1, 2, 2, 1, 3, 2, 2, 2, 2, 3, 1, 2, 1, 2, 2, 2, 2, 2, 3, 2, 3, 1, 2, 3, 2, 2, 1, 3, 1, 1, 3, 3, 1, 2, 3, 1, 3, 1, 2, 2, 1, 1, 3, 3, 1, 2, 1, 3, 3, 2, 1, 1, 3, 1, 2, 3, 3, 2, 3, 3, 1, 2, 3, 2, 1, 3, 1, 2, 1, 1, 2, 3, 2, 3, 2, 3, 2, 1, 3, 3, 3, 1, 3, 2, 2, 3, 1, 3, 3, 3, 1, 3, 1, 1, 3, 3, 2, 3, 3, 1, 2, 3, 2, 2, 3, 3, 3, 1, 2, 2, 1, 1, 3, 2, 3, 3, 1, 2, 1, 3, 1, 2, 3, 2, 3, 1, 1, 1, 3, 2, 3, 1, 3, 2, 1, 3, 2, 2, 3, 2, 3, 2, 1, 1, 3, 1, 3, 2, 2, 2, 3, 2, 2, 1, 2, 2, 3, 1, 3, 3, 2, 1, 1, 1, 2, 1, 3, 3, 3, 3, 2, 1, 1, 1, 2, 3, 2, 1, 3, 1, 3, 2, 2, 3, 1, 3, 1, 1, 2, 1, 2, 2, 1, 3, 1, 3, 2, 3, 1, 2, 3, 1, 1, 1, 1, 2, 3, 2, 2, 3, 1, 2, 1, 1, 1, 3, 3, 2, 1, 1, 1, 2, 2, 3, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 2, 3, 2, 3, 3, 3, 3, 1, 2, 3, 1, 1, 1, 3, 1, 3, 2, 2, 1, 3, 1, 3, 2, 2, 1, 2, 2, 3, 1, 3, 2, 1, 1, 3, 3, 2, 3, 3, 2, 3, 1, 3, 1, 3, 3, 1, 3, 2, 1, 3, 1, 3, 2, 1, 2, 2, 1, 3, 1, 1, 3, 3, 2, 2, 3, 1, 2, 3, 3, 2, 2, 1, 1, 1, 1, 3, 2, 1, 1, 3, 2, 1, 1, 3, 3, 3, 2, 3, 2, 1, 1, 1, 1, 1, 3, 2, 2, 1, 2, 1, 3, 2, 1, 3, 2, 1, 3, 1, 1, 3, 3, 3, 3, 2, 1, 1, 2, 1, 3, 3, 2, 1, 2, 3, 2, 1, 2, 2, 2, 1, 1, 3, 1, 1, 2, 3, 1, 1, 2, 3, 1, 3, 1, 1, 2, 2, 1, 2, 2, 2, 3, 1, 1, 1, 3, 1, 3, 1, 3, 3, 1, 1, 1, 3, 2, 3, 3, 2, 2, 1, 1, 1, 2, 1, 2, 2, 3, 3, 3, 1, 1, 3, 3, 2, 3, 3, 2, 3, 3, 3, 2, 3, 3, 1, 2, 3, 2, 1, 1, 1, 1, 3, 3, 3, 3, 2, 1, 1, 1, 1, 3, 1, 1, 2, 1, 1, 2, 3, 2, 1, 2, 2, 2, 3, 2, 1, 3, 2, 3, 2, 3, 2, 1, 1, 2, 3, 1, 3, 3, 3, 1, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 3, 2, 1, 3, 3, 2, 2, 2, 3, 1, 2, 1, 1, 3, 2, 3, 2, 3, 2, 3, 3, 2, 2, 1, 3, 1, 2, 1, 3, 1, 1, 1, 3, 1, 1, 3, 3, 2, 2, 1, 3, 1, 1, 3, 2, 3, 1, 1, 3, 1, 3, 3, 1, 2, 3, 1, 3, 1, 1, 2, 1, 3, 1, 1, 1, 1, 2, 1, 3, 1, 2, 1, 3, 1, 3, 1, 1, 2, 2, 2, 3, 2, 2, 1, 2, 3, 3, 2, 3, 3, 3, 2, 3, 3, 1, 3, 2, 3, 2, 1, 2, 1, 1, 1, 2, 3, 2, 2, 1, 2, 2, 1, 3, 1, 3, 3, 3, 2, 2, 3, 3, 1, 2, 2, 2, 3, 1, 2, 1, 3, 1, 2, 3, 1, 1, 1, 2, 2, 3, 1, 3, 1, 1, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 2, 2, 2, 3, 1, 3, 1, 2, 3, 2, 2, 3, 1, 2, 3, 2, 3, 1, 2, 2, 3, 1, 1, 1, 2, 2, 1, 1, 2, 1, 2, 1, 2, 3, 2, 1, 3, 3, 3, 1, 1, 3, 1, 2, 3, 3, 2, 2, 2, 1, 2, 3, 2, 2, 3, 2, 2, 2, 3, 3, 2, 1, 3, 2, 1, 3, 3, 1, 2, 3, 2, 1, 3, 3, 3, 1, 2, 2, 2, 3, 2, 3, 3, 1, 2, 1, 1, 2, 1, 3, 1, 2, 2, 1, 3, 2, 1, 3, 3, 2, 2, 2, 1, 2, 2, 1, 3, 1, 3, 1, 3, 3, 1, 1, 2, 3, 2, 2, 3, 1, 1, 1, 1, 3, 2, 2, 1, 3, 1, 2, 3, 1, 3, 1, 3, 1, 1, 3, 2, 3, 1, 1, 3, 3, 3, 3, 1, 3, 2, 2, 1, 1, 3, 3, 2, 2, 2, 1, 2, 1, 2, 1, 3, 2, 1, 2, 2, 3, 1, 2, 2, 2, 3, 2, 1, 2, 1, 2, 3, 3, 2, 3, 1, 1, 3, 3, 1, 2, 2, 2, 2, 2, 2, 1, 3, 3, 3, 3, 3, 1, 1, 3, 2, 1, 2, 1, 2, 2, 3, 2, 2, 2, 3, 1, 2, 1, 2, 2, 1, 1, 2, 3, 3, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 1, 3, 3, 2, 3, 2, 3, 3, 2, 2, 1, 1, 1, 3, 3, 1, 1, 1, 3, 3, 2, 1, 2, 1, 1, 2, 2, 1, 1, 1, 3, 1, 1, 2, 3, 2, 2, 1, 3, 1, 2, 3, 1, 2, 2, 2, 2, 3, 2, 3, 3, 1, 2, 1, 2, 3, 1, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 1, 3, 3, 3]下圖是數(shù)據(jù)的散點(diǎn)圖:
歸一化后的數(shù)據(jù):
[[ 0.44832535 0.39805139 0.56233353][ 0.15873259 0.34195467 0.98724416][ 0.28542943 0.06892523 0.47449629]..., [ 0.29115949 0.50910294 0.51079493][ 0.52711097 0.43665451 0.4290048 ][ 0.47940793 0.3768091 0.78571804]]測(cè)試算法
# coding=utf-8 from numpy import * import operator # 運(yùn)算符模塊,執(zhí)行排序操作時(shí)將用到 import matplotlib.pyplot as plt# 建立訓(xùn)訓(xùn)練集和相應(yīng)的標(biāo)簽 def createDataset():# 數(shù)組,注意此處是兩個(gè)中括號(hào)group=array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])labels=['A','A','B','B']return (group,labels)# 簡(jiǎn)單分類 def classify0(inX,dataSet,labels,k):#shape[0]得到的是矩陣行數(shù),shape[1]得到列數(shù)dataSetSize=dataSet.shape[0] # tile()得到和dataset相同的維數(shù),進(jìn)行相減diffMat=tile(inX,(dataSetSize,1))-dataSet # 各向量相減后平方sqDiffMat = diffMat**2# axis=1按行求和,得到了平方和sqDistances = sqDiffMat.sum(axis=1)# 開根號(hào),求得輸入向量和訓(xùn)練集各向量的歐氏距離distances = sqDistances**0.5# 得到各距離索引值,是升序,即最小距離到最大距離sortedDistIndicies = distances.argsort()classCount={} # 定義一個(gè)字典for i in range(k):# 前k個(gè)最小距離的標(biāo)簽voteIlabel = labels[sortedDistIndicies[i]] # 累計(jì)投票數(shù)classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1# 把分類結(jié)果進(jìn)行排序,然后返回得票數(shù)最多的分類結(jié)果# 其中iteritems()把字典分解為元祖列表,itemgetter(1)按照第二個(gè)元素的次序?qū)υ媾判?/span>sortedClassCount = sorted(classCount.iteritems(), \key=operator.itemgetter(1), reverse=True)# 輸出分類標(biāo)簽#print(sortedClassCount[0][0]) return sortedClassCount[0][0]# 數(shù)據(jù)預(yù)處理 def file2matrix(filename):'''從文件中讀入訓(xùn)練數(shù)據(jù),并存儲(chǔ)為矩陣'''fr=open(filename,'r')# 源代碼有錯(cuò)誤arrayOfLines=fr.readlines() # 只能讀一次numberOfLines = len(arrayOfLines) # 得到樣本的行數(shù)returnMat = zeros((numberOfLines,3)) # 得到一個(gè)二維矩陣,行數(shù)是樣本的行數(shù),每行3列print('row:%s and column:%s' %(returnMat.shape[0],returnMat.shape[1]))classLabelVector = [] # 得到一個(gè)一維的數(shù)組,存放樣本標(biāo)簽index = 0for line in arrayOfLines:#strip() 方法用于移除字符串頭尾指定的字符(默認(rèn)為所有的空字符,包括空格、換行(\n)、制表符(\t)等)line = line.strip() # 把回車符號(hào)給去掉#對(duì)于每一行,按照制表符切割字符串,得到的結(jié)果構(gòu)成一個(gè)數(shù)組,listFromLine = line.split('\t')#print(listFromLine[0:4])# 把分割好的數(shù)據(jù)放至數(shù)據(jù)集,是一個(gè)1000*3的數(shù)組returnMat[index,:] = listFromLine[0:3] classLabelVector.append(int(listFromLine[-1]))index += 1return ( returnMat,classLabelVector) fr.close()# 歸一化數(shù)據(jù) def autoNorm(dataSet):# 每列的最小值minvalsminVals=dataSet.min(0) # 0表示返回每列的最小值maxVals=dataSet.max(0)ranges=maxVals-minVals# 得到dataset相同行列數(shù)的0數(shù)組normDataSet=zeros(shape(dataSet))m = dataSet.shape[0] #數(shù)組的行數(shù)# tile復(fù)制形如[A,B,C](ABC分別代表每列的最小值)m行normDataSet = dataSet - tile(minVals, (m,1)) # 歸一化公式,注意是具體特征值相除normDataSet = normDataSet/tile(ranges, (m,1)) #element wise dividereturn normDataSet, ranges, minVals# 分類測(cè)試 def datingClassTest():hoRatio = 0.10 #hold out 10%datingDataMat,datingLabels = file2matrix('C:\Users\LiLong\Desktop\datingTestSet2.txt')normMat, ranges, minVals = autoNorm(datingDataMat)m = normMat.shape[0]# 測(cè)試數(shù)據(jù)的數(shù)量numTestVecs = int(m*hoRatio)print('the test number:',numTestVecs)errorCount = 0.0for i in range(numTestVecs):#normMat[i,:]表示輸入的測(cè)試集是前100行的數(shù)據(jù),normMat[numTestVecs:m,:]表示訓(xùn)練集#是100-1000的,datingLabels[numTestVecs:m]表示和訓(xùn)練集是對(duì)應(yīng)的classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],\datingLabels[numTestVecs:m],3)print ("the classifier came back with: %d, the real answer is: %d"\% (classifierResult, datingLabels[i]))if (classifierResult != datingLabels[i]): errorCount += 1.0print "the total error rate is: %f" % (errorCount/float(numTestVecs))print errorCount# 讀的是datingTestSet2.txt,不是datingTestSet.txt #file_raw='C:\Users\LiLong\Desktop\datingTestSet2.txt' if __name__== "__main__": datingClassTest()結(jié)果:
row:1000 and column:3 ('the test number:', 100) the classifier came back with: 3, the real answer is: 3 the classifier came back with: 2, the real answer is: 2 the classifier came back with: 1, the real answer is: 1 the classifier came back with: 1, the real answer is: 1 the classifier came back with: 1, the real answer is: 1 the classifier came back with: 1, the real answer is: 1 the classifier came back with: 1, the real answer is: 1 the classifier came back with: 1, the real answer is: 1 ..., the classifier came back with: 2, the real answer is: 2 the classifier came back with: 3, the real answer is: 3 the classifier came back with: 2, the real answer is: 2 the classifier came back with: 3, the real answer is: 3 the classifier came back with: 2, the real answer is: 2 the classifier came back with: 1, the real answer is: 1 the classifier came back with: 2, the real answer is: 2 the classifier came back with: 3, the real answer is: 3 the classifier came back with: 1, the real answer is: 1 the classifier came back with: 2, the real answer is: 2 the classifier came back with: 1, the real answer is: 1 the classifier came back with: 2, the real answer is: 2 the classifier came back with: 1, the real answer is: 1 the classifier came back with: 2, the real answer is: 2 the classifier came back with: 1, the real answer is: 1 the classifier came back with: 3, the real answer is: 3 the classifier came back with: 3, the real answer is: 3 the classifier came back with: 2, the real answer is: 2 the classifier came back with: 1, the real answer is: 1 the classifier came back with: 3, the real answer is: 1 the total error rate is: 0.050000 5.0結(jié)果顯示錯(cuò)誤率5.0%
與50位技術(shù)專家面對(duì)面20年技術(shù)見證,附贈(zèng)技術(shù)全景圖總結(jié)
- 上一篇: Python内置函数min(iterab
- 下一篇: 雅阁如何选购 从品牌到配置全方位解析雅阁