當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

使用sklearn不同方法在digits手写数字数据集上聚类并用matplotlib呈现

發(fā)布時間：2023/12/10 编程问答 24 豆豆

生活随笔收集整理的這篇文章主要介紹了使用sklearn不同方法在digits手写数字数据集上聚类并用matplotlib呈现小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

本文內(nèi)容

測試sklearn中以下聚類算法在digits手寫數(shù)字數(shù)據(jù)集上的聚類效果。

使用不同的評估方法對實驗結果進行評估。

準備

- [ ] sklearn庫

自2007年發(fā)布以來，scikit-learn已經(jīng)成為Python重要的機器學習庫了，scikit-learn簡稱sklearn，支持包括分類，回歸，降維和聚類四大機器學習算法。還包括了特征提取，數(shù)據(jù)處理和模型評估者三大模塊。

sklearn是Scipy的擴展，建立在Numpy和matplolib庫的基礎上。利用這幾大模塊的優(yōu)勢，可以大大的提高機器學習的效率。

sklearn擁有著完善的文檔，上手容易，具有著豐富的API，在學術界頗受歡迎。sklearn已經(jīng)封裝了大量的機器學習算法，包括LIBSVM和LIBINEAR。同時sklearn內(nèi)置了大量數(shù)據(jù)集，節(jié)省了獲取和整理數(shù)據(jù)集的時間。

庫的算法主要有四類：分類，回歸，聚類，降維。其中：

常用的回歸：線性、決策樹、SVM、KNN ；集成回歸：隨機森林、Adaboost、GradientBoosting、Bagging、ExtraTrees

常用的分類：線性、決策樹、SVM、KNN，樸素貝葉斯；集成分類：隨機森林、Adaboost、GradientBoosting、Bagging、ExtraTrees

常用聚類：k均值（K-means）、層次聚類（Hierarchical clustering）、DBSCAN

常用降維：LinearDiscriminantAnalysis、PCA

它具有以下特點

簡單高效的數(shù)據(jù)挖掘和數(shù)據(jù)分析工具
每個人都可以訪問，并且可以在各種情況下重用
基于NumPy，SciPy和matplotlib構建
開源，可商業(yè)使用-BSD許可證

sklearn datasets

sklearn中包含了大量的優(yōu)質的數(shù)據(jù)集，在我們學習機器學習的過程中，我們可以使用這些數(shù)據(jù)集實現(xiàn)出不同的模型。

首先，要使用sklearn中的數(shù)據(jù)集，必須導入datasets模塊。

from sklearn import datasets

digits手寫數(shù)字數(shù)據(jù)集
實驗要求采用digits數(shù)據(jù)集，我們先對這個數(shù)據(jù)集進行一個初步的了解：
手寫數(shù)字數(shù)據(jù)集包含1797個0-9的手寫數(shù)字數(shù)據(jù)，每個數(shù)據(jù)由8 * 8 大小的矩陣構成，矩陣中值的范圍是0-16，代表顏色的深度。
我們先加載一下數(shù)據(jù)，了解一下數(shù)據(jù)的維度，并以圖像的形式展示一些第一個數(shù)據(jù)：

import matplotlib.pyplot as plt from sklearn.datasets import load_digits digits = load_digits() print(digits.data.shape) print(digits.target.shape) print(digits.images.shape) plt.matshow(digits.images[0]) plt.show()

可以看到數(shù)據(jù)維度和第一張手寫數(shù)字
(1797, 64)
(1797,)
(1797, 8, 8)

實驗過程

K-means聚類digits數(shù)據(jù)集
在sklearn官網(wǎng)中提供的K-means對digits的聚類的demo代碼中運行出來的結果如下：（鏈接：https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html#sphx-glr-auto-examples-cluster-plot-kmeans-digits-py）

整個可視化的生成很漂亮，細細研究下整個實現(xiàn)過程來發(fā)現(xiàn)官方給出的示例也很棒，下面對代碼進行分析并進行改動，給出我們通常意義上的直接的聚類效果：

從庫sklearn.datasets中加載digits數(shù)據(jù)集，數(shù)據(jù)集的介紹見上面。數(shù)據(jù)集是分好label的，存在digits.target中，同時我們可以提取出數(shù)據(jù)集的樣本數(shù)，每個樣本的維度，分別存在n_samples n_features中，輸出這三個變量，可以得到：
n_digits: 10
n_samples 1797
n_features 64

下面是一段核心評估代碼，使用不同的評分方法來計算score表示聚類后類別的準確性，下面再分別用三種k-means聚類的方式來調用這段評分代碼，得到不同的score，這也是輸出文字的全部內(nèi)容（ps：這段代碼寫的真的很漂亮）

def bench_k_means(estimator, name, data):t0 = time()estimator.fit(data)print('%-9s\t%.2fs\t%i\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f'% (name, (time() - t0), estimator.inertia_,metrics.homogeneity_score(labels, estimator.labels_),metrics.completeness_score(labels, estimator.labels_),metrics.v_measure_score(labels, estimator.labels_),metrics.adjusted_rand_score(labels, estimator.labels_),metrics.adjusted_mutual_info_score(labels, estimator.labels_,average_method='arithmetic'),metrics.silhouette_score(data, estimator.labels_, metric='euclidean',sample_size=sample_size))) bench_k_means(KMeans(init='k-means++', n_clusters=n_digits, n_init=10),name="k-means++", data=data) bench_k_means(KMeans(init='random', n_clusters=n_digits, n_init=10),name="random", data=data) # in this case the seeding of the centers is deterministic, hence we run the # kmeans algorithm only once with n_init=1 pca = PCA(n_components=n_digits).fit(data) bench_k_means(KMeans(init=pca.components_, n_clusters=n_digits, n_init=1),name="PCA-based",data=data)

由此得到init=random，k-means++，pca下各個方式的score
init time inertia homo compl v-meas ARI AMI silhouette
k-means++ 0.42s 69432 0.602 0.650 0.625 0.465 0.621 0.146

random 0.22s 69694 0.669 0.710 0.689 0.553 0.686 0.147

PCA-based 0.04s 70804 0.671 0.698 0.684 0.561 0.681 0.118

補充：KMeans函數(shù)主要參數(shù)：
（這里只列出一部分，詳見https://blog.csdn.net/weixin_44707922/article/details/91954734）

可視化聚類

其實上面的代碼k-means聚類和評估已經(jīng)全部完成了，但是為了更好可視化輸出，我們可以進行如下操作：
pca降維至兩維，再次進行聚類，這是因為
1.散點圖中的數(shù)據(jù)點就是兩位的
2.在2維的基礎上再次k-means聚類是因為已經(jīng)聚類高維數(shù)據(jù)映射到二維空間的prelabel可能分散不過集中，影響可視化效果。

reduced_data = PCA(n_components=2).fit_transform(data) kmeans = KMeans(init='k-means++', n_clusters=n_digits, n_init=10) kmeans.fit(reduced_data) # 對降維后的數(shù)據(jù)進行kmeans result = kmeans.labels_

得到各個類的中心點：

centroids = kmeans.cluster_centers_

最后這一部分就是定義輸出的變化范圍和輸出的效果了

#窗口 plt.imshow(Z, interpolation='nearest',extent=(xx.min(), xx.max(), yy.min(), yy.max()),cmap=plt.cm.Paired,aspect='auto', origin='lower') #降維后的數(shù)據(jù)點 plt.plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=2) #聚類中心 plt.scatter(centroids[:, 0], centroids[:, 1],marker='x', s=169, linewidths=3,color='w', zorder=10)

對demo可視化效果的修改/另一種形式展示

在通俗的理解中，聚類后的結果應該是不同類以不同的顏色來表明，所以在修改的時候我用不同顏色來表示不同的聚類點，最后再加上聚類中心，會有更加直觀的結果：

plt.scatter(reduced_data[:, 0], reduced_data[:, 1],c=kmeans.labels_)

使用不同的方法對digits數(shù)據(jù)集聚類
有了前一部分的探索，使用其他的聚類方法處理起來就會相對輕松，下面我們分別來看這幾種方法的聚類和評估結果：

AffinityPropagation

使用AffinityPropagation的核心算法如下所示：

af = AffinityPropagation().fit(reduced_data) result = af.labels_

按照demo模型形式，繪制出來的效果如下：

（是不是很丑，分了90余類），按照我修改后的方式，效果稍微好點（只是相對好點）：

MeanShift

bandwidth = estimate_bandwidth(reduced_data, quantile=0.1)#經(jīng)過測試，在quantile=0.1的情況下得到的結果是最好的 bench_k_means(MeanShift(bandwidth=bandwidth, bin_seeding=True),name="MeanShift",data=data) meanshift = MeanShift(bandwidth=bandwidth, bin_seeding=True).fit(reduced_data)

init time nmi homo comp labelnum

MeanShift 2.72s 0.554 1.000 0.307 10

使用demo的效果本身就很好，如下：

SpectralClustering

pca = PCA(n_components=n_digits).fit_transform(data)#要使用數(shù)據(jù)降維是因為高維情況在建圖過程存在數(shù)據(jù)缺失bench_k_means(SpectralClustering(n_clusters=10),name="spectralcluster",data=pca)

init time nmi homo comp labelnum

spectralcluster 5.15s 0.598 0.449 0.796 10

這個聚類是沒有聚類中心的。

ward hierarchical clustering

ward = AgglomerativeClustering(n_clusters=10, linkage='ward') ward.fit(data)

init time nmi homo comp

ward hierarchical clustering 0.36s 0.797 0.758 0.836

其他的聚類情況可視化效果都類似，這里不再一一可視化。只給出評估指標：

AgglomerativeClustering

clustering = AgglomerativeClustering().fit(data)

DBSCN

db = DBSCAN().fit(data) result = db.labels_

init time nmi homo comp

DBSCAN 0.53s 0.375 0.000 1.000

對比

以下是上面各種方法的指標對比：

init time nmi homo comp

k-means++ 0.28s 0.626 0.602 0.650

random 0.22s 0.689 0.669 0.710

PCA-based 0.03s 0.681 0.667 0.695

AP 6.71s 0.655 0.932 0.460

MeanShift 2.72s 0.554 1.000 0.307

spectralcluster 5.15s 0.598 0.449 0.796

whc 0.36s 0.797 0.758 0.836

AC 0.21s 0.466 0.239 0.908

DBSCAN 0.53s 0.375 0.000 1.000

小結

各種方法看下來，還是感覺k-means算法是最經(jīng)典的算法，盡管在種子選點的方式上存在隨機性，但這個算法的邏輯和原理，對聚類的測定和迭代方法是它成為最經(jīng)典，也是首選的聚類方法，在各個評價指標上都有一個相對較好的結果。其他方法各有優(yōu)劣，如AP和sc的時間復雜性較高，也有些準確度上有較大的偏差，無法求得聚類中心等，但感謝各方的優(yōu)秀工程師提供的這些聚類算法為我們提供了多種數(shù)據(jù)的聚類方式，成為sklearn的重要部分之一；
太多了，再往下寫我hold不住了，就先到這里，其他的聚類等待看下一篇吧！

import numpy as np import matplotlib.pyplot as plt from itertools import cycle from time import time from sklearn import metrics from sklearn.datasets import load_digits from sklearn.decomposition import PCA from sklearn.metrics import normalized_mutual_info_score from sklearn.mixture import GaussianMixture from sklearn.preprocessing import scale from sklearn.cluster import KMeans, AffinityPropagation, estimate_bandwidth, MeanShift, SpectralClustering, \AgglomerativeClustering, DBSCAN#Loading digits data np.random.seed(42) digits = load_digits() data = scale(digits.data) n_samples, n_features = data.shape n_digits = len(np.unique(digits.target)) labels = digits.target sample_size = 300 print("n_digits: %d, \t n_samples %d, \t n_features %d"% (n_digits, n_samples, n_features)) print(82 * '_') print('init\t\ttime\tnmi\thomo\tcompl')#Evaluation def bench_k_means(estimator, name, data):t0 = time()estimator.fit(data)print('%-9s\t%.2fs\t%.3f\t%.3f\t%.3f'% (name, (time() - t0),#metrics.normalized_mutual_info_score(labels, estimator.labels_),metrics.v_measure_score(labels, estimator.labels_),metrics.homogeneity_score(labels, estimator.labels_),metrics.completeness_score(labels, estimator.labels_)))#kmeans bench_k_means(KMeans(init='k-means++', n_clusters=n_digits, n_init=10),name="k-means++", data=data) bench_k_means(KMeans(init='random', n_clusters=n_digits, n_init=10),name="random", data=data) pca = PCA(n_components=n_digits).fit(data) bench_k_means(KMeans(init=pca.components_, n_clusters=n_digits, n_init=1),name="PCA-based",data=data)#AffinityPropagation bench_k_means(AffinityPropagation(),name="AffinityPropagation", data=data)# MeanShift bandwidth = estimate_bandwidth(data, quantile=0.1) bench_k_means(MeanShift(bandwidth=bandwidth, bin_seeding=True),name="MeanShift",data=data)# ward hierarchical clustering bench_k_means(AgglomerativeClustering(n_clusters=10, linkage='ward'),name="ward hierarchical clustering",data=data)# AgglomerativeClustering bench_k_means(AgglomerativeClustering(),name="AgglomerativeClustering",data=data)# DBSCN bench_k_means(DBSCAN(),name="DBSCAN()",data=data)print(82 * '_') reduced_data = PCA(n_components=2).fit_transform(data)# Visualize the results on PCA-reduced data way1 # example:meanshiftbandwidth = estimate_bandwidth(reduced_data, quantile=0.07) meanshift = MeanShift(bandwidth=bandwidth, bin_seeding=True).fit(reduced_data) result = meanshift.labels_ centroids = meanshift.cluster_centers_ labels_unique = np.unique(result) n_clusters_ = len(labels_unique) plt.figure(1) plt.clf() colors = cycle('bgrcmybgrcmybgrcmybgrcmy') for k, col in zip(range(n_clusters_), colors):my_members = result == kcluster_center = centroids[k]plt.plot(reduced_data[my_members, 0], reduced_data[my_members, 1], col + '.')plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,markeredgecolor='k', markersize=5) plt.title('Estimated number of clusters: %d' % n_clusters_) plt.show()# Visualize the results on PCA-reduced data way2 # example:kmeans kmeans = KMeans(init='k-means++', n_clusters=n_digits, n_init=10) kmeans.fit(reduced_data) # 對降維后的數(shù)據(jù)進行kmeans result = kmeans.labels_ plt.figure(2) plt.clf() plt.scatter(reduced_data[:, 0], reduced_data[:, 1],c=result) centroids = kmeans.cluster_centers_ plt.scatter(centroids[:, 0], centroids[:, 1],marker='x', s=169, linewidths=3,color='w', zorder=10) plt.show()

總結

以上是生活随笔為你收集整理的使用sklearn不同方法在digits手写数字数据集上聚类并用matplotlib呈现的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：短视频源码小视频系统源码短视频APP
下一篇：测试电视是不是4k的软件,怎么判断4K电