當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

sklearn聚类算法之HAC

發布時間：2024/3/12 编程问答 64 豆豆

生活随笔收集整理的這篇文章主要介紹了 sklearn聚类算法之HAC 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

基本思想
層次凝聚聚類算法(Hierarchical Agglomerative Clustering)是一種效果很好的聚類算法，簡稱HAC，它的主要思想是先把每個樣本點當做一個聚類，然后不斷地將其中最近的兩個聚類進行合并，直到滿足某個迭代終止條件，比如當前聚類數是初始聚類數的20%，80%的聚類數都被合并了?？偨Y來說，HAC的具體實現步驟如下所示。
????(1)將訓練樣本集中的每個數據點都當做一個聚類；
????(2)計算每兩個聚類之間的距離，將距離最近的或最相似的兩個聚類進行合并；
????(3)重復上述步驟，直到滿足迭代終止條件
在這個算法中，相似度的度量方式有如下四種方式：
????(1)Single-link：兩個不同聚類中離得最近的兩個點之間的距離，即MIN；
????(2)Complete-link：兩個不同聚類中離得最遠的兩個點之間的距離，即MAX；
????(3)Average-link：兩個不同聚類中所有點對距離的平均值，即AVERAGE；
????(4)Ward-link：兩個不同聚類聚在一起后離差平方和的增量
API學習

class sklearn.cluster.AgglomerativeClustering(n_clusters=2,*, affinity='euclidean', memory=None, connectivity=None,compute_full_tree='auto', linkage='ward', distance_threshold=None, compute_distances=False ) 參數類型解釋

n_clusters	int or None, default=2	表示聚類數，和distance_threshold中必須有一個是None
affinity	str or callable, default=‘euclidean’	相似度度量函數，可以是’euclidean’/‘manhattan’/'cosine’等
memory	str or object with the joblib	緩存計算過程的文件夾路徑
connectivity	array-like or callable, default=None	可用來定義數據的給定結構，即對每個樣本給定鄰居樣本
compute_full_tree	‘auto’ or bool, default=‘auto’	如果為True，當聚類數較多時可用來減少計算時間
linkage	{‘ward’, ‘complete’, ‘average’, ‘single’}, default=‘ward’	表示不同的度量方法，默認為’ward’方法
distance_threshold	float, default=None	如果不為None，表示簇不會聚合的距離閾值，此時n_clusters必須不為None，compute_full_tree必須為None
compute_distances	bool, default=False	如果為True，即使不使用distance_threshold，也計算簇間距離，可用來可視化樹狀圖

屬性類型解釋

n_clusters_	int	聚類數
labels_	ndarray of shape(n_samples)	分類結果
n_leaves_	int	層次樹的樹葉數量
n_connected_components_	int	在圖中有聯系的部分的數量
n_features_in_	int	擬合期間的特征個數
feature_names_in	ndarray of shape(n_features_in_,)	擬合期間的特征名稱
children_	array-like of shape (n_samples-1, 2)	每一個非葉子節點的孩子
distances_	array-like of shape (n_nodes-1,)	children_中各節點之間的距離

方法說明

fit(X[, y])	Fit the hierarchical clustering from features, or distance matrix.
fit_predict(X[, y])	Fit and return the result of each sample’s clustering assignment.
get_params([deep])	Get parameters for this estimator.
set_params(**params)	Set the parameters of this estimator.

代碼示例

>>> from sklearn.cluster import AgglomerativeClustering >>> import numpy as np >>> X = np.array([[1, 2], [1, 4], [1, 0], ... [4, 2], [4, 4], [4, 0]]) >>> clustering = AgglomerativeClustering().fit(X) >>> clustering AgglomerativeClustering() >>> clustering.labels_ array([1, 1, 1, 0, 0, 0])

優秀作品學習
test1.py

import numpy as npfrom matplotlib import pyplot as plt from scipy.cluster.hierarchy import dendrogram from sklearn.datasets import load_iris from sklearn.cluster import AgglomerativeClusteringdef plot_dendrogram(model, **kwargs):# Create linkage matrix and then plot the dendrogram# create the counts of samples under each nodecounts = np.zeros(model.children_.shape[0])n_samples = len(model.labels_)for i, merge in enumerate(model.children_):current_count = 0for child_idx in merge:if child_idx < n_samples:current_count += 1 # leaf nodeelse:current_count += counts[child_idx - n_samples]counts[i] = current_countlinkage_matrix = np.column_stack([model.children_, model.distances_, counts]).astype(float)# Plot the corresponding dendrogramdendrogram(linkage_matrix, **kwargs)iris = load_iris() X = iris.data# setting distance_threshold=0 ensures we compute the full tree. model = AgglomerativeClustering(distance_threshold=0, n_clusters=None)model = model.fit(X) plt.title("Hierarchical Clustering Dendrogram") # plot the top three levels of the dendrogram plot_dendrogram(model, truncate_mode="level", p=3) plt.xlabel("Number of points in node (or index of point if no parenthesis).") plt.show()

運行結果：

test2.py

import time as time import numpy as np import matplotlib.pyplot as plt import mpl_toolkits.mplot3d.axes3d as p3 from sklearn.cluster import AgglomerativeClustering from sklearn.datasets import make_swiss_roll# ############################################################################# # Generate data (swiss roll dataset) n_samples = 1500 noise = 0.05 X, _ = make_swiss_roll(n_samples, noise=noise) # Make it thinner X[:, 1] *= 0.5# ############################################################################# # Compute clustering print("Compute unstructured hierarchical clustering...") st = time.time() ward = AgglomerativeClustering(n_clusters=6, linkage="ward").fit(X) elapsed_time = time.time() - st label = ward.labels_ print("Elapsed time: %.2fs" % elapsed_time) print("Number of points: %i" % label.size)# ############################################################################# # Plot result fig = plt.figure() ax = p3.Axes3D(fig) ax.view_init(7, -80) for l in np.unique(label):ax.scatter(X[label == l, 0],X[label == l, 1],X[label == l, 2],color=plt.cm.jet(float(l) / np.max(label + 1)),s=20,edgecolor="k",) plt.title("Without connectivity constraints (time %.2fs)" % elapsed_time)# ############################################################################# # Define the structure A of the data. Here a 10 nearest neighbors from sklearn.neighbors import kneighbors_graphconnectivity = kneighbors_graph(X, n_neighbors=10, include_self=False)# ############################################################################# # Compute clustering print("Compute structured hierarchical clustering...") st = time.time() ward = AgglomerativeClustering(n_clusters=6, connectivity=connectivity, linkage="ward" ).fit(X) elapsed_time = time.time() - st label = ward.labels_ print("Elapsed time: %.2fs" % elapsed_time) print("Number of points: %i" % label.size)# ############################################################################# # Plot result fig = plt.figure() ax = p3.Axes3D(fig) ax.view_init(7, -80) for l in np.unique(label):ax.scatter(X[label == l, 0],X[label == l, 1],X[label == l, 2],color=plt.cm.jet(float(l) / np.max(label + 1)),s=20,edgecolor="k",) plt.title("With connectivity constraints (time %.2fs)" % elapsed_time)plt.show()

運行結果：

總結

以上是生活随笔為你收集整理的sklearn聚类算法之HAC的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：笔记本玩梦幻西游开启时自动全屏,而且两边
下一篇：康卡斯特使持续升级，以有线电视

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

编程问答

sklearn聚类算法之HAC

總結