聚类算法——Birch详解
1 原理
1.1 B樹
(1)m路查找樹
一棵m路查找樹,它或者是一棵空樹,或者是滿足如下性質(zhì)的樹:
- 根最多有m棵子樹,并具有以下結(jié)構(gòu):
,是指向子樹的指針,是關(guān)鍵碼,
- 在子樹中所有的關(guān)鍵碼都大于,小于。
- 在子樹中所有的關(guān)鍵碼都大于
- 在子樹中所有的關(guān)鍵碼都小于
- 子樹也是m路查找樹
(2)B樹
m階B樹時(shí)一棵m路查找樹,它或是空樹,或者滿足以下性質(zhì):
- 樹中每個(gè)節(jié)點(diǎn)至多有m棵子樹
- 根節(jié)點(diǎn)至少有兩棵子樹
- 除根節(jié)點(diǎn)以外的所有非終端節(jié)點(diǎn)至少有棵子樹
- 所有的葉子節(jié)點(diǎn)都位于同一層
1.2 步驟
? ? ? ? 具體模擬過程參考:https://www.cnblogs.com/pinard/p/6179132.html
? ? ? 參考資料:
BIRCH能夠識(shí)別出數(shù)據(jù)集中數(shù)據(jù)分布的不均衡性,將分布在稠密區(qū)域中的點(diǎn)聚類并移除將分布在稀疏區(qū)域中的異常點(diǎn)。此外,BIRCH是一種增量聚類方法,針對(duì)每一個(gè)點(diǎn)的聚類決策都是基于當(dāng)前已經(jīng)處理過的數(shù)據(jù)點(diǎn),而不是全局的數(shù)據(jù)點(diǎn)。
① 建立一個(gè)聚類特征樹
首先是遍歷所有數(shù)據(jù),使用給定的內(nèi)存數(shù)量和磁盤上的回收空間構(gòu)建一個(gè)初始的內(nèi)存CF樹,來反映數(shù)據(jù)集上的聚類信息。對(duì)于稠密數(shù)據(jù),分組成更精細(xì)的簇,稀疏的數(shù)據(jù)點(diǎn)則被作為異常點(diǎn)移除。
② 縮小范圍,精簡(jiǎn)聚類特征樹
該過程是可選擇的,這個(gè)部分是連接步驟①和步驟③的橋梁,相似于步驟①,他開始遍歷初始化的聚類特征樹葉子節(jié)點(diǎn),移除更多的異常點(diǎn)和縮小范圍進(jìn)行分組。
③ 全局聚類
使用全局聚類或者半全局聚類來操作所有的葉子節(jié)點(diǎn),有數(shù)據(jù)點(diǎn)的聚類算法很容易適應(yīng)一組子簇,每個(gè)子簇由其聚類特征向量表示。計(jì)算子簇的質(zhì)心,然后每個(gè)子簇用質(zhì)心表示,這部分可以捕捉到數(shù)據(jù)的主要分布規(guī)律。
④ 簇類細(xì)化
因?yàn)椴襟E③只是對(duì)數(shù)據(jù)進(jìn)行粗略總結(jié),原數(shù)據(jù)只是被掃描了一次,需要繼續(xù)完善簇類。使用上階段產(chǎn)生的簇的中心作為種子,并將數(shù)據(jù)點(diǎn)重新分配到最近的種子,以獲得一組新的簇。這不僅允許屬于該子簇的點(diǎn)遷移,而且可以確保給定數(shù)據(jù)點(diǎn)的所有副本都遷移到同一個(gè)子簇中。還為提供了丟棄異常值的選項(xiàng)。也就是說,如果距離最近的點(diǎn)太遠(yuǎn),種子可以作為離群值處理,而不包含在結(jié)果中。
2.參數(shù)說明
函數(shù):sklearn.cluster.Birch
參數(shù):
- threshold:(float,default=0.5)新的子聚類和最近的子聚類合并的子聚類的半徑小于閾值,否則將進(jìn)行分裂。
- branching_factor:(int,default=50)每個(gè)結(jié)點(diǎn)中CF子聚類的最大數(shù)量。
- n_cluster:(int,default=3)最終聚類步驟的聚類數(shù)量,if None,不執(zhí)行最終的聚類步驟,子聚類原樣返回;if sklearn.cluster.Estimator,則該模型將子聚類作為新樣本執(zhí)行。
- compute_labels:(bool,default=True)每次擬合的時(shí)候是否標(biāo)簽值計(jì)算。
- copy:(bool,default=True)是否復(fù)制獲得的數(shù)據(jù),如果設(shè)置為false,初始化數(shù)據(jù)將被重寫。
屬性:
- root_:CF tree的root
- dummy_leaf_:所有葉子節(jié)點(diǎn)的指針
- subcluster_centers_:所有葉子里子聚類的質(zhì)心
- subcluster_labels_:全聚類之后子聚類質(zhì)心的labels
- labels_:所有輸入數(shù)據(jù)的labels
3 具體實(shí)現(xiàn)
可參考scikit-learn的實(shí)例:https://scikit-learn.org/stable/auto_examples/cluster/plot_birch_vs_minibatchkmeans.html#sphx-glr-auto-examples-cluster-plot-birch-vs-minibatchkmeans-py
4 源碼解析
源碼在:Anaconda3/Lib/site-packages/sklearn/cluster/birch.py中
(1)前綴知識(shí)
-
hasattr()函數(shù)用來判斷某個(gè)類實(shí)例對(duì)象是否包含指定名稱的屬性或方法,返回True和False
hasattr(obj, name),其中obj 指的是某個(gè)類的實(shí)例對(duì)象,name 表示指定的屬性名或方法名。
- getattr()函數(shù)獲取某個(gè)類實(shí)例對(duì)象中指定屬性的值
getattr(obj, name[, default]),其中obj 表示指定的類實(shí)例對(duì)象,name 表示指定的屬性名,而 default 是可選參數(shù),用于設(shè)定該函數(shù)的默認(rèn)返回值,即當(dāng)函數(shù)查找失敗時(shí),如果不指定 default 參數(shù),則程序?qū)⒅苯訄?bào) AttributeError 錯(cuò)誤,反之該函數(shù)將返回 default 指定的值。
-
setattr()函數(shù)的功能相對(duì)比較復(fù)雜,它最基礎(chǔ)的功能是修改類實(shí)例對(duì)象中的屬性值。其次,它還可以實(shí)現(xiàn)為實(shí)例對(duì)象動(dòng)態(tài)添加屬性或者方法。
(2)Birch函數(shù)
- Birch(BaseEstimator, TransformerMixin, ClusterMixin)在sklearn的base文件里
- 其他參數(shù)
- fit函數(shù)(主要核心計(jì)算在_fit函數(shù)中)
其他函數(shù):
構(gòu)建稀疏矩陣
def _iterate_sparse_X(X):"""This little hack returns a densified row when iterating over a sparsematrix, instead of constructing a sparse matrix for every row that isexpensive."""n_samples = X.shape[0]X_indices = X.indicesX_data = X.dataX_indptr = X.indptrfor i in range(n_samples):row = np.zeros(X.shape[1])startptr, endptr = X_indptr[i], X_indptr[i + 1]nonzero_indices = X_indices[startptr:endptr]row[nonzero_indices] = X_data[startptr:endptr]yield row分裂葉子結(jié)點(diǎn)的函數(shù):定義兩個(gè)子聚類,兩個(gè)CF節(jié)點(diǎn),并將CF節(jié)點(diǎn)加入到CF子聚類中,如果傳入的子聚類是葉子節(jié)點(diǎn),就進(jìn)行一系列的指針變換,計(jì)算子聚類的質(zhì)心和平方和之間的距離,選擇距離最大的矩陣,然后選擇較小的值為一個(gè)子聚類,其他的歸為另一個(gè)子聚類。
def _split_node(node, threshold, branching_factor):"""The node has to be split if there is no place for a new subclusterin the node.1. Two empty nodes and two empty subclusters are initialized.2. The pair of distant subclusters are found.3. The properties of the empty subclusters and nodes are updatedaccording to the nearest distance between the subclusters to thepair of distant subclusters.4. The two nodes are set as children to the two subclusters."""new_subcluster1 = _CFSubcluster()new_subcluster2 = _CFSubcluster()new_node1 = _CFNode(threshold=threshold, branching_factor=branching_factor,is_leaf=node.is_leaf,n_features=node.n_features)new_node2 = _CFNode(threshold=threshold, branching_factor=branching_factor,is_leaf=node.is_leaf,n_features=node.n_features)new_subcluster1.child_ = new_node1new_subcluster2.child_ = new_node2if node.is_leaf:if node.prev_leaf_ is not None:node.prev_leaf_.next_leaf_ = new_node1new_node1.prev_leaf_ = node.prev_leaf_new_node1.next_leaf_ = new_node2new_node2.prev_leaf_ = new_node1new_node2.next_leaf_ = node.next_leaf_if node.next_leaf_ is not None:node.next_leaf_.prev_leaf_ = new_node2dist = euclidean_distances(node.centroids_, Y_norm_squared=node.squared_norm_, squared=True)n_clusters = dist.shape[0]farthest_idx = np.unravel_index(dist.argmax(), (n_clusters, n_clusters))node1_dist, node2_dist = dist[(farthest_idx,)]node1_closer = node1_dist < node2_distfor idx, subcluster in enumerate(node.subclusters_):if node1_closer[idx]:new_node1.append_subcluster(subcluster)new_subcluster1.update(subcluster)else:new_node2.append_subcluster(subcluster)new_subcluster2.update(subcluster)return new_subcluster1, new_subcluster2獲取葉子結(jié)點(diǎn):
def _get_leaves(self):"""Retrieve the leaves of the CF Node.Returns-------leaves : list of shape (n_leaves,)List of the leaf nodes."""leaf_ptr = self.dummy_leaf_.next_leaf_leaves = []while leaf_ptr is not None:leaves.append(leaf_ptr)leaf_ptr = leaf_ptr.next_leaf_return leaves進(jìn)行全局聚類:增加了AgglomerativeClustering算法(另寫)。
def _global_clustering(self, X=None):"""Global clustering for the subclusters obtained after fitting"""clusterer = self.n_clusterscentroids = self.subcluster_centers_compute_labels = (X is not None) and self.compute_labels# Preprocessing for the global clustering.not_enough_centroids = Falseif isinstance(clusterer, numbers.Integral):clusterer = AgglomerativeClustering(n_clusters=self.n_clusters)# There is no need to perform the global clustering step.if len(centroids) < self.n_clusters:not_enough_centroids = Trueelif (clusterer is not None and nothasattr(clusterer, 'fit_predict')):raise ValueError("n_clusters should be an instance of ""ClusterMixin or an int")# To use in predict to avoid recalculation.self._subcluster_norms = row_norms(self.subcluster_centers_, squared=True)if clusterer is None or not_enough_centroids:self.subcluster_labels_ = np.arange(len(centroids))if not_enough_centroids:warnings.warn("Number of subclusters found (%d) by Birch is less ""than (%d). Decrease the threshold."% (len(centroids), self.n_clusters), ConvergenceWarning)else:# The global clustering step that clusters the subclusters of# the leaves. It assumes the centroids of the subclusters as# samples and finds the final centroids.self.subcluster_labels_ = clusterer.fit_predict(self.subcluster_centers_)if compute_labels:self.labels_ = self.predict(X)?
(3)CFNode
| 參數(shù) | 屬性 | ||
| threshold:float | 確定子聚類的閾值 | subclusters_ : list | 指定結(jié)點(diǎn)的子聚類 |
| branching_factor: int | 分支因子 | prev_leaf_ : _CFNode | 前葉子結(jié)點(diǎn) |
| is_leaf : bool | 是否是葉子節(jié)點(diǎn) | next_leaf_ : _CFNode | 后葉子結(jié)點(diǎn) |
| n_features : int | 特征數(shù)量 | init_centroids_? | 初始化質(zhì)心,shape=(branching_factor + 1, n_features) |
| ? | ? | init_sq_norm_? | 初始化平方和,shape=(branching_factor + 1, n_features) |
| ? | ? | centroids_ | 質(zhì)心 |
| ? | ? | squared_norm_ | 平方和 |
?
CFNode有三個(gè)函數(shù)構(gòu)成:
第一個(gè)函數(shù):append_subcluster(self, subcluster)更新CF的特征值
def append_subcluster(self, subcluster):#獲取CF的子聚類長度n_samples = len(self.subclusters_)#將新的子聚類加入到CF中self.subclusters_.append(subcluster)#初始化新子聚類的質(zhì)心和平方和(將質(zhì)心和平和方加入到列表中)self.init_centroids_[n_samples] = subcluster.centroid_self.init_sq_norm_[n_samples] = subcluster.sq_norm_# Keep centroids and squared norm as views. In this way# if we change init_centroids and init_sq_norm_, it is# sufficient,#更新最終的子聚類的質(zhì)心和平方和(將質(zhì)心和平和方加入到列表中)self.centroids_ = self.init_centroids_[:n_samples + 1, :]self.squared_norm_ = self.init_sq_norm_[:n_samples + 1]第二個(gè)函數(shù):update_split_subclusters(self, subcluster,new_subcluster1, new_subcluster2):更新分裂節(jié)點(diǎn)
def update_split_subclusters(self, subcluster,new_subcluster1, new_subcluster2):"""Remove a subcluster from a node and update it with thesplit subclusters."""ind = self.subclusters_.index(subcluster)self.subclusters_[ind] = new_subcluster1self.init_centroids_[ind] = new_subcluster1.centroid_self.init_sq_norm_[ind] = new_subcluster1.sq_norm_self.append_subcluster(new_subcluster2)第三個(gè)函數(shù):insert_cf_subcluster(self, subcluster):子聚類中插入CF特征
def insert_cf_subcluster(self, subcluster):"""Insert a new subcluster into the node."""# self.subclusters_不存在,則將新的子聚類加入到子聚類列表中if not self.subclusters_:self.append_subcluster(subcluster)return Falsethreshold = self.thresholdbranching_factor = self.branching_factor# We need to find the closest subcluster among all the# subclusters so that we can insert our new subcluster.#計(jì)算距離矩陣dist_matrix = np.dot(self.centroids_, subcluster.centroid_)dist_matrix *= -2.dist_matrix += self.squared_norm_closest_index = np.argmin(dist_matrix)closest_subcluster = self.subclusters_[closest_index]# If the subcluster has a child, we need a recursive strategy.#如果子聚類存在字跡,需要采用遞歸策略,更新CF參數(shù)if closest_subcluster.child_ is not None:split_child = closest_subcluster.child_.insert_cf_subcluster(subcluster)if not split_child:# If it is determined that the child need not be split, we# can just update the closest_subclusterclosest_subcluster.update(subcluster)self.init_centroids_[closest_index] = \self.subclusters_[closest_index].centroid_self.init_sq_norm_[closest_index] = \self.subclusters_[closest_index].sq_norm_return False# things not too good. we need to redistribute the subclusters in# our child node, and add a new subcluster in the parent# subcluster to accommodate the new child.else:new_subcluster1, new_subcluster2 = _split_node(closest_subcluster.child_, threshold, branching_factor)self.update_split_subclusters(closest_subcluster, new_subcluster1, new_subcluster2)if len(self.subclusters_) > self.branching_factor:return Truereturn False# good to go!else:#當(dāng)子聚類的殘差半徑小于閾值時(shí),更新CF參數(shù)merged = closest_subcluster.merge_subcluster(subcluster, self.threshold)#如果merged存在,將新的子聚類加入到子聚類中,并更新子聚類的參數(shù)if merged:self.init_centroids_[closest_index] = \closest_subcluster.centroid_self.init_sq_norm_[closest_index] = \closest_subcluster.sq_norm_return False# not close to any other subclusters, and we still# have space, so add.#如果子聚類的CF樹超過分支因子數(shù),分裂成新的子聚類加入到Node中elif len(self.subclusters_) < self.branching_factor:self.append_subcluster(subcluster)return False# We do not have enough space nor is it closer to an# other subcluster. We need to split.else:self.append_subcluster(subcluster)return True(4)CFSubcluster
| 參數(shù) | 屬性 | ||
| linear_sum:narray | 樣本 | n_samples_ :int | 每個(gè)子聚類的樣本數(shù) |
| ? | ? | linear_sum_ : narray | 子聚類所有樣本的線性和 |
| ? | ? | squared_sum_ : float | Sum of the squared l2 norms |
| ? | ? | centroids_? | 質(zhì)心 |
| ? | ? | child_ | 孩子結(jié)點(diǎn) |
| ? | ? | sq_norm_? | 子聚類的平方和 |
CFSubcluster有三個(gè)函數(shù)構(gòu)成:
第一個(gè)函數(shù):update(self, subcluster)更新數(shù)值(線性和、質(zhì)心、平方和等數(shù)值)
def update(self, subcluster):self.n_samples_ += subcluster.n_samples_self.linear_sum_ += subcluster.linear_sum_self.squared_sum_ += subcluster.squared_sum_self.centroid_ = self.linear_sum_ / self.n_samples_self.sq_norm_ = np.dot(self.centroid_, self.centroid_)第二個(gè)函數(shù):merge_subcluster(self, nominee_cluster, threshold)連接subclustert
def merge_subcluster(self, nominee_cluster, threshold):"""Check if a cluster is worthy enough to be merged. Ifyes then merge."""new_ss = self.squared_sum_ + nominee_cluster.squared_sum_new_ls = self.linear_sum_ + nominee_cluster.linear_sum_new_n = self.n_samples_ + nominee_cluster.n_samples_new_centroid = (1 / new_n) * new_lsnew_norm = np.dot(new_centroid, new_centroid)dot_product = (-2 * new_n) * new_normsq_radius = (new_ss + dot_product) / new_n + new_normif sq_radius <= threshold ** 2:(self.n_samples_, self.linear_sum_, self.squared_sum_,self.centroid_, self.sq_norm_) = \new_n, new_ls, new_ss, new_centroid, new_normreturn Truereturn False第三個(gè)函數(shù):radius(self):計(jì)算殘差
def radius(self):"""Return radius of the subcluster"""dot_product = -2 * np.dot(self.linear_sum_, self.centroid_)return sqrt(((self.squared_sum_ + dot_product) / self.n_samples_) +self.sq_norm_)?
總結(jié)
以上是生活随笔為你收集整理的聚类算法——Birch详解的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: tomcat 日志切割
- 下一篇: mysql中修改表字段的类型长度_mys