层次聚类算法 算法_聚类算法简介
層次聚類算法 算法
Take a look at the image below. It’s a collection of bugs and creepy-crawlies of different shapes and sizes. Take a moment to categorize them by similarity into a number of groups.
看看下面的圖片。 它是各種形狀和大小的錯(cuò)誤和令人毛骨悚然的爬行動(dòng)物的集合。 花一點(diǎn)時(shí)間按照相似性將它們分為多個(gè)組。
This isn’t a trick question. Start with grouping the spiders together.
這不是一個(gè)技巧性的問(wèn)題。 首先將蜘蛛分組在一起。
Done? While there’s not necessarily a “correct” answer here, it’s most likely you split the bugs into four clusters. The spiders in one cluster, the pair of snails in another, the butterflies and moth into one, and the trio of wasps and bees into one more.
做完了嗎 雖然這里一定不是一個(gè)“正確”的答案,這是最有可能拆分的錯(cuò)誤分為四個(gè)集群 。 蜘蛛在一個(gè)簇中,成對(duì)的蝸牛在另一個(gè)簇中,蝴蝶和飛蛾成一團(tuán),WaSP和蜜蜂三人成群。
That wasn’t too bad, was it? You could probably do the same with twice as many bugs, right? If you had a bit of time to spare — or a passion for entomology — you could probably even do the same with a hundred bugs.
還不錯(cuò)吧? 您可能會(huì)用兩倍多的錯(cuò)誤來(lái)做同樣的事情,對(duì)不對(duì)? 如果您有空余時(shí)間-或?qū)ハx(chóng)學(xué)充滿熱情-您甚至可以用一百個(gè)bug來(lái)做同樣的事情。
For a machine though, grouping ten objects into however many meaningful clusters is no small task, thanks to a mind-bending branch of maths called combinatorics, which tells us that are 115,975 different possible ways you could have grouped those ten insects together.
但是對(duì)于一臺(tái)機(jī)器來(lái)說(shuō),將十個(gè)對(duì)象分組到許多有意義的簇中并不是一件容易的事,這要?dú)w功于一個(gè)名為combinatorics的彎彎曲曲的數(shù)學(xué)分支,該分支告訴我們,有115,975種不同的可能方式可以將這十個(gè)昆蟲(chóng)分組在一起。
Had there been twenty bugs, there would have been over fifty trillion possible ways of clustering them.
如果有二十個(gè)錯(cuò)誤,那么將有超過(guò)五十萬(wàn)億種可能的方法將它們聚類。
With a hundred bugs — there’d be many times more solutions than there are particles in the known universe.
有一百個(gè)錯(cuò)誤-解決方案的數(shù)量比已知宇宙中存在的粒子多很多倍。
How many times more? By my calculation, approximately five hundred million billion billion times more. In fact, there are more than four million billion googol solutions (what’s a googol?).
還有多少倍? 根據(jù)我的計(jì)算,大約是五千億億倍 。 實(shí)際上,有超過(guò)四十億個(gè)googol解決方案( 什么是googol? )。
For just a hundred objects.
僅用于一百個(gè)對(duì)象。
Almost all of those solutions would be meaningless — yet from that unimaginable number of possible choices, you pretty quickly found one of the very few that clustered the bugs in a useful way.
幾乎所有這些解決方案都是毫無(wú)意義的-但是,從眾多難以想象的可能選擇中,您很快就找到了以有用的方式對(duì)錯(cuò)誤進(jìn)行聚類的少數(shù)幾個(gè)。
Us humans take it for granted how good we are categorizing and making sense of large volumes of data pretty quickly. Whether it’s a paragraph of text, or images on a screen, or a sequence of objects — humans are generally fairly efficient at making sense of whatever data the world throws at us.
我們?nèi)祟惱硭?dāng)然地認(rèn)為我們?cè)诜诸惡涂焖倮斫獯罅繑?shù)據(jù)方面有多出色。 無(wú)論是一段文字,還是屏幕上的圖像,還是一系列對(duì)象,人類通常都能有效地理解世界向我們提供的任何數(shù)據(jù)。
Given that a key aspect of developing A.I. and machine learning is getting machines to quickly make sense of large sets of input data, what shortcuts are there available?
鑒于開(kāi)發(fā)AI和機(jī)器學(xué)習(xí)的關(guān)鍵方面是使機(jī)器快速理解大量輸入數(shù)據(jù),有哪些捷徑可用?
Here, you can read about three clustering algorithms that machines can use to quickly make sense of large datasets. This is by no means an exhaustive list — there are other algorithms out there — but they represent a good place to start!
在這里,您可以了解機(jī)器可以用來(lái)快速理解大型數(shù)據(jù)集的三種聚類算法。 這絕不是一個(gè)詳盡的清單-那里還有其他算法-但這是一個(gè)不錯(cuò)的起點(diǎn)!
You’ll find for each a quick summary of when you might use them, a brief overview of how they work, and a more detailed, step-by-step worked example. I believe it helps to understand an algorithm by actually carrying out yourself.
您將找到每種使用時(shí)間的快速摘要,它們?nèi)绾喂ぷ鞯暮?jiǎn)要概述以及更詳細(xì)的分步工作示例。 我相信通過(guò)實(shí)際執(zhí)行自己的算法有助于理解算法。
If you’re really keen, you’ll find the best way to do this is with pen and paper. Go ahead — nobody will judge!
如果您真的很熱衷 ,您會(huì)發(fā)現(xiàn)使用筆和紙是實(shí)現(xiàn)此目的的最佳方法。 繼續(xù)-沒(méi)有人會(huì)判斷!
K均值聚類 (K-means clustering)
在...時(shí)使用 (Use when...)
…you have an idea of how many groups you’re expecting to find a priori.
…您對(duì)希望找到先驗(yàn)的小組有一個(gè)想法。
這個(gè)怎么運(yùn)作 (How it works)
The algorithm randomly assigns each observation into one of k categories, then calculates the mean of each category. Next, it reassigns each observation to the category with the closest mean before recalculating the means. This step repeats over and over until no more reassignments are necessary.
該算法將每個(gè)觀察隨機(jī)分配到k個(gè)類別之一,然后計(jì)算每個(gè)類別的平均值 。 接下來(lái),它將在重新計(jì)算均值之前將每個(gè)觀察值分配給具有最均值的類別。 重復(fù)這一步驟,直到不再需要重新分配為止。
工作實(shí)例 (Worked Example)
Take a group of 12 football (or ‘soccer’) players who have each scored a certain number of goals this season (say in the range 3–30). Let’s divide them into separate clusters — say three.
以一組12名足球(或“足球”)球員為例,他們每個(gè)賽季都進(jìn)球數(shù)(例如3-30)。 讓我們將它們分為單獨(dú)的集群-說(shuō)三個(gè)。
Step 1 requires us to randomly split the players into three groups and calculate the means of each.
步驟1要求我們將參與者隨機(jī)分為三組,并計(jì)算每組的均值。
Group 1Player A (5 goals),Player B (20 goals),Player C (11 goals) Group Mean = (5 + 20 + 11) / 3 = 12 goalsGroup 2Player D (5 goals),Player E (3 goals),Player F (19 goals) Group Mean = 9 goalsGroup 3Player G (30 goals),Player H (3 goals),Player I (15 goals) Group Mean = 16 goalsStep 2: For each player, reassign them to the group with the closest mean. E.g., Player A (5 goals) is assigned to Group 2 (mean = 9). Then recalculate the group means.
步驟2:對(duì)于每位玩家,將他們重新分配給具有最接近均值的組。 例如,玩家A(5個(gè)進(jìn)球)被分配給第2組(平均值= 9)。 然后重新計(jì)算組均值。
Group 1 (Old Mean = 12 goals)Player C (11 goals) New Mean = 11 goalsGroup 2 (Old Mean = 9 goals)Player A (5 goals),Player D (5 goals),Player E (3 goals),Player H (3 goals) New Mean = 4 goalsGroup 3 (Old Mean = 16 goals)Player G (30 goals),Player I (15 goals),Player B (20 goals),Player F (19 goals) New Mean = 21 goalsRepeat Step 2 over and over until the group means no longer change. For this somewhat contrived example, this happens on the next iteration. Stop! You have now formed three clusters from the dataset!
一遍又一遍地重復(fù)步驟2,直到組的含義不再改變?yōu)橹埂?對(duì)于這個(gè)有些人為的例子,這發(fā)生在下一次迭代中。 停止! 現(xiàn)在,您已經(jīng)從數(shù)據(jù)集中形成了三個(gè)聚類!
Group 1 (Old Mean = 11 goals)Player C (11 goals),Player I (15 goals) Final Mean = 13 goalsGroup 2 (Old Mean = 4 goals)Player A (5 goals),Player D (5 goals),Player E (3 goals),Player H (3 goals) Final Mean = 4 goalsGroup 3 (Old Mean = 21 goals)Player G (30 goals),Player B (20 goals),Player F (19 goals) Final Mean = 23 goalsWith this example, the clusters could correspond to the players’ positions on the field — such as defenders, midfielders and attackers.
在這個(gè)例子中,集群可以對(duì)應(yīng)于球員在場(chǎng)上的位置,例如防守者,中場(chǎng)球員和進(jìn)攻者。
K-means works here because we could have reasonably expected the data to fall naturally into these three categories.
K均值之所以在這里起作用,是因?yàn)槲覀兛梢院侠淼仡A(yù)期數(shù)據(jù)自然會(huì)落入這三類。
In this way, given data on a range of performance statistics, a machine could do a reasonable job of estimating the positions of players from any team sport — useful for sports analytics, and indeed any other purpose where classification of a dataset into predefined groups can provide relevant insights.
這樣,給定一系列性能統(tǒng)計(jì)數(shù)據(jù),一臺(tái)機(jī)器就可以合理地估算出任何團(tuán)隊(duì)運(yùn)動(dòng)項(xiàng)目中球員的位置,這對(duì)運(yùn)動(dòng)分析非常有用,并且在將數(shù)據(jù)集分類為預(yù)定義組的其他任何目的上都非常有用提供相關(guān)見(jiàn)解。
更細(xì)的細(xì)節(jié) (Finer details)
There are several variations on the algorithm described here. The initial method of ‘seeding’ the clusters can be done in one of several ways.
這里描述的算法有幾種變體。 可以通過(guò)以下幾種方式之一來(lái)完成“播種”群集的初始方法。
Here, we randomly assigned every player into a group, then calculated the group means. This causes the initial group means to tend towards being similar to one another, which ensures greater repeatability.
在這里,我們將每個(gè)玩家隨機(jī)分配到一個(gè)組中,然后計(jì)算組均值。 這導(dǎo)致初始組手段趨于彼此相似,從而確保了更大的可重復(fù)性。
An alternative is to seed the clusters with just one player each, then start assigning players to the nearest cluster. The returned clusters are more sensitive to the initial seeding step, reducing repeatability in highly variable datasets.
另一種選擇是給每個(gè)只有一個(gè)玩家的集群播種,然后開(kāi)始將玩家分配到最近的集群。 返回的簇對(duì)初始播種步驟更加敏感,從而降低了高度可變的數(shù)據(jù)集中的可重復(fù)性。
However, this approach may reduce the number of iterations required to complete the algorithm, as the groups will take less time to diverge.
但是,這種方法可能會(huì)減少完成算法所需的迭代次數(shù),因?yàn)榻M將花費(fèi)更少的時(shí)間進(jìn)行分離。
An obvious limitation to K-means clustering is that you have to provide a priori assumptions about how many clusters you’re expecting to find.
K均值聚類的一個(gè)明顯限制是,您必須提供關(guān)于要找到多少個(gè)聚類的先驗(yàn)假設(shè)。
There are methods to assess the fit of a particular set of clusters. For example, the Within-Cluster Sum-of-Squares is a measure of the variance within each cluster.
有一些方法可以評(píng)估特定集群集的擬合度。 例如,集群內(nèi)平方和是每個(gè)集群內(nèi)方差的度量。
The ‘better’ the clusters, the lower the overall WCSS.
群集越“好”,則總體WCSS越低。
層次聚類 (Hierarchical clustering)
在...時(shí)使用 (Use when...)
…you wish to uncover the underlying relationships between your observations.
…您希望揭示觀察結(jié)果之間的潛在關(guān)系。
這個(gè)怎么運(yùn)作 (How it works)
A distance matrix is computed, where the value of cell (i, j) is a distance metric between observations i and j.
計(jì)算距離矩陣,其中像元( i,j)的值是觀測(cè)值i和j之間的距離度量。
Then, pair the closest two observations and calculate their average. Form a new distance matrix, merging the paired observations into a single object.
然后,將最接近的兩個(gè)觀測(cè)值配對(duì)并計(jì)算它們的平均值。 形成一個(gè)新的距離矩陣,將成對(duì)的觀測(cè)值合并為一個(gè)對(duì)象。
From this distance matrix, pair up the closest two observations and calculate their average. Repeat until all observations are grouped together.
從這個(gè)距離矩陣中,配對(duì)最接近的兩個(gè)觀測(cè)值并計(jì)算它們的平均值。 重復(fù)直到將所有觀察結(jié)果分組在一起。
工作的例子 (Worked example)
Here’s a super-simplified dataset about a selection of whale and dolphin species. As a trained biologist, I can assure you we normally use much more detailed datasets for things like reconstructing phylogeny.
這是有關(guān)鯨和海豚物種選擇的超級(jí)簡(jiǎn)化數(shù)據(jù)集。 作為一名訓(xùn)練有素的生物學(xué)家,我可以向您保證,我們通常使用更詳細(xì)的數(shù)據(jù)集來(lái)重建系統(tǒng)發(fā)育 。
For now though, we’ll just look at the typical body lengths for these six species. We’ll be using just two repeated steps.
現(xiàn)在,我們只看這六個(gè)物種的典型體長(zhǎng)。 我們將僅使用兩個(gè)重復(fù)步驟。
Species Initials Length(m) Bottlenose Dolphin BD 3.0 Risso's Dolphin RD 3.6 Pilot Whale PW 6.5 Killer Whale KW 7.5 Humpback Whale HW 15.0 Fin Whale FW 20.0Step 1: compute a distance matrix between each species. Here, we’ll use the Euclidean distance — how far apart are the data points?
步驟1:計(jì)算每個(gè)物種之間的距離矩陣。 在這里,我們將使用歐幾里得距離 -數(shù)據(jù)點(diǎn)相距多遠(yuǎn)?
Read this exactly as you would a distance chart in a road atlas. The difference in length between any pair of species can be looked up by reading the value at the intersection of the relevant row and column.
就像在道路地圖集上繪制距離圖一樣,仔細(xì)閱讀本章。 可以通過(guò)讀取相關(guān)行和列相交處的值來(lái)查找任意一對(duì)物種之間的長(zhǎng)度差異。
BD RD PW KW HW RD 0.6 PW 3.5 2.9 KW 4.5 3.9 1.0 HW 12.0 11.4 8.5 7.5 FW 17.0 16.4 13.5 12.5 5.0Step 2: Pair up the two closest species. Here, this will be the Bottlenose & Risso’s Dolphins, with an average length of 3.3m.
步驟2:配對(duì)兩個(gè)最接近的物種。 在這里,這將是寬吻海豚和海豚的海豚,平均長(zhǎng)度為3.3m。
Repeat Step 1 by recalculating the distance matrix, but this time merge the Bottlenose & Risso’s Dolphins into a single object with length 3.3m.
通過(guò)重新計(jì)算距離矩陣來(lái)重復(fù)步驟1,但是這次將寬吻瓶和里索的海豚合并為一個(gè)長(zhǎng)度為3.3m的單個(gè)對(duì)象。
[BD, RD] PW KW HW PW 3.2 KW 4.2 1.0 HW 11.7 8.5 7.5 FW 16.7 13.5 12.5 5.0Next, repeat Step 2 with this new distance matrix. Here, the smallest distance is between the Pilot & Killer Whales, so we pair them up and take their average — which gives us 7.0m.
接下來(lái) ,使用這個(gè)新的距離矩陣重復(fù)步驟2。 在這里,飛行員與虎鯨之間的距離最小,因此我們將它們配對(duì)并取它們的平均值,即為7.0m。
Then, we repeat Step 1 — recalculate the distance matrix, but now we’ve merged the Pilot & Killer Whales into a single object of length 7.0m.
然后 ,我們重復(fù)步驟1-重新計(jì)算距離矩陣,但是現(xiàn)在我們將“飛行員與殺人鯨”合并為一個(gè)長(zhǎng)度為7.0m的對(duì)象。
[BD, RD] [PW, KW] HW[PW, KW] 3.7 HW 11.7 8.0 FW 16.7 13.0 5.0Next, repeat Step 2 with this distance matrix. The smallest distance (3.7m) is between the two merged objects — so now merge them into an even bigger object, and take the average (which is 5.2m).
接下來(lái) ,使用此距離矩陣重復(fù)步驟2。 最小的距離(3.7m)在兩個(gè)合并的對(duì)象之間-現(xiàn)在將它們合并成更大的對(duì)象,并取平均值(即5.2m)。
Then, repeat Step 1 and compute a new distance matrix, having merged the Bottlenose & Risso’s Dolphins with the Pilot & Killer Whales.
然后 ,重復(fù)步驟1并計(jì)算新的距離矩陣,將寬吻鼻和里索的海豚與飛行員和虎鯨合并。
[[BD, RD] , [PW, KW]] HW HW 9.8 FW 14.8 5.0Next, repeat Step 2. The smallest distance (5.0m) is between the Humpback & Fin Whales, so merge them into a single object, and take the average (17.5m).
接下來(lái) ,重復(fù)步驟2。座頭鯨和鰭鯨之間的最小距離(5.0m),因此將它們合并為一個(gè)對(duì)象,并取其平均值(17.5m)。
Then, it’s back to Step 1 — compute the distance matrix, having merged the Humpback & Fin Whales.
然后 ,返回到步驟1-合并了座頭鯨和鰭鯨,計(jì)算距離矩陣。
[[BD, RD] , [PW, KW]] [HW, FW] 12.3Finally, repeat Step 2 — there is only one distance (12.3m) in this matrix, so pair everything into one big object. Now you can stop! Look at the final merged object:
最后,重復(fù)步驟2-這個(gè)矩陣只有一個(gè)距離(12.3m),因此將所有東西配對(duì)成一個(gè)大物體。 現(xiàn)在您可以停止! 查看最終的合并對(duì)象:
[[[BD, RD],[PW, KW]],[HW, FW]]It has a nested structure (think JSON), which allows it to be drawn up as a tree-like graph, or 'dendrogram'.
它具有嵌套結(jié)構(gòu)(例如JSON ),可以將其繪制為樹(shù)狀圖或“樹(shù)狀圖”。
It reads in much the same way a family tree might. The nearer two observations are on the tree, the more similar or closely-related they are taken to be.
它的讀取方式與家譜的讀取方式幾乎相同。 樹(shù)上的兩個(gè)觀測(cè)值越接近,它們被認(rèn)為越相似或緊密相關(guān)。
The structure of the dendrogram gives insight into how the dataset is structured.
樹(shù)狀圖的結(jié)構(gòu)使您可以深入了解數(shù)據(jù)集的結(jié)構(gòu)。
In this example, there are two main branches, with Humpback Whale and Fin Whale on one side, and the Bottlenose Dolphin/Risso’s Dolphin and Pilot Whale/Killer Whale on the other.
在此示例中,有兩個(gè)主要分支,一側(cè)是座頭鯨和鰭鯨,另一側(cè)是寬吻海豚/里索的海豚和領(lǐng)航鯨/殺人鯨。
In evolutionary biology, much larger datasets with many more specimens and measurements are used in this way to infer taxonomic relationships between them.
在進(jìn)化生物學(xué)中,以這種方式使用具有更多標(biāo)本和測(cè)量值的更大的數(shù)據(jù)集來(lái)推斷它們之間的分類學(xué)關(guān)系。
Outside of biology, hierarchical clustering has applications in data mining and machine learning contexts.
在生物學(xué)之外,層次聚類在數(shù)據(jù)挖掘和機(jī)器學(xué)習(xí)環(huán)境中具有應(yīng)用。
The cool thing is that this approach requires no assumptions about the number of clusters you’re looking for.
有趣的是,這種方法無(wú)需假設(shè)您要尋找的群集數(shù)量。
You can split the returned dendrogram into clusters by “cutting” the tree at a given height. This height can be chosen in a number of ways, depending on the resolution at which you wish to cluster the data.
您可以通過(guò)以指定高度“切割”樹(shù)來(lái)將返回的樹(shù)狀圖拆分為簇。 可以通過(guò)多種方式選擇此高度,具體取決于您希望對(duì)數(shù)據(jù)進(jìn)行聚類的分辨率。
For instance, looking at the dendrogram above, if we draw a horizontal line at height = 10, we’d intersect the two main branches, splitting the dendrogram into two sub-graphs. If we cut at height = 2, we’d be splitting the dendrogram into three clusters.
例如,查看上面的樹(shù)狀圖,如果我們?cè)趆eight = 10處繪制一條水平線,我們將與兩個(gè)主要分支相交,將樹(shù)狀圖分為兩個(gè)子圖。 如果我們?cè)诟叨?#61; 2處進(jìn)行切割,我們會(huì)將樹(shù)狀圖分成三個(gè)簇。
更細(xì)的細(xì)節(jié) (Finer details)
There are essentially three aspects in which hierarchical clustering algorithms can vary to the one given here.
從本質(zhì)上講,層次聚類算法可以在三個(gè)方面與此處給出的算法有所不同。
Most fundamental is the approach — here, we have used an agglomerative process, whereby we start with individual data points and iteratively cluster them together until we’re left with one large cluster.
最基本的方法是這種方法-在這里,我們使用了一個(gè)凝聚過(guò)程,即從單個(gè)數(shù)據(jù)點(diǎn)開(kāi)始,然后將它們迭代地聚類在一起,直到剩下一個(gè)大的聚類。
An alternative (but more computationally intensive) approach is to start with one giant cluster, and then proceed to divide the data into smaller and smaller clusters until you’re left with isolated data points.
另一種替代方法(但計(jì)算量更大)是從一個(gè)巨型群集開(kāi)始,然后將數(shù)據(jù)分成越來(lái)越小的群集,直到剩下孤立的數(shù)據(jù)點(diǎn)為止。
There are also a range of methods that can be used to calculate the distance matrices. For many purposes, the Euclidean distance (think Pythagoras’ Theorem) will suffice, but there are alternatives that may be more applicable in some circumstances.
還有多種方法可用于計(jì)算距離矩陣。 對(duì)于許多目的,歐幾里德距離(認(rèn)為畢達(dá)哥拉斯定理)就足夠了,但是有些替代方法可能在某些情況下更適用。
Finally, the linkage criterion can also vary. Clusters are linked according to how close they are to one another, but the way in which we define ‘close’ is flexible.
最后, 鏈接標(biāo)準(zhǔn)也可以變化。 群集根據(jù)彼此之間的接近程度進(jìn)行鏈接,但是我們定義“關(guān)閉”的方式很靈活。
In the example above, we measured the distances between the means (or ‘centroids’) of each group and paired up the nearest groups. However, you may want to use a different definition.
在上面的示例中,我們測(cè)量了每個(gè)組的均值(或“質(zhì)心”)之間的距離,并將最接近的組配對(duì)。 但是,您可能要使用其他定義。
For example, each cluster is made up of several discrete points. You could define the distance between two clusters to be the minimum (or maximum) distance between any of their points — as illustrated in the figure below.
例如,每個(gè)群集由幾個(gè)離散點(diǎn)組成。 您可以將兩個(gè)聚類之間的距離定義為它們的任意點(diǎn)之間的最小(或最大)距離,如下圖所示。
There are still other ways of defining the linkage criterion, which may be suitable in different contexts.
還有其他定義鏈接標(biāo)準(zhǔn)的方式,可能適用于不同的上下文。
圖社區(qū)檢測(cè) (Graph Community Detection)
使用時(shí) (Use when)
…you have data that can be represented as a network, or ‘graph’.
…您擁有可以表示為網(wǎng)絡(luò)或“圖形”的數(shù)據(jù)。
這個(gè)怎么運(yùn)作 (How it works)
A graph community is very generally defined as a subset of vertices which are more connected to each other than with the rest of the network.
通常將圖社區(qū)定義為頂點(diǎn)的子集,這些頂點(diǎn)彼此之間的聯(lián)系比與網(wǎng)絡(luò)其余部分的聯(lián)系更多。
Various algorithms exist to identify communities, based upon more specific definitions. Algorithms include, but are not limited to: Edge Betweenness, Modularity-Maximsation, Walktrap, Clique Percolation, Leading Eigenvector…
基于更具體的定義,存在各種算法來(lái)標(biāo)識(shí)社區(qū)。 算法包括但不限于:邊緣中間性,模塊化最大化,Walktrap,集團(tuán)滲透,前導(dǎo)特征向量…
工作的例子 (Worked example)
Graph theory, or the mathematical study of networks, is a fascinating branch of mathematics that lets us model complex systems as an abstract collection of ‘dots’ (or vertices) connected by ‘lines’ (or edges).
圖論或網(wǎng)絡(luò)的數(shù)學(xué)研究是數(shù)學(xué)的一個(gè)引人入勝的分支,它使我們可以將復(fù)雜的系統(tǒng)建模為由“線”(或邊 )連接的“點(diǎn)”(或頂點(diǎn) )的抽象集合。
Perhaps the most intuitive case-studies are social networks.
也許最直觀的案例研究是社交網(wǎng)絡(luò)。
Here, the vertices represent people, and edges connect vertices who are friends/followers. However, any system can be modelled as a network if you can justify a method to meaningfully connect different components.
在這里,頂點(diǎn)代表人,邊連接作為朋友/跟隨者的頂點(diǎn)。 但是,如果可以證明一種方法有意義地連接不同的組件,則可以將任何系統(tǒng)建模為網(wǎng)絡(luò)。
Among the more innovative applications of graph theory to clustering include feature extraction from image data, and analysing gene regulatory networks.
圖論在聚類中的更多創(chuàng)新應(yīng)用包括從圖像數(shù)據(jù)中提取特征以及分析基因調(diào)控網(wǎng)絡(luò)。
As an entry-level example, take a look at this quickly put-together graph. It shows the eight websites I most recently visited, linked according to whether their respective Wikipedia articles link out to one another.
作為入門(mén)級(jí)示例,請(qǐng)看一下此快速匯總的圖表。 它顯示了我最近訪問(wèn)的八個(gè)網(wǎng)站,這些網(wǎng)站根據(jù)各自的Wikipedia文章是否相互鏈接而鏈接在一起。
You could assemble this data manually, but for larger-scale projects, it’s much quicker to write a Python script to do the same. Here’s one I wrote earlier.
您可以手動(dòng)組裝此數(shù)據(jù),但是對(duì)于大型項(xiàng)目,編寫(xiě)Python腳本來(lái)完成此操作要快得多。 這是我之前寫(xiě)的 。
The vertices are colored according to their community membership, and sized according to their centrality. See how Google and Twitter are the most central?
頂點(diǎn)根據(jù)其社區(qū)成員資格進(jìn)行著色,并根據(jù)其中心性進(jìn)行大小調(diào)整。 看看Google和Twitter是最核心的嗎?
Also, the clusters make pretty good sense in the real-world (always an important performance indicator).
此外,集群在現(xiàn)實(shí)世界中非常有意義(始終是重要的性能指標(biāo))。
The yellow vertices are generally reference/look-up sites; the blue vertices are all used for online publishing (of articles, tweets, or code); and the red vertices include YouTube, which was of course founded by former PayPal employees. Not bad deductions for a machine.
黃色頂點(diǎn)通常是參考/查找站點(diǎn); 藍(lán)色頂點(diǎn)全部用于在線發(fā)布(文章,推文或代碼的發(fā)布); 紅色頂點(diǎn)包括YouTube,它當(dāng)然是由前PayPal員工創(chuàng)立的。 對(duì)機(jī)器的扣減還不錯(cuò)。
Aside from being a useful way to visualize large systems, the real power of networks comes from their mathematical analysis. Let’s start by translating our nice picture of the network into a more mathematical format. Below is the adjacency matrix of the network.
除了是可視化大型系統(tǒng)的有用方法之外,網(wǎng)絡(luò)的真正功能還在于它們的數(shù)學(xué)分析。 讓我們首先將網(wǎng)絡(luò)的漂亮圖片轉(zhuǎn)換成更數(shù)學(xué)的格式。 下面是網(wǎng)絡(luò)的鄰接矩陣 。
GH Gl M P Q T W Y GitHub 0 1 0 0 0 1 0 0 Google 1 0 1 1 1 1 1 1 Medium 0 1 0 0 0 1 0 0 PayPal 0 1 0 0 0 1 0 1 Quora 0 1 0 0 0 1 1 0 Twitter 1 1 1 1 1 0 0 1 Wikipedia 0 1 0 0 1 0 0 0 YouTube 0 1 0 1 0 1 0 0The value at the intersection of each row and column records whether there is an edge between that pair of vertices.
每行和列的交點(diǎn)處的值記錄該對(duì)頂點(diǎn)之間是否存在邊。
For instance, there is an edge between Medium and Twitter (surprise, surprise!), so the value where their rows/columns intersect is 1. Similarly, there is no edge between Medium and PayPal, so the intersection of their rows/columns returns 0.
例如,Medium和Twitter之間有一條邊(驚訝,令人驚訝!),因此它們的行/列相交的值是1。類似地,Medium和PayPal之間沒(méi)有邊,因此它們的行/列的交點(diǎn)返回0。
Encoded within the adjacency matrix are all the properties of this network — it gives us the key to start unlocking all manner of valuable insights.
該網(wǎng)絡(luò)的所有屬性都編碼在鄰接矩陣中-它為我們提供了開(kāi)始解鎖各種有價(jià)值的見(jiàn)解的關(guān)鍵。
For a start, summing any column (or row) gives you the degree of each vertex — i.e., how many others it is connected to. This is commonly denoted with the letter k.
首先,求和任何列(或行)的總和即可得出每個(gè)頂點(diǎn)的度數(shù) ,即它與多少個(gè)頂點(diǎn)相連。 通常用字母k表示。
Likewise, summing the degrees of every vertex and dividing by two gives you L, the number of edges (or ‘links’) in the network. The number of rows/columns gives us N, the number of vertices (or ‘nodes’) in the network.
同樣,將每個(gè)頂點(diǎn)的度數(shù)相加并除以2可得到L ,即網(wǎng)絡(luò)中邊(或“鏈接”)的數(shù)量。 行數(shù)/列數(shù)為N ,即網(wǎng)絡(luò)中的頂點(diǎn)數(shù)(或“節(jié)點(diǎn)”)。
Knowing just k, L, N and the value of each cell in the adjacency matrix A lets us calculate the modularity of any given clustering of the network.
只需知道k , L,N和鄰接矩陣A中每個(gè)像元的值,就可以計(jì)算模塊化 的 網(wǎng)絡(luò)的任何給定群集。
Say we’ve clustered the network into a number of communities. We can use the modularity score to assess the ‘quality’ of this clustering.
假設(shè)我們已將網(wǎng)絡(luò)聚集到多個(gè)社區(qū)中。 我們可以使用模塊化評(píng)分來(lái)評(píng)估該聚類的“質(zhì)量”。
A higher score will show we’ve split the network into ‘a(chǎn)ccurate’ communities, whereas a low score suggests our clusters are more random than insightful. The image below illustrates this.
較高的分?jǐn)?shù)將表明我們已將網(wǎng)絡(luò)劃分為“準(zhǔn)確的”社區(qū),而較低的分?jǐn)?shù)表明我們的集群比有見(jiàn)地的更為隨機(jī)。 下圖說(shuō)明了這一點(diǎn)。
Modularity can be calculated using the formula below:
模塊化可以使用以下公式計(jì)算:
That’s a fair amount of math, but we can break it down bit by bit and it’ll make more sense.
這是相當(dāng)多的數(shù)學(xué)運(yùn)算,但是我們可以一點(diǎn)一點(diǎn)地分解它,這將更有意義。
M is of course what we’re calculating — modularity.
M當(dāng)然是我們正在計(jì)算的-模塊化。
1/2L tells us to divide everything that follows by 2L, i.e., twice the number of edges in the network. So far, so good.
1/2 L告訴我們將其后的所有內(nèi)容除以2 L ,即網(wǎng)絡(luò)中邊數(shù)的兩倍。 到目前為止,一切都很好。
The Σ symbol tells us we’re summing up everything to the right, and lets us iterate over every row and column in the adjacency matrix A.
Σ符號(hào)告訴我們我們?cè)谟疫厖R總所有內(nèi)容,并讓我們遍歷鄰接矩陣A中的每一行和每一列。
For those unfamiliar with sum notation, the i, j = 1 and the N work much like nested for-loops in programming. In Python, you’d write it as follows:
對(duì)于那些不熟悉總和表示法的人, i,j = 1和N的工作方式很像編程中的嵌套for循環(huán)。 在Python中,您可以這樣編寫(xiě):
sum = 0 for i in range(1,N):for j in range(1,N):ans = #stuff with i and j as indices sum += ansSo what is #stuff with i and j in more detail?
那么, #stuff with i and j是什么呢?
Well, the bit in brackets tells us to subtract ( k_i k_j ) / 2L from A_ij.
好吧,方括號(hào)中的位告訴我們從A_ij減去( k_i k_j)/ 2L 。
A_ij is simply the value in the adjacency matrix at row i, column j.
A_ij僅是第i行第 j列的鄰接矩陣中的值。
The values of k_i and k_j are the degrees of each vertex — found by adding up the entries in row i and column j respectively. Multiplying these together and dividing by 2L gives us the expected number of edges between vertices i and j if the network were randomly shuffled up.
k_i和k_j的值是每個(gè)頂點(diǎn)的度數(shù)-通過(guò)分別將第i行和第j列中的條目相加得出。 如果將網(wǎng)絡(luò)隨機(jī)改組,則將它們相乘并除以2 L可得到頂點(diǎn)i和j之間的預(yù)期邊數(shù)。
Overall, the term in the brackets reveals the difference between the network’s real structure and the expected structure it would have if randomly reassembled.
總體而言,方括號(hào)中的術(shù)語(yǔ)揭示了網(wǎng)絡(luò)的實(shí)際結(jié)構(gòu)與如果隨機(jī)重組將具有的預(yù)期結(jié)構(gòu)之間的差異。
Playing around with the values shows that it returns its highest value when A_ij = 1, and ( k_i k_j ) / 2L is low. This means we see a higher value if there is an ‘unexpected’ edge between vertices i and j.
數(shù)值的計(jì)算表明,當(dāng)A_ij = 1且( k_i k_j)/ 2L為低時(shí),它將返回其最大值。 這意味著如果在頂點(diǎn)i和j之間存在“意外”邊緣,則我們看到一個(gè)更高的值。
Finally, we multiply the bracketed term by whatever the last few symbols refer to.
最后,我們將括號(hào)中的術(shù)語(yǔ)乘以最后幾個(gè)符號(hào)所指的內(nèi)容。
The ?c_i, c_j is the fancy-sounding but totally harmless Kronecker-delta function. Here it is, explained in Python:
?c _i, c _j是聽(tīng)起來(lái)很花哨但完全無(wú)害的Kronecker-delta函數(shù)。 在Python中進(jìn)行了解釋:
def kroneckerDelta(ci, cj):if ci == cj:return 1else:return 0kroneckerDelta("A","A") #returns 1kroneckerDelta("A","B") #returns 0Yes — it really is that simple. The Kronecker-delta function takes two arguments, and returns 1 if they are identical, otherwise, zero.
是的-真的就是這么簡(jiǎn)單。 Kronecker-delta函數(shù)采用兩個(gè)參數(shù),如果相同則返回1,否則返回零。
This means that if vertices i and j have been put in the same cluster, then ?c_i, c_j = 1. Otherwise, if they are in different clusters, the function returns zero.
這意味著,如果將頂點(diǎn)i和j放在同一簇中,則?c _i, c _j = 1 。 否則,如果它們位于不同的群集中,則該函數(shù)將返回零。
As we are multiplying the bracketed term by this Kronecker-delta function, we find that for the nested sum Σ, the outcome is highest when there are lots of ‘unexpected’ edges connecting vertices assigned to the same cluster.
當(dāng)我們將括號(hào)中的項(xiàng)乘以該Kronecker-delta函數(shù)時(shí),我們發(fā)現(xiàn)對(duì)于嵌套的總和Σ ,當(dāng)有很多“意外”邊緣連接分配給同一聚類的頂點(diǎn)時(shí),結(jié)果最高。
As such, modularity is a measure of how well-clustered the graph is into separate communities.
因此,模塊化是衡量圖表在不同社區(qū)中的聚集程度的一種度量。
Dividing by 2L bounds the upper value of modularity at 1. Modularity scores near to or below zero indicate the current clustering of the network is really no use. The higher the modularity, the better the clustering of the network into separate communities.
除以2L會(huì)將模塊性的上限定為1。接近或低于零的模塊性分?jǐn)?shù)表明該網(wǎng)絡(luò)的當(dāng)前集群實(shí)際上沒(méi)有用。 模塊化程度越高,將網(wǎng)絡(luò)更好地聚集到單獨(dú)的社區(qū)中就越好。
By maximising modularity, we can find the best way of clustering the network.
通過(guò)最大程度地提高模塊化,我們可以找到群集網(wǎng)絡(luò)的最佳方法。
Notice that we have to pre-define how the graph is clustered to find out how ‘good’ that clustering actually is.
注意,我們必須預(yù)先定義圖的聚類方式,以找出聚類的實(shí)際效果。
Unfortunately, employing brute force to try out every possible way of clustering the graph to find which has the highest modularity score would be computationally impossible beyond a very limited sample size.
不幸的是,在非常有限的樣本量之外,采用蠻力嘗試對(duì)圖進(jìn)行聚類以找到具有最高模塊化得分的所有可能方法,在計(jì)算上都是不可能的。
Combinatorics tells us that for a network of just eight vertices, there are 4140 different ways of clustering them. A network twice the size would have over ten billion possible ways of clustering the vertices.
組合法告訴我們,對(duì)于只有八個(gè)頂點(diǎn)的網(wǎng)絡(luò),有4140種不同的方法將它們聚類。 兩倍大小的網(wǎng)絡(luò)將有超過(guò)一百億種可能的頂點(diǎn)聚類方法。
Doubling the network again (to a very modest 32 vertices) would give 128 septillion possible ways, and a network of eighty vertices would be cluster-able in more ways than there are atoms in the observable universe.
將網(wǎng)絡(luò)再次加倍(到非常適度的32個(gè)頂點(diǎn))將提供128億種可能的方式,并且與可觀察的宇宙中存在的原子相比,具有80個(gè)頂點(diǎn)的網(wǎng)絡(luò)將以更多的方式可集群。
Instead, we have to turn to a heuristic method that does a reasonably good job at estimating the clusters that will produce the highest modularity score, without trying out every single possibility.
取而代之的是,我們必須轉(zhuǎn)向一種啟發(fā)式方法,該方法在估計(jì)將產(chǎn)生最高模塊性得分的集群方面做得相當(dāng)好,而無(wú)需嘗試每種可能性。
This is an algorithm called Fast-Greedy Modularity-Maximization, and it’s somewhat analogous to the agglomerative hierarchical clustering algorithm describe above. Instead of merging according to distance, ‘Mod-Max’ merges communities according to changes in modularity.
這是一種稱為快速貪婪模塊化最大化的算法,它有點(diǎn)類似于上面描述的聚集層次聚類算法。 'Mod-Max'并非根據(jù)距離進(jìn)行合并,而是根據(jù)模塊化的變化來(lái)合并社區(qū)。
Here’s how it goes:
這是怎么回事:
Begin by initially assigning every vertex to its own community, and calculating the modularity of the whole network, M.
首先將每個(gè)頂點(diǎn)分配給它自己的社區(qū),然后計(jì)算整個(gè)網(wǎng)絡(luò)的模塊化M。
Step 1 requires that for each community pair linked by at least a single edge, the algorithm calculates the resultant change in modularity ΔM if the two communities were merged into one.
步驟1要求,對(duì)于至少由一條邊鏈接的每個(gè)社區(qū)對(duì),如果將兩個(gè)社區(qū)合并為一個(gè)社區(qū),該算法將計(jì)算模塊性ΔM的最終變化。
Step 2 then takes the pair of communities that produce the biggest increase in ΔM, which are then merged. Calculate the new modularity M for this clustering, and keep a record of it.
然后, 步驟2選取產(chǎn)生最大ΔM的一對(duì)社區(qū),然后將其合并。 計(jì)算該集群的新模塊性M ,并對(duì)其進(jìn)行記錄。
Repeat steps 1 and 2 — each time merging the pair of communities for which doing so produces the biggest gain in ΔM, then recording the new clustering pattern and its associated modularity score M.
重復(fù)步驟1和2-每次合并一對(duì)社區(qū),這樣做會(huì)在ΔM中產(chǎn)生最大的收益,然后記錄新的聚類模式及其相關(guān)的模塊化評(píng)分M。
Stop when all the vertices are grouped into one giant cluster. Now the algorithm checks the records it kept as it went along, and identifies the clustering pattern that returned the highest value of M. This is the returned community structure.
將所有頂點(diǎn)分組為一個(gè)大簇時(shí)停止 。 現(xiàn)在,該算法檢查其保留的記錄,并確定返回最大M值的聚類模式。 這是返回的社區(qū)結(jié)構(gòu)。
更細(xì)的細(xì)節(jié) (Finer details)
Whew! That was computationally intensive, at least for us humans.
ew! 這是計(jì)算密集型的,至少對(duì)于我們?nèi)祟惗浴?
Graph theory is a rich source of computationally challenging, often NP-hard problems — yet it also has incredible potential to provide valuable insights into complex systems and datasets.
圖論是計(jì)算難題(通常是NP難題)的豐富來(lái)源,但它也具有令人難以置信的潛力,可以為復(fù)雜的系統(tǒng)和數(shù)據(jù)集提供有價(jià)值的見(jiàn)解。
Just ask Larry Page, whose eponymous PageRank algorithm — which helped propel Google from start-up to basically world domination in less than a generation — was based entirely in graph theory.
只需問(wèn)問(wèn)拉里·佩奇(Larry Page),他的同名PageRank算法完全基于圖論,該算法在不到一代人的時(shí)間里就將Google從新興企業(yè)推向了世界統(tǒng)治。
Community detection is a major focus of current research in graph theory, and there are plenty of alternatives to Modularity-Maximization, which while useful, does have some drawbacks.
社區(qū)檢測(cè)是當(dāng)前圖論研究的主要重點(diǎn),模塊化最大化有許多替代方案,盡管它們很有用,但確實(shí)存在一些缺點(diǎn)。
For a start, its agglomerative approach often sees small, well-defined communities swallowed up into larger ones. This is known as the resolution limit — the algorithm will not find communities below a certain size.
首先,它的聚集方法通常會(huì)看到小型的,定義明確的社區(qū)被吞并為較大的社區(qū)。 這稱為分辨率限制 -算法將找不到小于特定大小的社區(qū)。
Another challenge is that rather than having one distinct, easy-to-reach global peak, the Mod-Max approach actually tends to produce a wide ‘plateau’ of many similar high modularity scores — making it somewhat difficult to truly identify the absolute maximum score.
另一個(gè)挑戰(zhàn)是,Mod-Max方法并沒(méi)有產(chǎn)生一個(gè)清晰易懂的全球峰值,而是趨向于產(chǎn)生許多相似的高模塊評(píng)分的寬廣的“平臺(tái)”,這使得真正識(shí)別絕對(duì)最大評(píng)分有些困難。
Other algorithms use different ways to define and approach community detection.
其他算法使用不同的方式來(lái)定義和處理社區(qū)檢測(cè)。
Edge-Betweenness is a divisive algorithm, starting with all vertices grouped in one giant cluster. It proceeds to iteratively remove the least ‘important’ edges in the network, until all vertices are left isolated. This produces a hierarchical structure, with similar vertices closer together in the hierarchy.
Edge-Betweenness是一種分割算法,從將所有頂點(diǎn)分組到一個(gè)巨型簇中開(kāi)始。 進(jìn)行迭代地刪除網(wǎng)絡(luò)中最不重要的邊緣,直到所有頂點(diǎn)都保持隔離。 這將產(chǎn)生一個(gè)層次結(jié)構(gòu),相似的頂點(diǎn)在層次結(jié)構(gòu)中靠得更近。
Another algorithm is Clique Percolation, which takes into account possible overlap between graph communities.
另一個(gè)算法是Clique Percolation ,它考慮了圖社區(qū)之間可能的重疊。
Yet another set of algorithms are based on random-walks across the graph, and then there are spectral clustering methods which start delving into the eigendecomposition of the adjacency matrix and other matrices derived therefrom. These ideas are used in feature extraction in, for example, areas such as computer vision.
另一組算法是基于整個(gè)圖的隨機(jī)游動(dòng) ,然后是頻譜聚類方法,它們開(kāi)始研究鄰接矩陣和從中得出的其他矩陣的特征分解。 這些想法用于例如計(jì)算機(jī)視覺(jué)等領(lǐng)域的特征提取。
It’d be well beyond the scope of this article to give each algorithm its own in-depth worked example. Suffice to say that this is an active area of research, providing powerful methods to make sense of data that even a generation ago would have been extremely difficult to process.
為每個(gè)算法提供自己的深入工作示例將遠(yuǎn)遠(yuǎn)超出本文的范圍。 可以說(shuō)這是一個(gè)活躍的研究領(lǐng)域,它提供了強(qiáng)大的方法來(lái)理解數(shù)據(jù),即使是一代人以前也很難處理這些數(shù)據(jù)。
結(jié)論 (Conclusion)
Hopefully this article has informed and inspired you to better understand how machines can make sense of data. The future is a rapidly changing place, and many of those changes will be driven by what technology becomes capable of in the next generation or two.
希望本文能為您提供啟發(fā)并啟發(fā)您更好地了解機(jī)器如何理解數(shù)據(jù)。 未來(lái)是一個(gè)瞬息萬(wàn)變的地方,其中許多變化將取決于下一代技術(shù)的能力。
As outlined in the introduction, machine learning is an extraordinarily ambitious field of research, in which massively complex problems require solving in as accurate and as efficient a way possible. Tasks that come naturally to us humans require innovative solutions when taken on by machines.
正如導(dǎo)言中概述的那樣,機(jī)器學(xué)習(xí)是一個(gè)非常宏大的研究領(lǐng)域,其中龐大的復(fù)雜問(wèn)題需要以盡可能準(zhǔn)確和有效的方式來(lái)解決。 對(duì)于人類來(lái)說(shuō),自然而然的任務(wù)需要由機(jī)器承擔(dān)的創(chuàng)新解決方案。
There’s still plenty of progress to be made, and whoever contributes the next breakthrough idea will no doubt be generously rewarded. Maybe someone reading this article will be behind the next powerful algorithm?
仍有大量的進(jìn)步,無(wú)論誰(shuí)貢獻(xiàn)下一個(gè)突破性的想法,無(wú)疑都會(huì)得到豐厚的回報(bào)。 也許有人讀這篇文章會(huì)成為下一個(gè)強(qiáng)大算法的幕后推手?
All great ideas have to start somewhere!
所有好主意都必須從某個(gè)地方開(kāi)始!
翻譯自: https://www.freecodecamp.org/news/how-machines-make-sense-of-big-data-an-introduction-to-clustering-algorithms-4bd97d4fbaba/
層次聚類算法 算法
總結(jié)
以上是生活随笔為你收集整理的层次聚类算法 算法_聚类算法简介的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 梦到自己怀孕有什么预兆
- 下一篇: 机器学习 美股_我如何使用机器学习来探索