當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

什么是數(shù)據(jù)挖掘？ (What is data mining ?!)

It’s technically a profound dive into datasets searching for some correlations, rules, anomaly detection and the list goes on. It’s a way to do some simple but effective machine learning instead of doing it the hard way like using regular neural networks or the ultimate complex version that is convolutions and recurrent neural networks (we will definitely go through that thoroughly in future articles).

從技術(shù)上講，這是對數(shù)據(jù)集的深入研究，以尋找一些相關(guān)性，規(guī)則，異常檢測，并且列表還在繼續(xù)。這是一種進(jìn)行簡單但有效的機(jī)器學(xué)習(xí)的方法，而不是像使用常規(guī)神經(jīng)網(wǎng)絡(luò)或卷積和遞歸神經(jīng)網(wǎng)絡(luò)這樣的終極復(fù)雜版本那樣艱苦的方法來完成它(我們肯定會(huì)在以后的文章中全面介紹)。

Data mining algorithms vary from one to another, each one has it’s own privileges and disadvantages, i will not go through that in this article but the first one you should focus on must be the classical Apriori Algorithm as it is the opening gate to the data mining world.

數(shù)據(jù)挖掘算法因人而異，每種算法都有其自身的特權(quán)和劣勢，在本文中我不會(huì)進(jìn)行介紹，但是您應(yīng)該關(guān)注的第一個(gè)算法必須是經(jīng)典的Apriori算法，因?yàn)樗菙?shù)據(jù)的門戶采礦世界。

But before going any further, there’s some special data mining vocabulary that we need to get familiar with :

但是在進(jìn)一步介紹之前，我們需要熟悉一些特殊的數(shù)據(jù)挖掘詞匯：

k-Itemsets : an itemset is just a set of items, the k refers to it’s order/length which means the number of items contained in the itemset.
k-Itemsets：一個(gè)項(xiàng)目集只是一組項(xiàng)目， k表示它的順序/長度，這意味著該項(xiàng)目集中包含的項(xiàng)目數(shù)。
Transaction : it is a captured data, can refer to purchased items in a store. Note that Apriori algorithm operates on datasets containing thousands or even millions of transactions.
交易：它是捕獲的數(shù)據(jù)，可以參考商店中購買的物品。請注意，Apriori算法對包含數(shù)千甚至數(shù)百萬個(gè)事務(wù)的數(shù)據(jù)集進(jìn)行操作。
Association rule : an antecedent → consequent relationship between two itemsets :
關(guān)聯(lián)規(guī)則：兩個(gè)項(xiàng)目集之間的前→后關(guān)系：

Implies the presence of the itemset Y (consequent) in the considered transaction given the itemset X (antecedent).

在給定項(xiàng)目集X(先行者)的情況下，表示在考慮的事務(wù)中存在項(xiàng)目集Y(因此)。

Support : represents the popularity/frequency of an itemset, calculated this way :
支持：表示項(xiàng)目集的受歡迎程度/頻率，通過以下方式計(jì)算：

Confidence ( X → Y ) : shows how much a rule is confident/true, in other words the likelihood of having the consequent itemset in a transaction, calculated this way :
置信度(X→Y)：顯示一條規(guī)則置信度/真實(shí)度的多少，換句話說，在交易中擁有后續(xù)項(xiàng)集的可能性，計(jì)算方式為：

A rule is called a strong rule if its confidence is equal to 1.

如果規(guī)則的置信度等于1，則稱為強(qiáng)規(guī)則 。

Lift ( X → Y ) : A measure of performance, indicates the quality of an association rule :
提升(X→Y)：一種性能度量，表示關(guān)聯(lián)規(guī)則的質(zhì)量：

MinSup : a user-specified variable which stands for the minimum support threshold for itemsets.
MinSup：用戶指定的變量代表項(xiàng)目集的最低支持閾值。
MinConf : a user-specified variable which stands for the minimum confidence threshold for rules.
MinConf：用戶指定的變量，代表規(guī)則的最小置信度閾值。
Frequent itemset : whose support is equal or higher than the chosen minsup.
頻繁項(xiàng)目集：支持等于或大于選擇的minsup 。
Infrequent itemset : whose support is less than the chosen minsup.
不 頻繁項(xiàng)目 集：其支持小于所選的minsup 。

那么... Apriori如何工作？ (So…h(huán)ow does Apriori work ?)

Starting with a historical glimpse, the algorithm was first proposed by the computer scientists Agrawal and Srikant in 1994, it proceeds this way :

從歷史的一瞥開始，該算法由計(jì)算機(jī)科學(xué)家Agrawal和Srikant于1994年首次提出，它以這種方式進(jìn)行：

Generates possible combinations of k-itemsets (starts with k=1)
生成k個(gè)項(xiàng)目集的可能組合(以k = 1開頭)
Calculates support according to each itemset
根據(jù)每個(gè)項(xiàng)目集計(jì)算支持
Eliminates infrequent itemsets
消除不頻繁的項(xiàng)目集
Increments k and repeats the process
遞增k并重復(fù)該過程

Now, how to generate those itemsets ?!!

現(xiàn)在，如何生成這些項(xiàng)目集？

For itemsets of length k=2, it is required to consider every possible combination of two items (no permutation is needed). For k > 2, two conditions must be satisfied first :

對于長度為k = 2的項(xiàng)目集，需要考慮兩個(gè)項(xiàng)目的每種可能的組合(不需要排列)。對于k> 2 ，必須首先滿足兩個(gè)條件：

The combined itemset must be formed of two frequent ones of length k-1, let’s call’em subsets.
組合的項(xiàng)目集必須由兩個(gè)長度為k-1的 頻繁項(xiàng)組成，我們稱它們?yōu)閑m 子集。
Both subsets must have the same prefix of length k-2
兩個(gè)子集必須具有相同的長度k-2前綴

If you think about it, these steps will just extend the previously found frequent itemsets, this is called the ‘bottom up’ approach. It also proves that Apriori algorithm respects the monotone property :

如果您考慮一下，這些步驟將僅擴(kuò)展先前發(fā)現(xiàn)的頻繁項(xiàng)目集，這稱為“自下而上”方法。這也證明Apriori算法尊重單調(diào)性 ：

All subsets of a frequent itemset must also be frequent.

頻繁項(xiàng)目集的所有子集也必須是頻繁的。

As well as the anti-monotone property :

以及抗單調(diào)特性 ：

All super-sets of an infrequent itemset must also be infrequent.

罕見項(xiàng)目集的所有超集也必須是不頻繁的。

Okay, but wait a minute, this seems infinite !!

好的，但是等等，這似乎是無限的！

No, luckily it is not infinite, the algorithm stops at a certain order k if :

不，幸運(yùn)的是它不是無限的，如果滿足以下條件，該算法將以某個(gè)順序k停止：

All the generated itemsets of length k are infrequent
生成的所有長度為k的項(xiàng)目集很少
No found prefix of length k-2 in common which makes it impossible to generate new itemsets of length k
找不到長度為k-2的前綴，這使得無法生成長度為k的新項(xiàng)目集

Sure…it’s not rocket science ! but how about an example to make this clearer ?

當(dāng)然……這不是火箭科學(xué)！ 但是如何使這個(gè)例子更清楚呢？

Here’s a small transaction table in binary format, the value of an item is 1 if it’s present in the considered transaction, otherwise it’s 0.

這是一個(gè)二進(jìn)制格式的小交易表，如果項(xiàng)目存在于所考慮的交易中，則該項(xiàng)目的值為1 ，否則為0 。

太好了……是時(shí)候進(jìn)行一些關(guān)聯(lián)規(guī)則挖掘了！ (Great…It’s time for some association rule mining !)

Once you reach this part, all there’s left to do is to take one frequent k-itemset at a time and generate all its possible rules using binary partitioning.

一旦達(dá)到這一部分，剩下要做的就是一次獲取一個(gè)頻繁的k項(xiàng)集，并使用二進(jìn)制分區(qū)生成所有可能的規(guī)則。

If the 3-itemset {Almonds-Sugar-Milk} from the previous example were a frequent itemset, then the generated rules would look like :

如果前面示例中的3個(gè)項(xiàng)目集{Almonds-Sugar-Milk}是一個(gè)頻繁項(xiàng)集，則生成的規(guī)則將如下所示：

我的Apriori模擬概述！使用Python (An overview of my Apriori simulation !! Using Python)

數(shù)據(jù)集 (Dataset)

Of format csv (Comma separated values), containing 7501 transactions of purchased items in a supermarket. Restructuring the dataset with the transaction encoder class from mlxtend library made the use and manipulation much easier. The resulting structure is occupying an area of ??871.8 KB with 119 columns indexed respectively by food name from “Almonds” to “Zucchini”.

格式為csv (逗號分隔值)，包含在超市中的7501個(gè)已購買商品的交易。使用mlxtend庫中的事務(wù)編碼器類重構(gòu)數(shù)據(jù)集使使用和操作更加容易。最終的結(jié)構(gòu)占據(jù)了871.8 KB的區(qū)域，其中119列分別由食品名稱從``杏仁''到``西葫蘆''索引。

Here’s an overview of the transaction table before and after :

這是之前和之后的事務(wù)表的概述：

實(shí)現(xiàn)算法 (Implementing the algorithm)

I will not be posting any code fragments as it was a straight forward approach, the procedure is recursive, calls the responsible functions for the itemsets generation, support calculation, elimination and association rule mining in the mentioned order.

我不會(huì)發(fā)布任何代碼片段，因?yàn)檫@是一種直接的方法，該過程是遞歸的，并按上述順序調(diào)用負(fù)責(zé)項(xiàng)集生成，支持計(jì)算，消除和關(guān)聯(lián)規(guī)則挖掘的負(fù)責(zé)功能。

The execution took 177 seconds which seemed optimised and efficient thanks to Pandas and NumPy’s ability to perform quick element-wise operations. All found association rules were saved in an html file for later use.

由于Pandas和NumPy能夠執(zhí)行快速的按元素操作，因此執(zhí)行過程耗時(shí)177秒，這似乎是優(yōu)化和高效的。找到的所有關(guān)聯(lián)規(guī)則都保存在html文件中，以備后用。

現(xiàn)在，去超市逛逛怎么樣？通過Plotly使用Dash (Now, how about a tour in the supermarket ? Using Dash by Plotly)

Finally, i got to use the previously saved rules to suggest food items based on what my basket contains. Here’s a quick preview :

最后，我必須使用之前保存的規(guī)則根據(jù)購物籃中的食物來建議食物。快速預(yù)覽：

Feel free to check my source code here.

請?jiān)诖颂庪S意檢查我的源代碼。

翻譯自: https://medium.com/the-coded-theory/data-mining-a-focus-on-apriori-algorithm-b201d756c7ff

推薦算法的先驗(yàn)算法的連接

總結(jié)

以上是生活随笔為你收集整理的推荐算法的先验算法的连接_数据挖掘专注于先验算法的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：做梦梦到长了好多白头发是什么意思
下一篇：时间序列模式识别_空气质量传感器数据的时