當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

概率论在数据挖掘_为什么概率论在数据科学中很重要

發布時間：2023/12/15 编程问答 40 豆豆

生活随笔收集整理的這篇文章主要介紹了概率论在数据挖掘_为什么概率论在数据科学中很重要小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

概率論在數據挖掘

數據科學 (Data Science)

Lots of Data Science concepts are applied using Probability fundamental knowledge. We can name a few popular terms such as “Decision Making”, “Recommender System”, “Deep Learning”. Famous framework for deep learning like Tensorflow or Pytorch are implemented heavily based on the concept of Probability. So understanding what Probability is and how it works will help us to go far in the path for learning Data Science.

數據科學概念大號 OTS使用概率的基礎知識應用。我們可以命名一些流行的術語，例如“決策制定”，“推薦系統”，“深度學習”。像Tensorflow或Pytorch這樣的著名的深度學習框架在概率論的基礎上大量實現。因此，了解什么是概率及其如何工作將有助于我們走上學習數據科學的道路。

依賴與獨立 (Dependence and Independence)

Roughly speaking, we say that two events E and F are dependent if knowing something about whether E happens gives us information about whether F happens (and vice versa). Otherwise they are independent.

粗略地說，如果知道有關E是否發生的某些信息給我們有關F是否發生的信息(反之亦然)，那么我們說兩個事件E和F是相關的。否則，它們是獨立的。

For instance, if we flip a fair coin twice, knowing whether the first flip is Heads gives us no information about whether the second flip is Heads. These events are independent. On the other hand, knowing whether the first flip is Heads certainly gives us information about whether both flips are Tails. (If the first flip is Heads, then definitely it’s not the case that both flips are Tails.) These two events are dependent. Mathematically, we say that two events E and F are independent if the probability that they both happen is the product of the probabilities that each one happens:

例如，如果我們擲兩次公平硬幣，則知道第一個擲骰是否為正面，就不會提供有關第二個擲骰是否為正面的信息。這些事件是獨立的。另一方面，知道第一個翻轉是否為Heads肯定會為我們提供有關兩個翻轉是否均為Tails的信息。 (如果第一個翻轉是Heads，那么兩個翻轉都是Tails肯定不是這種情況。)這兩個事件是相關的。從數學上講，如果兩個事件E和F都發生的概率是每個事件發生的概率的乘積，則我們說這兩個事件是獨立的：

P (E, F) = P( E) * P( F)

In the example above, the probability of “first flip Heads” is 1/2, and the probability of “both flips Tails” is 1/4, but the probability of “first flip Heads and both flips Tails” is 0.

在上面的示例中，“第一個翻轉頭”的概率為1/2，“兩個翻轉尾部”的概率為1/4，但是“第一翻轉頭和兩個翻轉尾部”的概率為0。

貝葉斯定理 (Bayes’ Theorem)

To understand how Bayes’ Theorem works, try to answer the question below:

要了解貝葉斯定理的工作原理，請嘗試回答以下問題：

Steve is very shy and withdrawn, invariably helpful but with very little interest in people or in the world of reality. A meek and tidy soul, he has a need for order and structure, and a passion for detail. How likely Steve to be one of those:
1. A librarian
2. A farmer

Very often, we (irrationally) will think Steve is “most likely to be” a librarian. Well, we will not think so if we understand the ratio of farmer to the librarian. Let’s just say it is probably is 20/1.

很多時候，我們(非理性地)會認為史蒂夫“最有可能成為”圖書館員。好吧，如果我們了解農民與圖書館員的比例，我們就不會這樣認為。我們只說大概是20/1。

In librarian category, let’s say 50% of the librarians fit the character traits in the question, whereas in farmer category, let’s say it’s only 10%.

在圖書館員類別中，假設50％的圖書館員符合問題中的性格特征，而在農民類別中，假設只有10％。

Alright, so let’s say we have 10 librarian, and 200 farmers. The probability of a farmer given the description will be:

好了，假設我們有10名圖書管理員和200名農民。給出描述的農民可能性為：

5/(5+20) = 1/5 ~ 20%

So, if we guess the candidate is likely a librarian. We are probably WRONG.

因此，如果我們猜測候選人可能是圖書館員。我們可能是錯誤的。

Below is the formula of Bayes’ theorem.

下面是貝葉斯定理的公式。

P(H|E) = P(H)*P(E|H) / P(E)

where:

哪里：

P(H) = Probability of hypothesis is true, before any evidenceP(E|H) = Probability of seeing the evidence if hypothesis is trueP(E) = Probability of seeing the evidenceP(H|E) = Probability of hypothesis is true given some evidence

隨機變量 (Random Variable)

is a variable whose possible values have an associated probabilitydistribution. A very simple random variable equals 1 if a coin flip turns up heads and 0 if the flip turns up tails. A more complicated one might measure the number of heads observed when flipping a coin 10 times or a value picked from range(10) where each number is equally likely. The associated distribution gives the probabilities that the variable realizes each of itspossible values. The coin flip variable equals 0 with probability 0.5 and 1 with probability 0.5. The range(10) variable has a distribution that assigns probability 0.1 to each of the numbers from 0 to 9.

是一個變量，其可能值具有關聯的概率分布。一個非常簡單的隨機變量，如果擲硬幣的頭朝上，則等于1；如果擲硬幣的頭朝上，則等于0。一個更復雜的方法可能是測量拋硬幣10次或從range(10)中選取的值(每個數字均等)時觀察到的正面數。關聯的分布給出了變量實現其每個可能值的概率。硬幣翻轉變量的概率為0.5，等于0；概率為0.5，等于1。 range(10)變量的分布為從0到9的每個數字分配了概率0.1。

連續分布 (Continuous Distributions)

Often we’ll want to model distributions across a continuum of outcomes. (For our purposes, these outcomes will always be real numbers, although that’s not always the case in real life.) For example, the uniform distribution puts equal weight on all the numbers between 0 and 1. Because there are infinitely many numbers between 0 and 1, this means that the weight it assigns to individual points must necessarily be zero. For this reason, we represent a continuous distribution with a probability density function (pdf) such that the probability of seeing a value in a certain interval equals the integral of the density function over the interval.

通常，我們希望對連續結果的分布進行建模。 (出于我們的目的，這些結果將始終是實數，盡管現實生活中并非總是如此。)例如，均勻分布對0到1之間的所有數字賦予相等的權重。因為0之間有無數個數字1，這意味著它分配給各個點的權重必須為零。因此，我們用概率密度函數(pdf)表示連續分布，以使在某個間隔內看到一個值的概率等于該間隔內密度函數的積分。

The density function for the uniform distribution could be implemented in Python like :

均勻分布的密度函數可以在Python中實現，例如：

def uniform_pdf(x):
return 1 if x >= 0 and x < 1 else 0

Or if we want to create a method for cumulative distribution function :

或者，如果我們想為累積分布函數創建方法：

def uniform_cdf(x):
if x < 0:
return 0
elif x < 1: return x
else:
return 1

結論 (Conclusion)

Probability is interesting but requires a lot of learning. There’s a lot about Probability which I did not cover in this post, such as Normal Distribution, Central Limit Theorem, Markov Chains or Poisson process.. So take your time to find out more about it.

概率很有趣，但是需要大量學習。我在這篇文章中沒有涉及到很多關于概率的信息，例如正態分布，中心極限定理，馬爾可夫鏈或泊松過程。因此，請花一些時間來了解更多有關概率的信息。

翻譯自: https://towardsdatascience.com/why-probability-important-in-data-science-58a1543e5535