日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

数据eda_关于分类和有序数据的EDA

發布時間:2023/11/29 编程问答 54 豆豆
生活随笔 收集整理的這篇文章主要介紹了 数据eda_关于分类和有序数据的EDA 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

數據eda

數據科學和機器學習統計 (STATISTICS FOR DATA SCIENCE AND MACHINE LEARNING)

Categorical variables are the ones where the possible values are provided as a set of options, it can be pre-defined or open. An example can be the gender of a person. In the case of Ordinal variables, the options can be ordered by some rule, like the Likert Scale:

分類變量是將可能的值作為一組選項提供的變量,可以預定義或打開。 一個例子可以是一個人的性別。 對于序數變量,可以按照某些規則對選項進行排序,例如Likert Scale:

  • Like

    喜歡
  • Like Somewhat

    有點像
  • Neutral

    中性
  • Dislike Somewhat

    有點不喜歡
  • Dislike

    不喜歡

To simplify further examples, we will use a simple example, based on a group of students that have passed or not 2 distinct exams, the results are represented in the next RxC table:

為了簡化更多示例,我們將使用一個簡單示例,該示例基于一組已通過或未通過2次不同考試的學生,結果顯示在下一個RxC表中:

The example used in the whole article, self-generated.整篇文章中使用的示例是自生成的。

Statisticians have developed specific techniques to analyze this data, the most important are:

統計人員已經開發出分析此數據的特定技術,其中最重要的是:

協議措施 (Measures of Agreement)

百分比協議 (Percent Agreement)

Calculated as the divisions between the number of cases where the rates are in a certain class by the total number of rates.

計算為費率在特定類別中的案例數除以費率總數。

Adding totals to the example, self-generated.將總計添加到示例中,自行生成。
  • The percent agreement for Passing the exam 2 is 25/(25+60) = 0.29, so 29.4%

    通過考試2的百分比協議是25 /(25 + 60)= 0.29,所以29.4%
  • The percent agreement for Passing the exam 1 is 30/85 = 0.35, so 35.3%

    通過考試1的百分比協議是30/85 = 0.35,所以35.3%
  • The percent agreement of passing the exam 1 and not passing the exam 2 is 10/85 = 0.117, so 11.7%.

    通過考試1和未通過考試2的百分比協議是10/85 = 0.117,所以11.7%。

The problem with the percent agreement is that the data can be obtained only by chance.

百分比一致性的問題在于只能偶然獲得數據。

科恩的卡帕 (Cohen’s Kappa)

The example used in the whole article, self-generated.整篇文章中使用的示例是自生成的。

To overcome the problems of percent agreement, we calculate Kappa as:

為了克服百分比協議的問題,我們將Kappa計算為:

Cohen’s Kappa formula, self-generated.科恩的Kappa公式,是自生成的。

where P0 is the observed agreement and Pe the expected agreement, calculated as:

其中P0是觀察到的協議, Pe是期望的協議,計算公式為:

P0 and Pe formulas, self-generated.P0和Pe公式,是自生成的。

In our example:

在我們的示例中:

  • P0 = 70/85 = 0.82

    P0 = 70/85 = 0.82

  • Pe = 30 x 25 / 852 + 55 x 60 / 852 = 0.56

    Pe = 30 x 25 /852+ 55 x 60 /852= 0.56

  • K = 0.26 / 0.44 = 0.59

    K = 0.26 / 0.44 = 0.59

The Kappa results are in possible range is (-1,1), where 0 means that observed agreement and chance agreement is the same, 1 if all cases were in agreement and -1 if all cases were in disagreement.

Kappa結果的可能范圍是(-1,1),其中0表示觀察到的一致和機會一致是相同的,如果所有情況都一致,則為1;如果所有情況都不一致,則為-1。

卡方分布 (The Chi-Squared Distribution)

To do hypothesis testing with categorical variables, we need to use custom distributions, the most common is the Chi-Square, being a continuous theoretical probability distribution.

要使用分類變量進行假設檢驗,我們需要使用自定義分布,最常見的是卡方,即連續的理論概率分布。

This distribution has only one parameter, k which means degrees of freedom. As k approaches infinity, the chi-Squared distribution becomes similar to the normal distribution.

這種分布只有一個參數, k表示自由度。 當k接近無窮大時,卡方分布變得類似于正態分布。

卡方檢驗 (Chi-Squared Test)

This test is used to check if two categorical variables are independent, we will use the same example to explain how to calculate it:

該測試用于檢查兩個類別變量是否獨立,我們將使用相同的示例來說明如何計算它:

First, we define the hypothesis that we want to test, in our case, we want to check if passing exam 1 and exam 2 are independent, so:

首先,我們定義要測試的假設,在本例中,我們要檢查通過考試1和考試2是否獨立,因此:

  • H0 = Pass exam 1 and pass exam 2 are independent.

    H0 =通過考試1和通過考試2是獨立的。
  • Ha = Pass exam 1 and pass exam 2 are dependent.

    Ha =通過考試1和通過考試2是相關的。

This test relies on the difference between expected and observed values, to calculate the expected values(what you expect to find if both variables were independent), we use:

該測試依賴于期望值與觀察值之間的差異,以計算期望值(如果兩個變量都是獨立的,您會發現什么),我們使用:

Expected values formula, self-generated.期望值公式,自行生成。

To simplify the calculations first we calculate the marginals, these values are the sums per row and column that we already calculated in the second table if this post. The expected values are calculated as:

為了簡化計算,首先我們計算邊際,這些值是我們在第二張表中已經計算出的每行和每列的總和。 期望值的計算公式為:

Expected values calculation for our example, self-generated.本示例的期望值計算,是自生成的。

Now we have all we need to calculate the chi-squared formula:

現在我們有了計算卡方公式所需的全部:

The chi-Squared formula, self-generated.卡方公式,自生成。

With the sum symbol, we mean that we have to calculate the formula for all combinations of our variables, in our case 4, and sum the results:

對于總和符號,我們的意思是我們必須為變量4的所有組合計算公式,并對結果求和:

Results for each sum of the formula, self-generated.公式的每個和的結果,自生成。

The final values are the sum of all 4, being 26.96, now we have to compare this result with the statistical tables, for this we need to know the degrees of freedom, they are calculated as (num rows-1)*(num columns-1), in our case we have a degree of freedom = 1.

最終值是所有4的總和,即26.96 ,現在我們必須將此結果與統計表進行比較,為此,我們需要知道自由度,它們的計算方式為(num rows-1)*(num columns -1) ,在我們的情況下,我們的自由度= 1。

According to the tables found easy searching Chi-Squared table at Google(statistical packages for any language should have them in a function), the critical value for 𝝰 = 0.05, is 3.841, our result is much larger, so, we reject the null hypothesis which means that pass exam 1 and pass exam 2 are dependent.

根據在Google上發現的易于搜索的Chi-Squared表(任何語言的統計軟件包都應在函數中包含它們),, = 0.05的臨界值為3.841,我們的結果要大得多,因此,我們拒絕空值假設意味著通過考試1和通過考試2是相互依賴的。

分類數據的相關統計 (Correlation statistics for categorical data)

As person correlation requires variables to be measured on at least interval level, we need to adopt a new calculation for binary and ordinal variables, let’s introduce them:

由于人的相關性要求至少在區間水平上測量變量,因此我們需要對二進制和序數變量采用新的計算方法,讓我們對其進行介紹:

二進制變量 (Binary Variables)

Phi is a measure of the degree of association between two binary variables, based on the table introduced at the Cohen’s Kappa sections, it’s calculated as:

Phi是兩個二進制變量之間關聯度的度量,基于Cohen Kappa部分介紹的表,其計算公式為:

Formulas to calculate the phi statistic, self-generated.自行計算phi統計信息的公式。

Using the second formula, in our example, Φ = (26.96/85)^(1/2) = 0.1

在我們的示例中,使用第二個公式,Φ=( 26.96 / 85)^(1/2)= 0.1

Notice that the first formula can obtain negative values, meanwhile, the second one can only result in positive values, we don't care about the direction of our result, we just analyze the absolute value.

注意,第一個公式可以得出負值,而第二個公式只能得出正值,我們不在乎結果的方向,我們只分析絕對值。

If the distribution of the data is 50–50, so data is evenly distributed, phi can reach the value of 1, else the potential max value is lower. In our case, we have very little relationship.

如果數據的分布是50–50,則數據分布均勻,phi可以達到1的值,否則潛在的最大值較低。 就我們而言,我們之間的關系很少。

點-雙相關 (The Point-Biserial Correlation)

It’s a measure that calculates the correlation between dichotomous and continuous variables, the formula is the next-one:

這是一種計算二分變量和連續變量之間的相關性的度量,公式為下一個:

Point biserial correlation formula, self-generated.點雙數相關公式,自生成。

Where:

哪里:

  • x?1 = mean of the continuous variable for group 1

    x?1 =組1連續變量的平均值

  • x?2 = mean of the continuous variable for group 2

    x?2 =第2組連續變量的平均值

  • p = proportion of class 1 in the dichotomous variable

    p = 1類在二分變量中的比例

  • s_x = Standart deviation of the continuous variable

    s_x =連續變量的標準偏差

To follow our example we will suppose the next values, obtained comparing the exam 1 variable with the number of hours studied:

遵循我們的示例,我們將假定下一個值,該值是將考試1變量與學習的小時數進行比較而獲得的:

  • x? pass = 5.5

    x?通過 = 5.5

  • x? not pass = 3.1

    x?不及格 = 3.1

  • p = 20/25 = 0.8

    p = 20/25 = 0.8

  • s_x = 2

    s_x = 2

With these values, we obtain a result of 2.4 * 0.4 / 2 = 0.48, indicating that there’s some relation between our variables.

使用這些值,我們得到的結果為2.4 * 0.4 / 2 = 0.48 ,表明變量之間存在某種關系。

序數變數 (Ordinal Variables)

The most used correlation coefficient for ordinal variables is the Spearman’s rank-order coefficient, usually called Spearman’s r.

序數變量最常用的相關系數是Spearman的秩序系數 ,通常稱為Spearman的r 。

Spearman’s r correlation coefficient for ordinal variables, self-generated.Spearman的r相關系數,用于自變量。

where d_i means the difference between 2 variables for each individual and n the size of the sample.

其中d_i表示每個個體的2個變量與樣本大小的n之差。

摘要 (Summary)

In data science, we’re used to do some scatter plots of the binary, categorical or ordinary variables, use them as color differences in other plots, but when we calculate the correlations it’s easy to skip this variable, because of the built-in functions for pandas in the case of python or Dplyr in R don't use them.

在數據科學中,我們習慣于對二進制,分類或普通變量進行散點圖繪制,將它們用作其他圖中的色差,但是當我們計算相關性時,由于內置變量,很容易跳過此變量R中的python或Dplyr的熊貓函數不使用它們。

In this post, we showed how to analyze these variables' distribution and their correlation with all the other variables.

在這篇文章中,我們展示了如何分析這些變量的分布以及它們與所有其他變量的相關性。

This is the tenth post of my particular #100daysofML, I will be publishing the advances of this challenge at GitHub, Twitter, and Medium (Adrià Serra).

這是我特別#十后100daysofML,我會發布在GitHub上,Twitter和中型企業(這一挑戰的進步阿德里亞塞拉 )。

https://twitter.com/CrunchyML

https://twitter.com/CrunchyML

https://github.com/CrunchyPistacho/100DaysOfML

https://github.com/CrunchyPistacho/100DaysOfML

翻譯自: https://medium.com/ai-in-plain-english/eda-on-categorical-and-ordinal-data-22f8a4407836

數據eda

總結

以上是生活随笔為你收集整理的数据eda_关于分类和有序数据的EDA的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。

主站蜘蛛池模板: 日批在线| 99黄色网| 精品国产乱码一区二区三 | 超碰超碰超碰超碰超碰 | 中文字幕乱码人妻二区三区 | 午夜影视体验区 | 日韩私人影院 | 日韩精品在线一区二区 | 中文综合网 | 日韩精品中字 | 欧美日韩国产高清视频 | 在线观看你懂得 | 国产精品视频免费观看 | 欧美第七页 | 欧美成人免费在线视频 | 国产精品久久久久久人 | 九色国产在线 | 三级欧美韩日大片在线看 | 国产91成人| 永久免费av在线 | 少妇又色又紧又爽又刺激视频 | 日干夜干天天干 | 亚洲色图综合网 | 久久午夜影院 | 日韩av无码一区二区三区不卡 | 日韩一级片网站 | 丰满少妇一区 | 激情丁香婷婷 | 国产精品1234区 | 亚洲熟女乱色一区二区三区久久久 | 中文字幕电影av | 校园春色亚洲色图 | 三级国产三级在线 | 中文字字幕码一二三区 | 国产精品毛片va一区二区三区 | 91福利视频免费观看 | 欧美日韩在线网站 | 免费视频色 | 人人干天天操 | 探花av在线 | 日本精品免费视频 | 国产三级视频网站 | 亚洲色鬼 | 一级成人av | 亚洲中午字幕 | 久久综合一区 | 日韩一区二区久久 | 91精品国产色综合久久不卡98 | 黄色岛国片 | 男女被到爽流尿 | 国产女人视频 | 一区二区国产精品视频 | 综合色88| 少妇av | 国产乱淫av片 | 亚洲专区中文字幕 | 国产精品成人一区二区网站软件 | 色欲人妻综合网 | 嫩草www| 日韩亚洲欧美精品 | 久久久久18 | 国产黄网站 | 久久精品播放 | 欧美巨鞭大战丰满少妇 | 男生插女生视频在线观看 | 男女免费观看视频 | 精品影片一区二区入口 | 69免费视频| 97色婷婷| 国产成人无码aa精品一区 | 成人av电影免费观看 | 最新国产视频 | 91久久极品少妇xxxxⅹ软件 | 日韩av一区二区三区 | 一级全黄少妇性色生活片 | 爽好多水快深点欧美视频 | jiizzyou性欧美老片 | 天天射天天干天天操 | aaa黄色| 手机看片久久久 | 欧美视频观看 | 成人日韩视频 | 久久亚洲AV成人无码国产野外 | 亚洲乱码国产乱码精品 | 国产外围在线 | 亚洲自拍偷拍色图 | 91超碰在线观看 | 亚洲国产成人精品久久久 | 欧美精品hd | 蜜臀va | 在线欧美一区 | 欧美啪啪一区二区 | 久操热 | 久久综合综合久久 | 黑人性生活视频 | 欧美在线视频网站 | 视频成人免费 | 一区二区三区在线免费 | 女人下面流白浆的视频 |