當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

卡方检验python程序_Python从零开始第二章（1）卡方检验(python)

發(fā)布時間：2023/12/9 python 44 豆豆

生活随笔收集整理的這篇文章主要介紹了卡方检验python程序_Python从零开始第二章（1）卡方检验(python) 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

如果我們想確定兩個獨立分類數(shù)據(jù)組的統(tǒng)計顯著性，會發(fā)生什么？這是卡方檢驗獨立性有用的地方。

Chi-Square檢驗

我們將在1994年查看人口普查數(shù)據(jù)。具體來說，我們對“性別和“每周工作時間”之間的關(guān)系感興趣。在我們的案例中，每個人只能有一個“性別”，且只有一個工作時間類別。為了這個例子，我們將使用pandas將數(shù)字列'每周小時'轉(zhuǎn)換為一個分類列。然后我們將'sex'和'hours_per_week_categories'分配給新的數(shù)據(jù)幀。# -*- coding: utf-8 -*-

"""

Created on Sun Feb 3 19:24:55 2019

@author: czh

"""

# In[*]

import matplotlib.pyplot as plt

import numpy as np

import math

import seaborn as sns

import pandas as pd

%matplotlib inline

import os

os.chdir('D:\\train')

# In[*]

cols = ['age', 'workclass', 'fnlwg', 'education', 'education-num',

'marital-status','occupation','relationship', 'race','sex',

'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']

data = pd.read_csv('census.csv', names=cols,sep=', ')

# In[*]

#Create a column for work hour categories.

def process_hours(df):

cut_points = [0,9,19,29,39,49,1000]

label_names = ["0-9","10-19","20-29","30-39","40-49","50+"]

df["hours_per_week_categories"] = pd.cut(df["hours-per-week"],

cut_points,labels=label_names)

return df

# In[*]

data = process_hours(data)

workhour_by_sex = data[['sex', 'hours_per_week_categories']]

workhour_by_sex.head()sex hours_per_week_categories

0 Male 40-49

1 Male 10-19

2 Male 40-49

3 Male 40-49

4 Female 40-49

查看workhour_by_sex['sex'].value_counts()

Out[31]:

Male 21790

Female 10771

Name: sex, dtype: int64workhour_by_sex['hours_per_week_categories'].value_counts()

Out[33]:

40-49 18336

50+ 6462

30-39 3667

20-29 2392

10-19 1246

0-9 458

Name: hours_per_week_categories, dtype: int64原假設(shè)

回想一下，我們有興趣知道'sex'和'hours_per_week_categories'之間是否存在關(guān)系。為此，我們必須使用卡方檢驗。但首先，讓我們陳述我們的零假設(shè)和另類假設(shè)。H0：性別與每周工作小時數(shù)沒有統(tǒng)計學(xué)上的顯著關(guān)系.H0：性別與每周工作小時數(shù)之間沒有統(tǒng)計學(xué)上的顯著關(guān)系。

H1：性別和每周工作小時數(shù)之間存在統(tǒng)計學(xué)上的顯著關(guān)系.

下一步是將數(shù)據(jù)格式化為頻率計數(shù)表。這稱為列聯(lián)表，我們可以通過在pandas中使用pd.crosstab（）函數(shù)來實現(xiàn)。contingency_table = pd.crosstab(

workhour_by_sex['sex'],

workhour_by_sex['hours_per_week_categories'],

margins = True

)

contingency_table

Out[34]:

hours_per_week_categories 0-9 10-19 20-29 30-39 40-49 50+ All

sex

Female 235 671 1287 1914 5636 1028 10771

Male 223 575 1105 1753 12700 5434 21790

All 6462 1246 18336 3667 458 2392 32561

該表中的每個單元表示頻率計數(shù)。例如，表格中“男性”行和“10 -19”列的交集將表示從我們的樣本數(shù)據(jù)集中每周工作10-19小時的男性人數(shù)。 “全部”行和“50 +”列的交叉點表示每周工作50小時以上的人員總數(shù)。# In[*]

#Assigns the frequency values

malecount = contingency_table.iloc[0][0:6].values

femalecount = contingency_table.iloc[1][0:6].values

#Plots the bar chart

fig = plt.figure(figsize=(10, 5))

sns.set(font_scale=1.8)

categories = ["0-9","10-19","20-29","30-39","40-49","50+"]

p1 = plt.bar(categories, malecount, 0.55, color='#d62728')

p2 = plt.bar(categories, femalecount, 0.55, bottom=malecount)

plt.legend((p2[0], p1[0]), ('Male', 'Female'))

plt.xlabel('Hours per Week Worked')

plt.ylabel('Count')

plt.show()

image.png

上圖顯示了人口普查中的樣本數(shù)據(jù)。如果性別與每周工作小時數(shù)之間確實沒有關(guān)系。然后，數(shù)據(jù)將顯示每個時間類別的“男性”和“女性”之間的均勻比率。例如，如果5％的女性工作50+小時，我們預(yù)計工作50小時以上的男性的百分比相同。

使用Scipy進行卡方檢驗

現(xiàn)在我們已經(jīng)完成了所有計算，現(xiàn)在是時候?qū)ふ医輳搅恕_obs = np.array([contingency_table.iloc[0][0:6].values,

contingency_table.iloc[1][0:6].values])

f_obs

from scipy import stats

stats.chi2_contingency(f_obs)[0:3]

Out[38]: (2287.190943926107, 0.0, 5)

p值= ~0，自由度= 5。

結(jié)論

如果p值<0.05，我們可以拒絕零假設(shè)。 “性別”和“每周工作時間”之間肯定存在某種關(guān)系。我們不知道這種關(guān)系是什么，但我們知道這兩個變量并不是彼此獨立的。

總結(jié)

以上是生活随笔為你收集整理的卡方检验python程序_Python从零开始第二章（1）卡方检验(python)的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Nexus下载构件失败
下一篇：常见设计模式 (python代码实现)

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

python

卡方检验python程序_Python从零开始第二章（1）卡方检验(python)

總結(jié)