當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

寻根究底，探讨 chi -square特征词选择方法后面的数学支持

發布時間：2023/12/14 编程问答 33 豆豆

生活随笔收集整理的這篇文章主要介紹了寻根究底，探讨 chi -square特征词选择方法后面的数学支持小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

尋根究底，探討 chi -square特征詞選擇方法后面的數學支持

最近研究特征詞選擇算法，主要在研究chi方統計量的方法。

Christopher D Manning的書《信息檢索導論》中（王斌譯作191頁，英文原版255頁）的公式定義如下：

我所迷惑不解的是這個公式為啥長成這個樣子？

對于我還是略有了解的，比如X~n(0,1),那么X^2就服從chi-square, 獨立獨立的chi方分布相加后仍然是chi squared 變量，并且自由度為各個加數自由度的和。我遍搜了腦子里所有和chi-squared 分布有關的知識，還是推導不出這個公式。覺得這個公式怪怪的。如果說：?是服從N（0,,1），那么

那么

這個變量應服從均值和方差均為的正態分布，那么如果這樣上面的

應該服從自由度為4的才對。

查了manning書后面的關于數理統計的參考文獻還是沒有結果，而且目前我能找到的最原始論文Yiming Yang 1999那篇論文中也沒有做過多的解釋。最后根據Yiming Yang 論文中的一個詞contigency table 終于找到了蛛絲馬跡。以下列出資料來源：

http://en.wikipedia.org/wiki/Noncentral_chi-square_distribution

http://courses.washington.edu/urbdp520/UDP520/ChiSquareNotes.doc

http://en.wikipedia.org/wiki/Pearson's_chi-square_test

http://en.wikipedia.org/wiki/Contingency_table

最核心的理論可以說是 Pearson chi-square test. 這個檢驗主要應用于兩個領域：

1。檢測分布的擬合。也就是評價。根據抽樣樣本進行擬合后的分布與某個理論上的分布之間的差異性。2。檢測兩個隨機變量（這兩個隨機變量的出現情況用contigency table 表示）是否獨立。（這里的應用是屬于第二種場合）

Pearson chi-square test的問題一般會出現兩個表。一個是實際事件的contigency table,一個是期望事件的contigency table.

注：contingency table可以這樣理解：比如說有兩個事件E1,E2。1事件有三個屬性a1,a2,a3,E2事件有兩個屬性b1,b2,那么contigency table可以看成統計兩個事件屬性共現次數的矩陣。上面的例子就是3*2型的矩陣。

（O，相當于文本特征詞選擇中的N）

主要有兩個步驟構成。一個是構造test statistic,一個是計算自由度。

根據 pearson chi-square test理論：

test statistic 的定義如下

The chi-square statistic is calculated by finding the difference between each observed and theoretical frequency for each possible outcome, squaring them, dividing each by the theoretical frequency, and taking the sum of the results.

也就是說

本身就是一個chi-squared 類型的test variable，那么它的freedom degree又該如何計算呢，

Pearson指出：

freedom degree 由 contingency table 的(row-1)*(column-1)定義。因為用于特征詞選擇算法的chi-square test的contingency table 維度為2*2所以自由度為1。

我們可以看下面的例子(來源：http://courses.washington.edu/urbdp520/UDP520/ChiSquareNotes.doc)：下面例子用Chi-Square 檢測地方醫院的條件設施和社區人口的增減是否獨立。因為Contigency talbe 是3*2的，所以最后的自由度為2*1=2。

Contingency Test, or Chi-Square Test

Used to determine if there is association between nominal and ordinal scaled variables.

Our first test of association!

Based on two principles:

Marginal probability: MPr[x]: the probability of a single event happening

MPr[x] =?# of times event happened

# of opportunities for event

Joint probability: JPr[x,y]: the probability of seeing two independent events happening at the same time.

JPr[x,y] = MPr[x] * MPr[y]

The logic of the chi-square test is to compare a set of?actual?conditions or data to an?expected?set of data that we would expect to see by chance.

We do this by creating cross-tab tables, which are simply descriptive tables of our actual and expected values.

We then plug our results into the chi-square calculation, and compare our results to the chi-square distribution, as with the other tests we’ve covered.

Example: Is the condition of local hospitals determined by the growth or decline in community population?

Independent variable? growth/decline of population

Dependent variable? Condition of hospital

Growth/declineàHospital condition

Actual data:

Hospital Condition	Community Pop. Increase 1980-2000	Community Pop. Decrease 1980-2000	Total	Marginal Probability of a condition
Need of Major Repair	10	50	60	MPr[MR]=60/200=.3
Need of Minor Repair	10	30	40	MPr[MiR]=40/200=.2
Adequate Facilities	80	20	100	MPr[A]=100/200=.5
Total ?	100	100	200	?
Marginal Probability of community	MPr[PI]=100/200=.5	MPr[PD]=100/200=.5	?	?

Expected Table, if community growth does NOT affect hospital condition:

Hospital Condition	Community Pop. Increase 1980-2000	Community Pop. Decrease 1980-2000	Total
Need of?_MajorRepair	30 = JPr[MR,PI] = MPr[MR]MPr[PI] = .3 .5=.15(200 hospitals)= 30	30 = JPr[MR,PD] MPr[MR]MPr[PD] .3 .5=.15(200 hospitals)= 30	60
Need of Minor Repair	20	20 MPr[MiR]MPr[PD] .2 .5=.10(200 hospitals)=20	40
Adequate Facilities	50	50 MPr[A]MPr[PD] .5 .5=.25(200 hospitals)=50	100
Total ?	100	100	200

Assumptions: Expected table is a representative sample. And community characteristics has no relationship to hospital condition.

Testable Hypotheses:

Ho: A_{ith row jth column?}= E_ij?(actual = expected, and thus independent does not affect dependent)

Ha: A_ij?≠ E_ij

Calculate test statistic:

?= (50-30)/30 + (10-30)/30 + (30-20)/20 + … ≈ 73

Determine rejection region:

d.f. = (# rows-1)(# columns-1) in this case (3-1)(2-1) = 2…

One tail, positive, always, due to squaring in test statistic

For alpha=.10

?_.1,2?= 4.605

Ho is thus rejected, independent variable (growth of community) does not affect the dependent variable (condition of hospital).

Notes:

Don’t want to use chi-squared for small expected table values, so do cross tab test:

Cross tab test: Cannot have more than 20% of expected cells with values ≤ 5, and no cells can have value ≤ 3.

If it fails the test, you can do three things:

Go to original cross tab table and combine rows or columns

Eliminate a column or row (bad news, losing that data)

Increase your sample size

Generally, Chi-square is for nominal data only. BUT it gets used inappropriately all the time. There is a loss of raw data going from ratio to ordinal.

Also note that chi-squared is a weak tool. It’s common because it’s one of the few tools to examine nominal/ordinal data. But it only tells you if an effect exists.?It does not tell you the amount or direction of the effect.

注： manning書中的另一個公式：

和Yiming Yang 1999年的論文 A comparative Study on Feature Selection In Text Categorization 中卡方公式是一個意思，這個公式可以通過前面的公式王斌譯作191頁，英文原版255頁經過很普通代換，提取公因式等操作推導出來

至此，理解完畢。

總結

以上是生活随笔為你收集整理的寻根究底，探讨 chi -square特征词选择方法后面的数学支持的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： linux锐捷代码_锐捷 for lin
下一篇： firework算法_FireWorks