寻根究底,探讨 chi -square特征词选择方法后面的数学支持
尋根究底,探討 chi -square特征詞選擇方法后面的數學支持
最近研究特征詞選擇算法,主要在研究chi方統計量的方法。
Christopher D Manning的書《信息檢索導論》中(王斌譯作191頁,英文原版255頁)的公式定義如下:
我所迷惑不解的是這個公式為啥長成這個樣子?
對于我還是略有了解的,比如X~n(0,1),那么X^2就服從chi-square, 獨立 獨立的chi方分布相加后仍然是chi squared 變量,并且自由度為各個加數自由度的和。我遍搜了腦子里所有和chi-squared 分布有關的知識,還是推導不出這個公式。覺得這個公式怪怪的。如果說:?是服從N(0,,1),那么
那么
這個變量應服從均值和方差均為的正態分布,那么如果這樣上面的
應該服從自由度為4的才對。
查了manning書后面的關于數理統計的參考文獻還是沒有結果,而且目前我能找到的最原始論文Yiming Yang 1999那篇論文中也沒有做過多的解釋。最后根據Yiming Yang 論文中的一個詞contigency table 終于找到了蛛絲馬跡。以下列出資料來源:
http://en.wikipedia.org/wiki/Noncentral_chi-square_distribution
http://courses.washington.edu/urbdp520/UDP520/ChiSquareNotes.doc
http://en.wikipedia.org/wiki/Pearson's_chi-square_test
http://en.wikipedia.org/wiki/Contingency_table
最核心的理論可以說是 Pearson chi-square test. 這個檢驗主要應用于兩個領域:
1。檢測分布的擬合。也就是評價。根據抽樣樣本進行擬合后的分布與某個理論上的分布之間的差異性。2。檢測兩個隨機變量(這兩個隨機變量的出現情況用contigency table 表示)是否獨立。(這里的應用是屬于第二種場合)
?
Pearson chi-square test的問題一般會出現兩個表。一個是實際事件的contigency table,一個是期望事件的contigency table.
注:contingency table可以這樣理解:比如說有兩個事件E1,E2。1事件有三個屬性a1,a2,a3,E2事件有兩個屬性b1,b2,那么contigency table可以看成統計兩個事件屬性共現次數的矩陣。上面的例子就是3*2型的矩陣。
(O,相當于文本特征詞選擇中的N)
主要有兩個步驟構成。一個是構造test statistic,一個是計算自由度。
?
?
根據 pearson chi-square test理論:
test statistic 的定義如下
The chi-square statistic is calculated by finding the difference between each observed and theoretical frequency for each possible outcome, squaring them, dividing each by the theoretical frequency, and taking the sum of the results.
也就是說
本身就是一個chi-squared 類型的test variable,那么它的freedom degree又該如何計算呢,
Pearson指出:
freedom degree 由 contingency table 的(row-1)*(column-1)定義。因為用于特征詞選擇算法的chi-square test的contingency table 維度為2*2所以自由度為1。
?
我們可以看下面的例子(來源:http://courses.washington.edu/urbdp520/UDP520/ChiSquareNotes.doc):下面例子用Chi-Square 檢測地方醫院的條件設施和社區人口的增減是否獨立。 因為Contigency talbe 是3*2的,所以最后的自由度為2*1=2。
Contingency Test, or Chi-Square Test
Used to determine if there is association between nominal and ordinal scaled variables.
Our first test of association!
Based on two principles:
Marginal probability: MPr[x]: the probability of a single event happening
MPr[x] =?# of times event happened
# of opportunities for event
Joint probability: JPr[x,y]: the probability of seeing two independent events happening at the same time.
JPr[x,y] = MPr[x] * MPr[y]
The logic of the chi-square test is to compare a set of?actual?conditions or data to an?expected?set of data that we would expect to see by chance.
We do this by creating cross-tab tables, which are simply descriptive tables of our actual and expected values.
We then plug our results into the chi-square calculation, and compare our results to the chi-square distribution, as with the other tests we’ve covered.
Example: Is the condition of local hospitals determined by the growth or decline in community population?
Independent variable? growth/decline of population
Dependent variable? Condition of hospital
Growth/declineàHospital condition
Actual data:
| Hospital Condition | Community Pop. Increase 1980-2000 | Community Pop. Decrease 1980-2000 | Total | Marginal Probability of a condition |
| Need of Major Repair | 10 | 50 | 60 | MPr[MR]=60/200=.3 |
| Need of Minor Repair | 10 | 30 | 40 | MPr[MiR]=40/200=.2 |
| Adequate Facilities | 80 | 20 | 100 | MPr[A]=100/200=.5 |
| Total ? | 100 | 100 | 200 | ? |
| Marginal Probability of community | MPr[PI]=100/200=.5 | MPr[PD]=100/200=.5 | ? | ? |
Expected Table, if community growth does NOT affect hospital condition:
| Hospital Condition | Community Pop. Increase 1980-2000 | Community Pop. Decrease 1980-2000 | Total |
| Need of?MajorRepair | 30 = JPr[MR,PI] = MPr[MR]*MPr[PI] = .3 * .5=.15(200 hospitals)= 30 | 30 = JPr[MR,PD] MPr[MR]*MPr[PD] .3 * .5=.15(200 hospitals)= 30 | 60 |
| Need of Minor Repair | 20 | 20 MPr[MiR]*MPr[PD] .2 * .5=.10(200 hospitals)=20 | 40 |
| Adequate Facilities | 50 | 50 MPr[A]*MPr[PD] .5 * .5=.25(200 hospitals)=50 | 100 |
| Total ? | 100 | 100 | 200 |
Assumptions: Expected table is a representative sample. And community characteristics has no relationship to hospital condition.
Testable Hypotheses:
Ho: Aith row jth column?= Eij?(actual = expected, and thus independent does not affect dependent)
Ha: Aij?≠ Eij
Calculate test statistic:
?= (50-30)/30 + (10-30)/30 + (30-20)/20 + … ≈ 73
Determine rejection region:
d.f. = (# rows-1)(# columns-1) in this case (3-1)(2-1) = 2…
One tail, positive, always, due to squaring in test statistic
For alpha=.10
?.1,2?= 4.605
Ho is thus rejected, independent variable (growth of community) does not affect the dependent variable (condition of hospital).
Notes:
Don’t want to use chi-squared for small expected table values, so do cross tab test:
Cross tab test: Cannot have more than 20% of expected cells with values ≤ 5, and no cells can have value ≤ 3.
If it fails the test, you can do three things:
Generally, Chi-square is for nominal data only. BUT it gets used inappropriately all the time. There is a loss of raw data going from ratio to ordinal.
Also note that chi-squared is a weak tool. It’s common because it’s one of the few tools to examine nominal/ordinal data. But it only tells you if an effect exists.?It does not tell you the amount or direction of the effect.
?
?
注: manning書中的另一個公式:
和Yiming Yang 1999年的論文 A comparative Study on Feature Selection In Text Categorization 中 卡方公式是一個意思,這個公式可以通過前面的公式王斌譯作191頁,英文原版255頁 經過很普通代換,提取公因式等操作推導出來
?
至此,理解完畢。
總結
以上是生活随笔為你收集整理的寻根究底,探讨 chi -square特征词选择方法后面的数学支持的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: linux锐捷代码_锐捷 for lin
- 下一篇: firework算法_FireWorks