some understanding of《Improved Use of Continuous Attributes in C4.5》
Here are formulas provided in
“Improved Use of Continuous Attributes in C4.5”
1996,Journal of Artificial Intelligence Research 4 (1996)77-90
Info(D)=?∑j=1Cp(D,j)?log2(p(D,j))Info(D)=-\sum_{j=1}^{C}p(D,j)·log_2(p(D,j))Info(D)=?∑j=1C?p(D,j)?log2?(p(D,j))
Gain(D,T)=Info(D)?∑i=1k∣Di∣∣D∣?Info(Di)Gain(D,T)=Info(D)-\sum_{i=1}^{k}\frac{|D_i|}{|D|}·Info(D_i)Gain(D,T)=Info(D)?∑i=1k?∣D∣∣Di?∣??Info(Di?)
Split(D,T)=?∑i=1k∣Di∣∣D∣?log2(∣Di∣∣D∣)Split(D,T)=-\sum_{i=1}^{k}\frac{|D_i|}{|D|}·log_2(\frac{|D_i|}{|D|})Split(D,T)=?∑i=1k?∣D∣∣Di?∣??log2?(∣D∣∣Di?∣?)
The followding are my understandings:
------------------first change-----------------------------
then,
Gain_Ratio=Gain(D,T)Split(D,T)Gain\_Ratio=\frac{Gain(D,T)}{Split(D,T)}Gain_Ratio=Split(D,T)Gain(D,T)?
Then ,my understanding of the "first change"is
Gain_Ratio_adjusted=Gain(D,T)?log2(N?1)DSplit(D,T)Gain\_Ratio\_adjusted=\frac{Gain(D,T)-\frac{log_2(N-1)}{D}}{Split(D,T)}Gain_Ratio_adjusted=Split(D,T)Gain(D,T)?Dlog2?(N?1)??
is this right?
Many Thanks~
--------------------second change---------------------------
Relevant part of “second change” in this article is:
"This seems to be an unnecessary complication,so the threshold t is chosen instead to maximize gain.Once the threshold is chosen,however,the final selection of the attribute to be used for the test is still made on the basis of the gain ratio criterion using the adjusted gain
"
My understanding is:
1st step:
choose threshold t according to Gain(D,T)maxGain(D,T)_{max}Gain(D,T)max?,
Not Gain_RatiomaxGain\_Ratio_{max}Gain_Ratiomax?
Not (Gain(D,T)?log2(N?1)/∣D∣)max(Gain(D,T)-log_2(N-1)/|D|)_{max}(Gain(D,T)?log2?(N?1)/∣D∣)max?
2nd step:
the criterion to choose best feature is according to:
Gain_Ratio(discretefeature)=Gain(D,T)Split(D,T)Gain\_Ratio(discrete\ feature)=\frac{Gain(D,T)}{Split(D,T)}Gain_Ratio(discrete?feature)=Split(D,T)Gain(D,T)?
Gain_Ratio_adjusted(continuousfeature)=Gain(D,T)?log2(N?1)DSplit(D,T)Gain\_Ratio\_adjusted(continuous\ feature)=\frac{Gain(D,T)-\frac{log_2(N-1)}{D}}{Split(D,T)}Gain_Ratio_adjusted(continuous?feature)=Split(D,T)Gain(D,T)?Dlog2?(N?1)??
Finally,just choose the feature whose Gain Ratio or Gain Ratio(adjusted) is the largest.
is this understanding right?
Many thanks~
總結(jié)
以上是生活随笔為你收集整理的some understanding of《Improved Use of Continuous Attributes in C4.5》的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: ubuntu16.04終端補全忽略大小寫
- 下一篇: some understanding