日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

Theory Defect in selecting best pruned tree from CCP with Cross-validation

發(fā)布時(shí)間:2023/12/20 编程问答 33 豆豆
生活随笔 收集整理的這篇文章主要介紹了 Theory Defect in selecting best pruned tree from CCP with Cross-validation 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

The problem is:
Selecting best pruned tree from CCP with Cross-validation
--------------------------------------------------------------------------------------------
Professor Ricco RAKOTOMALALA’s lecture[1] has described Cart-CCP process clearly.
And I have implemented it in my Github[2]:
What [1] have described :
is:
67% of all data,for growing a CART tree and get a sequence of pruned tree.
33% of all data,for selecting the best pruned tree.
This is only for when the data is large enough.
-------------------------------------------------
according to[3]
(This author is not in this world any longer,so I can NOT contact him),

Two methods of estimation are discussed: Use of an independent test sample and cross-validation. Of the two, use of an independent test sample is computationally more efficient and is preferred when the learning sample contains a large number of cases. As a useful by-product it gives relatively unbiased estimates of the node misclassification costs. **Cross-validation is computationally more expensive,but makes more effective use of all cases and gives useful** information regarding the stability of the tree structure.

So,there’s the second way to choose the best pruned tree:cross-validation.I mean :
The difficuly lies in selecting best pruned tree with cross-validation,
NOT in Pruning algorithm
.

-----------almost all material about CCP in Google-------------
in[4],is said in page 19th:

It is important to note that in the procedure described here we are effectively using cross-validation to select the best value of the complexity parameter from the set β1, . . . , βK. Once the best value has been determined, the corresponding tree from the original cost-complexity sequence is returned.

no details about how to select.

in[5],mlwikipedia only talk about the same method as [2].

in[6],no validation about selecting is in discussion

in[7],IBM files don’t discuss how to use cross-validation to select the best-tree.

in[8],no details about how to select with cross-validation
(I have contacted the author,but no reply).

Here’s a picture about cross-validation after CCP,but no details.

The goal of Cross-validation is:
Estimation of the prediction error(or MSE)of the trees in sequenceMain(use all the data) with the sequenceV:(remove the vth fold data before growing Tv1)

For main tree,we have pruned tree sequenceMain
[T1T_1T1?,T2T_2T2?,T3T_3T3?,T4T_4T4?]
For CV tree V,we have pruned tree sequenceV:
[Tv1T_{v1}Tv1?,Tv2T_{v2}Tv2?,Tv3T_{v3}Tv3?,Tv4 ]
(because only 3 decision nodes,so it can only be pruned twice )
Then how could I estimate error rate of T4T_4T4? in sequenceMain when no Tv4T_{v4}Tv4? in SequenceV?
In[3] part3.4.2,the formula to estimate TkT_kTk?is:
Rcv(Tk)=Rcv(T(ak′))R^{cv}(T_k)=R^{cv}(T(a_k^{'}))Rcv(Tk?)=Rcv(T(ak?)),
now in SequenceV,Tv4(=T(a4′))T_{v4}(=T(a_4^{'}))Tv4?(=T(a4?)) does not exist ,
how to estimateRcv(T4)R^{cv}(T_4)Rcv(T4?) in sequenceMain via the above equation ?

The other reference about CCP were:
in[9],only pruning algorithm ,no details about how to select the best pruned tree.
in[10]page 110th:" There is no theoretical justification for
this heuristic tree-matching process as it was mentioned by Esposito et al. (1997).
"
Quotation of[10] means that:
sequenceMain and sequenceV can NOT be matched,no matter in shape of pruned-tree or in the length of sequence.
------------------------------------------------------------------------------------------
So,selecting the best pruned tree(from CCP) with cross-validation is NOT rigrous in theory,it has limitations,it is available only in large and balalanced datasets which can ensure:
For different LvL_vLv?,it can produces similar tress.

Reference:
[1]Decision tree learning algorithms
[2]https://github.com/appleyuchi/Decision_Tree_Prune
[3]《Classification and regresssion trees》“Chapter 3.4 THE BEST PRUNED SUBTREE: AN ESTIMATION PROBLEM”

[4]http://www.cs.uu.nl/docs/vakken/mdm/trees.pdf
[5]Cost-Complexity Pruning
[6]https://onlinecourses.science.psu.edu/stat857/node/60/
[7]Cost-Complexity Pruning Process-IBM
[8]Classification and Regression Trees
[9]Lecture 19: Decision trees
[10]Chapter4 Overfitting Avoidance in Regression Trees

---------------------appendix--------------------------------------
Why ak′=akak+1a_k^{'}=\sqrt{a_ka_{k+1}}ak?=ak?ak+1?? in CCP and ECP?
Because the final best pruned tree is built based on all the data,the real cost complexity will larger than aka_kak?,we need to give it a modification,so we use ak=akak+1a_k=\sqrt{a_ka_{k+1}}ak?=ak?ak+1??,but it’s of course not rigorously modified,just by experience.

總結(jié)

以上是生活随笔為你收集整理的Theory Defect in selecting best pruned tree from CCP with Cross-validation的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。