日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

gbdt和xgboost中feature importance的获取

發(fā)布時(shí)間:2025/3/15 编程问答 29 豆豆
生活随笔 收集整理的這篇文章主要介紹了 gbdt和xgboost中feature importance的获取 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

來源于stack overflow,其實(shí)就是計(jì)算每個(gè)特征對(duì)于降低特征不純度的貢獻(xiàn)了多少,降低越多的,說明feature越重要

I'll use the?sklearn?code, as it is generally much cleaner than the?R?code.

Here's the implementation of the feature_importances property of the?GradientBoostingClassifier?(I removed some lines of code that get in the way of the conceptual stuff)

def feature_importances_(self):total_sum = np.zeros((self.n_features, ), dtype=np.float64)for stage in self.estimators_:stage_sum = sum(tree.feature_importances_for tree in stage) / len(stage)total_sum += stage_sumimportances = total_sum / len(self.estimators_)return importances

This is pretty easy to understand.?self.estimators_?is an array containing the individual trees in the booster, so the for loop is iterating over the individual trees. There's one hickup with the

stage_sum = sum(tree.feature_importances_for tree in stage) / len(stage)

this is taking care of the non-binary response case. Here we fit multiple trees in each stage in a one-vs-all way. Its simplest conceptually to focus on the binary case, where the sum has one summand, and this is just?tree.feature_importances_. So in the binary case, we can rewrite this all as

def feature_importances_(self):total_sum = np.zeros((self.n_features, ), dtype=np.float64)for tree in self.estimators_:total_sum += tree.feature_importances_ importances = total_sum / len(self.estimators_)return importances

So, in words,?sum up the feature importances of the individual trees, then divide by the total number of trees. It remains to see how to calculate the feature importances for a single tree.

The importance calculation of a tree is implemented at the?cython level, but it's still followable. Here's a cleaned up version of the code

cpdef compute_feature_importances(self, normalize=True):"""Computes the importance of each feature (aka variable)."""while node != end_node:if node.left_child != _TREE_LEAF:# ... and node.right_child != _TREE_LEAF:left = &nodes[node.left_child]right = &nodes[node.right_child]importance_data[node.feature] += (node.weighted_n_node_samples * node.impurity -left.weighted_n_node_samples * left.impurity -right.weighted_n_node_samples * right.impurity)node += 1importances /= nodes[0].weighted_n_node_samplesreturn importances

This is pretty simple. Iterate through the nodes of the tree. As long as you are not at a leaf node, calculate the weighted reduction in node purity from the split at this node, and attribute it to the feature that was split on

importance_data[node.feature] += (node.weighted_n_node_samples * node.impurity -left.weighted_n_node_samples * left.impurity -right.weighted_n_node_samples * right.impurity)

Then, when done, divide it all by the total weight of the data (in most cases, the number of observations)

importances /= nodes[0].weighted_n_node_samples

It's worth recalling that the?impurity?is a common metric to use when determining what split to make when growing a tree. In that light, we are simply summing up how much splitting on each feature allowed us to reduce the impurity across all the splits in the tree.

?

轉(zhuǎn)載于:https://www.cnblogs.com/wuxiangli/p/6756577.html

總結(jié)

以上是生活随笔為你收集整理的gbdt和xgboost中feature importance的获取的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。