cox风险回归模型参数估计_信用风险管理:分类模型和超参数调整
cox風險回歸模型參數估計
The final part aims to walk you through the process of applying different classification algorithms on our transformed dataset as well as producing the best-performing model using Hyperparameter Tuning.
最后一部分旨在引導您完成在轉換后的數據集上應用不同分類算法的過程,以及使用超參數調整生成性能最佳的模型的過程。
As a reminder, this end-to-end project aims to solve a classification problem in Data Science, particularly in finance industry and is divided into 3 parts:
提醒一下,此端到端項目旨在解決數據科學(特別是金融行業)中的分類問題,分為三個部分:
Machine Learning Modelling (Classification)
機器學習建模(分類)
If you have missed the previous two parts, feel free to check them out here and here before going through the final part which leveraged their output in producing the best classification model.
如果您錯過了前兩個部分,請隨時在此處查看 在進行最后一部分之前, 這里將利用他們的輸出來產生最佳分類模型。
A.分類模型 (A. Classification Models)
Which algorithms should be used to build a model that addresses and solves a classification problem?
應該使用哪種算法來構建可解決并解決分類問題的模型?
When it comes to classification, we have quite a handful of different algorithms to use unlike regression. To name some, Logistic Regression, K-Neighbors, SVC, Decision Tree and Random Forest are the top common and widely used algorithms to solve such problems.
關于分類,與回歸不同,我們有很多不同的算法可以使用。 僅舉一些例子,邏輯回歸,K鄰居,SVC,決策樹和隨機森林是解決此類問題的最常用且廣泛使用的算法。
Here’s a quick recap of what each algorithm does and how it distinguishes itself from the others:
以下是每種算法的功能及其與眾不同之處的快速概述:
Logistic Regression: this algorithm uses regression to predict the continuous probability of a data sample (from 0 to 1), then classifies that sample to the more probable target (either 0 or 1). However, it assumes a linear relationship between the the inputs and the target, which might not be a good choice if the dataset does not follow Gaussian Distribution.
Logistic回歸 :此算法使用回歸來預測數據樣本的連續概率 (從0到1),然后將該樣本分類為更可能的目標(0或1)。 但是,它假設輸入和目標之間存在線性關系,如果數據集不遵循高斯分布,則可能不是一個好的選擇。
K-Neighbors: this algorithm assumes data points which are in close proximity to each other belong to the same class. Particularly, it classifies the target (either 0 or 1) of a data sample by a plurality vote of the neighbors which are close in distance to it.
K-Neighbors :該算法假定彼此接近的數據點屬于同一類。 特別是,它通過距離最近的鄰居的多次投票對數據樣本的目標(0或1)進行分類。
SVC: this algorithm makes classifications by defining a decision boundary and then classify the data sample to the target (either 0 or 1) by seeing which side of the boundary it falls on. Essentially, the algorithm aims to maximize the distance between the decision boundary and points in each class to decrease the chance of false classification.
SVC :此算法通過定義決策邊界進行分類,然后通過查看數據樣本落在邊界的哪一側將其分類到目標(0或1)。 本質上,該算法旨在最大化決策邊界和每個類別中的點之間的距離,以減少錯誤分類的機會。
Decision Tree: as the name tells, this algorithm splits the root of the tree (the entire dataset) into decision nodes, and each decision node will be split until no further node is splittable. Then, the algorithm classifies the data sample by sorting them down the tree from the root to the leaf/terminal node and seeing which target node it falls on.
決策樹 :顧名思義,此算法將樹的根 (整個數據集)拆分為決策節點,并且每個決策節點都將被拆分,直到沒有其他節點可拆分為止。 然后,該算法通過對數據樣本從根到葉/終端節點的樹進行分類,并查看其落在哪個目標節點上,從而對數據樣本進行分類。
Random Forest: this algorithm is an ensemble technique developed from the Decision Tree, in which it involves many decision tree that work together. Particularly, the random forest gives that data sample to each of the decision trees and returns the most popular classification to assign the target to that data sample. This algorithm helps avoid overfitting which may occurs to Decision Tree, as it aggregates the classification from multiple trees instead of 1.
隨機森林 :此算法是從決策樹開發的一種集成技術,其中涉及許多協同工作的決策樹。 特別地,隨機森林將數據樣本提供給每個決策樹,并返回最流行的分類以將目標分配給該數據樣本。 該算法有助于避免決策樹可能發生的過擬合,因為它會聚合來自多個樹而不是1的分類。
Let’s see how they work with our dataset compared to one another:
讓我們比較一下它們如何與我們的數據集一起工作:
from sklearn.linear_model import LogisticRegressionfrom sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifierclassifiers = {
"LogisticRegression" : LogisticRegression(),
"KNeighbors" : KNeighborsClassifier(),
"SVC" : SVC(),
"DecisionTree" : DecisionTreeClassifier(),
"RandomForest" : RandomForestClassifier()
}
After importing the algorithms from sklearn, I created a dictionary which combines all algorithms into one place, so that it’s easier to apply them on the data at once, without the need to manually iterate each individually.
從sklearn導入算法后,我創建了一個字典,將所有算法組合到一個位置 ,這樣可以更輕松地將它們一次應用于數據,而無需手動進行單獨迭代。
#Compute the training score of each modelstrain_scores = []test_scores = []for key, classifier in classifiers.items():
classifier.fit(x_a_train_rs_over_pca, y_a_train_over)
train_score = round(classifier.score(x_a_train_rs_over_pca, y_a_train_over),2)
train_scores.append(train_score)
test_score = round(classifier.score(x_a_test_rs_over_pca, y_a_test_over),2)
test_scores.append(test_score)print(train_scores)
print(test_scores)
After applying the algorithms on both train and test sets, it seems that Logistic Regression doesn’t work well for the dataset as the scores are relatively low (around 50%, which indicates that the model is not able to classify the target). This is quite understandable and somehow proves that our original dataset is not normally distributed.
在訓練集和測試集上應用算法后,由于分數相對較低(大約50%,這表明該模型無法對目標進行分類),因此Logistic回歸似乎不適用于數據集。 這是完全可以理解的,并且以某種方式證明了我們的原始數據集不是正態分布的。
In contrast, Decision Tree and Random Forest produced a significantly high accuracy scores on the train sets (85%). Yet, it’s the otherwise for the test set when the scores are remarkably low (over 50%). Possible reasons that might explain the large gap is (1) overfitting the train set, (2) leaking target to the test set. However, after cross checking, it doesn’t seem as the case.
相反,決策樹和隨機森林在火車上產生了很高的準確性得分(85%)。 但是,當分數非常低(超過50%)時,則是測試集的其他情況。 可能解釋大差距的可能原因是:(1)過度安裝了列車組;(2)目標泄漏到測試組。 但是,經過交叉檢查后,情況似乎并非如此。
Hence, I decided to look into another scoring metric, Cross Validation Score, to see if there’s any difference. Basically, this technique splits the training set into n folds (default = 5), then fits the data on n-1 folds and score on the other fold. This process is repeated in n folds from which the average score will be calculated. Cross validation score brings a more objective analysis on how the models works as compared to the standard accuracy score.
因此,我決定研究另一個得分指標, 交叉驗證得分,以查看是否存在任何差異。 基本上,此技術將訓練集分為n折(默認= 5),然后將數據擬合為n-1折,而得分為另一折。 該過程以n倍重復,將從中計算出平均分數。 與標準準確性分數相比,交叉驗證分數可更客觀地分析模型的工作方式。
from sklearn.model_selection import cross_val_scoretrain_cross_scores = []test_cross_scores = []for key, classifier in classifiers.items():
classifier.fit(x_a_train_rs_over_pca, y_a_train_over)
train_score = cross_val_score(classifier, x_a_train_rs_over_pca, y_a_train_over, cv=5)
train_cross_scores.append(round(train_score.mean(),2))
test_score = cross_val_score(classifier, x_a_test_rs_over_pca, y_a_test_over, cv=5)
test_cross_scores.append(round(test_score.mean(),2))
print(train_cross_scores)
print(test_cross_scores)
As seen, the gap between the train and test scores was significantly bridged!
如圖所示,訓練成績和考試成績之間的差距已大大縮小!
Since Random Forest model produced the highest cross validation score, we will test it against another score metric named ROC AUC Score as well as see how it performs on the ROC Curve.
由于隨機森林模型產生了最高的交叉驗證得分,因此我們將使用另一個名為ROC AUC得分的得分度量標準對其進行測試,并查看其在ROC曲線上的表現 。
Essentially, ROC Curve is a plot of the false positive rate (x-axis) versus the true positive rate (y-axis) within the threshold between 0 and 1 while AUC represents the degree or measure of separability (simply, the ability to distinguish the target).
本質上, ROC曲線是在0到1之間的閾值內,假陽性率(x軸)與真陽性率(y軸)的關系圖,而AUC表示可分離性的程度或度量(簡單地,區分能力目標)。
Image credit: https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5圖片來源: https : //towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5Below is a quick summary table of how to calculate FPR (the inversion of Specificity) and TPR (also known as Sensitivity):
以下是有關如何計算FPR (特異性倒置)和TPR (也稱為靈敏度)的快速摘要表:
Image credit: https://towardsdatascience.com/hackcvilleds-4636c6c1ba53圖片來源: https : //towardsdatascience.com/hackcvilleds-4636c6c1ba53 from sklearn.model_selection import cross_val_predictfrom sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_scorerf = RandomForestClassifier()
rf.fit(x_a_train_rs_over_pca, y_a_train_over)
rf_pred = cross_val_predict(rf, x_a_test_rs_over_pca, y_a_test_over, cv=5)
print(roc_auc_score(y_a_test_over, rf_pred))#Plot the ROC Curve
fpr, tpr, _ = roc_curve(y_a_test_over, rf_pred)
plt.plot(fpr, tpr)
plt.show()ROC Curve with ROC AUC Score = 76%ROC AUC分數= 76%的ROC曲線
As I had proved that cross validation worked on this dataset, I then applied another cross validation technique called “cross_val_predict”, which follows similar methodology of splitting n-folds and predicting the value accordingly.
當我證明交叉驗證可在該數據集上工作時,我隨后應用了另一種稱為“ cross_val_predict ”的交叉驗證技術,該技術遵循類似的拆分n折并相應地預測值的方法。
B.超參數調整 (B. Hyperparameter Tuning)
What is hyperparameter tuning and how does it help to improve the accuracy of the model?
什么是超參數調整,它如何幫助提高模型的準確性?
After computing the model from the default estimators of each algorithm, I was hoping to see if further improvement could be made, which comes down to Hyperparameter Tuning. Essentially, this technique chooses a set of optimal estimators from each algorithm that (might) produces the highest accuracy score on the given dataset.
從每種算法的默認估計量計算出模型后,我希望看看是否可以進行進一步的改進,這歸結為“超參數調整”。 本質上,此技術從(可能)在給定數據集上產生最高準確性得分的每種算法中選擇一組最佳估計量 。
The reason why I put (might) in the definition is that for some cases, little to none improvement is seen depends on the dataset as well as the preparation done initially (plus it takes like forever to run). However, Hyperparameter Tuning should be taken into consideration with the hope of finding the best performing model.
我之所以定義(可能),是因為在某些情況下,幾乎看不到任何改善,這取決于數據集以及最初完成的準備工作(而且要花很長時間才能運行)。 但是,應考慮超參數調整,以期找到性能最佳的模型。
#Use GridSearchCV to find the best parametersfrom sklearn.model_selection import GridSearchCV#Logistic Regressionlr = LogisticRegression()
lr_params = {"penalty": ['l1', 'l2'], "C": [0.001, 0.01, 0.1, 1, 10, 100, 1000], "solver": ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}
grid_logistic = GridSearchCV(lr, lr_params)
grid_logistic.fit(x_a_train_rs_over_pca, y_a_train_over)
lr_best = grid_logistic.best_estimator_#KNearest Neighbors
knear = KNeighborsClassifier()
knear_params = {"n_neighbors": list(range(2,7,1)), "algorithm": ['auto', 'ball_tree', 'kd_tree', 'brutle']}
grid_knear = GridSearchCV(knear, knear_params)
grid_knear.fit(x_a_train_rs_over_pca, y_a_train_over)
knear_best = grid_knear.best_estimator_#SVCsvc = SVC()
svc_params = {"C": [0.5, 0.7, 0.9, 1], "kernel":['rbf', 'poly', 'sigmoid', 'linear']}
grid_svc = GridSearchCV(svc, svc_params)
grid_svc.fit(x_a_train_rs_over_pca, y_a_train_over)
svc_best = grid_svc.best_estimator_#Decision Treetree = DecisionTreeClassifier()
tree_params = {"criterion": ['gini', 'entropy'], "max_depth":list(range(2,5,1)), "min_samples_leaf":list(range(5,7,1))}
grid_tree = GridSearchCV(tree, tree_params)
grid_tree.fit(x_a_train_rs_over_pca, y_a_train_over)
tree_best = grid_tree.best_estimator_
GridSearchCV is the key to finding the set of optimal estimators in each algorithm, as it scrutinizes and combines different estimators to fit the dataset, then returns the best set among all.
GridSearchCV是在每種算法中找到最佳估計量集合的關鍵,因為它會仔細檢查并組合不同的估計量以適合數據集,然后返回所有之中的最佳估計量。
One thing of note is that we have to remember by heart all available estimators of each algorithm to be able to use. For example, with Logistic Regression, we have a set of “penalty”, “C”, and “solver” which do not belong to other algorithms.
需要注意的一件事是,我們必須牢記每種算法能夠使用的所有可用估計量。 例如,對于Logistic回歸,我們擁有一組不屬于其他算法的“懲罰”,“ C”和“求解器”。
After finding the .best_estimator_ of each algorithm, fit and predict the data using each algorithm with its best set. However, we need to compare the new scores against the original to determine if any improvement is seen or to continue fine-tuning the estimators again.
找到每種算法的.best_estimator_之后,使用每種算法的最佳組合來擬合和預測數據。 但是,我們需要將新得分與原始得分進行比較,以確定是否看到了任何改進,或者繼續對估計量進行微調。
獎勵:XGBoost和LightGBM (Bonus: XGBoost and LightGBM)
What are XGBoost and LightGBM and how significantly better do these algorithms do compared to the traditional?
什么是XGBoost和LightGBM?與傳統算法相比,這些算法的效果有多明顯?
Apart from the common classification algorithms I’ve heard of, I also have known a couple of advanced algorithms which rooted from the traditional. In this case, XGBoost and LightGBM can be considered as the successor of Decision and Random Forest. Look at the below timeline for a better understanding of how these algorithms were developed:
除了我聽說過的常見分類算法外,我還知道一些源自傳統的高級算法。 在這種情況下,可以將XGBoost和LightGBM視為“決策和隨機森林”的后繼者。 請查看以下時間軸,以更好地了解這些算法的開發方式:
Image credit: https://www.slideshare.net/GabrielCyprianoSaca/xgboost-lightgbm圖片來源: https : //www.slideshare.net/GabrielCyprianoSaca/xgboost-lightgbmI’m not going to go into details of how these algorithms differ mathematically, but in general, they are able to prune the decision trees better while handling missing values + avoid overfitting at the same time.
我不會詳細介紹這些算法在數學上的區別,但總的來說,它們能夠在處理缺失值的同時更好地修剪決策樹,同時避免過度擬合。
#XGBoostimport xgboost as xgbxgb_model = xgb.XGBClassifier()
xgb_model.fit(x_a_train_rs_over_pca, y_a_train_over)
xgb_train_score = cross_val_score(xgb_model, x_a_train_rs_over_pca, y_a_train_over, cv=5)
xgb_test_score = cross_val_score(xgb_model, x_a_test_rs_over_pca, y_a_test_over, cv=5)print(round(xgb_train_score.mean(),2))
print(round(xgb_test_score.mean(),2))#LightGBM
import lightgbm as lgblgb_model = lgb.LGBMClassifier()
lgb_model.fit(x_a_train_rs_over_pca, y_a_train_over)
lgb_train_score = cross_val_score(lgb_model, x_a_train_rs_over_pca, y_a_train_over, cv=5)
lgb_test_score = cross_val_score(lgb_model, x_a_test_rs_over_pca, y_a_test_over, cv=5)print(round(lgb_train_score.mean(),2))
print(round(lgb_test_score.mean(),2))
After computing, the train and set scores of each model are 72% & 73% (XGBoost) and 69% & 72% (LightGBM), which is relatively the same as Random Forest model computed above. However, we are still able to make further optimisations via Hyperparameter Tuning for these advanced models, but beware that it might take forever since XGBoost and LightGBM have longer runtime due to the complexity of their algorithm.
經過計算,每個模型的訓練和設定分數分別為72%和73%(XGBoost)和69%和72%(LightGBM),與上面計算的隨機森林模型相對相同。 但是,對于這些高級模型,我們仍然可以通過Hyperparameter Tuning進行進一步的優化,但是請注意,由于XGBoost和LightGBM的算法復雜性,它們的運行時間更長,因此可能要花很長時間。
Voila! That’s the wrap for this end-to-end project with regards to Classification! If you are keen to explore the entire code, feel free to check out my Github below:
瞧! 這就是有關分類的端到端項目的內容! 如果您熱衷于瀏覽整個代碼,請隨時在下面查看我的Github:
Repository: https://github.com/andrewnguyen07/credit-risk-managementLinkedIn: www.linkedin.com/in/andrewnguyen07
資料庫: https : //github.com/andrewnguyen07/credit-risk-management LinkedIn: www.linkedin.com/in/andrewnguyen07
Follow my Medium to keep posted on future projects coming up soon!
按照我的中號來發布即將發布的未來項目!
翻譯自: https://towardsdatascience.com/credit-risk-management-classification-models-hyperparameter-tuning-d3785edd8371
cox風險回歸模型參數估計
總結
以上是生活随笔為你收集整理的cox风险回归模型参数估计_信用风险管理:分类模型和超参数调整的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: bitnami如何使用_使用Bitnam
- 下一篇: 支持向量机 回归分析_支持向量机和回归分