人口预测和阻尼-增长模型_使用分类模型预测利率-第2部分
人口預測和阻尼-增長模型
We are back! This post is a continuation of the series “Predicting Interest Rate with Classification Models”. I will make my best to make each article independent in a way that it won't need the previous ones for you to make the most of it.
我們回來了! 這篇文章是“使用分類模型預測利率”系列的延續。 我將盡我所能,使每篇文章都獨立起來,而無需您以前的文章來充分利用它。
快速回顧 (Fast Recap)
In the first article of the series, we applied a Logistic Regression model to predict up movements of the Fed Fund Effective Rate. For that, we used Quandl to retrieve data from Commodity Indices, Merrill Lynch, and US Federal Reserve.
在該系列的第一篇文章中,我們應用了Logistic回歸模型來預測聯邦基金有效利率的上升趨勢。 為此,我們使用Quandl從商品指數 , 美林和美聯儲中檢索數據。
Data | Image by Author數據 圖片作者The variables used in the series articles are RICIA the Euronext Rogers International Agriculture Commodity Index, RICIM the Euronext Rogers International Metals Commodity Index, RICIE the Euronext Rogers International Energy Commodity Index, EMHYY the Emerging Markets High Yield Corporate Bond Index Yield, AAAEY the US AAA-rated Bond Index (yield) and, finally, USEY the US Corporate Bond Index Yield. All of them are daily values ranging from 2005–01–03 to 2020–07–01.
系列文章中使用的變量是RICIA, 泛歐羅杰斯國際農業商品指數 ,RICIM 泛歐羅杰斯國際金屬商品指數 ,RICIE 泛歐羅杰斯國際能源商品指數 ,EMHYY 新興市場高收益企業債券指數收益率 ,AAAEY 美國AAA級評級債券指數(收益率) ,最后是USEY 美國公司債券指數收益率 。 所有這些都是每天的值,范圍是2005-01-03至2020-07-01。
Now let’s move to the intuition of the models that we will use!
現在,讓我們轉到將要使用的模型的直覺上!
樸素貝葉斯和隨機森林簡介 (A brief introduction to Naive Bayes and Random Forest)
樸素貝葉斯 (Naive Bayes)
Naive Bayes is a probabilistic classifier method based on Bayes Theorem. The theorem gives us the occurrence’s probability of an event (A)given that another event (B) has occurred.
樸素貝葉斯是一種基于貝葉斯定理的概率分類器方法。 該定理為我們給出了另一個事件(B)發生的事件(A)的發生概率。
Bayes Theorem | Image by Author貝葉斯定理| 圖片作者Given a vector of features X = (x?, x?, x?,…, x?), we can rewrite the equation above as
給定的特征X =(X? 中,x?,X?,...,X?),我們可以改寫上述公式作為矢量
Bayes Theorem | Image by Author貝葉斯定理| 圖片作者It is very important to keep in mind that the model relies on the assumption of conditional independence. Which means that x? are conditionally independent, given y.
請記住,該模型依賴于 條件獨立性 的假設,這一點非常重要 。 哪一個 表示給定y時 x?是條件獨立的。
Assuming conditional independence in features,
假設功能具有條件獨立性,
Bayes Theorem | Image by Author貝葉斯定理| 圖片作者For our problem, we are interested in taking the category with maximum probability and labeling our prediction as 0 or 1.
對于我們的問題,我們感興趣的是以最大概率選擇類別并將我們的預測標記為0或1。
Classification rule | Image by Author分類規則| 圖片作者There are three types of Naive Bayes methods: Multinomial, Bernoulli and Gaussian. We are going to use the Gaussian type that is used when the predictor has continuous values.
樸素貝葉斯方法有三種類型:多項式,伯努利和高斯。 我們將使用在預測變量具有連續值時使用的高斯類型。
隨機森林 (Random Forest)
Random Forest can be used for classification and regression tasks. It is formed by a set of decision trees that are formed by randomly choosing features to make predictions. In the end, the most voted prediction is the outcome of the model.
隨機森林可用于分類和回歸任務。 它由一組決策樹組成,這些決策樹是通過隨機選擇要素進行預測而形成的。 最后,投票最多的預測是模型的結果。
Random Forest | Image by Author隨機森林| 圖片作者As measures of purity, it is possible to apply the Gini Index or Entropy. Gini is the measure of the probability of incorrect labeling a randomly chosen value from the data set. Its maximum value of impurity is 0.5 and maximum purity is 0.
作為純度的度量,可以應用基尼系數或熵。 基尼系數是對數據集中隨機選擇的值進行錯誤標注的概率的度量。 它的最大雜質值為0.5,最大純度為0。
Gini Index | Image by Author基尼系數| 圖片作者Entropy, as well as Gini, is a measurement of disorder of data. In other words, it is essentially a measure of uncertainty. Its maximum impurity is 1 and maximum purity is 0.
熵和基尼系數都是對數據混亂的一種度量。 換句話說,它實質上是不確定性的量度。 其最大雜質為1,最大純度為0。
Entropy | Image by Author熵| 圖片作者These measures are used to calculate what we call Information Gain that will give us how much information is gained as we go down the tree. If a node is created that does not improve our information, it shouldn't be there. That’s why impurity is so important.
這些度量用于計算我們稱為信息增益的信息,該信息增益將為我們提供從樹上下來時獲得的信息量。 如果創建的節點不能改善我們的信息,則該節點不應存在。 這就是為什么雜質如此重要的原因。
Important note: which one to use depends on the problem you have in your hands.
重要說明:要使用哪一個取決于您遇到的問題。
代碼 (The code)
Usually, the first step is to download the data and take a look at it. The purpose of doing this is to have insights that will help increase our knowledge about our features and dealing with possible NaNs. In this article, we will import the libraries that we are going to use and transform possible NaN values into the average value of each variable before looking at the data.
通常,第一步是下載數據并進行查看。 這樣做的目的是獲得見解,這將有助于增加我們對我們的功能的了解并應對可能的NaN。 在本文中,我們將導入將要使用的庫,并在查看數據之前將可能的NaN值轉換為每個變量的平均值。
As we are more interested in applying the classification method and studying it, turning NaNs into average values will suit us just fine. But be aware that this is an essential part of any machine learning problem.
由于我們對應用分類方法并對其進行研究更感興趣,因此將NaN轉換為平均值將非常適合我們。 但是請注意,這是任何機器學習問題的重要組成部分。
import numpy as npimport pandas as pd
import quandl as qdl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="white")
from imblearn.over_sampling import ADASYN
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE
import statsmodels.api as sm
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics# get data from Quandl
data = pd.DataFrame()
meta_data = ['RICIA','RICIM','RICIE']
for code in meta_data:
df=qdl.get('RICI/'+code,start_date="2005-01-03", end_date="2020-07-01")
df.columns = [code]
data = pd.concat([data, df], axis=1)meta_data = ['EMHYY','AAAEY','USEY']
for code in meta_data:
df=qdl.get('ML/'+code,start_date="2005-01-03", end_date="2020-07-01")
df.columns = [code]
data = pd.concat([data, df], axis=1)# dealing with possible empty values (not much attention to this part, but it is very important)
data.fillna(data.mean(), inplace=True)
print(data.head())
print("\nData shape:\n",data.shape)Data | Image by Author數據 圖片作者 #histograms
data.hist()
plt.show()Histograms | Image by Author直方圖 圖片作者
There are a couple of conclusions to be made looking at the data. But, for the sake of simplicity, we are going to skip the majority of these conclusions and just notice that the values of variables range a lot from each other. So we will scale the data with Min-Max scaler. Next, we are going to download our dependent variable and binarize it.
查看數據有兩個結論。 但是,為簡單起見,我們將跳過這些結論中的大多數,而只是注意到變量的值彼此之間有很大的不同。 因此,我們將使用Min-Max縮放器縮放數據。 接下來,我們將下載我們的因變量并將其二值化。
# scaling values to maked them vary between 0 and 1scaler = MinMaxScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data.values), columns=data.columns, index=data.index)# pulling dependent variable from Quandl (par yield curve)
par_yield = qdl.get('FED/RIFSPFF_N_D',start_date="2005-01-03", end_date="2020-07-01")
par_yield.columns = ['FED/RIFSPFF_N_D']# create an empty df with same index as variables and fill it with our independent var values
par_data = pd.DataFrame(index=data_scaled.index, columns=['FED/RIFSPFF_N_D'])
par_data.update(par_yield['FED/RIFSPFF_N_D'])# get the variation and binarize it
par_data=par_data.pct_change()
par_data.fillna(0, inplace=True)
par_data = par_data.apply(lambda x: [0 if y <= 0 else 1 for y in x])
print("Number of 0 and 1s:\n",par_data.value_counts())0s and 1s | Image by Author0和1 | 圖片作者
The binarization rule that we used was: if y ≤ 0 then 0 else 1. This rule gives the same label to neutral and down movements and that's why we got a data set that has 3143 zeros and 909 ones, which means that 77% of our data is composed of zeros. If we leave it as it is, we will probably end up with a biased estimator. It will probably have high accuracy because if it classifies everything as zeros it will be right 77% of the time, but that does not mean it is good. So let’s oversample the data with a method called ADASYN.
我們使用的二值化規則為:如果y≤0,則為0,否則為1。該規則為中立和下降運動賦予相同的標簽,這就是為什么我們得到的數據集包含3143個零和909個零的原因,這意味著77%的零我們的數據由零組成。 如果我們保持原樣,我們可能最終會得到有偏估計。 它可能具有很高的準確性,因為如果將所有內容歸類為零,那么77%的時間是正確的,但這并不意味著它很好。 因此,讓我們使用稱為ADASYN的方法對數據進行過采樣。
# Over-sampling with ADASYN methodsampler = ADASYN(random_state=13)
X_os, y_os = sampler.fit_sample(data_scaled, par_data.values.ravel())
columns = data_scaled.columns
data_scaled = pd.DataFrame(data=X_os,columns=columns )
par_data= pd.DataFrame(data=y_os,columns=['FED/RIFSPFF_N_D'])print("\nProportion of 0s in oversampled data: ",len(par_data[par_data['FED/RIFSPFF_N_D']==0])/len(data_scaled))
print("\nProportion 1s in oversampled data: ",len(par_data[par_data['FED/RIFSPFF_N_D']==1])/len(data_scaled))The oversampled proportion of 0s and 1s | Image by Author0s和1s的過采樣比例| 圖片作者
Ok! Now we are good to go! Let’s split our data and apply the methods.
好! 現在我們可以出發了! 讓我們分割數據并應用方法。
# split data into test and train setX_train, X_test, y_train, y_test = train_test_split(data_scaled, par_data, test_size=0.2, random_state=13)# just make it easier to write y
y = y_train['FED/RIFSPFF_N_D']
樸素貝葉斯 (Naive Bayes)
# Naive Bayes modelgnb = GaussianNB()
gnb.fit(X_train, y)
y_pred = gnb.predict(X_test)
print('\nAccuracy of naive bayes classifier on test set: {:.2f}'.format(gnb.score(X_test, y_test)))# confusion matrix
confusion_matrix = metrics.confusion_matrix(y_test, y_pred)
print('\nConfusion matrix:\n',confusion_matrix)
print('\nClassification report:\n',metrics.classification_report(y_test, y_pred))# plot confusion matrix
disp = metrics.plot_confusion_matrix(gnb, X_test, y_test,cmap=plt.cm.Blues)
disp.ax_.set_title('Confusion Matrix')
plt.savefig('Confusion Matrix')# roc curve
logit_roc_auc = metrics.roc_auc_score(y_test, gnb.predict(X_test))
fpr, tpr, thresholds = metrics.roc_curve(y_test, gnb.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Naive Bayes (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve - Naive Bayes')
plt.legend(loc="lower right")
plt.savefig('NB_ROC')Naive Bayes Classifier’s Accuracy | Image by Author樸素貝葉斯分類器的準確性| 圖片作者 Naive Bayes classification report | Image by Author樸素貝葉斯分類報告| 圖片作者
If we compare the classification report of the Logistic Regression model applied in Part 1 and the Naive Bayes method, it seems that we were able to increase our F1-score from 0.60 to 0.61.
如果我們將第1部分中應用的Logistic回歸模型的分類報告與樸素貝葉斯方法進行比較,看來我們能夠將F1分數從0.60增加到0.61。
Naive Bayes Classifier ‘s Confusion Matrix | Image by Author樸素貝葉斯分類器的混淆矩陣 圖片作者For the Gaussian Naive Bayes Classifier, we got an accuracy of 66%, pretty much equal the Logistic Regression model in Part 1. We can notice by looking at the Confusion Matrix that it predicted right 817 values. The Logistic Regression model predicted right 810. Let’s look at the ROC curve of the Naive Bayes model.
對于高斯樸素貝葉斯分類器,我們獲得了66%的準確度,幾乎與第1部分中的Logistic回歸模型相當。 通過查看混淆矩陣,我們可以注意到它預測了正確的817值。 Logistic回歸模型預測為810。讓我們看一下樸素貝葉斯模型的ROC曲線。
Naive Bayes ROC curve | Image by Author樸素貝葉斯ROC曲線| 圖片作者Now, comparing the Logistic Regression ROC curve with the Naive Bayes ROC curve, we can see an increase in the area below the ROC curve of 0.01. Going from 0.65 to 0.66. It seems that we found a slightly better model for our prediction problem. Now we will apply the Random Forest model.
現在,將Logistic回歸ROC曲線與Naive Bayes ROC曲線進行比較,我們可以看到ROC曲線下方的面積增加了0.01。 從0.65升至0.66。 看來我們為預測問題找到了更好的模型。 現在我們將應用隨機森林模型。
隨機森林 (Random Forest)
The Random Forest model can have its hyperparameters optimized to improve the model. However, to have a first feeling of the model, we will apply it with its default values. If it shows us promising results, then it will be optimized. This approach will save us time!
可以對隨機森林模型的超參數進行優化以改善模型。 但是,為了對模型有初步了解,我們將其默認值應用于模型。 如果它顯示出令人鼓舞的結果,那么它將得到優化。 這種方法將節省我們的時間!
# Random Forest modelclf=RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=None, min_samples_split=2,
min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto',
max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None,
bootstrap=True, oob_score=False, n_jobs=None, random_state=13, verbose=0,
warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)
clf.fit(X_train, y)
y_pred = clf.predict(X_test)
print('\nAccuracy of Random Forest classifier on test set: {:.2f}'.format(clf.score(X_test, y_test)))# confusion matrix
confusion_matrix = metrics.confusion_matrix(y_test, y_pred)
print('\nConfusion matrix:\n',confusion_matrix)
print('\nClassification report:\n',metrics.classification_report(y_test, y_pred))# plot confusion matrix
disp = metrics.plot_confusion_matrix(clf, X_test, y_test,cmap=plt.cm.Blues)
disp.ax_.set_title('Confusion Matrix')
plt.savefig('Confusion Matrix')# roc curve
logit_roc_auc = metrics.roc_auc_score(y_test, clf.predict(X_test))
fpr, tpr, thresholds = metrics.roc_curve(y_test, clf.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Random Forest Classifier (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve - Random Forest Classifier')
plt.legend(loc="lower right")
plt.savefig('RF_ROC')RFs Accuracy | Image by Author射頻精度| 圖片作者 RF’s classification report | Image by Author射頻的分類報告| 圖片作者
WOW! Now we substantially improved our results! We were able to find an accuracy of 76% while increasing our Precision and Recall measures!
哇! 現在,我們大大改善了結果! 我們能夠找到76%的準確度,同時增加了“精確度”和“召回率”指標!
RF’s Confusion Matrix | Image by Author射頻的混亂矩陣| 圖片作者We can see by looking at the confusion matrix that we labeled correctly 943 values. It is an increase of 15% compared to the Gaussian Naive Bayes classification model. Finally, let’s see what the ROC curve can tell us!
通過查看混淆矩陣,可以看到我們正確標記了943個值。 與高斯樸素貝葉斯分類模型相比,它增加了15%。 最后,讓我們看看ROC曲線可以告訴我們什么!
RF’s ROC curve | Image by AuthorRF的ROC曲線| 圖片作者That is a much more beautiful curve! The area below the curve is 0.10 bigger then Naive Bayes ROC curve. What a great improvement indeed! Now we can separate this model and put it in our “Potential Good Models” list to be optimized after we finish testing two other models, CatBoost and Support Vector Machines. See you in Part 3!
那是一條更加美麗的曲線! 曲線下方的區域比Naive Bayes ROC曲線大0.10。 確實有很大的進步! 現在,我們可以分離該模型,并將其放入“潛在良好模型”列表中,在我們完成對另外兩個模型CatBoost和Support Vector Machines的測試之后,可以對其進行優化。 第三部分見!
This article was written in conjunction with Guilherme Bezerra Pujades Magalh?es.
本文與 Guilherme Bezerra PujadesMagalh?es 一起撰寫 。
參考和重要鏈接 (References and great links)
[1] T. Mitchell, Machine Learning Course (2009)
[1] T. Mitchell, 機器學習課程 (2009)
[2] Haibo He, Yang Bai, E. A. Garcia, and Shutao Li, ADASYN: Adaptive synthetic sampling approach for imbalanced learning (2008) IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, 2008, pp. 1322–1328.
[2]何海波,楊洋,EA Garcia和李樹濤 , ADASYN:用于不平衡學習的自適應合成采樣方法 (2008年),IEEE國際神經網絡聯合會議(IEEE,世界計算智能大會),香港,2008年,第pp 1322–1328。
翻譯自: https://towardsdatascience.com/predicting-interest-rate-with-classification-models-part-2-d25a8f798a99
人口預測和阻尼-增長模型
總結
以上是生活随笔為你收集整理的人口预测和阻尼-增长模型_使用分类模型预测利率-第2部分的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 深度学习cnn人脸检测_用于对象检测的深
- 下一篇: jupyter 共享_可共享的Jupyt