當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

用svm预测信用卡诈骗

發(fā)布時(shí)間：2023/12/20 编程问答 33 豆豆

生活随笔收集整理的這篇文章主要介紹了用svm预测信用卡诈骗小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

數(shù)據(jù)集來(lái)源于Kaggle https://www.kaggle.com/mlg-ulb/creditcardfraud, 用于預(yù)測(cè)信用卡用戶(hù)是否會(huì)落入詐騙組，這里發(fā)一個(gè)中文版本的存稿

數(shù)據(jù)初探

首先導(dǎo)入數(shù)據(jù)

dat = pd.read_csv("E:/study/machine learning/credit card fraud/creditcard.csv") df = pd.DataFrame(dat) df.describe()Time V1 V2 ... Amount Class Amount_log 0 0.0 -1.359807 -0.072781 ... 149.62 0 5.008166 1 0.0 1.191857 0.266151 ... 2.69 0 0.993252 2 1.0 -1.358354 -1.340163 ... 378.66 0 5.936665 3 1.0 -0.966272 -0.185226 ... 123.50 0 4.816322 4 2.0 -1.158233 0.877737 ... 69.99 0 4.248495 5 2.0 -0.425966 0.960523 ... 3.67 0 1.302913 6 4.0 1.229658 0.141004 ... 4.99 0 1.609438 7 7.0 -0.644269 1.417964 ... 40.80 0 3.708927 8 7.0 -0.894286 0.286157 ... 93.20 0 4.534855 9 9.0 -0.338262 1.119593 ... 3.68 0 1.305626 10 10.0 1.449044 -1.176339 ... 7.80 0 2.055405 11 10.0 0.384978 0.616109 ... 9.99 0 2.302585 12 10.0 1.249999 -1.221637 ... 121.50 0 4.79...

數(shù)據(jù)共31列，除去是否落入詐騙組的“Class”組外，另有時(shí)間和消費(fèi)額，以及用于隱藏用戶(hù)信息、經(jīng)過(guò)PCA處理過(guò)的V1至V28，我們首先來(lái)看數(shù)據(jù)在Class組中的分布

plt.figure(figsize=(7,5)) sns.countplot(df['Class']) plt.title("Fraud and not Fraud Class Count", fontsize=18) plt.xlabel("Fraud and not Fraud", fontsize=15) plt.ylabel("Count", fontsize=15) plt.show()

我們可以看到，這是一組明顯的不平衡數(shù)據(jù)集，參與過(guò)詐騙的用戶(hù)量遠(yuǎn)遠(yuǎn)少?zèng)]參與過(guò)詐騙的用戶(hù)，這意味著我們?cè)诮Ｇ?#xff0c;首先要對(duì)數(shù)據(jù)集的不平衡性進(jìn)行處理，否則模型會(huì)始終傾向于將用戶(hù)分入非詐騙組。

我們繼續(xù)分析另外兩個(gè)明確定義的變量，下圖是經(jīng)過(guò)log處理后的消費(fèi)額Amount與Class的箱型圖，從圖中我們可以看到，信用卡詐騙用戶(hù)的消費(fèi)額范圍更廣，且IQR明顯高于非詐騙用戶(hù)，但最高的消費(fèi)額存在于非詐騙組。

df['Amount_log'] = np.log(df['Amount'] + 0.01) # engineer the data for better visualization plt.figure(figsize=(7,5)) sns.boxplot(x = "Class", y = "Amount_log", data = df) plt.show()

最后我們分析時(shí)間與class的關(guān)系，

fraud = df[df["Class"] == 1] nonfraud = df[df["Class"] == 0]plt.figure(figsize=(7,20)) plt.subplot(211) ax1 = sns.scatterplot(x=fraud["Time"],y=fraud["Amount"]) plt.subplot(212) ax2 = sns.scatterplot(x=nonfraud["Time"],y=nonfraud["Amount"]) plt.show()

在不同的時(shí)間維度上，詐騙組和非詐騙組的消費(fèi)額雖都整體偏低，但分布都非常均勻，該圖顯示時(shí)間與是否詐騙沒(méi)有什么明顯的關(guān)系。在之后建模時(shí)，我們可以考慮刪除時(shí)間變量。

我們現(xiàn)在再來(lái)看看變量之間的相關(guān)性

df_corr = df.corr() plt.figure(figsize=(7,5)) sns.heatmap(df_corr, cmap="YlGnBu") plt.title('Heatmap correlation') plt.show()

從上圖我們可以看出，絕大多數(shù)的變量之間都沒(méi)有相關(guān)關(guān)系，這也側(cè)面證明了變量也確實(shí)經(jīng)過(guò)了PCA，不需要再次進(jìn)行PCA的處理。

建模

首先，我們需要將數(shù)據(jù)集分為訓(xùn)練集和測(cè)試集兩部分

X = df.drop(["Amount_log", "Time", "Class"],axis=1) y = df["Class"] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 12)

處理不平衡數(shù)據(jù)有多種方法，考慮到訓(xùn)練時(shí)間問(wèn)題（因?yàn)槲覀冞@里選擇的是需要較長(zhǎng)訓(xùn)練時(shí)間的SVM），這里我們選擇undersampling，從多數(shù)集中抽樣，使詐騙組和非詐騙組中的用戶(hù)量成為同一個(gè)量級(jí)

undersampling_train = pd.concat([X_train,y_train],axis=1) undersampling_train_nonfraud = undersampling_train[undersampling_train['Class']==0].sample(300) undersampling_train_fraud = undersampling_train[undersampling_train['Class']==1] undersampling_train_total = pd.concat([undersampling_train_nonfraud,undersampling_train_fraud],axis=0) undersampling_X = undersampling_train_total.drop("Class",axis=1) undersampling_y = undersampling_train_total["Class"]

我們選擇的方法是SVM

def confusion_matrix_1(CM):fig, ax = plot_confusion_matrix(conf_mat=CM)plt.title("The Confusion Matrix 1 of Undersampled dataset")plt.ylabel("Actual")plt.xlabel("Predicted")plt.show()print("The accuracy is "+str((CM[1,1]+CM[0,0])/(CM[0,0] + CM[0,1]+CM[1,0] + CM[1,1])*100) + " %")print("The recall from the confusion matrix is "+ str(CM[1,1]/(CM[1,0] + CM[1,1])*100) +" %")ss = SVC(kernel="linear") ss.fit(undersampling_X,undersampling_y) y_pred = ss.predict(X_test) cmss = confusion_matrix(y_test, y_pred) confusion_matrix_1(cmss) The accuracy is 95.20519086542512 % The recall from the confusion matrix is 87.09677419354838 %

95%的正確率看起來(lái)還不錯(cuò)，但我們?cè)诜治鲂庞每ㄔp騙問(wèn)題時(shí)，更多是用recall來(lái)衡量模型的準(zhǔn)確度，而87%的recall看起來(lái)就有提升的空間了

調(diào)參

我們嘗試使用GridSearchCV來(lái)調(diào)整SVM的參數(shù)

turned_parameters = [{'kernel':['linear','rbf','poly'],'gamma':['auto'],'C': [1,10,100,1000]}] svm = GridSearchCV(SVC(), turned_parameters,cv=5,scoring='recall') svm.fit(undersampling_X, undersampling_y) print("Best parameters set found on Training dataset:") print() print(svm.best_params_)

由GridSearchCV，最合適的參數(shù)為{‘C’: 100, ‘gamma’: ‘a(chǎn)uto’, ‘kernel’: ‘rbf’}，我們使用這組參數(shù)來(lái)重新對(duì)數(shù)據(jù)集進(jìn)行建模

{'C': 100, 'gamma': 'auto', 'kernel': 'rbf'} The accuracy is 82.39796634925986 % The recall from the confusion matrix is 94.35483870967742 %

雖然總的正確率下降，但94%的recall要明顯優(yōu)于原模型，故新模型要更加貼合數(shù)據(jù)。

總結(jié)

以上是生活随笔為你收集整理的用svm预测信用卡诈骗的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：视频去水印工具推荐-视频去水印步骤
下一篇：中兴微电子招贤纳士