當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

小姐姐教我的 sklearn 逻辑回归

發(fā)布時(shí)間：2024/5/6 编程问答 52 豆豆

生活随笔收集整理的這篇文章主要介紹了小姐姐教我的 sklearn 逻辑回归小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

sklearn 邏輯回歸

Alex一晚上都沒睡好覺，被IEEE-CIS Fraud Detection折磨的死去活來，不管怎么調(diào)參，用什么樣的策略，評(píng)分就是上不去，這可不行，什么時(shí)候認(rèn)過輸，生死看淡，不服就干！結(jié)果：

第二天，Alex打算去工作室問問Bachelor，這家伙肯定還藏了不少東西沒說，結(jié)果Bachelor不知道是因?yàn)樾奶撨€是咋的，竟然沒來，工作室只有一個(gè)膚白貌美大長(zhǎng)腿的實(shí)習(xí)生MM在，聽Bachelor說這個(gè)是實(shí)習(xí)生是個(gè)高手，也是MIT過來的，而且人家主修的就是人工智能這方面，Alex決定厚著臉皮去問問。

Alex：“Hi Coco, Do you know the credit card fraud detection we received recently?”

Coco：“你說啥？”

Alex：“我靠，你會(huì)說中文。”

Coco：“相較于你的英文，我還是聽懂了你的中文。”

ALex：“。。。。。。”

ALex：“我說你知道咱們工作室最近接的那個(gè)信用卡欺詐檢測(cè)的項(xiàng)目么？”

Coco：“知道啊，那個(gè)項(xiàng)目就是我接的，只不過現(xiàn)在是Bachelor負(fù)責(zé)，我跟進(jìn)。”

Alex：“那太好了，Bachelor昨天給我講了講，但是我回去自己做的時(shí)候準(zhǔn)確率出奇的低。”

Coco：“你跟我說說Bachelor講了啥，我?guī)湍憧纯窗伞！?/p>

于是Alex就把昨天的邏輯回歸又說了一遍…

Coco：“整體看來沒問題，只不過Bachelor只講了邏輯回歸的原理，應(yīng)用在這個(gè)項(xiàng)目上還需要一些處理。”

于是，Alex一邊聽著Coco的講解，一邊擦口水…

Coco倒是沒多想，專心的給Alex講解：

這樣吧，我就給你捋一下工作室目前是怎么做的。

import numpy as np import pandas as pd import matplotlib.pyplot as plttrain_identity = pd.read_csv('/kaggle/input/ieee-fraud-detection/train_identity.csv') train_transaction = pd.read_csv('/kaggle/input/ieee-fraud-detection/train_transaction.csv') data = pd.merge(train_transaction, train_identity, on="TransactionID", how="left") data.drop("TransactionID", axis=1, inplace=True) # TransactionID說實(shí)話沒啥用 data.drop("TransactionDT", axis=1, inplace=True) # TransactionDT類似于時(shí)間戳，也沒啥用 del train_identity, train_transaction

上來還是先把兩張表的數(shù)據(jù)讀進(jìn)內(nèi)存里，然后通過TransactionID連接起來制作一個(gè)train表，之后既可以把原來的兩張表刪了，不然會(huì)占用內(nèi)存。

pd.set_option('display.max_columns',None) # 設(shè)置pandas顯示列不限制數(shù)量 pd.set_option('display.max_rows',None) # 設(shè)置pandas顯示行不限制數(shù)量 data.head()

數(shù)據(jù)預(yù)處理

首先還是去掉缺失值超過50%的特征：

na_count = data.isnull().sum().sort_values(ascending=False) na_rate = na_count / len(data) na_data = pd.concat([na_count,na_rate],axis=1,keys=['count','ratio']) data.drop(na_data[na_data['ratio'] > 0.3].index, axis=1, inplace=True)

將所有的離散型數(shù)據(jù)對(duì)應(yīng)為數(shù)值型數(shù)據(jù)，缺失值用均值填充:

for col in data.columns:if data[col].dtypes == "object":data[col], uniques = pd.factorize(data[col])data[col].fillna(data[col].mean(), inplace=True) data.head()

其實(shí)在這個(gè)項(xiàng)目里，重要的是數(shù)據(jù)預(yù)處理而不是邏輯回歸的算法。

拿到數(shù)據(jù)之后，先簡(jiǎn)單看一下，isFraud表示有沒有沒詐騙，也就是我們的y值，0表示沒有被詐騙，1表示被詐騙了，但是在日常生活中，雖然詐騙比較多，但還是正常樣本占大多數(shù)，所以我們先看一下正負(fù)樣本的統(tǒng)計(jì)。

count_isFraud = pd.value_counts(data["isFraud"], sort=True) count_isFraud.plot(kind="bar") # 條形圖 plt.title("Fraud statistics") plt.xlabel("isFraud") plt.ylabel("Frequency")

X = data.iloc[:, data.columns != "isFraud"] Y = data.iloc[:, data.columns == "isFraud"]

通過條形圖我們可以發(fā)現(xiàn)，正負(fù)樣本量是非常不均衡的，這會(huì)對(duì)我們的模型計(jì)算產(chǎn)生很大的影響，我估計(jì)你昨天測(cè)試集的結(jié)果應(yīng)該全都是0吧。

對(duì)于這種樣本不均衡的數(shù)據(jù)最常使用的有兩種解決方案：

1.過采樣：通過1樣本數(shù)據(jù)制造一些數(shù)據(jù)，讓兩種樣本數(shù)據(jù)保持平衡；

2.下采樣：在0樣本數(shù)據(jù)中隨機(jī)抽取跟1樣本數(shù)據(jù)一樣多的數(shù)據(jù)，讓兩種樣本數(shù)據(jù)保持均衡。

# 過采樣——通過第三方庫(kù)可以很方便的實(shí)現(xiàn) from imblearn.over_sampling import SMOTE over_sample = SMOTE() X_over_sample_data, Y_over_sample_data = over_sample.fit_sample(X.values, Y.values) X_over_sample_data = pd.DataFrame(X_over_sample_data) Y_over_sample_data = pd.DataFrame(Y_over_sample_data) # 看一下過采樣數(shù)據(jù)集的長(zhǎng)度 print("Total number of normal =", len(Y_over_sample_data[Y_over_sample_data == 0])) print("Total number of fraud =", len(Y_over_sample_data[Y_over_sample_data == 1])) # 下采樣 number_of_fraud = len(data[data.isFraud == 1]) # 統(tǒng)計(jì)被詐騙數(shù)據(jù)量 fraud_indices = np.array(data[data.isFraud == 1].index) # 被詐騙數(shù)據(jù)索引 normal_indices = np.array(data[data.isFraud == 0].index) # 正常數(shù)據(jù)索引# 在正常數(shù)據(jù)中隨機(jī)選擇與被詐騙數(shù)據(jù)量相等的正常數(shù)據(jù)的索引 random_normal_indices = np.array(np.random.choice(normal_indices, number_of_fraud, replace=False))# 將所有被詐騙的數(shù)據(jù)索引和隨機(jī)選擇的等量正常數(shù)據(jù)索引合并 under_sample_indices = np.concatenate([fraud_indices, random_normal_indices]) # 從原始數(shù)據(jù)中取出下采樣數(shù)據(jù)集 under_sample_data = data.iloc[under_sample_indices, :]X_under_sample = under_sample_data.iloc[:, under_sample_data.columns != "isFraud"] Y_under_sample = under_sample_data.iloc[:, under_sample_data.columns == "isFraud"] # 看一下下采樣數(shù)據(jù)集的長(zhǎng)度 print("Total number of under_sample_data =", len(under_sample_data)) print("Total number of normal =", len(under_sample_data[data.isFraud == 0])) print("Total number of fraud =", len(under_sample_data[data.isFraud == 1]))

標(biāo)準(zhǔn)化

在我們做機(jī)器學(xué)習(xí)模型的時(shí)候，要保證特征之間的分布數(shù)據(jù)是差不多，也就是保證初始情況下每一列特征的重要程度是相似的，比如說card1這一列，它的數(shù)據(jù)相比如其它的數(shù)據(jù)都非常大，在訓(xùn)練模型的時(shí)候機(jī)器可能認(rèn)為card1這行數(shù)據(jù)非常重要，但實(shí)際上并不能確定。

因此，我們需要對(duì)data進(jìn)行標(biāo)準(zhǔn)化的處理，通過sklearn的preprocessing模塊可以快速的幫助我們隊(duì)數(shù)據(jù)做標(biāo)準(zhǔn)化：

from sklearn.preprocessing import StandardScalercols = X_under_sample.columnsfor col in cols:X[col] = StandardScaler().fit_transform(X[col].values.reshape(-1, 1))for col in cols:X_under_sample[col] = StandardScaler().fit_transform(X_under_sample[col].values.reshape(-1, 1))for col in cols:X_over_sample_data[col] = StandardScaler().fit_transform(X_over_sample_data[col].values.reshape(-1, 1))

交叉驗(yàn)證

為了在不使用測(cè)試集的情況下驗(yàn)證模型的效果，通常在訓(xùn)練一個(gè)機(jī)器學(xué)習(xí)的模型之前，會(huì)對(duì)train數(shù)據(jù)集進(jìn)行切分:

我們把訓(xùn)練集分成5份，然后進(jìn)行五輪訓(xùn)練：

第一輪：第1份數(shù)據(jù)留作驗(yàn)證，2、3、4、5份數(shù)據(jù)用作訓(xùn)練模型；第二輪：第2份數(shù)據(jù)留作驗(yàn)證，1、3、4、5份數(shù)據(jù)用作訓(xùn)練模型；第三輪：第3份數(shù)據(jù)留作驗(yàn)證，1、2、4、5份數(shù)據(jù)用作訓(xùn)練模型；第四輪：第4份數(shù)據(jù)留作驗(yàn)證，1、2、3、5份數(shù)據(jù)用作訓(xùn)練模型；第五輪：第5份數(shù)據(jù)留作驗(yàn)證，1、2、3、4份數(shù)據(jù)用作訓(xùn)練模型；

最后針對(duì)每一輪訓(xùn)練的結(jié)果取一個(gè)平均的效果，會(huì)讓我們的模型變得更加優(yōu)秀，當(dāng)然這個(gè)效果不用我們來實(shí)現(xiàn)，sklearn已經(jīng)幫我們實(shí)現(xiàn)好了：

from sklearn.model_selection import train_test_split

簡(jiǎn)單來說，train_test_split就是輸入訓(xùn)練集的X和Y，返回切分后的訓(xùn)練集X、Y，驗(yàn)證集X、Y。

# 對(duì)原始的全部數(shù)據(jù)進(jìn)行切分 X_train, X_test, Y_train, y_test = train_test_split(X, Y, test_size=0.2)# 對(duì)下采樣數(shù)據(jù)集進(jìn)行切分 X_under_sample_train, X_under_sample_test, Y_under_sample_train, Y_under_sample_test = train_test_split(X_under_sample, Y_under_sample, test_size=0.2)# 對(duì)上采樣數(shù)據(jù)集進(jìn)行切分 X_over_sample_train, X_over_sample_test, Y_over_sample_train, Y_over_sample_test = train_test_split(X_over_sample_data, Y_over_sample_data, test_size=0.2)

模型評(píng)估

一個(gè)模型的好壞不能只看它的準(zhǔn)確率，舉一個(gè)簡(jiǎn)單的例子：

假設(shè)1000個(gè)人中，有900個(gè)人是正常的，有100個(gè)人被詐騙。從這1000個(gè)人中抽出100人來測(cè)試某個(gè)模型，這100個(gè)人中有90個(gè)人是正常的，有10個(gè)人是被詐騙的。有一個(gè)非常粗暴的模型，將輸入的樣本都判斷為正常，那么這個(gè)模型在處理這100個(gè)數(shù)據(jù)的時(shí)候，會(huì)預(yù)測(cè)準(zhǔn)那90個(gè)正常的，但剩下10個(gè)預(yù)測(cè)錯(cuò)誤。模型的準(zhǔn)確率為90%，但是這顯然不是一個(gè)好模型。

所以，在樣本不均衡的情況下不能夠使用準(zhǔn)確率作為模型評(píng)估的標(biāo)準(zhǔn)，而要使用recall，也就是召回率。

計(jì)算recall需要先看一個(gè)表格：

正類負(fù)類

檢索到	True Positive（TP），正類判斷為正類	False Positive（FP），負(fù)類判斷為正類
未檢索到	False Negative（FN），正類判斷為負(fù)類	True Negative（TN），負(fù)類判斷為負(fù)類

$\frac{TP}{TP+FN}$

這看起來不是很好理解，我們?cè)賮砼e一個(gè)例子：

還是1000個(gè)人中，有900個(gè)人是正常的，有100個(gè)人被詐騙。又有一個(gè)不靠譜的模型，將輸入的樣本50%判斷為正常，50%判斷為異常，我們的目的是找出所有被詐騙的。經(jīng)過這個(gè)模型的計(jì)算，得到500個(gè)模型認(rèn)為的被詐騙人，但是500個(gè)人中只有50個(gè)人是被詐騙的，剩下450個(gè)都是正常的。檢索到的500個(gè)人中：50個(gè)異常數(shù)據(jù)被判斷為異常，TP=50；450個(gè)正常數(shù)據(jù)被判斷為異常，FP=450。未被檢索到的500個(gè)人中：50個(gè)異常數(shù)據(jù)被判斷為正常，FN=50；450個(gè)正常數(shù)據(jù)被判斷為正常，TN=450。recall = 50 / (50 + 50) = 0.5

正則化懲罰項(xiàng)

我們訓(xùn)練模型的目標(biāo)其實(shí)就是求出參數(shù)θ，假設(shè)通過計(jì)算得到θ1和θ2兩個(gè)模型，盡管參數(shù)值截然不同，但在預(yù)測(cè)中有可能會(huì)得到相同的結(jié)果。

那么對(duì)于這兩個(gè)模型，我們到底要選擇哪一個(gè)呢？

在回答這個(gè)問題之前，我們要了解一個(gè)知識(shí)點(diǎn)，過擬合。

過擬合問題是機(jī)器學(xué)習(xí)中很讓人頭疼的一件事情，舉個(gè)例子：

暫時(shí)不用管這是什么算法，我們的目標(biāo)是對(duì)紅綠色的點(diǎn)進(jìn)行分類，可以看到對(duì)于大部分?jǐn)?shù)據(jù)區(qū)分的還是比較完美的，但是綠色范圍在左下方突出了一個(gè)角，為了去擬合在紅色堆里那個(gè)按照正常的邏輯應(yīng)該判定為紅色點(diǎn)的綠色點(diǎn)，但是，有可能那個(gè)離群的綠色點(diǎn)是個(gè)錯(cuò)誤數(shù)據(jù)。

這就是過擬合，只能說我們的模型對(duì)于訓(xùn)練集來說太過完美，這可并不是一件好事，我們的目的是想讓模型能夠匹配所有的數(shù)據(jù)，不僅僅局限于訓(xùn)練集。

過擬合通常發(fā)生在訓(xùn)練數(shù)據(jù)不夠多或者訓(xùn)練過度（overtrainging）的情況下，而正則化方法就是為了解決過擬合的問題而誕生的，通過在損失函數(shù)中引入額外的計(jì)算項(xiàng)，可以有效的防止過擬合，提高模型的泛化能力。

$目標(biāo) 函數(shù) = 損失函數(shù) + 正則化懲罰項(xiàng)$

目前有兩種正則化懲罰項(xiàng)：

L1參數(shù)正則化： $ω(θ)=∣∣w∣∣=∑i∣wi∣\omega(\theta)=||w||=\sum_i|w_i|$

L2參數(shù)正則化： $ω(θ)=12∣w∣2\omega(\theta)=\frac{1}{2}|w|^2$

邏輯回歸模型

我們已經(jīng)學(xué)習(xí)過邏輯回歸算法的推導(dǎo)過程，能夠?qū)⒂?jì)算過程由代碼實(shí)現(xiàn)，帶如果每次使用邏輯回歸都要再寫一遍代碼顯然是非常繁瑣的，sklearn包幫我們實(shí)現(xiàn)好了一個(gè)很優(yōu)秀的邏輯回歸訓(xùn)練器，只需要輸入相應(yīng)的參數(shù)，就可以造出一個(gè)訓(xùn)練器。

import time from sklearn.metrics import recall_score from sklearn.model_selection import KFold from sklearn.linear_model import LogisticRegression def kfold_scores(x_train_data, y_train_data):start_time = time.time()fold = KFold(3, shuffle=True) # 3折交叉驗(yàn)證c_param_range = [10, 100, 1000] # 懲罰力度，正則化懲罰項(xiàng)的系數(shù) # 做可視化展示results_table = pd.DataFrame(index=range(len(c_param_range), 2), columns=["C_parameter", "Mean recall scores"])results_table["C_parameter"] = c_param_range# 不確定哪一個(gè)正則化懲罰項(xiàng)的系數(shù)更好，因此采用循環(huán)確認(rèn)index = 0for c_param in c_param_range:print('--------------------------------------------------------------------------------')print("If C parameter =", c_param, end="\n\n")# 做交叉驗(yàn)證recall_accs = []lr = LogisticRegression(C=c_param, penalty='l1', solver='liblinear', max_iter=10000)for iteration, indices in enumerate(fold.split(x_train_data)):# 擬合訓(xùn)練數(shù)據(jù)lr.fit(x_train_data.iloc[indices[0], :], y_train_data.iloc[indices[0], :].values.ravel())# 使用驗(yàn)證集得出預(yù)測(cè)數(shù)據(jù)y_predicted_undersample = lr.predict(x_train_data.iloc[indices[1], :])# 計(jì)算recallrecall_acc = recall_score(y_train_data.iloc[indices[1], :], y_predicted_undersample)recall_accs.append(recall_acc)print('\tIteration ', iteration, ': recall score = ', recall_acc)index += 1# 計(jì)算recall的平均值results_table.loc[index, "Mean recall scores"] = np.mean(recall_accs)print('Mean recall score = ', results_table.loc[index, "Mean recall scores"], end="\n\n")print('--------------------------------------------------------------------------------')best_c_param = results_table.loc[results_table['Mean recall scores'].astype(float).idxmax()]['C_parameter']print('Best C parameter = ', best_c_param, "\t duration: ", time.time() - start_time)return lr, best_c_param # 對(duì)原始全部數(shù)據(jù)進(jìn)行測(cè)試 # lr1, param1 = kfold_scores(X_train, Y_train) del X_train del Y_train # 對(duì)下采樣數(shù)據(jù)進(jìn)行測(cè)試 lr2, param2 = kfold_scores(X_under_sample_train, Y_under_sample_train) del X_under_sample_train del Y_under_sample_train # 對(duì)上采樣數(shù)據(jù)進(jìn)行測(cè)試 # lr3, param3 = kfold_scores(X_over_sample_train, Y_over_sample_train) del X_over_sample_train del Y_over_sample_train test_identity = pd.read_csv('/kaggle/input/ieee-fraud-detection/test_identity.csv') test_transaction = pd.read_csv('/kaggle/input/ieee-fraud-detection/test_transaction.csv') data = pd.merge(test_transaction, test_identity, on="TransactionID", how="left") test_ID = data[["TransactionID"]] data.drop("TransactionID", axis=1, inplace=True) # TransactionID說實(shí)話沒啥用 data.drop("TransactionDT", axis=1, inplace=True) # TransactionDT類似于時(shí)間戳，也沒啥用for col in data.columns:if col.startswith("id"):newcol = col.replace("-", "_")data.rename(columns={col: newcol},inplace=True)del test_identity, test_transactiondata.drop(na_data[na_data['ratio'] > 0.3].index, axis=1, inplace=True) for col in data.columns:if data[col].dtypes == "object":data[col], uniques = pd.factorize(data[col])data[col].fillna(data[col].mean(), inplace=True) for col in cols:data[col] = StandardScaler().fit_transform(data[col].values.reshape(-1, 1))test_predict = lr2.predict(data.values) submission = pd.concat([test_ID, pd.Series(test_predict)], axis=1, keys=["TransactionID", "isFraud"]) submission.to_csv("submission1.csv", index=False)

好了，這就是通過sklearn來做這個(gè)邏輯回歸的項(xiàng)目，咱們可以吧結(jié)果提交評(píng)測(cè)網(wǎng)站IEEE-CIS Fraud Detection看看效果怎么樣：

Coco：“你看，這么做是不是效果比你之前那個(gè)好一點(diǎn)了。”

Alex：“我靠，真大。。。。”

Coco：“什么真大？”

Alex：“額，分?jǐn)?shù)啊。”

Coco：“行，大概就是這個(gè)套路，剩下的是就是細(xì)節(jié)的問題了，你回去也可以試試過采樣和正常數(shù)據(jù)能達(dá)到什么效果。”

Alex：“不用回去了，我就在這試。”

總結(jié)

以上是生活随笔為你收集整理的小姐姐教我的 sklearn 逻辑回归的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： 10.1 HTML介绍与开发环境的搭建
下一篇： 10.1.1 head标签