李宏毅机器学习Lesson2——Logistic Regression实现收入预测
李宏毅機(jī)器學(xué)習(xí)Lesson2——Logistic Regression實(shí)現(xiàn)收入預(yù)測
- 數(shù)據(jù)集介紹
- 加載數(shù)據(jù)集、標(biāo)準(zhǔn)化
- 定義相關(guān)函數(shù)
- 交叉驗(yàn)證訓(xùn)練模型(小小地實(shí)現(xiàn)了一下?lián)p失函數(shù)懲罰項(xiàng)中系數(shù)的選擇)
- 此處也用到了黃海廣博士在吳恩達(dá)機(jī)器學(xué)習(xí)筆記中的代碼,感謝如此詳實(shí)的代碼
數(shù)據(jù)集介紹
數(shù)據(jù)集來源于Kaggle,是一個(gè)美國公民收入的數(shù)據(jù)集,本次任務(wù)的目標(biāo)是根據(jù)多個(gè)特征準(zhǔn)確區(qū)分樣本的收入屬于哪一個(gè)類別(大于50k美元/小于50k美元),從網(wǎng)上下載的數(shù)據(jù)集中有六個(gè)文件
前三個(gè)是原始數(shù)據(jù)集,本次任務(wù)使用后三個(gè),這是助教已經(jīng)根據(jù)原始數(shù)據(jù)集進(jìn)行一定數(shù)據(jù)清洗后可以直接拿來建模的數(shù)據(jù)。
加載數(shù)據(jù)集、標(biāo)準(zhǔn)化
X_train_fpath = '/Users/weiyubai/Desktop/學(xué)習(xí)資料/機(jī)器學(xué)習(xí)相關(guān)/李宏毅數(shù)據(jù)等多個(gè)文件/數(shù)據(jù)/hw2/data/X_train' Y_train_fpath = '/Users/weiyubai/Desktop/學(xué)習(xí)資料/機(jī)器學(xué)習(xí)相關(guān)/李宏毅數(shù)據(jù)等多個(gè)文件/數(shù)據(jù)/hw2/data/Y_train'with open(X_train_fpath) as f:next(f)X_train = np.array([line.strip('\n').split(',')[1:] for line in f], dtype = float) with open(Y_train_fpath) as f:next(f)Y_train = np.array([line.strip('\n').split(',')[1] for line in f], dtype = float)def normalize_feature(df): # """Applies function along input axis(default 0) of DataFrame."""return df.apply(lambda column: (column - column.mean()) / (column.std() if column.std() else 1))#特征縮放函數(shù)定義X_train_no = normalize_feature(pd.DataFrame(X_train))#標(biāo)準(zhǔn)化 X_train_no = np.concatenate((np.ones(len(X_train)).reshape(len(X_train),1),X_train_no.values),axis = 1)#為特征添1,作為截距項(xiàng)至此,數(shù)據(jù)導(dǎo)入及處理已經(jīng)完畢。查看一下數(shù)據(jù)集,X_train_no為54256*511的np數(shù)組
定義相關(guān)函數(shù)
接下來定義梯度下降過程中需要用到的各個(gè)函數(shù)
def sigmoid(z):return 1 / (1 + np.exp(-z))def cross_entropy(X, y, theta):e=1e-9return np.mean(-y * np.log(np.maximum(sigmoid(X @ theta),e)) - (1 - y ) * np.log(np.maximum(1 - sigmoid(X @ theta),e)))#添加懲罰項(xiàng)的lossfunction def regularized_cost(X, y,theta, l=1): # '''you don't penalize theta_0'''theta_j1_to_n = theta[1:]regularized_term = (l / (2 * len(X))) * np.power(theta_j1_to_n, 2).sum() # return regularized_termreturn cross_entropy(X, y, theta) + regularized_termdef gradient(X, y, theta):return (1 / len(X)) * (X.T @ (sigmoid(X @ theta) - y))#添加懲罰項(xiàng)的gradient def regularized_gradient(X, y,theta, l=1): # '''still, leave theta_0 alone'''theta_j1_to_n = theta[1:]regularized_theta = (l / len(X)) * theta_j1_to_n# by doing this, no offset is on theta_0regularized_term = np.concatenate([np.array([0]), regularized_theta])return gradient(X, y,theta) + regularized_termdef predict(x,theta):prob = sigmoid(x @ theta)return (prob >= 0.5).astype(int)交叉驗(yàn)證訓(xùn)練模型(小小地實(shí)現(xiàn)了一下?lián)p失函數(shù)懲罰項(xiàng)中系數(shù)的選擇)
此處實(shí)現(xiàn)了一下李宏毅老師在開頭的課上說的交叉驗(yàn)證選擇模型,對加上懲罰項(xiàng)的損失函數(shù)進(jìn)行minimize,選擇了六個(gè)不同的懲罰項(xiàng)系數(shù),分別進(jìn)行三折交叉驗(yàn)證,查看其訓(xùn)練集和驗(yàn)證集的平均精度。
from sklearn.model_selection import StratifiedKFold# 劃分交叉驗(yàn)證集并保存 n_splits = 3 trainx_set = dict() validx_set = dict() trainy_set = dict() validy_set = dict()gkf = StratifiedKFold(n_splits=3).split(X=X_train_no, y=Y_train) for fold, (train_idx, valid_idx) in enumerate(gkf):trainx_set[fold] = X_train_no[train_idx] trainy_set[fold] = Y_train[train_idx]validx_set[fold] = X_train_no[valid_idx] validy_set[fold] = Y_train[valid_idx]交叉驗(yàn)證集先保存下來,隨后循環(huán)的時(shí)候直接取用,不容易出錯(cuò)
# 初始化 lambda_ = [0,0.1,1,10,100,1000] sum_valid_acc = {} sum_train_acc = {} sum_train_loss = {} sum_valid_loss = {}# 訓(xùn)練 for l in lambda_: # theta = np.ones(X_train.shape[1])learning_rate = 0.1epochs = 1000train_loss = {}valid_loss = {}train_acc = {}valid_acc = {}for fold in range(n_splits):theta = np.ones(X_train_no.shape[1])train_inputs = trainx_set[fold]train_outputs = trainy_set[fold]valid_inputs = validx_set[fold]valid_outputs = validy_set[fold]# 迭代訓(xùn)練一次for epoch in range(epochs):loss = cross_entropy(train_inputs, train_outputs, theta)gra = regularized_gradient(train_inputs, train_outputs, theta, l=l)theta = theta - learning_rate * gray_pred_train = predict(train_inputs,theta)_acc = 1 - np.abs(train_outputs - y_pred_train).sum()/len(train_outputs)train_acc[fold] = _accy_pred_valid = predict(valid_inputs,theta)acc_ = 1 - np.abs(valid_outputs - y_pred_valid).sum()/len(valid_outputs)valid_acc[fold] = acc_ train_loss[fold] = lossvalid_loss[fold] = cross_entropy(valid_inputs, valid_outputs, theta)print('在訓(xùn)練懲罰為{}模型的第{}折'.format(l,fold+1))sum_train_loss[l] = [train_loss[fold] for fold in train_loss]sum_valid_loss[l] = [valid_loss[fold] for fold in valid_loss]sum_valid_acc[l] = [valid_acc[fold] for fold in valid_acc]sum_train_acc[l] = [train_acc[fold] for fold in train_acc]print("已經(jīng)訓(xùn)練完懲罰為{}的模型".format(l))最后查看一下六個(gè)不同系數(shù)懲罰項(xiàng)分別經(jīng)過3次3折交叉驗(yàn)證后的訓(xùn)練集、驗(yàn)證集平均精度
for l in lambda_:print('lambda為{}時(shí),訓(xùn)練集平均精度為{},驗(yàn)證集平均精度為{}'.format(l,np.mean(np.array(sum_train_acc[l])),np.mean(np.array(sum_valid_acc[l]))))不過看上去結(jié)果也沒相差很遠(yuǎn),就當(dāng)作一次小小地記錄吧,總體來說驗(yàn)證集的精度都在87%左右
此處也用到了黃海廣博士在吳恩達(dá)機(jī)器學(xué)習(xí)筆記中的代碼,感謝如此詳實(shí)的代碼
總結(jié)
以上是生活随笔為你收集整理的李宏毅机器学习Lesson2——Logistic Regression实现收入预测的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 430单片机实现三人投票表决器_用ATC
- 下一篇: 如何做跟进客户关系维护PPT课件?