當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

UCI数据集+机器学习+十折交叉验证

發(fā)布時間：2023/12/20 编程问答 57 豆豆

生活随笔收集整理的這篇文章主要介紹了 UCI数据集+机器学习+十折交叉验证小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

? ? ? ?本文為本學期《生物醫(yī)學信息》課程作業(yè)，第一次發(fā)文，希望可以記錄自己的學習狀態(tài)，和大家一起學習進步。?

作業(yè)要求：

背景介紹：

? ? ? ? UCI?數(shù)據(jù)集是加州大學歐文分校(University of CaliforniaIrvine)提出的用于機器學習的?

數(shù)據(jù)庫。目前共有?622?個數(shù)據(jù)集，是一個常用的機器學習標準測試數(shù)據(jù)集。本文選取?UCI?

數(shù)據(jù)集中第?196?號數(shù)據(jù)集進行處理分析。?

? ? ? ? ?sklearn?是一個?Python?第三方提供的非常強力的機器學習庫，它建立在?NumPy, SciPy,?

Pandas?和?Matplotlib?之上，在?sklearn?里面有六大任務模塊：分別是分類、回歸、聚類、?

降維、模型選擇和預處理。合理的使用?sklearn?可以減少代碼量與編程時間，使我們有更多?

的精力去分析數(shù)據(jù)分布，調整模型和修改超參。

數(shù)據(jù)集介紹：

? ? ? ? ?本實驗使用的數(shù)據(jù)集來自?UCI machine learning?數(shù)據(jù)集生命科學類中的?Localization?

Data for Person Activity Data Set，此數(shù)據(jù)集共有?164860?個樣本以及?8?個特征，樣本數(shù)×?

特征數(shù)?> 50?萬，包含了五個人的左右腳踝、腰部和胸部在不同時間點的位置坐標等屬性，?

根據(jù)這些屬性，將受試者分為行走、躺下、站立等?11?種不同的行為狀態(tài)。

數(shù)據(jù)下載地址為：

UCI Machine Learning Repository: Localization Data for Person Activity Data Set?

數(shù)據(jù)實例：?

A01,020-000-033-111,633790226057226795,27.05.2009 14:03:25:723,4.292500972747803,?

2.0738532543182373,1.36650812625885, walking?

第一列?SequenceName：{A01,A02,A03,A04,A05,B01,B02,B03,B04,B05,C01,C02,?

C03,C04,C05,D01,D02,D03,D04,D05,E01,E02,E03,E04,E05} (Nominal)?，代表?A, B, C, D,?

E 5?個人。?

第二列?TagIdentificator：{010-000-024-033,020-000-033-111,020-000-032-221,?

010-000-030-096} (Nominal)?，使用不同的數(shù)字序列，代表人體的不同部位，分別為?

ANKLE_LEFT、ANKLE_RIGHT、CHEST、BELT。第三列?Timestamp：時間戳。?

第四列?date：符合?dd.MM.yyyy HH:mm:ss:SSS?形式的日期數(shù)據(jù)。?

第五列-第七列分別為?x、y、z?坐標。?

第八列?activity：{walking, falling, lying down, lying, sitting down, sitting, standing up?

from lying, on all fours, sitting on the ground, standing up from sitting, standing up from?

sitting on the ground}，表示人的行為狀態(tài)，共有以上?11?種。

K折交叉驗證：

K?次交叉驗證（K-fold cross-validation），將訓練集分割成?K?個子樣本，一個單獨的?

子樣本被保留作為驗證模型的數(shù)據(jù)，其他?K-1?個樣本用來訓練。交叉驗證重復?K?次，每個?

子樣本驗證一次，平均?K?次的結果或者使用其它結合方式，最終得到一個單一估測。這個?

方法的優(yōu)勢在于，同時重復運用隨機產(chǎn)生的子樣本進行訓練和驗證，每次的結果驗證一次，?

10?次交叉驗證是最常用的。

?相關操作：

代碼需要import的包：

import pandas as pd from sklearn import preprocessing,tree,model_selection,svm,naive_bayes import time from sklearn.model_selection import KFold, train_test_split from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier from sklearn.neighbors import KNeighborsClassifier

讀入數(shù)據(jù)，獲取數(shù)據(jù)集基本信息，并顯示前五行數(shù)據(jù)。代碼如下：

path = "./ConfLongDemo_JSI.txt" df = pd.read_table(path, sep=',', names=['SequenceName', 'TagIdentificator','Timestamp', 'date', 'x', 'y', 'z', 'activity']) print(df.head()) df.info()

運行結果：

# 對類別型特征進行啞變量的制作 cat_features = ['SequenceName', 'TagIdentificator'] for col in cat_features:df[col] = df[col].astype('object') X_cat = df[cat_features] X_cat = pd.get_dummies(X_cat) # print(X_cat.head())# 對數(shù)值型特征進行數(shù)據(jù)標準化和歸一化 scale_X = preprocessing.StandardScaler() num_features = ['Timestamp', 'x', 'y', 'z'] X_num = scale_X.fit_transform(df[num_features]) X_num = preprocessing.normalize(X_num, norm='l2') X_num = pd.DataFrame(data=X_num, columns=num_features, index=df.index) # print(X_num.head())

觀察到?Timestamp?和?date?都為時間數(shù)據(jù)，存在強相關的冗余，因此可將?date?屬性去?

除，只對?Timestamp?進行分析。?

進行數(shù)據(jù)預處理，挑選出類別型特征（SequenceName, TagIdentificator），將其轉換為啞變

量。挑選出數(shù)值型特征（Timestamp，x，y，z），進行數(shù)據(jù)的標準化和歸一化。標準化??

（?Z-Score?）的公式為：?(X-mean)/std?計算時對每個屬性?/?每列分別進行，使用?

sklearn.preprocessing.scale()函數(shù)，可以直接將給定數(shù)據(jù)進行標準化。標準化的目的在于避?

免原始特征值差異過大，導致訓練得到的參數(shù)權重不歸一，無法比較各特征的重要性。在數(shù)?

據(jù)標準化之后進行數(shù)據(jù)的歸一化，查看結果的前五行。代碼如下：

合并數(shù)據(jù)，在分別對數(shù)值型數(shù)據(jù)已經(jīng)類別型數(shù)據(jù)進行數(shù)據(jù)預處理后，將它們合并組成?

Sam，顯示?Sam?的前?5?行與形狀，代碼如下：

# 合并數(shù)據(jù) Sam = pd.concat([X_cat, X_num,df['activity']], axis=1, ignore_index=False) # print(Sam.head()) # print(Sam.shape)

設置十折交叉，對數(shù)據(jù)集進行劃分，并查看測試數(shù)據(jù)集。代碼如下：

# 十折交叉 kf = KFold(n_splits=10) for train_index, test_index in kf.split(Sam): print("Train:", train_index, "Validation:",test_index) print(len(train_index),len(test_index)) print(Sam.iloc[train_index, 0:34])

根據(jù)十折交叉所得訓練集及測試集的索引，從?Sam?中取得實際用于訓練和測試的部分。?

代碼如下：

X_train, X_test = Sam.iloc[train_index, 0:33], Sam.iloc[test_index, 0:33] y_train, y_test = Sam.iloc[train_index, 33], Sam.iloc[test_index, 33]

共使用8種算法進行訓練，分別測試其準確率和用時。分別是K近鄰算法、決策樹算法、先驗為高斯分布的樸素貝葉斯算法、支持向量機算法、Tree Bagging算法、隨機森林算法、adaboost算法、GBDT算法。

采用十折交叉驗證的方法代碼如下：

clfs = {'K_neighbor': KNeighborsClassifier(),'decision_tree': tree.DecisionTreeClassifier(),'naive_gaussian': naive_bayes.GaussianNB(),'svm': svm.SVC(),'bagging_tree': BaggingClassifier(tree.DecisionTreeClassifier(), max_samples=0.5,max_features=0.5),'random_forest': RandomForestClassifier(n_estimators=50),'adaboost': AdaBoostClassifier(n_estimators=50),'gradient_boost': GradientBoostingClassifier(n_estimators=50, learning_rate=1.0,max_depth=1, random_state=0)}# 合并數(shù)據(jù) Sam = pd.concat([X_cat, X_num,df['activity']], axis=1, ignore_index=False) # print(Sam.head()) # print(Sam.shape)for clf_key in clfs.keys():print('\nthe classifier is:', clf_key)Sum_score = 0Sum_elapsed = 0# K折交叉n_splits = 5kf = KFold(n_splits)for train_index, test_index in kf.split(Sam):# print("Train:", train_index, "Validation:",test_index)# print(len(train_index),len(test_index))# print(Sam.iloc[train_index, 0:33])# # print(Sam.iloc[train_index,33])X_train, X_test = Sam.iloc[train_index, 0:33], Sam.iloc[test_index, 0:33]y_train, y_test = Sam.iloc[train_index, 33], Sam.iloc[test_index, 33]clf = clfs[clf_key]begin = time.perf_counter()clf.fit(X_train, y_train.ravel())elapsed = time.perf_counter() - beginscore = clf.score(X_test, y_test.ravel())Sum_score = Sum_score + scoreSum_elapsed = Sum_elapsed + elapsedprint('the score is:', score)# print('the elapsed is:', elapsed)print('The score is:', Sum_score / n_splits)print('The elapsed is:', Sum_elapsed / n_splits)

使用如參考文章中劃分獨立測試集的方法代碼為：

# 合并數(shù)據(jù) X = pd.concat([X_cat, X_num], axis=1, ignore_index=False) y = df['activity'] # print(X.head()) # print(X.shape) # print(y.shape)# 劃分訓練集測試集 X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.20, shuffle=True) # print(X_train.shape, y_train.shape) # print(X_test.shape, y_test.shape)for clf_key in clfs.keys():print('\nthe classifier is:', clf_key)clf = clfs[clf_key]begin = time.perf_counter()clf.fit(X_train, y_train.ravel())elapsed = time.perf_counter() - beginscore = clf.score(X_test, y_test.ravel())print('the score is:', score)print('the elapsed is:', elapsed)

結果討論：

十折交叉驗證運行結果如下：

劃分獨立測試集的結果如下：

可以看到，機器學習模型的復雜度并不直接與準確度成正比。?

在十折交叉驗證中，最簡單的?KNN?模型都可以達到?0.411?的準確度，而集成方法?adaboost?的準

確度卻只有?0.327?左右。在這?8?中算法中，隨機森林算法的準確度最高，同時?SVM?和?Bagging

Tree?的表現(xiàn)也較為良好。?

以?KNN?為例，分別輸出十折交叉驗證過程每一次實驗的準確度如下圖，可以看出，不?

同訓練集測試集對于同一模型的準確度存在一定影響，準確度最小為?0.361，最大為?0.489，?

相差?0.128。因此采用十折交叉驗證得到的結果可以更加準確體現(xiàn)這一模型的性能與好壞，?

更加具有說服力。

在獨立測試集中，最簡單的?KNN?模型能達到?0.789?的準確度，而?adaboost、gradient_boost?等

集成方法卻連?0.4?的準確度都達不到。在這?8?種算法中，準確度最高的是?三個算法是決策樹、

Bagging Tree?和隨機森林，都達到了?0.85?以上，樹模型可能更加適合這個數(shù)據(jù)集。?

同一模型在獨立測試集的的表現(xiàn)優(yōu)于十折交叉驗證中的表現(xiàn)。為排除訓練集以及測試集?

大小對實驗結果的影響，由于獨立測試集是是按照?8 : 2?劃分的，所以采用五折交叉驗證重?

新測試，即將?K?折交叉驗證中的?n_splits?參數(shù)賦值為 5，再次運行代碼。結果如下圖：

可以看到即使訓練集測試集大小相同（訓練集：測試集?= 8?:?2），獨立測試集的結果都遠勝于K折交叉驗證的結果，觀察ConfLongDemo_JSI.txt文件，發(fā)現(xiàn)作為輸出y的‘a(chǎn)ctivity’字段，分布不是無序的，而是有聚集有按類別排序的分布，對于K折交叉驗證的方法，其劃分數(shù)據(jù)集采用的方式是按照索引大小整塊取出的方法，因此用于訓練和測試的數(shù)據(jù)集其‘a(chǎn)ctivity’字段都有一定規(guī)律性存在。而使用sklearn的train_test_split()函數(shù)進行劃分的訓練集和測試集都是隨機劃分的，因此對于此數(shù)據(jù)集在同一模型下，使用獨立測試集的表現(xiàn)更好。

?模型和測試方法沒有優(yōu)劣之分，只有適不適合的區(qū)別，具體使用什么方法或模型取決于具體的問題。

由于剛剛入門，自己獨立完成存在一定困難。所以在網(wǎng)上借鑒（copy）了相關文章的處理方法。并自己進行一定修改。主要體現(xiàn)在十折交叉驗證這一方法的應用。

參考文章：

使用sklearn進行UCI machine learning數(shù)據(jù)集的機器學習實戰(zhàn) - 簡書

3.1. Cross-validation: evaluating estimator performance — scikit-learn 1.1.dev0 documentation交叉驗證法(?cross validation) - 云+社區(qū) - 騰訊云

3.1. Cross-validation: evaluating estimator performance — scikit-learn 1.1.dev0 documentation

文章資源及源代碼：

數(shù)據(jù)文件及py文件已上傳至百度網(wǎng)盤，可以按需下載。歡迎大家交流。

鏈接: https://pan.baidu.com/s/1U6VkgPesWcVGR285iLL-WQ?pwd=8sng 提取碼: 8sng 復制這段內容后打開百度網(wǎng)盤手機App，操作更方便哦

總結

以上是生活随笔為你收集整理的UCI数据集+机器学习+十折交叉验证的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： C++ cout的使用，看这一篇就够了
下一篇：进程通信例子