日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

sklearn机器学习常用数据处理总结

發布時間:2024/1/23 编程问答 29 豆豆
生活随笔 收集整理的這篇文章主要介紹了 sklearn机器学习常用数据处理总结 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

數據才是王道→數據預處理與數據集構建

from IPython.display import Image %matplotlib inline # Added version check for recent scikit-learn 0.18 checks from distutils.version import LooseVersion as Version from sklearn import __version__ as sklearn_version

1.處理缺省值

import pandas as pd from io import StringIOcsv_data = '''A,B,C,D 1.0,2.0,3.0,4.0 5.0,6.0,,8.0 10.0,11.0,12.0,'''# If you are using Python 2.7, you need # to convert the string to unicode: csv_data = unicode(csv_data)df = pd.read_csv(StringIO(csv_data)) df df.isnull().sum()2.可以直接刪除缺省值多的樣本或者特征

df.dropna() #默認行 df.dropna(axis=1) # only drop rows where all columns are NaN df.dropna(how='all') # drop rows that have not at least 4 non-NaN values df.dropna(thresh=4) # only drop rows where NaN appear in specific columns (here: 'C') df.dropna(subset=['C'])3.重新計算缺省值
from sklearn.preprocessing import Imputerimr = Imputer(missing_values='NaN', strategy='mean', axis=0) imr = imr.fit(df) imputed_data = imr.transform(df.values) imputed_data df.values 4.處理類別型數據
import pandas as pddf = pd.DataFrame([['green', 'M', 10.1, 'class1'],['red', 'L', 13.5, 'class2'],['blue', 'XL', 15.3, 'class1']])df.columns = ['color', 'size', 'price', 'classlabel'] df5.序列特征映射
size_mapping = {'XL': 3,'L': 2,'M': 1}df['size'] = df['size'].map(size_mapping) dfinv_size_mapping = {v: k for k, v in size_mapping.items()} df['size'].map(inv_size_mapping)6.類別編碼
import numpy as npclass_mapping = {label: idx for idx, label in enumerate(np.unique(df['classlabel']))} class_mapping df['classlabel'] = df['classlabel'].map(class_mapping) dfinv_class_mapping = {v: k for k, v in class_mapping.items()} df['classlabel'] = df['classlabel'].map(inv_class_mapping) dffrom sklearn.preprocessing import LabelEncoderclass_le = LabelEncoder() y = class_le.fit_transform(df['classlabel'].values) y class_le.inverse_transform(y)7.對類別型的特征用one-hot編碼
X = df[['color', 'size', 'price']].valuescolor_le = LabelEncoder() X[:, 0] = color_le.fit_transform(X[:, 0]) X array([[1, 1, 10.1],[2, 2, 13.5],[0, 3, 15.3]], dtype=object)
from sklearn.preprocessing import OneHotEncoderohe = OneHotEncoder(categorical_features=[0]) ohe.fit_transform(X).toarray() array([[ 0. , 1. , 0. , 1. , 10.1],[ 0. , 0. , 1. , 2. , 13.5],[ 1. , 0. , 0. , 3. , 15.3]])
pd.get_dummies(df[['price', 'color', 'size']])
8.對連續值特征做幅度縮放(scaling)
from sklearn.preprocessing import MinMaxScaler mms = MinMaxScaler() X_train_norm = mms.fit_transform(X_train) X_test_norm = mms.transform(X_test)


from sklearn.preprocessing import StandardScalerstdsc = StandardScaler() X_train_std = stdsc.fit_transform(X_train) X_test_std = stdsc.transform(X_test) A visual example:
ex = pd.DataFrame([0, 1, 2, 3, 4, 5])# standardize ex[1] = (ex[0] - ex[0].mean()) / ex[0].std(ddof=0)# Please note that pandas uses ddof=1 (sample standard deviation) # by default, whereas NumPy's std method and the StandardScaler # uses ddof=0 (population standard deviation)# normalize ex[2] = (ex[0] - ex[0].min()) / (ex[0].max() - ex[0].min()) ex.columns = ['input', 'standardized', 'normalized'] ex
9.特征選擇

通過L1正則化的截斷性效應選擇,不重要的都為0,特征矩陣變成稀疏矩陣。

from sklearn.linear_model import LogisticRegressionlr = LogisticRegression(penalty='l1', C=0.1) lr.fit(X_train_std, y_train) print('Training accuracy:', lr.score(X_train_std, y_train)) print('Test accuracy:', lr.score(X_test_std, y_test)) lr.intercept_ lr.coef_10.通過隨機森林對特征重要性排序
from sklearn.ensemble import RandomForestClassifierfeat_labels = df_wine.columns[1:]forest = RandomForestClassifier(n_estimators=10000,random_state=0,n_jobs=-1)forest.fit(X_train, y_train) importances = forest.feature_importances_indices = np.argsort(importances)[::-1]for f in range(X_train.shape[1]):print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))plt.title('Feature Importances') plt.bar(range(X_train.shape[1]), importances[indices],color='lightblue', align='center')plt.xticks(range(X_train.shape[1]), feat_labels[indices], rotation=90) plt.xlim([-1, X_train.shape[1]]) plt.tight_layout() #plt.savefig('./random_forest.png', dpi=300) plt.show() if Version(sklearn_version) < '0.18':X_selected = forest.transform(X_train, threshold=0.15) else:from sklearn.feature_selection import SelectFromModelsfm = SelectFromModel(forest, threshold=0.15, prefit=True)X_selected = sfm.transform(X_train)X_selected.shape Now, let's print the 3 features that met the threshold criterion for feature selection that we set earlier (note that this code snippet does not appear in the actual book but was added to this notebook later for illustrative purposes):
for f in range(X_selected.shape[1]):print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))

總結

以上是生活随笔為你收集整理的sklearn机器学习常用数据处理总结的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。