日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

机器学习实战:小麦种子(封装函数进行调参、标准化、绘图查看数据分布)

發布時間:2023/12/20 编程问答 32 豆豆
生活随笔 收集整理的這篇文章主要介紹了 机器学习实战:小麦种子(封装函数进行调参、标准化、绘图查看数据分布) 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

聲明:內容非原創,代碼來自葁sir

import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline # 導入數據集 seeds = pd.read_csv('data/seeds.csv',sep = '\t',header = None) seeds.head() 0123456701234
15.2614.840.87105.7633.3122.2215.220Kama
14.8814.570.88115.5543.3331.0184.956Kama
14.2914.090.90505.2913.3372.6994.825Kama
13.8413.940.89555.3243.3792.2594.805Kama
16.1414.990.90345.6583.5621.3555.175Kama
# 觀察小麥有多少類 seeds[7].value_counts() Kama 70 Rosa 70 Canadian 70 Name: 7, dtype: int64 seeds[7].value_counts().plot(kind = 'bar') <AxesSubplot:>

# 或者用seaborn import seaborn as sns sns.set() # seaborn 常用圖像 # barplot() # scatterplot() # swanrmplot() # boxplot() # violinplot() # countplot() # pairplot() # heatmap() from sklearn.model_selection import train_test_split from sklearn.linear_model import Lasso,RidgeClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.preprocessing import MinMaxScaler,StandardScaler X = seeds.iloc[:,:7].copy() # X = seeds.values[:,:7].copy() # 但是這樣復制 numpy.ndarray X.shape (210, 7) X 012345601234...205206207208209
15.2614.840.87105.7633.3122.2215.220
14.8814.570.88115.5543.3331.0184.956
14.2914.090.90505.2913.3372.6994.825
13.8413.940.89555.3243.3792.2594.805
16.1414.990.90345.6583.5621.3555.175
.....................
12.1913.200.87835.1372.9813.6314.870
11.2312.880.85115.1402.7954.3255.003
13.2013.660.88835.2363.2328.3155.056
11.8413.210.85215.1752.8363.5985.044
12.3013.340.86845.2432.9745.6375.063

210 rows × 7 columns

y = seeds.iloc[:,-1].copy() # y = seeds.values[:,-1].copy() y.shape (210,) X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1) # 封裝函數來進行knn試探性運算 def knn_score(k,X,y):# 構造算法對象knn = KNeighborsClassifier(n_neighbors = k)scores = []train_scores = []for i in range(100):# 拆分X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1)# 訓練knn.fit(X_train,y_train)# 評價模型scores.append(knn.score(X_test,y_test))# 經驗評分train_scores.append(knn.score(X_train,y_train))return np.array(scores).mean(),np.array(train_scores).mean()# 調參 result_dict = {} k_list = [1,3,5,7,9,11] for k in k_list:score,train_score = knn_score(k,X,y)result_dict[k] = [score,train_score] result_dict {1: [0.9047619047619047, 1.0],3: [0.9047619047619047, 0.9642857142857139],5: [0.8571428571428572, 0.9285714285714287],7: [0.8571428571428572, 0.9345238095238096],9: [0.8809523809523812, 0.9226190476190478],11: [0.8809523809523812, 0.9226190476190478]} pd.DataFrame(result_dict).T 011357911
0.9047621.000000
0.9047620.964286
0.8571430.928571
0.8571430.934524
0.8809520.922619
0.8809520.922619
result = pd.DataFrame(result_dict).T.copy() result.columns = ['Test','Train'] result TestTrain1357911
0.9047621.000000
0.9047620.964286
0.8571430.928571
0.8571430.934524
0.8809520.922619
0.8809520.922619
result.plot() plt.xticks(k_list) plt.show()

進階版

# z-score (x-x.mean)/ x.std N(0,1) # MinMaxScaller (x-x.min)/(x.max-x.min) 0-1 # 異常值 空值 數據分布查看 X.shape (210, 7) # 查看統計學指標 X.describe().T countmeanstdmin25%50%75%max0123456
210.014.8475242.90969910.590012.2700014.3550017.30500021.1800
210.014.5592861.30595912.410013.4500014.3200015.71500017.2500
210.00.8709990.0236290.80810.856900.873450.8877750.9183
210.05.6285330.4430634.89905.262255.523505.9797506.6750
210.03.2586050.3777142.63002.944003.237003.5617504.0330
210.03.7002011.5035570.76512.561503.599004.7687508.4560
210.05.4080710.4914804.51905.045005.223005.8770006.5500
def standard_X(X):X_copy = X.copy() # 拿數據for col_name in X_copy.columns: # 取列名col_data = X_copy[[col_name]] # 根據列名拿列數據,兩個方括號是因為要二維數組# fit_transformstand_data = StandardScaler().fit_transform(col_data.values) # 標準化X_copy[col_name] = stand_data # 將數據替換成標準化后的數據return X_copy standard_X(X).describe([0.01,0.25,0.5,0.75,0.99]).T # standard_X(X).describe([0.01,0.25,0.5,0.75,0.99]).T countmeanstdmin1%25%50%75%99%max0123456
210.0-5.392512e-171.002389-1.466714-1.397504-0.887955-0.1696740.8465992.0729132.181534
210.09.146123e-171.002389-1.649686-1.474607-0.851433-0.1836640.8870692.0235052.065260
210.01.322091e-151.002389-2.668236-2.588824-0.5980790.1039930.7116771.6781182.006586
210.0-2.182910e-151.002389-1.650501-1.464372-0.828682-0.2376280.7945952.1544592.367533
210.0-2.030122e-161.002389-1.668209-1.634930-0.834907-0.0573350.8044961.9367252.055112
210.0-3.679596e-161.002389-1.956769-1.857934-0.759148-0.0674690.7123792.5199053.170590
210.0-1.337554e-161.002389-1.813288-1.633810-0.740495-0.3774590.9563942.1307972.328998

查看數據分布

經過對標準化數據describe查看99分位數 發現標簽為2和5的兩個列 有較大差距

stand_X = standard_X(X) for col_name in stand_X.columns:sns.distplot(stand_X[col_name])plt.title(col_name)plt.show()

分箱操作

10 3000 5000 10000000

以5000為分割點 分割出高收入 低收入 進行映射 (減少數據之間的差異)

# 0 0 1 1 X[0] = pd.cut(X[0],bins = 5,labels = [0,1,2,3,4]) # 將數據進行切割,防止過擬合 X[0] 0 2 1 2 2 1 3 1 4 2.. 205 0 206 0 207 1 208 0 209 0 Name: 0, Length: 210, dtype: category Categories (5, int64): [0 < 1 < 2 < 3 < 4] sns.countplot(X[0]) C:\Anaconda\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.warnings.warn(<AxesSubplot:xlabel='0', ylabel='count'>

# 拆所有數據 for col_name in X.columns:X[col_name] = pd.cut(X[col_name],bins = 5,labels = [0,1,2,3,4]) X 012345601234...205206207208209
2222201
2231201
1141210
1131200
2242301
.....................
0030110
0010021
1130241
0010011
0020131

210 rows × 7 columns

knn = KNeighborsClassifier() X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state = 1) knn.fit(X_train,y_train) KNeighborsClassifier() knn.score(X_train,y_train) 0.9166666666666666 knn.score(X_test,y_test) 0.9523809523809523

總結

以上是生活随笔為你收集整理的机器学习实战:小麦种子(封装函数进行调参、标准化、绘图查看数据分布)的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。