日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

数据缺失值处理

發布時間:2025/3/8 编程问答 45 豆豆
生活随笔 收集整理的這篇文章主要介紹了 数据缺失值处理 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
數據缺失值處理 In [1]: import pandas as pd import numpy as np from sklearn.ensemble import RandomForestRegressor,RandomForestClassifier from sklearn.preprocessing import StandardScaler from sklearn.impute import SimpleImputer In [2]: df = pd.DataFrame() df['x0'] = [1,2,3,4,5,np.nan] df['x1'] = [np.nan,7,8,1,2,5] df Out[2]: x0 x1 0 1.0 NaN 1 2.0 7.0 2 3.0 8.0 3 4.0 1.0 4 5.0 2.0 5 NaN 5.0 刪除 In [3]: df_new = df.dropna() df_new Out[3]: x0 x1 1 2.0 7.0 2 3.0 8.0 3 4.0 1.0 4 5.0 2.0 填補 統計法 利用其它行或列的數據的均值,中位數或總和來補全缺失值In [4]: df_new = df.copy() df_new['x0'] = df_new['x0'].fillna(df_new['x0'].mean()) df_new['x1'] = df_new['x1'].fillna(df_new['x1'].median()) #df_new['x1'] = df_new['x1'].fillna(df_new['x1'].sum()) df_new Out[4]: x0 x1 0 1.0 5.0 1 2.0 7.0 2 3.0 8.0 3 4.0 1.0 4 5.0 2.0 5 3.0 5.0 取均值不算上空值的個數In [5]: df_new = np.array(df) mean_imputer = SimpleImputer(strategy='mean') mean_imputer = mean_imputer.fit(df_new) imputed_df = mean_imputer.transform(df_new) print(imputed_df) [[1. 4.6][2. 7. ][3. 8. ][4. 1. ][5. 2. ][3. 5. ]] 聚類法 利用無監督機器學習的聚類方法。通過K均值的聚類方法將所有樣本進行聚類劃分,然后再通過劃分的種類的均值對各自類中的缺失值進行填補。歸其本質還是通過找相似來填補缺失值。In [6]: # TODO 模型法 使用其它變量訓練模型來預測缺失值: 隨機森林In [7]: age_df = pd.DataFrame() age_df['age'] = [np.nan,7,8,9,np.nan,6,6,7,9] age_df['height'] = [1,2,3,4,3,1,1,2,4] age_df Out[7]: age height 0 NaN 1 1 7.0 2 2 8.0 3 3 9.0 4 4 NaN 3 5 6.0 1 6 6.0 1 7 7.0 2 8 9.0 4 In [8]: known_age = age_df[age_df.age.notnull()].values unknown_age = age_df[age_df.age.isnull()].values unknown_age Out[8]: array([[nan, 1.],[nan, 3.]]) In [9]: X = known_age[:, 1:] y = known_age[:, 0] In [10]: rfr = RandomForestClassifier(random_state=0, n_estimators=2000, n_jobs=-1) #rfr = RandomForestRegressor(random_state=0, n_estimators=2000, n_jobs=-1) rfr.fit(X, y) Out[10]: RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',max_depth=None, max_features='auto', max_leaf_nodes=None,min_impurity_decrease=0.0, min_impurity_split=None,min_samples_leaf=1, min_samples_split=2,min_weight_fraction_leaf=0.0, n_estimators=2000,n_jobs=-1, oob_score=False, random_state=0, verbose=0,warm_start=False) In [11]: predictedAges = rfr.predict(unknown_age[:, 1:]) age_df["new_age"] = age_df["age"] age_df.loc[(age_df.age.isnull()), 'new_age' ] = predictedAges age_df Out[11]: age height new_age 0 NaN 1 6.0 1 7.0 2 7.0 2 8.0 3 8.0 3 9.0 4 9.0 4 NaN 3 8.0 5 6.0 1 6.0 6 6.0 1 6.0 7 7.0 2 7.0 8 9.0 4 9.0

?

總結

以上是生活随笔為你收集整理的数据缺失值处理的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。