日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Kaggle: House Prices: Advanced Regression Techniques

發(fā)布時間:2025/4/14 编程问答 38 豆豆
生活随笔 收集整理的這篇文章主要介紹了 Kaggle: House Prices: Advanced Regression Techniques 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

Kaggle: House Prices: Advanced Regression Techniques

notebook來自https://www.kaggle.com/neviadomski/how-to-get-to-top-25-with-simple-model-sklearn

思路流程:

1.導(dǎo)入數(shù)據(jù),查看數(shù)據(jù)結(jié)構(gòu)和缺失值情況
重點在于查看缺失值情況的寫法:
NAs = pd.concat([train.isnull().sum(), test.isnull().sum()], axis = 1, keys = ['train', 'test']) NAs[NAs.sum(axis=1) > 0]

2.數(shù)據(jù)預(yù)處理(刪除無用特征,特征轉(zhuǎn)化,缺失值填充,構(gòu)造新特征,特征值標(biāo)準(zhǔn)化,轉(zhuǎn)化為dummy)
Q:什么樣的特征需要做轉(zhuǎn)化?
A:如某些整型數(shù)據(jù)只表示類別,其數(shù)值本身沒有意義,則應(yīng)轉(zhuǎn)化為dummy
重點學(xué)習(xí)手動將特征轉(zhuǎn)化為dummy的方法(這里情況稍微還要復(fù)雜一點,因為存在同一特征對應(yīng)兩列的情況,如Condition1,Condition2)

3.隨機打亂數(shù)據(jù),分離訓(xùn)練集和測試集

4.構(gòu)建多個單一模型

5.模型融合

問題:

1.如何判斷一個特征是否是無用特征?

2.模型融合的方法?這里為什是np.exp(GB_model.predict(test_features)) + np.exp(ENS_model.predict(test_features_std))?

3.為什么label分布偏斜需要做轉(zhuǎn)化?

In?[33]: #Kaggle: House Prices: Advanced Regression Techniques import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn import ensemble, linear_model, tree from sklearn.model_selection import train_test_split, cross_val_score from sklearn.metrics import mean_squared_error, r2_score from sklearn.utils import shuffle ? %matplotlib inline import warnings warnings.filterwarnings('ignore') ? train = pd.read_csv('downloads/train.csv') test = pd.read_csv('downloads/test.csv') In?[8]: train.head() Out[8]: ?IdMSSubClassMSZoningLotFrontageLotAreaStreetAlleyLotShapeLandContourUtilities...PoolAreaPoolQCFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionSalePrice01234
160RL65.08450PaveNaNRegLvlAllPub...0NaNNaNNaN022008WDNormal208500
220RL80.09600PaveNaNRegLvlAllPub...0NaNNaNNaN052007WDNormal181500
360RL68.011250PaveNaNIR1LvlAllPub...0NaNNaNNaN092008WDNormal223500
470RL60.09550PaveNaNIR1LvlAllPub...0NaNNaNNaN022006WDAbnorml140000
560RL84.014260PaveNaNIR1LvlAllPub...0NaNNaNNaN0122008WDNormal250000

5 rows × 81 columns?

In?[9]: #檢查缺失值 NAs = pd.concat([train.isnull().sum(), test.isnull().sum()], axis = 1, keys = ['train', 'test']) #sum()默認(rèn)的axis=0,即跨行 NAs[NAs.sum(axis=1) > 0] #只顯示有缺失值的特征 Out[9]: ?traintestAlleyBsmtCondBsmtExposureBsmtFinSF1BsmtFinSF2BsmtFinType1BsmtFinType2BsmtFullBathBsmtHalfBathBsmtQualBsmtUnfSFElectricalExterior1stExterior2ndFenceFireplaceQuFunctionalGarageAreaGarageCarsGarageCondGarageFinishGarageQualGarageTypeGarageYrBltKitchenQualLotFrontageMSZoningMasVnrAreaMasVnrTypeMiscFeaturePoolQCSaleTypeTotalBsmtSFUtilities
13691352.0
3745.0
3844.0
01.0
01.0
3742.0
3842.0
02.0
02.0
3744.0
01.0
10.0
01.0
01.0
11791169.0
690730.0
02.0
01.0
01.0
8178.0
8178.0
8178.0
8176.0
8178.0
01.0
259227.0
04.0
815.0
816.0
14061408.0
14531456.0
01.0
01.0
02.0
In?[10]: #打印R2和RMSE得分 def print_score (prediction, labels):print('R2: {}'.format(r2_score(prediction, labels)))print('RMSE: {}'.format(np.sqrt(mean_squared_error(prediction, labels)))) ? #對給定的模型進行評估,分別打印訓(xùn)練集上的得分和測試集上的得分 def train_test_score(estimator, x_train, x_test, y_train, y_test):train_predictions = estimator.predict(x_train)print('------------train-----------')print_score(train_predictions, y_train)print('------------test------------')test_predictions = estimator.predict(x_test)print_score(test_predictions, y_test) In?[11]: #將標(biāo)簽從訓(xùn)練集中分離出來 train_label = train.pop('SalePrice') ? #將訓(xùn)練集特征和測試集特征拼在一起,便于一起刪除無用的特征 features = pd.concat([train, test], keys = ['train', 'test']) ? #刪除無用特征(為什么說它們是無用特征并沒有解釋) features.drop(['Utilities', 'RoofMatl', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'Heating', 'LowQualFinSF','BsmtFullBath', 'BsmtHalfBath', 'Functional', 'GarageYrBlt', 'GarageArea', 'GarageCond', 'WoodDeckSF','OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal'],axis=1, inplace=True) print(features.shape) (2919, 56) In?[12]: #將series數(shù)據(jù)轉(zhuǎn)化為str #問題:什么樣的數(shù)據(jù)需要轉(zhuǎn)化為str #答:將原來的某些整型數(shù)據(jù)轉(zhuǎn)化為str,這些整型數(shù)據(jù)數(shù)字大小本身并沒有含義,而只是代表一個類,所以轉(zhuǎn)化為str后,后續(xù)再轉(zhuǎn)化為dummy features['MSSubClass'] = features['MSSubClass'].astype(str) #pandas調(diào)用特征的兩種方法:.feature和['feature'],兩者效果相同,下面就是.feature方法 features.OverallCond = features.OverallCond.astype(str) features['KitchenAbvGr'] = features['KitchenAbvGr'].astype(str) features['YrSold'] = features['YrSold'].astype(str) features['MoSold'] = features['MoSold'].astype(str) ? #用眾數(shù)填充缺失值 features['MSZoning'] = features['MSZoning'].fillna(features['MSZoning'].mode()[0]) features['MasVnrType'] = features['MasVnrType'].fillna(features['MasVnrType'].mode()[0]) features['Electrical'] = features['Electrical'].fillna(features['Electrical'].mode()[0]) features['KitchenQual'] = features['KitchenQual'].fillna(features['KitchenQual'].mode()[0]) features['SaleType'] = features['SaleType'].fillna(features['SaleType'].mode()[0]) ? #用某個特定值填充缺失值 features['LotFrontage'] = features['LotFrontage'].fillna(features['LotFrontage'].mean()) features['Alley'] = features['Alley'].fillna('NOACCESS') for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):features[col] = features[col].fillna('NoBSMT') features['TotalBsmtSF'] = features['TotalBsmtSF'].fillna(0) features['FireplaceQu'] = features['FireplaceQu'].fillna('NoFP') for col in ('GarageType', 'GarageFinish', 'GarageQual'):features[col] = features[col].fillna('NoGRG') features['GarageCars'] = features['GarageCars'].fillna(0.0) ? #構(gòu)造新特征 features['TotalSF'] = features['TotalBsmtSF'] + features['1stFlrSF'] + features['2ndFlrSF'] features.drop(['TotalBsmtSF', '1stFlrSF', '2ndFlrSF'], axis=1, inplace=True) print(features.shape) (2919, 54) In?[13]: #查看房價分布情況 ax = sns.distplot(train_label) In?[14]: #發(fā)現(xiàn)圖像整體向左傾斜,所以做log轉(zhuǎn)變 train_label = np.log(train_label) ax = sns.distplot(train_label) In?[15]: #對數(shù)字特征做標(biāo)準(zhǔn)化處理 num_features = features.loc[:,['LotFrontage', 'LotArea', 'GrLivArea', 'TotalSF']] num_features_standarized = (num_features - num_features.mean()) / num_features.std() num_features_standarized.head() Out[15]: ??LotFrontageLotAreaGrLivAreaTotalSFtrain01234
-0.202033-0.2178410.4134760.022999
0.501785-0.072032-0.471810-0.029167
-0.0612690.1371730.5636590.196886
-0.436639-0.0783710.427309-0.092511
0.6894690.5188141.3778060.988072
In?[16]: ax = sns.pairplot(num_features_standarized) In?[17]: #重點 #convert categorical data to dummies #將所有condition不重復(fù)的記錄在一個set中 conditions = set([x for x in features['Condition1']] + [x for x in features['Condition2']]) #自定義dummy變量,行數(shù)為陽歷數(shù),列數(shù)為原condition數(shù)據(jù)轉(zhuǎn)化為dummy后的維數(shù) dummies = pd.DataFrame(data = np.zeros((len(features.index), len(conditions))), index = features.index, columns = conditions) #遍歷所有樣例,將原來的condition信息轉(zhuǎn)化為對應(yīng)的dummy信息 for i, cond in enumerate(zip(features['Condition1'], features['Condition2'])): #用ix找到位置,注意cond可能包含Condition1和Condition2兩個位置的信息,對應(yīng)dummies數(shù)組的兩個點,所以需要用ix而不能簡單的直接用dummies[i,cond]dummies.ix[i, cond] = 1 #將dummy后的特征數(shù)據(jù)拼接到原features后面,并給dummy特征的index增加前綴 features = pd.concat([features, dummies.add_prefix('Cond_')], axis = 1) #最后就可以刪除原來的Condition特征 features.drop(['Condition1', 'Condition2'], axis = 1, inplace =True) print(features.shape) (2919, 61) In?[18]: features.head() Out[18]: ??IdMSSubClassMSZoningLotFrontageLotAreaStreetAlleyLotShapeLandContourLotConfig...TotalSFCond_PosACond_ArteryCond_PosNCond_RRAnCond_RRAeCond_FeedrCond_NormCond_RRNnCond_RRNetrain01234
160RL65.08450PaveNOACCESSRegLvlInside...2566.00.00.00.00.00.00.01.00.00.0
220RL80.09600PaveNOACCESSRegLvlFR2...2524.00.00.00.00.00.01.01.00.00.0
360RL68.011250PaveNOACCESSIR1LvlInside...2706.00.00.00.00.00.00.01.00.00.0
470RL60.09550PaveNOACCESSIR1LvlCorner...2473.00.00.00.00.00.00.01.00.00.0
560RL84.014260PaveNOACCESSIR1LvlFR2...3343.00.00.00.00.00.00.01.00.00.0

5 rows × 61 columns

In?[19]: #convert Exterior to dummies Exterior = set([x for x in features['Exterior1st']] + [x for x in features['Exterior2nd']]) dummies = pd.DataFrame(data = np.zeros([len(features.index), len(Exterior)]), index = features.index, columns = Exterior) for i, ext in enumerate(zip(features['Exterior1st'], features['Exterior2nd'])):dummies.ix[i, ext] = 1 features = pd.concat([features, dummies.add_prefix('Ext_')], axis = 1) features.drop(['Exterior1st', 'Exterior2nd', 'Ext_nan'], axis = 1, inplace = True) print(features.shape) (2919, 78) In?[20]: features.dtypes[features.dtypes == 'object'].index Out[20]: Index(['MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour','LotConfig', 'LandSlope', 'Neighborhood', 'BldgType', 'HouseStyle','OverallCond', 'RoofStyle', 'MasVnrType', 'ExterQual', 'ExterCond','Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1','BsmtFinType2', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenAbvGr','KitchenQual', 'FireplaceQu', 'GarageType', 'GarageFinish','GarageQual', 'PavedDrive', 'MoSold', 'YrSold', 'SaleType','SaleCondition'],dtype='object') In?[21]: #遍歷特定類型數(shù)據(jù)的方法:for col in features.dtypes[features.dtypes == 'object'].index #convert all other categorical vars to dummies for col in features.dtypes[features.dtypes == 'object'].index:for_dummy = features.pop(col)features = pd.concat([features, pd.get_dummies(for_dummy, prefix = col)], axis = 1) print(features.shape) (2919, 263) In?[22]: #用之前幾個標(biāo)準(zhǔn)化的數(shù)據(jù)更新features ? features_standardized = features.copy() features_standardized.update(num_features_standarized) In?[23]: #重新分離訓(xùn)練集和測試集 ? #首先分離沒有標(biāo)準(zhǔn)化的features train_features = features.loc['train'].drop(['Id'], axis=1).select_dtypes(include=[np.number]).values test_features = features.loc['test'].drop(['Id'], axis=1).select_dtypes(include=[np.number]).values ? #再分離標(biāo)準(zhǔn)化的數(shù)據(jù) train_features_std = features_standardized.loc['train'].drop(['Id'], axis=1).select_dtypes(include=[np.number]).values test_features_std = features_standardized.loc['test'].drop(['Id'], axis=1).select_dtypes(include=[np.number]).values print(train_features.shape) print(train_features_std.shape) (1460, 262) (1460, 262) In?[24]: #shuffle train dataset train_features_std, train_features, train_label = shuffle(train_features_std, train_features, train_label, random_state = 5) In?[25]: #split train and test data x_train, x_test, y_train, y_test = train_test_split(train_features, train_label, test_size = 0.1, random_state = 200) x_train_std, x_test_std, y_train_std, y_test_std = train_test_split(train_features_std, train_label, test_size = 0.1, random_state = 200) In?[26]: #構(gòu)建第一個模型:ElasticNet ENSTest = linear_model.ElasticNetCV(alphas=[0.0001, 0.0005, 0.001, 0.01, 0.1, 1, 10], l1_ratio=[.01, .1, .5, .9, .99], max_iter=5000).fit(x_train_std, y_train_std) train_test_score(ENSTest, x_train_std, x_test_std, y_train_std, y_test_std) ------------train----------- R2: 0.9009283127352861 RMSE: 0.11921419084690392 ------------test------------ R2: 0.8967299522701895 RMSE: 0.11097042840114624 In?[27]: #測試模型的交叉驗證得分 score = cross_val_score(ENSTest, train_features_std, train_label, cv = 5) print('Accurary: %0.2f +/- %0.2f' % (score.mean(), score.std()*2)) Accurary: 0.88 +/- 0.10 In?[28]: #構(gòu)建第二個模型:GradientBoosting GB = ensemble.GradientBoostingRegressor(n_estimators=3000, learning_rate = 0.05, max_depth = 3, max_features = 'sqrt', min_samples_leaf = 15,min_samples_split = 10, loss = 'huber').fit(x_train_std, y_train_std) train_test_score(GB, x_train_std, x_test_std, y_train_std, y_test_std) ------------train----------- R2: 0.9607778449577035 RMSE: 0.07698826081848897 ------------test------------ R2: 0.9002871760789876 RMSE: 0.10793269100940146 In?[29]: #構(gòu)建第二個模型:GradientBoosting GB = ensemble.GradientBoostingRegressor(n_estimators=3000, learning_rate = 0.05, max_depth = 3, max_features = 'sqrt', min_samples_leaf = 15,min_samples_split = 10, loss = 'huber').fit(x_train_std, y_train_std) train_test_score(GB, x_train_std, x_test_std, y_train_std, y_test_std) Accurary: 0.90 +/- 0.04 In?[30]: #模型融合 GB_model = GB.fit(train_features, train_label) ENS_model = ENSTest.fit(train_features_std, train_label) In?[31]: #為什么模型融合公式是這樣的? Final_score = (np.exp(GB_model.predict(test_features)) + np.exp(ENS_model.predict(test_features_std))) / 2 In?[32]: #寫入csv文件 pd.DataFrame({'Id':test.Id, 'SalePrice':Final_score}).to_csv('submit.csv', index=False)

轉(zhuǎn)載于:https://www.cnblogs.com/RB26DETT/p/11566650.html

總結(jié)

以上是生活随笔為你收集整理的Kaggle: House Prices: Advanced Regression Techniques的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。