kaggle—HousePrice房价预测项目实战
房價預(yù)測是kaggle官網(wǎng)的一個競賽項目,算是機(jī)器學(xué)習(xí)的一個入門項目。kaggle官網(wǎng)鏈接: link.
關(guān)于kaggle競賽項目的操作流程可以參看這篇博客: link.
一、kaggle介紹
kaggle主要為開發(fā)商和數(shù)據(jù)科學(xué)家提供舉辦機(jī)器學(xué)習(xí)競賽、托管數(shù)據(jù)庫、編寫和分享代碼的平臺,kaggle已經(jīng)吸引了80萬名數(shù)據(jù)科學(xué)家的關(guān)注。是學(xué)習(xí)數(shù)據(jù)挖掘和數(shù)據(jù)分析一個不可多得的實戰(zhàn)學(xué)習(xí)平臺,上面還有許多的項目有巨額的獎金,有許多的獲獎選手都會分享他們的代碼并分析和挖掘數(shù)據(jù)的經(jīng)驗。
二、房價預(yù)測
房價競賽的鏈接: link.
三、數(shù)據(jù)分析
導(dǎo)入相應(yīng)的庫:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
(1)從官網(wǎng)下載數(shù)據(jù)集
(2)利用pandas從下載目錄里導(dǎo)入數(shù)據(jù),輸出所有列名、輸出詳細(xì)的目標(biāo)變量的信息,查看數(shù)據(jù)是否滿足正態(tài)分布
df_train=pd.read_csv('./train.csv')
print(df_train.columns) #讀取列表中所有列的列名
print(df_train['SalePrice'].describe()) #分析目標(biāo)變量
sns.distplot((df_train['SalePrice'])) #查看是否滿足正態(tài)分布
plt.show()
所有列名
Index([‘Id’, ‘MSSubClass’, ‘MSZoning’, ‘LotFrontage’, ‘LotArea’, ‘Street’,
‘Alley’, ‘LotShape’, ‘LandContour’, ‘Utilities’, ‘LotConfig’,
‘LandSlope’, ‘Neighborhood’, ‘Condition1’, ‘Condition2’, ‘BldgType’,
‘HouseStyle’, ‘OverallQual’, ‘OverallCond’, ‘YearBuilt’, ‘YearRemodAdd’,
‘RoofStyle’, ‘RoofMatl’, ‘Exterior1st’, ‘Exterior2nd’, ‘MasVnrType’,
‘MasVnrArea’, ‘ExterQual’, ‘ExterCond’, ‘Foundation’, ‘BsmtQual’,
‘BsmtCond’, ‘BsmtExposure’, ‘BsmtFinType1’, ‘BsmtFinSF1’,
‘BsmtFinType2’, ‘BsmtFinSF2’, ‘BsmtUnfSF’, ‘TotalBsmtSF’, ‘Heating’,
‘HeatingQC’, ‘CentralAir’, ‘Electrical’, ‘1stFlrSF’, ‘2ndFlrSF’,
‘LowQualFinSF’, ‘GrLivArea’, ‘BsmtFullBath’, ‘BsmtHalfBath’, ‘FullBath’,
‘HalfBath’, ‘BedroomAbvGr’, ‘KitchenAbvGr’, ‘KitchenQual’,
‘TotRmsAbvGrd’, ‘Functional’, ‘Fireplaces’, ‘FireplaceQu’, ‘GarageType’,
‘GarageYrBlt’, ‘GarageFinish’, ‘GarageCars’, ‘GarageArea’, ‘GarageQual’,
‘GarageCond’, ‘PavedDrive’, ‘WoodDeckSF’, ‘OpenPorchSF’,
‘EnclosedPorch’, ‘3SsnPorch’, ‘ScreenPorch’, ‘PoolArea’, ‘PoolQC’,
‘Fence’, ‘MiscFeature’, ‘MiscVal’, ‘MoSold’, ‘YrSold’, ‘SaleType’,
‘SaleCondition’, ‘SalePrice’],dtype=‘object’)
房價變量的詳細(xì)信息
count 1460.000000
mean 180921.195890
std 79442.502883
min 34900.000000
25% 129975.000000
50% 163000.000000
75% 214000.000000
max 755000.000000
Name: SalePrice, dtype: float64
房價分布曲線
從結(jié)果來看并不滿足標(biāo)準(zhǔn)的正態(tài)分布,所以查看下它的峰度(kurtosis)和偏度(skewness)。
峰度(kurtosis):描述變量取值分布形態(tài)的陡緩程度的統(tǒng)計量。
kurtosis=0與正態(tài)分布的陡緩程度相同。
kurtosis>0比正態(tài)分布的高峰更加陡峭。
kurtosis<0比正態(tài)分布的高峰平。
偏度(skewness)是描述變量取值分布對稱性的統(tǒng)計量。
skewness=0分布形態(tài)與正態(tài)分布偏度相同。
skewness>0表示正(右)偏差數(shù)值較大,右邊的 尾巴比較長。
skewness<0表示負(fù)(左)偏差數(shù)值較大,左邊的尾巴比較長。
print('Skewness:%f'%df_train['SalePrice'].skew())
print('Kurtosis:%f'%df_train['SalePrice'].kurt())
Skewness:1.882876
Kurtosis:6.536282
(3)查看各個特征的分布走向
居住面積:
var='GrLivArea'
data=pd.concat([df_train['SalePrice'],df_train[var]],axis=1)
data.plot.scatter(x=var,y='SalePrice',ylim=(0,800000))
plt.show()
這里可以看出居住面積存在離群值
地下室面積:
var='TotalBsmtSF'
data=pd.concat([df_train['SalePrice'],df_train[var]],axis=1)
data.plot.scatter(x=var,y='SalePrice',ylim=(0,800000))
plt.show()
整體材料與飾面質(zhì)量:(用箱型圖,可以查看離群值,均值,最值)
var='OverallQual'
data=pd.concat([df_train['SalePrice'],df_train[var]],axis=1)
f,ax=plt.subplots(figsize=(8,6))
fig=sns.boxplot(x=var,y='SalePrice',data=data)
fig.axis(ymin=0,ymax=800000)
plt.show()
原施工日期:
var='YearBuilt'
data=pd.concat([df_train['SalePrice'],df_train[var]],axis=1)
f,ax=plt.subplots(figsize=(16,8))
fig=sns.boxplot(x=var,y='SalePrice',data=data)
fig.axis(ymin=0,ymax=800000)
plt.xticks(rotation=90)
# plt.savefig('原施工日期.jpg') #保存至當(dāng)前目錄
plt.show()
特征相關(guān)度熱度圖:
corrmat=df_train.corr()
f,ax=plt.subplots(figsize=(12,9))
sns.heatmap(corrmat,square=True,cmap='YlGnBu')
plt.savefig('熱力圖.jpg')
plt.show()
選取前10個和出售價格相關(guān)性比較大的特征進(jìn)行分析:
將上面挑出來的十個特征進(jìn)行兩兩畫圖(選六個):
sns.set()
cols=['SalePrice','OverallQual','GrLivArea','GarageCars','TotalBsmtSF','FullBath','YearBuilt']
sns.pairplot(df_train[cols],size=2.5)
plt.savefig('相關(guān)性圖.jpg')
plt.show()
缺失值查看:
total=df_train.isnull().sum().sort_values(ascending=False)
percent=(df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
missing_data=pd.concat([total,percent],axis=1,keys=['Total','Percent'])
print(missing_data.head(20))
Total Percen
PoolQC 1453 0.995205
MiscFeature 1406 0.963014
Alley 1369 0.937671
Fence 1179 0.807534
FireplaceQu 690 0.472603
LotFrontage 259 0.177397
GarageCond 81 0.055479
GarageType 81 0.055479
GarageYrBlt 81 0.055479
GarageFinish 81 0.055479
GarageQual 81 0.055479
BsmtExposure 38 0.026027
BsmtFinType2 38 0.026027
BsmtFinType1 37 0.025342
BsmtCond 37 0.025342
BsmtQual 37 0.025342
MasVnrArea 8 0.005479
MasVnrType 8 0.005479
Electrical 1 0.000685
Utilities 0 0.000000
四、數(shù)據(jù)處理
這個環(huán)節(jié)將對我們數(shù)據(jù)分析時候找出來的一些需要處理的數(shù)據(jù)進(jìn)行適當(dāng)?shù)臄?shù)據(jù)處理,數(shù)據(jù)處理部分因人而異,數(shù)據(jù)處理的好壞將直接影響模型的結(jié)果。
先將所用到的庫導(dǎo)入(包括后面建模的)
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import norm,skew
from sklearn.preprocessing import LabelEncoder
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
導(dǎo)入數(shù)據(jù)集:
train = pd.read_csv('./train.csv')
test = pd.read_csv('./test.csv')
查看數(shù)據(jù)集的大小:
print('The train data size before dropping Id feature is :{}'.format(train.shape))
print('The test data size before dropping Id feature is :{}'.format(test.shape))
The train data size after dropping Id feature is :(1460, 81)
The test data size after dropping Id feature is :(1459, 80)
將id列賦值給變量,然后刪除該列,再看下刪除后的數(shù)據(jù)集大小:
train_ID = train['Id']
test_ID = test['Id']train.drop('Id', axis=1, inplace=True)
test.drop('Id', axis=1, inplace=True)print('\nThe train data size after dropping Id feature is :{}'.format(train.shape))
print('The test data size after dropping Id feature is :{}'.format(test.shape))
The train data size after dropping Id feature is :(1460, 80)
The test data size after dropping Id feature is :(1459, 79)
由數(shù)據(jù)分析階段知道居住面積存在離群值,再次顯示出來(為下一步刪掉離群值提供刪除的范圍),刪除離群值,顯示刪除后的數(shù)據(jù)集:
fig, ax = plt.subplots()
ax.scatter(x=train['GrLivArea'], y=train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()# 干掉離群點
train = train.drop(train[(train['GrLivArea'] > 4000) & (train['SalePrice'] < 300000)].index)
# 查看去掉離群值后的數(shù)據(jù)
fig, ax = plt.subplots()
ax.scatter(x=train['GrLivArea'], y=train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()
將目標(biāo)變量轉(zhuǎn)換為正態(tài)分布:
sns.distplot(train['SalePrice'], fit=norm)
plt.show()
(mu, sigma) = norm.fit(train['SalePrice'])
print('\n mu = {:.2f} and sigma={:.2f}\n'.format(mu, sigma))
mu = 180932.92 and sigma=79467.79
qq圖(可以查看數(shù)據(jù)與正態(tài)分布差距):
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()
上面可以看出來與正態(tài)分布偏離還是很大,這里做個對數(shù)變換:
train['SalePrice'] = np.log1p(train['SalePrice'])sns.distplot(train['SalePrice'], fit=norm)
plt.show()(mu, sigma) = norm.fit(train['SalePrice'])
print('\n mu={:.2f} and sigma={:.2f}\n'.format(mu, sigma))
#qq圖
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()
mu=12.02 and sigma=0.40
缺失值處理:
缺失值處理前我們先建立整體的數(shù)據(jù)集(將訓(xùn)練集和測試集放一起處理)
ntrain = train.shape[0]
ntest = test.shape[0]
y_train = train.SalePrice.values
all_data = pd.concat((train, test)).reset_index(drop=True)
all_data.drop(['SalePrice'], axis=1, inplace=True)
print('all_data size is :{}'.format(all_data.shape))
all_data size is :(2917, 79)
打印缺失數(shù)據(jù):
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio': all_data_na})
print(missing_data)
Missing Ratio
PoolQC 99.691464
MiscFeature 96.400411
Alley 93.212204
Fence 80.425094
FireplaceQu 48.680151
LotFrontage 16.660953
GarageFinish 5.450806
GarageYrBlt 5.450806
GarageQual 5.450806
GarageCond 5.450806
GarageType 5.382242
BsmtExposure 2.811107
BsmtCond 2.811107
BsmtQual 2.776826
BsmtFinType2 2.742544
BsmtFinType1 2.708262
MasVnrType 0.822763
MasVnrArea 0.788481
MSZoning 0.137127
BsmtFullBath 0.068564
BsmtHalfBath 0.068564
Utilities 0.068564
Functional 0.068564
Exterior2nd 0.034282
Exterior1st 0.034282
SaleType 0.034282
BsmtFinSF1 0.034282
BsmtFinSF2 0.034282
BsmtUnfSF 0.034282
Electrical 0.034282
f, ax=plt.subplots(figsize=(15,12))
plt.xticks(rotation='90')
sns.barplot(x=all_data_na. index,y=all_data_na)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)
plt.show()
對上述缺失值做填充:
#查看游泳池數(shù)據(jù),并填充
# print(all_data['PoolQC'][:5])
all_data['PoolQC']=all_data['PoolQC'].fillna('None')
# print(all_data['PoolQC'][:5])
#MiscFeature
# print(all_data['MiscFeature'][:10])
all_data['MiscFeature']=all_data['MiscFeature'].fillna('None')
#通道的入口
all_data['Alley']=all_data['Alley'].fillna('None')
#柵欄
all_data['Fence']=all_data['Fence'].fillna('None')
#壁爐
all_data['FireplaceQu']=all_data['FireplaceQu'].fillna('None')
#離街道的距離(用臨近值代替)
all_data['LotFrontage']=all_data.groupby('Neighborhood')['LotFrontage'].transform(lambda x:x.fillna(x.median()))
#車庫的一系列特征
for col in ('GarageFinish','GarageQual','GarageCond','GarageType'):all_data[col]=all_data[col].fillna('None')
for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):all_data[col] = all_data[col].fillna(0)
#地下室的一系列特征
for col in ('BsmtFullBatch','BsmtUnfSF','TotalBsmtSf','BsmtFinSF1','BsmtFinSF2','BsmtHalfBath'):all_data[col]=all_data[col].fillna(0)
for col in ('BsmtExposure','BsmtCond','BsmtQual','BsmtFinType2','BsmtFinType1'):all_data[col]=all_data[col].fillna('None')#砌體
all_data['MasVnrType']=all_data['MasVnrType'].fillna('None')
all_data['MasVnrArea']=all_data['MasVnrArea'].fillna(0)
all_data['MSZoning'].mode()
#一般分區(qū)分類
all_data['MSZoning']=all_data['MSZoning'].fillna(all_data['MSZoning'].mode()[0])
#家庭功能評定 對于Functional,數(shù)據(jù)描述里說明,其NA值代表Typ
all_data['Functional']=all_data['Functional'].fillna('Typ')
#電力系統(tǒng)
all_data['Electrical']=all_data['Electrical'].fillna(all_data['Electrical'].mode()[0])
#廚房品質(zhì)
all_data['KitchenQual']=all_data['KitchenQual'].fillna(all_data['KitchenQual'].mode()[0])
#外部
all_data['Exterior1st']=all_data['Exterior1st'].fillna(all_data['Exterior1st'].mode()[0])
all_data['Exterior2nd']=all_data['Exterior2nd'].fillna(all_data['Exterior2nd'].mode()[0])
#銷售類型
all_data['SaleType']=all_data['SaleType'].fillna(all_data['SaleType'].mode()[0])
#建筑類型
all_data['MSSubClass']=all_data['MSSubClass'].fillna('None')all_data=all_data.drop(['Utilities'],axis=1)
填充完以后再次查看缺失值:
all_data_na=(all_data.isnull().sum()/len(all_data))*100
all_data_na=all_data_na.drop(all_data_na[all_data_na==0].index).sort_values(ascending=False)
missing_data=pd.DataFrame({'Missing Ratio':all_data_na})
print(missing_data)
Empty DataFrame
Columns: [Missing Ratio]
Index: []
完成填充!
將有些不是連續(xù)值的數(shù)據(jù)給他們做成類別值:
all_data['MSSubClass']=all_data['MSSubClass'].apply(str)
all_data['OverallCond']=all_data['OverallCond'].astype(str)
all_data['YrSold']=all_data['YrSold'].astype(str)
all_data['MoSold']=all_data['MoSold'].astype(str)
使用sklearn進(jìn)行標(biāo)簽映射(使用sklearn的LabelEncoder方法將類別特征(離散型)編碼為0~n-1之間連續(xù)的特征數(shù)值):
cols=('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond',\'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1',\'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',\'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond',\'YrSold', 'MoSold')for c in cols:lbl = LabelEncoder()lbl.fit(list(all_data[c].values))all_data[c] = lbl.transform(list(all_data[c].values))
# shape
print('Shape all_data: {}'.format(all_data.shape))
一般房價與房子整體的面積有關(guān),所以這里多做一個特征,將幾個面積整合在一起:
all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']
我們檢查數(shù)值型特征數(shù)據(jù)的偏度(skewness),但是要注意,object類型的數(shù)據(jù)無法計算skewness,因此計算的時候要過濾掉object數(shù)據(jù)。
umeric_feats = all_data.dtypes[all_data.dtypes != 'object'].index
# check the skew of all numerical features
skewed_feats = all_data[umeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
skewness = pd.DataFrame({'Skew': skewed_feats})
print(skewness.head(10))
Skew
MiscVal 21.939672
PoolArea 17.688664
LotArea 13.109495
LowQualFinSF 12.084539
3SsnPorch 11.372080
LandSlope 4.973254
KitchenAbvGr 4.300550
BsmtFinSF2 4.144503
EnclosedPorch 4.002344
ScreenPorch 3.945101
對于偏度過大的特征數(shù)據(jù)利用sklearn的box-cox轉(zhuǎn)換函數(shù),以降低數(shù)據(jù)的偏度
# box cox transformation of highly skewed features
# box cox轉(zhuǎn)換的知識可以google
skewness = skewness[abs(skewness) > 0.75]
print('there are {} skewed numerical features to Box Cox transform'.format(skewness.shape[0]))
from scipy.special import boxcox1p
skewed_feats_index = skewness.index
lam = 0.15
for feat in skewed_feats_index:all_data[feat] = boxcox1p(all_data[feat], lam)
使用pandas的dummy方法來進(jìn)行數(shù)據(jù)獨熱編碼,并形成最終的訓(xùn)練和測試數(shù)據(jù)集:
# getting dummy categorical features onehot???
all_data = pd.get_dummies(all_data)
print(all_data.shape)
# getting the new train and test sets
train = all_data[:ntrain]
test = all_data[ntrain:]
模型的建立
導(dǎo)入所需要的庫:
這里使用sklearn的交叉驗證函數(shù)cross_val_score,由于該函數(shù)并沒有shuffle的功能,我們還需要額外的kfold函數(shù)來對數(shù)據(jù)集進(jìn)行隨機(jī)分割。
from sklearn.linear_model import ElasticNet, Lasso, BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import lightgbm as lgbn_folds=5
def rmsle_cv(model):kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values)rmse = np.sqrt(-cross_val_score(model, train.values, y_train, scoring="neg_mean_squared_error", cv=kf))return rmse
Lasso模型
lasso=make_pipeline(RobustScaler(),Lasso(alpha=0.0005,random_state=1))
ElasticNet模型
ENet=make_pipeline(RobustScaler(),ElasticNet(alpha=0.0005,l1_ratio=.9,random_state=3))
KernelRidge帶有核函數(shù)的嶺回歸
KRR=KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)
GradientBoostingRegressor模型
GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05, max_depth=4, max_features='sqrt',min_samples_leaf=15, min_samples_split=10, loss='huber', random_state=5)
XGboost模型
xgb_model = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, learning_rate=0.05, max_depth=3,min_child_weight=1.7817, n_estimators=2200, reg_alpha=0.4640, reg_lambda=0.8571,subsample=0.5213, silent=1, random_state=7, nthread=-1)
lightgbm模型
lgb_model =lgb.LGBMRegressor(objective='regression',num_leaves=1000,learning_rate=0.05,n_estimators=350,reg_alpha=0.9)
輸出每個模型的得分:
score = rmsle_cv(lasso)
print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))score = rmsle_cv(ENet)
print("ElasticNet score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))score = rmsle_cv(KRR)
print("Kernel Ridge score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))score = rmsle_cv(GBoost)
print("Gradient Boosting score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))score = rmsle_cv(xgb_model)
print("Xgboost score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))score = rmsle_cv(lgb_model)
print("lightgbm score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
Lasso score: 0.1115 (0.0074)
ElasticNet score: 0.1116 (0.0074)
Kernel Ridge score: 0.1153 (0.0075)
Gradient Boosting score: 0.1167 (0.0083)
Xgboost score: 0.1164 (0.0070)
lightgbm score: 0.1288 (0.0058)
均值化模型
class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):def __init__(self, models):self.models = models# we define clones of the original models to fit the data indef fit(self, X, y):self.models_ = [clone(x) for x in self.models]# train cloned base modelsfor model in self.models_:model.fit(X, y)return self# we do the predictions for cloned models and average themdef predict(self, X):predictions = np.column_stack([model.predict(X) for model in self.models_])return np.mean(predictions, axis=1)
這里我們將enet gboost krr lasso四個模型進(jìn)行均值:
averaged_models = AveragingModels(models=(ENet, GBoost, KRR, lasso))
score_all = rmsle_cv(averaged_models)
print('Averaged base models score: {:.4f} ({:.4f})\n'.format(score_all.mean(), score_all.std()))
Averaged base models score: 0.1087 (0.0077)
Stacking模型
在Stacking模型基礎(chǔ)上加入元模型。
這里在均化模型基礎(chǔ)上加入元模型,然后在這些基礎(chǔ)模型上使用折外預(yù)測(out-of-folds)來訓(xùn)練我們的元模型,其訓(xùn)練步驟如下:
1 將訓(xùn)練集分出2個部分:train_a和train_b
2 用train_a來訓(xùn)練其他基礎(chǔ)模型
3 然后用其訓(xùn)練模型在測試集train_b上進(jìn)行預(yù)測
4 使用步驟3中中的預(yù)測結(jié)果作為輸入,然后在其元模型上進(jìn)行訓(xùn)練
參考鏈接: link.我們使用五折stacking方法,一般情況下,我們會將訓(xùn)練集分為5個部分,每次的訓(xùn)練中都會使用其中4個部分的數(shù)據(jù)集,然后使用最后一個部分?jǐn)?shù)據(jù)集來預(yù)測,五次迭代后我們會得到五次預(yù)測結(jié)果,最終使用著五次結(jié)果作為元模型的輸入進(jìn)行元模型的訓(xùn)練(其預(yù)測目標(biāo)變量不變)。在元模型的預(yù)測部分,我們會平均所有基礎(chǔ)模型的預(yù)測結(jié)果作為元模型的輸入進(jìn)行預(yù)測。
class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):def __init__(self, base_models, meta_model, n_folds=5):self.base_models = base_modelsself.meta_model = meta_modelself.n_folds = n_folds# We again fit the data on clones of the original modelsdef fit(self, X, y):self.base_models_ = [list() for x in self.base_models]self.meta_model_ = clone(self.meta_model)kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156)# Train cloned base models then create out-of-fold predictions# that are needed to train the cloned meta-modelout_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models)))for i, model in enumerate(self.base_models):for train_index, holdout_index in kfold.split(X, y):instance = clone(model)self.base_models_[i].append(instance)instance.fit(X[train_index], y[train_index])y_pred = instance.predict(X[holdout_index])out_of_fold_predictions[holdout_index, i] = y_pred# Now train the cloned meta-model using the out-of-fold predictions as new featureself.meta_model_.fit(out_of_fold_predictions, y)return self# Do the predictions of all base models on the test data and use the averaged predictions as# meta-features for the final prediction which is done by the meta-modeldef predict(self, X):meta_features = np.column_stack([np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)for base_models in self.base_models_])return self.meta_model_.predict(meta_features)
用前面定義好的enet、gboost、krr基礎(chǔ)模型,使用lasso作為元模型進(jìn)行訓(xùn)練預(yù)測,并計算得分:
stacked_averaged_models = StackingAveragedModels(base_models = (ENet, GBoost, KRR), meta_model = lasso)
score_all_stacked = rmsle_cv(stacked_averaged_models)
print('Stacking Averaged base models score: {:.4f} ({:.4f})\n'.format(score_all_stacked.mean(), score_all_stacked.std()))
Stacking Averaged base models score: 0.1081 (0.0073)
集成模型
將前面的模型進(jìn)行集成化,組合出一個更加高效的模型(StackedRegressor,XGBoost和LightGBM模型集成)
定義一下評價函數(shù)
def rmsle(y, y_pred):return np.sqrt(mean_squared_error(y, y_pred))
分別訓(xùn)練XGBoost和LightGBM和StackedRegressor模型:
stacked_averaged_models.fit(train.values, y_train)
stacked_train_pred = stacked_averaged_models.predict(train.values)
stacked_test_pred = np.expm1(stacked_averaged_models.predict(test.values))
print(rmsle(y_train, stacked_train_pred))
xgb_model.fit(train, y_train)
xgb_train_pred = xgb_model.predict(train)
xgb_test_pred = np.expm1(xgb_model.predict(test))
print(rmsle(y_train, xgb_train_pred))
lgb_model.fit(train, y_train)
lgb_train_pred = lgb_model.predict(train)
lgb_pred = np.expm1(lgb_model.predict(test.values))
print(rmsle(y_train, lgb_train_pred))
0.07839506096666995
0.07876052198274874
0.05893922686966146
用加權(quán)來平均上述的xgboost和LightGBM和StackedRegressor模型:
print('RMSLE score on train data all models:')
print(rmsle(y_train, stacked_train_pred * 0.6 + xgb_train_pred * 0.20 +lgb_train_pred * 0.20 ))
# Ensemble prediction 集成預(yù)測
ensemble_result = stacked_test_pred * 0.60 + xgb_test_pred * 0.20 + lgb_test_pred *0.20
生成結(jié)果的提交
submission = pd.DataFrame()
submission['Id'] = test_ID
submission['SalePrice'] = ensemble_result
submission.to_csv(r'E:\fangjiayucejieguo\submission.csv', index=False)
kaggle官網(wǎng)提交結(jié)果
總結(jié)
以上是生活随笔為你收集整理的kaggle—HousePrice房价预测项目实战的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: nestjs CRUD
- 下一篇: SolidWorks批量转换STL文件或