kaggle—HousePrice房价预测项目实战
房價預測是kaggle官網的一個競賽項目,算是機器學習的一個入門項目。kaggle官網鏈接: link.
關于kaggle競賽項目的操作流程可以參看這篇博客: link.
一、kaggle介紹
kaggle主要為開發商和數據科學家提供舉辦機器學習競賽、托管數據庫、編寫和分享代碼的平臺,kaggle已經吸引了80萬名數據科學家的關注。是學習數據挖掘和數據分析一個不可多得的實戰學習平臺,上面還有許多的項目有巨額的獎金,有許多的獲獎選手都會分享他們的代碼并分析和挖掘數據的經驗。
二、房價預測
房價競賽的鏈接: link.
三、數據分析
導入相應的庫:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
(1)從官網下載數據集
(2)利用pandas從下載目錄里導入數據,輸出所有列名、輸出詳細的目標變量的信息,查看數據是否滿足正態分布
df_train=pd.read_csv('./train.csv')
print(df_train.columns) #讀取列表中所有列的列名
print(df_train['SalePrice'].describe()) #分析目標變量
sns.distplot((df_train['SalePrice'])) #查看是否滿足正態分布
plt.show()
所有列名
Index([‘Id’, ‘MSSubClass’, ‘MSZoning’, ‘LotFrontage’, ‘LotArea’, ‘Street’,
‘Alley’, ‘LotShape’, ‘LandContour’, ‘Utilities’, ‘LotConfig’,
‘LandSlope’, ‘Neighborhood’, ‘Condition1’, ‘Condition2’, ‘BldgType’,
‘HouseStyle’, ‘OverallQual’, ‘OverallCond’, ‘YearBuilt’, ‘YearRemodAdd’,
‘RoofStyle’, ‘RoofMatl’, ‘Exterior1st’, ‘Exterior2nd’, ‘MasVnrType’,
‘MasVnrArea’, ‘ExterQual’, ‘ExterCond’, ‘Foundation’, ‘BsmtQual’,
‘BsmtCond’, ‘BsmtExposure’, ‘BsmtFinType1’, ‘BsmtFinSF1’,
‘BsmtFinType2’, ‘BsmtFinSF2’, ‘BsmtUnfSF’, ‘TotalBsmtSF’, ‘Heating’,
‘HeatingQC’, ‘CentralAir’, ‘Electrical’, ‘1stFlrSF’, ‘2ndFlrSF’,
‘LowQualFinSF’, ‘GrLivArea’, ‘BsmtFullBath’, ‘BsmtHalfBath’, ‘FullBath’,
‘HalfBath’, ‘BedroomAbvGr’, ‘KitchenAbvGr’, ‘KitchenQual’,
‘TotRmsAbvGrd’, ‘Functional’, ‘Fireplaces’, ‘FireplaceQu’, ‘GarageType’,
‘GarageYrBlt’, ‘GarageFinish’, ‘GarageCars’, ‘GarageArea’, ‘GarageQual’,
‘GarageCond’, ‘PavedDrive’, ‘WoodDeckSF’, ‘OpenPorchSF’,
‘EnclosedPorch’, ‘3SsnPorch’, ‘ScreenPorch’, ‘PoolArea’, ‘PoolQC’,
‘Fence’, ‘MiscFeature’, ‘MiscVal’, ‘MoSold’, ‘YrSold’, ‘SaleType’,
‘SaleCondition’, ‘SalePrice’],dtype=‘object’)
房價變量的詳細信息
count 1460.000000
mean 180921.195890
std 79442.502883
min 34900.000000
25% 129975.000000
50% 163000.000000
75% 214000.000000
max 755000.000000
Name: SalePrice, dtype: float64
房價分布曲線
從結果來看并不滿足標準的正態分布,所以查看下它的峰度(kurtosis)和偏度(skewness)。
峰度(kurtosis):描述變量取值分布形態的陡緩程度的統計量。
kurtosis=0與正態分布的陡緩程度相同。
kurtosis>0比正態分布的高峰更加陡峭。
kurtosis<0比正態分布的高峰平。
偏度(skewness)是描述變量取值分布對稱性的統計量。
skewness=0分布形態與正態分布偏度相同。
skewness>0表示正(右)偏差數值較大,右邊的 尾巴比較長。
skewness<0表示負(左)偏差數值較大,左邊的尾巴比較長。
print('Skewness:%f'%df_train['SalePrice'].skew())
print('Kurtosis:%f'%df_train['SalePrice'].kurt())
Skewness:1.882876
Kurtosis:6.536282
(3)查看各個特征的分布走向
居住面積:
var='GrLivArea'
data=pd.concat([df_train['SalePrice'],df_train[var]],axis=1)
data.plot.scatter(x=var,y='SalePrice',ylim=(0,800000))
plt.show()
這里可以看出居住面積存在離群值
地下室面積:
var='TotalBsmtSF'
data=pd.concat([df_train['SalePrice'],df_train[var]],axis=1)
data.plot.scatter(x=var,y='SalePrice',ylim=(0,800000))
plt.show()
整體材料與飾面質量:(用箱型圖,可以查看離群值,均值,最值)
var='OverallQual'
data=pd.concat([df_train['SalePrice'],df_train[var]],axis=1)
f,ax=plt.subplots(figsize=(8,6))
fig=sns.boxplot(x=var,y='SalePrice',data=data)
fig.axis(ymin=0,ymax=800000)
plt.show()
原施工日期:
var='YearBuilt'
data=pd.concat([df_train['SalePrice'],df_train[var]],axis=1)
f,ax=plt.subplots(figsize=(16,8))
fig=sns.boxplot(x=var,y='SalePrice',data=data)
fig.axis(ymin=0,ymax=800000)
plt.xticks(rotation=90)
# plt.savefig('原施工日期.jpg') #保存至當前目錄
plt.show()
特征相關度熱度圖:
corrmat=df_train.corr()
f,ax=plt.subplots(figsize=(12,9))
sns.heatmap(corrmat,square=True,cmap='YlGnBu')
plt.savefig('熱力圖.jpg')
plt.show()
選取前10個和出售價格相關性比較大的特征進行分析:
將上面挑出來的十個特征進行兩兩畫圖(選六個):
sns.set()
cols=['SalePrice','OverallQual','GrLivArea','GarageCars','TotalBsmtSF','FullBath','YearBuilt']
sns.pairplot(df_train[cols],size=2.5)
plt.savefig('相關性圖.jpg')
plt.show()
缺失值查看:
total=df_train.isnull().sum().sort_values(ascending=False)
percent=(df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
missing_data=pd.concat([total,percent],axis=1,keys=['Total','Percent'])
print(missing_data.head(20))
Total Percen
PoolQC 1453 0.995205
MiscFeature 1406 0.963014
Alley 1369 0.937671
Fence 1179 0.807534
FireplaceQu 690 0.472603
LotFrontage 259 0.177397
GarageCond 81 0.055479
GarageType 81 0.055479
GarageYrBlt 81 0.055479
GarageFinish 81 0.055479
GarageQual 81 0.055479
BsmtExposure 38 0.026027
BsmtFinType2 38 0.026027
BsmtFinType1 37 0.025342
BsmtCond 37 0.025342
BsmtQual 37 0.025342
MasVnrArea 8 0.005479
MasVnrType 8 0.005479
Electrical 1 0.000685
Utilities 0 0.000000
四、數據處理
這個環節將對我們數據分析時候找出來的一些需要處理的數據進行適當的數據處理,數據處理部分因人而異,數據處理的好壞將直接影響模型的結果。
先將所用到的庫導入(包括后面建模的)
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import norm,skew
from sklearn.preprocessing import LabelEncoder
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
導入數據集:
train = pd.read_csv('./train.csv')
test = pd.read_csv('./test.csv')
查看數據集的大小:
print('The train data size before dropping Id feature is :{}'.format(train.shape))
print('The test data size before dropping Id feature is :{}'.format(test.shape))
The train data size after dropping Id feature is :(1460, 81)
The test data size after dropping Id feature is :(1459, 80)
將id列賦值給變量,然后刪除該列,再看下刪除后的數據集大小:
train_ID = train['Id']
test_ID = test['Id']train.drop('Id', axis=1, inplace=True)
test.drop('Id', axis=1, inplace=True)print('\nThe train data size after dropping Id feature is :{}'.format(train.shape))
print('The test data size after dropping Id feature is :{}'.format(test.shape))
The train data size after dropping Id feature is :(1460, 80)
The test data size after dropping Id feature is :(1459, 79)
由數據分析階段知道居住面積存在離群值,再次顯示出來(為下一步刪掉離群值提供刪除的范圍),刪除離群值,顯示刪除后的數據集:
fig, ax = plt.subplots()
ax.scatter(x=train['GrLivArea'], y=train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()# 干掉離群點
train = train.drop(train[(train['GrLivArea'] > 4000) & (train['SalePrice'] < 300000)].index)
# 查看去掉離群值后的數據
fig, ax = plt.subplots()
ax.scatter(x=train['GrLivArea'], y=train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()
將目標變量轉換為正態分布:
sns.distplot(train['SalePrice'], fit=norm)
plt.show()
(mu, sigma) = norm.fit(train['SalePrice'])
print('\n mu = {:.2f} and sigma={:.2f}\n'.format(mu, sigma))
mu = 180932.92 and sigma=79467.79
qq圖(可以查看數據與正態分布差距):
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()
上面可以看出來與正態分布偏離還是很大,這里做個對數變換:
train['SalePrice'] = np.log1p(train['SalePrice'])sns.distplot(train['SalePrice'], fit=norm)
plt.show()(mu, sigma) = norm.fit(train['SalePrice'])
print('\n mu={:.2f} and sigma={:.2f}\n'.format(mu, sigma))
#qq圖
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()
mu=12.02 and sigma=0.40
缺失值處理:
缺失值處理前我們先建立整體的數據集(將訓練集和測試集放一起處理)
ntrain = train.shape[0]
ntest = test.shape[0]
y_train = train.SalePrice.values
all_data = pd.concat((train, test)).reset_index(drop=True)
all_data.drop(['SalePrice'], axis=1, inplace=True)
print('all_data size is :{}'.format(all_data.shape))
all_data size is :(2917, 79)
打印缺失數據:
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio': all_data_na})
print(missing_data)
Missing Ratio
PoolQC 99.691464
MiscFeature 96.400411
Alley 93.212204
Fence 80.425094
FireplaceQu 48.680151
LotFrontage 16.660953
GarageFinish 5.450806
GarageYrBlt 5.450806
GarageQual 5.450806
GarageCond 5.450806
GarageType 5.382242
BsmtExposure 2.811107
BsmtCond 2.811107
BsmtQual 2.776826
BsmtFinType2 2.742544
BsmtFinType1 2.708262
MasVnrType 0.822763
MasVnrArea 0.788481
MSZoning 0.137127
BsmtFullBath 0.068564
BsmtHalfBath 0.068564
Utilities 0.068564
Functional 0.068564
Exterior2nd 0.034282
Exterior1st 0.034282
SaleType 0.034282
BsmtFinSF1 0.034282
BsmtFinSF2 0.034282
BsmtUnfSF 0.034282
Electrical 0.034282
f, ax=plt.subplots(figsize=(15,12))
plt.xticks(rotation='90')
sns.barplot(x=all_data_na. index,y=all_data_na)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)
plt.show()
對上述缺失值做填充:
#查看游泳池數據,并填充
# print(all_data['PoolQC'][:5])
all_data['PoolQC']=all_data['PoolQC'].fillna('None')
# print(all_data['PoolQC'][:5])
#MiscFeature
# print(all_data['MiscFeature'][:10])
all_data['MiscFeature']=all_data['MiscFeature'].fillna('None')
#通道的入口
all_data['Alley']=all_data['Alley'].fillna('None')
#柵欄
all_data['Fence']=all_data['Fence'].fillna('None')
#壁爐
all_data['FireplaceQu']=all_data['FireplaceQu'].fillna('None')
#離街道的距離(用臨近值代替)
all_data['LotFrontage']=all_data.groupby('Neighborhood')['LotFrontage'].transform(lambda x:x.fillna(x.median()))
#車庫的一系列特征
for col in ('GarageFinish','GarageQual','GarageCond','GarageType'):all_data[col]=all_data[col].fillna('None')
for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):all_data[col] = all_data[col].fillna(0)
#地下室的一系列特征
for col in ('BsmtFullBatch','BsmtUnfSF','TotalBsmtSf','BsmtFinSF1','BsmtFinSF2','BsmtHalfBath'):all_data[col]=all_data[col].fillna(0)
for col in ('BsmtExposure','BsmtCond','BsmtQual','BsmtFinType2','BsmtFinType1'):all_data[col]=all_data[col].fillna('None')#砌體
all_data['MasVnrType']=all_data['MasVnrType'].fillna('None')
all_data['MasVnrArea']=all_data['MasVnrArea'].fillna(0)
all_data['MSZoning'].mode()
#一般分區分類
all_data['MSZoning']=all_data['MSZoning'].fillna(all_data['MSZoning'].mode()[0])
#家庭功能評定 對于Functional,數據描述里說明,其NA值代表Typ
all_data['Functional']=all_data['Functional'].fillna('Typ')
#電力系統
all_data['Electrical']=all_data['Electrical'].fillna(all_data['Electrical'].mode()[0])
#廚房品質
all_data['KitchenQual']=all_data['KitchenQual'].fillna(all_data['KitchenQual'].mode()[0])
#外部
all_data['Exterior1st']=all_data['Exterior1st'].fillna(all_data['Exterior1st'].mode()[0])
all_data['Exterior2nd']=all_data['Exterior2nd'].fillna(all_data['Exterior2nd'].mode()[0])
#銷售類型
all_data['SaleType']=all_data['SaleType'].fillna(all_data['SaleType'].mode()[0])
#建筑類型
all_data['MSSubClass']=all_data['MSSubClass'].fillna('None')all_data=all_data.drop(['Utilities'],axis=1)
填充完以后再次查看缺失值:
all_data_na=(all_data.isnull().sum()/len(all_data))*100
all_data_na=all_data_na.drop(all_data_na[all_data_na==0].index).sort_values(ascending=False)
missing_data=pd.DataFrame({'Missing Ratio':all_data_na})
print(missing_data)
Empty DataFrame
Columns: [Missing Ratio]
Index: []
完成填充!
將有些不是連續值的數據給他們做成類別值:
all_data['MSSubClass']=all_data['MSSubClass'].apply(str)
all_data['OverallCond']=all_data['OverallCond'].astype(str)
all_data['YrSold']=all_data['YrSold'].astype(str)
all_data['MoSold']=all_data['MoSold'].astype(str)
使用sklearn進行標簽映射(使用sklearn的LabelEncoder方法將類別特征(離散型)編碼為0~n-1之間連續的特征數值):
cols=('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond',\'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1',\'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',\'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond',\'YrSold', 'MoSold')for c in cols:lbl = LabelEncoder()lbl.fit(list(all_data[c].values))all_data[c] = lbl.transform(list(all_data[c].values))
# shape
print('Shape all_data: {}'.format(all_data.shape))
一般房價與房子整體的面積有關,所以這里多做一個特征,將幾個面積整合在一起:
all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']
我們檢查數值型特征數據的偏度(skewness),但是要注意,object類型的數據無法計算skewness,因此計算的時候要過濾掉object數據。
umeric_feats = all_data.dtypes[all_data.dtypes != 'object'].index
# check the skew of all numerical features
skewed_feats = all_data[umeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
skewness = pd.DataFrame({'Skew': skewed_feats})
print(skewness.head(10))
Skew
MiscVal 21.939672
PoolArea 17.688664
LotArea 13.109495
LowQualFinSF 12.084539
3SsnPorch 11.372080
LandSlope 4.973254
KitchenAbvGr 4.300550
BsmtFinSF2 4.144503
EnclosedPorch 4.002344
ScreenPorch 3.945101
對于偏度過大的特征數據利用sklearn的box-cox轉換函數,以降低數據的偏度
# box cox transformation of highly skewed features
# box cox轉換的知識可以google
skewness = skewness[abs(skewness) > 0.75]
print('there are {} skewed numerical features to Box Cox transform'.format(skewness.shape[0]))
from scipy.special import boxcox1p
skewed_feats_index = skewness.index
lam = 0.15
for feat in skewed_feats_index:all_data[feat] = boxcox1p(all_data[feat], lam)
使用pandas的dummy方法來進行數據獨熱編碼,并形成最終的訓練和測試數據集:
# getting dummy categorical features onehot???
all_data = pd.get_dummies(all_data)
print(all_data.shape)
# getting the new train and test sets
train = all_data[:ntrain]
test = all_data[ntrain:]
模型的建立
導入所需要的庫:
這里使用sklearn的交叉驗證函數cross_val_score,由于該函數并沒有shuffle的功能,我們還需要額外的kfold函數來對數據集進行隨機分割。
from sklearn.linear_model import ElasticNet, Lasso, BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import lightgbm as lgbn_folds=5
def rmsle_cv(model):kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values)rmse = np.sqrt(-cross_val_score(model, train.values, y_train, scoring="neg_mean_squared_error", cv=kf))return rmse
Lasso模型
lasso=make_pipeline(RobustScaler(),Lasso(alpha=0.0005,random_state=1))
ElasticNet模型
ENet=make_pipeline(RobustScaler(),ElasticNet(alpha=0.0005,l1_ratio=.9,random_state=3))
KernelRidge帶有核函數的嶺回歸
KRR=KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)
GradientBoostingRegressor模型
GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05, max_depth=4, max_features='sqrt',min_samples_leaf=15, min_samples_split=10, loss='huber', random_state=5)
XGboost模型
xgb_model = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, learning_rate=0.05, max_depth=3,min_child_weight=1.7817, n_estimators=2200, reg_alpha=0.4640, reg_lambda=0.8571,subsample=0.5213, silent=1, random_state=7, nthread=-1)
lightgbm模型
lgb_model =lgb.LGBMRegressor(objective='regression',num_leaves=1000,learning_rate=0.05,n_estimators=350,reg_alpha=0.9)
輸出每個模型的得分:
score = rmsle_cv(lasso)
print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))score = rmsle_cv(ENet)
print("ElasticNet score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))score = rmsle_cv(KRR)
print("Kernel Ridge score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))score = rmsle_cv(GBoost)
print("Gradient Boosting score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))score = rmsle_cv(xgb_model)
print("Xgboost score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))score = rmsle_cv(lgb_model)
print("lightgbm score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
Lasso score: 0.1115 (0.0074)
ElasticNet score: 0.1116 (0.0074)
Kernel Ridge score: 0.1153 (0.0075)
Gradient Boosting score: 0.1167 (0.0083)
Xgboost score: 0.1164 (0.0070)
lightgbm score: 0.1288 (0.0058)
均值化模型
class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):def __init__(self, models):self.models = models# we define clones of the original models to fit the data indef fit(self, X, y):self.models_ = [clone(x) for x in self.models]# train cloned base modelsfor model in self.models_:model.fit(X, y)return self# we do the predictions for cloned models and average themdef predict(self, X):predictions = np.column_stack([model.predict(X) for model in self.models_])return np.mean(predictions, axis=1)
這里我們將enet gboost krr lasso四個模型進行均值:
averaged_models = AveragingModels(models=(ENet, GBoost, KRR, lasso))
score_all = rmsle_cv(averaged_models)
print('Averaged base models score: {:.4f} ({:.4f})\n'.format(score_all.mean(), score_all.std()))
Averaged base models score: 0.1087 (0.0077)
Stacking模型
在Stacking模型基礎上加入元模型。
這里在均化模型基礎上加入元模型,然后在這些基礎模型上使用折外預測(out-of-folds)來訓練我們的元模型,其訓練步驟如下:
1 將訓練集分出2個部分:train_a和train_b
2 用train_a來訓練其他基礎模型
3 然后用其訓練模型在測試集train_b上進行預測
4 使用步驟3中中的預測結果作為輸入,然后在其元模型上進行訓練
參考鏈接: link.我們使用五折stacking方法,一般情況下,我們會將訓練集分為5個部分,每次的訓練中都會使用其中4個部分的數據集,然后使用最后一個部分數據集來預測,五次迭代后我們會得到五次預測結果,最終使用著五次結果作為元模型的輸入進行元模型的訓練(其預測目標變量不變)。在元模型的預測部分,我們會平均所有基礎模型的預測結果作為元模型的輸入進行預測。
class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):def __init__(self, base_models, meta_model, n_folds=5):self.base_models = base_modelsself.meta_model = meta_modelself.n_folds = n_folds# We again fit the data on clones of the original modelsdef fit(self, X, y):self.base_models_ = [list() for x in self.base_models]self.meta_model_ = clone(self.meta_model)kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156)# Train cloned base models then create out-of-fold predictions# that are needed to train the cloned meta-modelout_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models)))for i, model in enumerate(self.base_models):for train_index, holdout_index in kfold.split(X, y):instance = clone(model)self.base_models_[i].append(instance)instance.fit(X[train_index], y[train_index])y_pred = instance.predict(X[holdout_index])out_of_fold_predictions[holdout_index, i] = y_pred# Now train the cloned meta-model using the out-of-fold predictions as new featureself.meta_model_.fit(out_of_fold_predictions, y)return self# Do the predictions of all base models on the test data and use the averaged predictions as# meta-features for the final prediction which is done by the meta-modeldef predict(self, X):meta_features = np.column_stack([np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)for base_models in self.base_models_])return self.meta_model_.predict(meta_features)
用前面定義好的enet、gboost、krr基礎模型,使用lasso作為元模型進行訓練預測,并計算得分:
stacked_averaged_models = StackingAveragedModels(base_models = (ENet, GBoost, KRR), meta_model = lasso)
score_all_stacked = rmsle_cv(stacked_averaged_models)
print('Stacking Averaged base models score: {:.4f} ({:.4f})\n'.format(score_all_stacked.mean(), score_all_stacked.std()))
Stacking Averaged base models score: 0.1081 (0.0073)
集成模型
將前面的模型進行集成化,組合出一個更加高效的模型(StackedRegressor,XGBoost和LightGBM模型集成)
定義一下評價函數
def rmsle(y, y_pred):return np.sqrt(mean_squared_error(y, y_pred))
分別訓練XGBoost和LightGBM和StackedRegressor模型:
stacked_averaged_models.fit(train.values, y_train)
stacked_train_pred = stacked_averaged_models.predict(train.values)
stacked_test_pred = np.expm1(stacked_averaged_models.predict(test.values))
print(rmsle(y_train, stacked_train_pred))
xgb_model.fit(train, y_train)
xgb_train_pred = xgb_model.predict(train)
xgb_test_pred = np.expm1(xgb_model.predict(test))
print(rmsle(y_train, xgb_train_pred))
lgb_model.fit(train, y_train)
lgb_train_pred = lgb_model.predict(train)
lgb_pred = np.expm1(lgb_model.predict(test.values))
print(rmsle(y_train, lgb_train_pred))
0.07839506096666995
0.07876052198274874
0.05893922686966146
用加權來平均上述的xgboost和LightGBM和StackedRegressor模型:
print('RMSLE score on train data all models:')
print(rmsle(y_train, stacked_train_pred * 0.6 + xgb_train_pred * 0.20 +lgb_train_pred * 0.20 ))
# Ensemble prediction 集成預測
ensemble_result = stacked_test_pred * 0.60 + xgb_test_pred * 0.20 + lgb_test_pred *0.20
生成結果的提交
submission = pd.DataFrame()
submission['Id'] = test_ID
submission['SalePrice'] = ensemble_result
submission.to_csv(r'E:\fangjiayucejieguo\submission.csv', index=False)
kaggle官網提交結果
總結
以上是生活随笔為你收集整理的kaggle—HousePrice房价预测项目实战的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: nestjs CRUD
- 下一篇: SolidWorks批量转换STL文件或