zillow房价预测比赛_Kaggle竞赛 —— 房价预测 (House Prices)
這個(gè)比賽總的情況就是給你79個(gè)特征然后根據(jù)這些預(yù)測房價(jià) (SalePrice),這其中既有離散型也有連續(xù)性特征,而且存在大量的缺失值。不過好在比賽方提供了data_description.txt這個(gè)文件,里面對(duì)各個(gè)特征的含義進(jìn)行了描述,理解了其中內(nèi)容后對(duì)于大部分缺失值就都能順利插補(bǔ)了。
參加比賽首先要做的事是了解其評(píng)價(jià)指標(biāo),如果一開始就搞錯(cuò)了到最后可能就白費(fèi)功夫了-。- House Prices的評(píng)估指標(biāo)是均方根誤差 (RMSE),這是常見的用于回歸問題的指標(biāo) :
\[\sqrt{\frac{\sum_{i=1}^{N}(y_i-\hat{y_i})^2}{N}}
\]
我目前的得分是0.11421
對(duì)我的分?jǐn)?shù)提升最大的主要有兩塊:
特征工程 : 主要為離散型變量的排序賦值,特征組合和PCA
模型融合 : 主要為加權(quán)平均和Stacking
將在下文中一一說明。
目錄:
探索性可視化(Exploratory Visualization)
數(shù)據(jù)清洗(Data Cleaning)
特征工程(Feature Engineering)
基本建模&評(píng)估(Basic Modeling & Evaluation)
參數(shù)調(diào)整(Hyperparameters Tuning)
集成方法(Ensemble Methods)
探索性可視化(Exploratory Visualization)
由于原始特征較多,這里只選擇建造年份 (YearBuilt) 來進(jìn)行可視化:
plt.figure(figsize=(15,8))
sns.boxplot(train.YearBuilt, train.SalePrice)
一般認(rèn)為新房子比較貴,老房子比較便宜,從圖上看大致也是這個(gè)趨勢(shì),由于建造年份 (YearBuilt) 這個(gè)特征存在較多的取值 (從1872年到2010年),直接one hot encoding會(huì)造成過于稀疏的數(shù)據(jù),因此在特征工程中會(huì)將其進(jìn)行數(shù)字化編碼 (LabelEncoder) 。
數(shù)據(jù)清洗 (Data Cleaning)
這里主要的工作是處理缺失值,首先來看各特征的缺失值數(shù)量:
aa = full.isnull().sum()
aa[aa>0].sort_values(ascending=False)
PoolQC 2908
MiscFeature 2812
Alley 2719
Fence 2346
SalePrice 1459
FireplaceQu 1420
LotFrontage 486
GarageQual 159
GarageCond 159
GarageFinish 159
GarageYrBlt 159
GarageType 157
BsmtExposure 82
BsmtCond 82
BsmtQual 81
BsmtFinType2 80
BsmtFinType1 79
MasVnrType 24
MasVnrArea 23
MSZoning 4
BsmtFullBath 2
BsmtHalfBath 2
Utilities 2
Functional 2
Electrical 1
BsmtUnfSF 1
Exterior1st 1
Exterior2nd 1
TotalBsmtSF 1
GarageCars 1
BsmtFinSF2 1
BsmtFinSF1 1
KitchenQual 1
SaleType 1
GarageArea 1
如果我們仔細(xì)觀察一下data_description里面的內(nèi)容的話,會(huì)發(fā)現(xiàn)很多缺失值都有跡可循,比如上表第一個(gè)PoolQC,表示的是游泳池的質(zhì)量,其值缺失代表的是這個(gè)房子本身沒有游泳池,因此可以用 “None” 來填補(bǔ)。
下面給出的這些特征都可以用 “None” 來填補(bǔ):
cols1 = ["PoolQC" , "MiscFeature", "Alley", "Fence", "FireplaceQu", "GarageQual", "GarageCond", "GarageFinish", "GarageYrBlt", "GarageType", "BsmtExposure", "BsmtCond", "BsmtQual", "BsmtFinType2", "BsmtFinType1", "MasVnrType"]
for col in cols1:
full[col].fillna("None", inplace=True)
下面的這些特征多為表示XX面積,比如 TotalBsmtSF 表示地下室的面積,如果一個(gè)房子本身沒有地下室,則缺失值就用0來填補(bǔ)。
cols=["MasVnrArea", "BsmtUnfSF", "TotalBsmtSF", "GarageCars", "BsmtFinSF2", "BsmtFinSF1", "GarageArea"]
for col in cols:
full[col].fillna(0, inplace=True)
LotFrontage這個(gè)特征與LotAreaCut和Neighborhood有比較大的關(guān)系,所以這里用這兩個(gè)特征分組后的中位數(shù)進(jìn)行插補(bǔ)。
full['LotFrontage']=full.groupby(['LotAreaCut','Neighborhood'])['LotFrontage'].transform(lambda x: x.fillna(x.median()))
特征工程 (Feature Engineering)
離散型變量的排序賦值
對(duì)于離散型特征,一般采用pandas中的get_dummies進(jìn)行數(shù)值化,但在這個(gè)比賽中光這樣可能還不夠,所以下面我采用的方法是按特征進(jìn)行分組,計(jì)算該特征每個(gè)取值下SalePrice的平均數(shù)和中位數(shù),再以此為基準(zhǔn)排序賦值,下面舉個(gè)例子:
MSSubClass這個(gè)特征表示房子的類型,將數(shù)據(jù)按其分組:
full.groupby(['MSSubClass'])[['SalePrice']].agg(['mean','median','count'])
按表中進(jìn)行排序:
'180' : 1
'30' : 2 '45' : 2
'190' : 3, '50' : 3, '90' : 3,
'85' : 4, '40' : 4, '160' : 4
'70' : 5, '20' : 5, '75' : 5, '80' : 5, '150' : 5
'120': 6, '60' : 6
我總共大致排了20多個(gè)特征,具體見完整代碼。
特征組合
將原始特征進(jìn)行組合通常能產(chǎn)生意想不到的效果,然而這個(gè)數(shù)據(jù)集中原始特征有很多,不可能所有都一一組合,所以這里先用Lasso進(jìn)行特征篩選,選出較重要的一些特征進(jìn)行組合。
lasso=Lasso(alpha=0.001)
lasso.fit(X_scaled,y_log)
FI_lasso = pd.DataFrame({"Feature Importance":lasso.coef_}, index=data_pipe.columns)
FI_lasso[FI_lasso["Feature Importance"]!=0].sort_values("Feature Importance").plot(kind="barh",figsize=(15,25))
plt.xticks(rotation=90)
plt.show()
最終加了這些特征,這其中也包括了很多其他的各種嘗試:
class add_feature(BaseEstimator, TransformerMixin):
def __init__(self,additional=1):
self.additional = additional
def fit(self,X,y=None):
return self
def transform(self,X):
if self.additional==1:
X["TotalHouse"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"]
X["TotalArea"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"] + X["GarageArea"]
else:
X["TotalHouse"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"]
X["TotalArea"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"] + X["GarageArea"]
X["+_TotalHouse_OverallQual"] = X["TotalHouse"] * X["OverallQual"]
X["+_GrLivArea_OverallQual"] = X["GrLivArea"] * X["OverallQual"]
X["+_oMSZoning_TotalHouse"] = X["oMSZoning"] * X["TotalHouse"]
X["+_oMSZoning_OverallQual"] = X["oMSZoning"] + X["OverallQual"]
X["+_oMSZoning_YearBuilt"] = X["oMSZoning"] + X["YearBuilt"]
X["+_oNeighborhood_TotalHouse"] = X["oNeighborhood"] * X["TotalHouse"]
X["+_oNeighborhood_OverallQual"] = X["oNeighborhood"] + X["OverallQual"]
X["+_oNeighborhood_YearBuilt"] = X["oNeighborhood"] + X["YearBuilt"]
X["+_BsmtFinSF1_OverallQual"] = X["BsmtFinSF1"] * X["OverallQual"]
X["-_oFunctional_TotalHouse"] = X["oFunctional"] * X["TotalHouse"]
X["-_oFunctional_OverallQual"] = X["oFunctional"] + X["OverallQual"]
X["-_LotArea_OverallQual"] = X["LotArea"] * X["OverallQual"]
X["-_TotalHouse_LotArea"] = X["TotalHouse"] + X["LotArea"]
X["-_oCondition1_TotalHouse"] = X["oCondition1"] * X["TotalHouse"]
X["-_oCondition1_OverallQual"] = X["oCondition1"] + X["OverallQual"]
X["Bsmt"] = X["BsmtFinSF1"] + X["BsmtFinSF2"] + X["BsmtUnfSF"]
X["Rooms"] = X["FullBath"]+X["TotRmsAbvGrd"]
X["PorchArea"] = X["OpenPorchSF"]+X["EnclosedPorch"]+X["3SsnPorch"]+X["ScreenPorch"]
X["TotalPlace"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"] + X["GarageArea"] + X["OpenPorchSF"]+X["EnclosedPorch"]+X["3SsnPorch"]+X["ScreenPorch"]
return X
PCA
PCA是非常重要的一環(huán),對(duì)于最終分?jǐn)?shù)的提升很大。因?yàn)槲倚略龅倪@些特征都是和原始特征高度相關(guān)的,這可能導(dǎo)致較強(qiáng)的多重共線性 (Multicollinearity) ,而PCA恰可以去相關(guān)性。因?yàn)檫@里使用PCA的目的不是降維,所以 n_components 用了和原來差不多的維度,這是我多方實(shí)驗(yàn)的結(jié)果,即前面加XX特征,后面再降到XX維。
pca = PCA(n_components=410)
X_scaled=pca.fit_transform(X_scaled)
test_X_scaled = pca.transform(test_X_scaled)
基本建模&評(píng)估(Basic Modeling & Evaluation)
首先定義RMSE的交叉驗(yàn)證評(píng)估指標(biāo):
def rmse_cv(model,X,y):
rmse = np.sqrt(-cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=5))
return rmse
使用了13個(gè)算法和5折交叉驗(yàn)證來評(píng)估baseline效果:
LinearRegression
Ridge
Lasso
Random Forrest
Gradient Boosting Tree
Support Vector Regression
Linear Support Vector Regression
ElasticNet
Stochastic Gradient Descent
BayesianRidge
KernelRidge
ExtraTreesRegressor
XgBoost
names = ["LR", "Ridge", "Lasso", "RF", "GBR", "SVR", "LinSVR", "Ela","SGD","Bay","Ker","Extra","Xgb"]
for name, model in zip(names, models):
score = rmse_cv(model, X_scaled, y_log)
print("{}: {:.6f}, {:.4f}".format(name,score.mean(),score.std()))
結(jié)果如下, 總的來說樹模型普遍不如線性模型,可能還是因?yàn)間et_dummies后帶來的數(shù)據(jù)稀疏性,不過這些模型都是沒調(diào)過參的。
LR: 1026870159.526766, 488528070.4534
Ridge: 0.117596, 0.0091
Lasso: 0.121474, 0.0060
RF: 0.140764, 0.0052
GBR: 0.124154, 0.0072
SVR: 0.112727, 0.0047
LinSVR: 0.121564, 0.0081
Ela: 0.111113, 0.0059
SGD: 0.159686, 0.0092
Bay: 0.110577, 0.0060
Ker: 0.109276, 0.0055
Extra: 0.136668, 0.0073
Xgb: 0.126614, 0.0070
接下來建立一個(gè)調(diào)參的方法,應(yīng)時(shí)刻牢記評(píng)估指標(biāo)是RMSE,所以打印出的分?jǐn)?shù)也要是RMSE。
class grid():
def __init__(self,model):
self.model = model
def grid_get(self,X,y,param_grid):
grid_search = GridSearchCV(self.model,param_grid,cv=5, scoring="neg_mean_squared_error")
grid_search.fit(X,y)
print(grid_search.best_params_, np.sqrt(-grid_search.best_score_))
grid_search.cv_results_['mean_test_score'] = np.sqrt(-grid_search.cv_results_['mean_test_score'])
print(pd.DataFrame(grid_search.cv_results_)[['params','mean_test_score','std_test_score']])
舉例Lasso的調(diào)參:
grid(Lasso()).grid_get(X_scaled,y_log,{'alpha': [0.0004,0.0005,0.0007,0.0006,0.0009,0.0008],'max_iter':[10000]})
{'max_iter': 10000, 'alpha': 0.0005} 0.111296607965
params mean_test_score std_test_score
0 {'max_iter': 10000, 'alpha': 0.0003} 0.111869 0.001513
1 {'max_iter': 10000, 'alpha': 0.0002} 0.112745 0.001753
2 {'max_iter': 10000, 'alpha': 0.0004} 0.111463 0.001392
3 {'max_iter': 10000, 'alpha': 0.0005} 0.111297 0.001339
4 {'max_iter': 10000, 'alpha': 0.0007} 0.111538 0.001284
5 {'max_iter': 10000, 'alpha': 0.0006} 0.111359 0.001315
6 {'max_iter': 10000, 'alpha': 0.0009} 0.111915 0.001206
7 {'max_iter': 10000, 'alpha': 0.0008} 0.111706 0.001229
經(jīng)過漫長的多輪測試,最后選擇了這六個(gè)模型:
lasso = Lasso(alpha=0.0005,max_iter=10000)
ridge = Ridge(alpha=60)
svr = SVR(gamma= 0.0004,kernel='rbf',C=13,epsilon=0.009)
ker = KernelRidge(alpha=0.2 ,kernel='polynomial',degree=3 , coef0=0.8)
ela = ElasticNet(alpha=0.005,l1_ratio=0.08,max_iter=10000)
bay = BayesianRidge()
集成方法 (Ensemble Methods)
加權(quán)平均
根據(jù)權(quán)重對(duì)各個(gè)模型加權(quán)平均:
class AverageWeight(BaseEstimator, RegressorMixin):
def __init__(self,mod,weight):
self.mod = mod
self.weight = weight
def fit(self,X,y):
self.models_ = [clone(x) for x in self.mod]
for model in self.models_:
model.fit(X,y)
return self
def predict(self,X):
w = list()
pred = np.array([model.predict(X) for model in self.models_])
# for every data point, single model prediction times weight, then add them together
for data in range(pred.shape[1]):
single = [pred[model,data]*weight for model,weight in zip(range(pred.shape[0]),self.weight)]
w.append(np.sum(single))
return w
weight_avg = AverageWeight(mod = [lasso,ridge,svr,ker,ela,bay],weight=[w1,w2,w3,w4,w5,w6])
score = rmse_cv(weight_avg,X_scaled,y_log)
print(score.mean()) # 0.10768459878025885
分?jǐn)?shù)為0.10768,比任何單個(gè)模型都好。
然而若只用SVR和Kernel Ridge兩個(gè)模型,則效果更好,看來是其他幾個(gè)模型拖后腿了。。
weight_avg = AverageWeight(mod = [svr,ker],weight=[0.5,0.5])
score = rmse_cv(weight_avg,X_scaled,y_log)
print(score.mean()) # 0.10668349587195189
Stacking
Stacking的原理見下圖:
如果是像圖中那樣的兩層stacking,則是第一層5個(gè)模型,第二層1個(gè)元模型。第一層模型的作用是訓(xùn)練得到一個(gè)\(\mathbb{R}^{n×m}\)的特征矩陣來用于輸入第二層模型訓(xùn)練,其中n為訓(xùn)練數(shù)據(jù)行數(shù),m為第一層模型個(gè)數(shù)。
class stacking(BaseEstimator, RegressorMixin, TransformerMixin):
def __init__(self,mod,meta_model):
self.mod = mod
self.meta_model = meta_model
self.kf = KFold(n_splits=5, random_state=42, shuffle=True)
def fit(self,X,y):
self.saved_model = [list() for i in self.mod]
oof_train = np.zeros((X.shape[0], len(self.mod)))
for i,model in enumerate(self.mod):
for train_index, val_index in self.kf.split(X,y):
renew_model = clone(model)
renew_model.fit(X[train_index], y[train_index])
self.saved_model[i].append(renew_model)
oof_train[val_index,i] = renew_model.predict(X[val_index])
self.meta_model.fit(oof_train,y)
return self
def predict(self,X):
whole_test = np.column_stack([np.column_stack(model.predict(X) for model in single_model).mean(axis=1)
for single_model in self.saved_model])
return self.meta_model.predict(whole_test)
def get_oof(self,X,y,test_X):
oof = np.zeros((X.shape[0],len(self.mod)))
test_single = np.zeros((test_X.shape[0],5))
test_mean = np.zeros((test_X.shape[0],len(self.mod)))
for i,model in enumerate(self.mod):
for j, (train_index,val_index) in enumerate(self.kf.split(X,y)):
clone_model = clone(model)
clone_model.fit(X[train_index],y[train_index])
oof[val_index,i] = clone_model.predict(X[val_index])
test_single[:,j] = clone_model.predict(test_X)
test_mean[:,i] = test_single.mean(axis=1)
return oof, test_mean
最開始我用get_oof的方法將第一層模型的特征矩陣提取出來,再和原始特征進(jìn)行拼接,最后的cv分?jǐn)?shù)下降到了0.1018,然而在leaderboard上的分?jǐn)?shù)卻變差了,看來這種方法會(huì)導(dǎo)致過擬合。
X_train_stack, X_test_stack = stack_model.get_oof(a,b,test_X_scaled)
X_train_add = np.hstack((a,X_train_stack))
X_test_add = np.hstack((test_X_scaled,X_test_stack))
print(rmse_cv(stack_model,X_train_add,b).mean()) # 0.101824682747
最后的結(jié)果提交,我用了Lasso,Ridge,SVR,Kernel Ridge,ElasticNet,BayesianRidge作為第一層模型,Kernel Ridge作為第二層模型。
stack_model = stacking(mod=[lasso,ridge,svr,ker,ela,bay],meta_model=ker)
stack_model.fit(a,b)
pred = np.exp(stack_model.predict(test_X_scaled))
result=pd.DataFrame({'Id':test.Id, 'SalePrice':pred})
result.to_csv("submission.csv",index=False)
總結(jié)
以上是生活随笔為你收集整理的zillow房价预测比赛_Kaggle竞赛 —— 房价预测 (House Prices)的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: PAT乙类1009 说反话 (20 分)
- 下一篇: 测试一体机风扇分贝软件,9款小风扇深度横