當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

zillow房价预测比赛_Kaggle竞赛 —— 房价预测 (House Prices)

發布時間：2024/7/5 编程问答 54 豆豆

生活随笔收集整理的這篇文章主要介紹了 zillow房价预测比赛_Kaggle竞赛 —— 房价预测 (House Prices) 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

這個比賽總的情況就是給你79個特征然后根據這些預測房價 (SalePrice)，這其中既有離散型也有連續性特征，而且存在大量的缺失值。不過好在比賽方提供了data_description.txt這個文件，里面對各個特征的含義進行了描述，理解了其中內容后對于大部分缺失值就都能順利插補了。

參加比賽首先要做的事是了解其評價指標，如果一開始就搞錯了到最后可能就白費功夫了-。- House Prices的評估指標是均方根誤差 (RMSE)，這是常見的用于回歸問題的指標 :

\[\sqrt{\frac{\sum_{i=1}^{N}(y_i-\hat{y_i})^2}{N}}

我目前的得分是0.11421

對我的分數提升最大的主要有兩塊：

特征工程 : 主要為離散型變量的排序賦值，特征組合和PCA

模型融合 : 主要為加權平均和Stacking

將在下文中一一說明。

探索性可視化(Exploratory Visualization)

數據清洗(Data Cleaning)

特征工程(Feature Engineering)

基本建模&評估(Basic Modeling & Evaluation)

參數調整(Hyperparameters Tuning)

集成方法(Ensemble Methods)

探索性可視化(Exploratory Visualization)

由于原始特征較多，這里只選擇建造年份 (YearBuilt) 來進行可視化：

plt.figure(figsize=(15,8))

sns.boxplot(train.YearBuilt, train.SalePrice)

一般認為新房子比較貴，老房子比較便宜，從圖上看大致也是這個趨勢，由于建造年份 (YearBuilt) 這個特征存在較多的取值 (從1872年到2010年)，直接one hot encoding會造成過于稀疏的數據，因此在特征工程中會將其進行數字化編碼 (LabelEncoder) 。

數據清洗 (Data Cleaning)

這里主要的工作是處理缺失值，首先來看各特征的缺失值數量：

aa = full.isnull().sum()

aa[aa>0].sort_values(ascending=False)

PoolQC 2908

MiscFeature 2812

Alley 2719

Fence 2346

SalePrice 1459

FireplaceQu 1420

LotFrontage 486

GarageQual 159

GarageCond 159

GarageFinish 159

GarageYrBlt 159

GarageType 157

BsmtExposure 82

BsmtCond 82

BsmtQual 81

BsmtFinType2 80

BsmtFinType1 79

MasVnrType 24

MasVnrArea 23

MSZoning 4

BsmtFullBath 2

BsmtHalfBath 2

Utilities 2

Functional 2

Electrical 1

BsmtUnfSF 1

Exterior1st 1

Exterior2nd 1

TotalBsmtSF 1

GarageCars 1

BsmtFinSF2 1

BsmtFinSF1 1

KitchenQual 1

SaleType 1

GarageArea 1

如果我們仔細觀察一下data_description里面的內容的話，會發現很多缺失值都有跡可循，比如上表第一個PoolQC，表示的是游泳池的質量，其值缺失代表的是這個房子本身沒有游泳池，因此可以用 “None” 來填補。

下面給出的這些特征都可以用 “None” 來填補：

cols1 = ["PoolQC" , "MiscFeature", "Alley", "Fence", "FireplaceQu", "GarageQual", "GarageCond", "GarageFinish", "GarageYrBlt", "GarageType", "BsmtExposure", "BsmtCond", "BsmtQual", "BsmtFinType2", "BsmtFinType1", "MasVnrType"]

for col in cols1:

full[col].fillna("None", inplace=True)

下面的這些特征多為表示XX面積，比如 TotalBsmtSF 表示地下室的面積，如果一個房子本身沒有地下室，則缺失值就用0來填補。

cols=["MasVnrArea", "BsmtUnfSF", "TotalBsmtSF", "GarageCars", "BsmtFinSF2", "BsmtFinSF1", "GarageArea"]

for col in cols:

full[col].fillna(0, inplace=True)

LotFrontage這個特征與LotAreaCut和Neighborhood有比較大的關系，所以這里用這兩個特征分組后的中位數進行插補。

full['LotFrontage']=full.groupby(['LotAreaCut','Neighborhood'])['LotFrontage'].transform(lambda x: x.fillna(x.median()))

特征工程 (Feature Engineering)

離散型變量的排序賦值

對于離散型特征，一般采用pandas中的get_dummies進行數值化，但在這個比賽中光這樣可能還不夠，所以下面我采用的方法是按特征進行分組，計算該特征每個取值下SalePrice的平均數和中位數，再以此為基準排序賦值，下面舉個例子：

MSSubClass這個特征表示房子的類型，將數據按其分組：

full.groupby(['MSSubClass'])[['SalePrice']].agg(['mean','median','count'])

按表中進行排序：

'180' : 1

'30' : 2 '45' : 2

'190' : 3, '50' : 3, '90' : 3,

'85' : 4, '40' : 4, '160' : 4

'70' : 5, '20' : 5, '75' : 5, '80' : 5, '150' : 5

'120': 6, '60' : 6

我總共大致排了20多個特征，具體見完整代碼。

特征組合

將原始特征進行組合通常能產生意想不到的效果，然而這個數據集中原始特征有很多，不可能所有都一一組合，所以這里先用Lasso進行特征篩選，選出較重要的一些特征進行組合。

lasso=Lasso(alpha=0.001)

lasso.fit(X_scaled,y_log)

FI_lasso = pd.DataFrame({"Feature Importance":lasso.coef_}, index=data_pipe.columns)

FI_lasso[FI_lasso["Feature Importance"]!=0].sort_values("Feature Importance").plot(kind="barh",figsize=(15,25))

plt.xticks(rotation=90)

plt.show()

最終加了這些特征，這其中也包括了很多其他的各種嘗試：

class add_feature(BaseEstimator, TransformerMixin):

def __init__(self,additional=1):

self.additional = additional

def fit(self,X,y=None):

return self

def transform(self,X):

if self.additional==1:

X["TotalHouse"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"]

X["TotalArea"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"] + X["GarageArea"]

else:

X["TotalHouse"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"]

X["TotalArea"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"] + X["GarageArea"]

X["+_TotalHouse_OverallQual"] = X["TotalHouse"] * X["OverallQual"]

X["+_GrLivArea_OverallQual"] = X["GrLivArea"] * X["OverallQual"]

X["+_oMSZoning_TotalHouse"] = X["oMSZoning"] * X["TotalHouse"]

X["+_oMSZoning_OverallQual"] = X["oMSZoning"] + X["OverallQual"]

X["+_oMSZoning_YearBuilt"] = X["oMSZoning"] + X["YearBuilt"]

X["+_oNeighborhood_TotalHouse"] = X["oNeighborhood"] * X["TotalHouse"]

X["+_oNeighborhood_OverallQual"] = X["oNeighborhood"] + X["OverallQual"]

X["+_oNeighborhood_YearBuilt"] = X["oNeighborhood"] + X["YearBuilt"]

X["+_BsmtFinSF1_OverallQual"] = X["BsmtFinSF1"] * X["OverallQual"]

X["-_oFunctional_TotalHouse"] = X["oFunctional"] * X["TotalHouse"]

X["-_oFunctional_OverallQual"] = X["oFunctional"] + X["OverallQual"]

X["-_LotArea_OverallQual"] = X["LotArea"] * X["OverallQual"]

X["-_TotalHouse_LotArea"] = X["TotalHouse"] + X["LotArea"]

X["-_oCondition1_TotalHouse"] = X["oCondition1"] * X["TotalHouse"]

X["-_oCondition1_OverallQual"] = X["oCondition1"] + X["OverallQual"]

X["Bsmt"] = X["BsmtFinSF1"] + X["BsmtFinSF2"] + X["BsmtUnfSF"]

X["Rooms"] = X["FullBath"]+X["TotRmsAbvGrd"]

X["PorchArea"] = X["OpenPorchSF"]+X["EnclosedPorch"]+X["3SsnPorch"]+X["ScreenPorch"]

X["TotalPlace"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"] + X["GarageArea"] + X["OpenPorchSF"]+X["EnclosedPorch"]+X["3SsnPorch"]+X["ScreenPorch"]

return X

PCA

PCA是非常重要的一環，對于最終分數的提升很大。因為我新增的這些特征都是和原始特征高度相關的，這可能導致較強的多重共線性 (Multicollinearity) ，而PCA恰可以去相關性。因為這里使用PCA的目的不是降維，所以 n_components 用了和原來差不多的維度，這是我多方實驗的結果，即前面加XX特征，后面再降到XX維。

pca = PCA(n_components=410)

X_scaled=pca.fit_transform(X_scaled)

test_X_scaled = pca.transform(test_X_scaled)

基本建模&評估(Basic Modeling & Evaluation)

首先定義RMSE的交叉驗證評估指標：

def rmse_cv(model,X,y):

rmse = np.sqrt(-cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=5))

return rmse

使用了13個算法和5折交叉驗證來評估baseline效果：

LinearRegression

Ridge

Lasso

Random Forrest

Gradient Boosting Tree

Support Vector Regression

Linear Support Vector Regression

ElasticNet

Stochastic Gradient Descent

BayesianRidge

KernelRidge

ExtraTreesRegressor

XgBoost

names = ["LR", "Ridge", "Lasso", "RF", "GBR", "SVR", "LinSVR", "Ela","SGD","Bay","Ker","Extra","Xgb"]

for name, model in zip(names, models):

score = rmse_cv(model, X_scaled, y_log)

print("{}: {:.6f}, {:.4f}".format(name,score.mean(),score.std()))

結果如下，總的來說樹模型普遍不如線性模型，可能還是因為get_dummies后帶來的數據稀疏性，不過這些模型都是沒調過參的。

LR: 1026870159.526766, 488528070.4534

Ridge: 0.117596, 0.0091

Lasso: 0.121474, 0.0060

RF: 0.140764, 0.0052

GBR: 0.124154, 0.0072

SVR: 0.112727, 0.0047

LinSVR: 0.121564, 0.0081

Ela: 0.111113, 0.0059

SGD: 0.159686, 0.0092

Bay: 0.110577, 0.0060

Ker: 0.109276, 0.0055

Extra: 0.136668, 0.0073

Xgb: 0.126614, 0.0070

接下來建立一個調參的方法，應時刻牢記評估指標是RMSE，所以打印出的分數也要是RMSE。

class grid():

def __init__(self,model):

self.model = model

def grid_get(self,X,y,param_grid):

grid_search = GridSearchCV(self.model,param_grid,cv=5, scoring="neg_mean_squared_error")

grid_search.fit(X,y)

print(grid_search.best_params_, np.sqrt(-grid_search.best_score_))

grid_search.cv_results_['mean_test_score'] = np.sqrt(-grid_search.cv_results_['mean_test_score'])

print(pd.DataFrame(grid_search.cv_results_)[['params','mean_test_score','std_test_score']])

舉例Lasso的調參：

grid(Lasso()).grid_get(X_scaled,y_log,{'alpha': [0.0004,0.0005,0.0007,0.0006,0.0009,0.0008],'max_iter':[10000]})

{'max_iter': 10000, 'alpha': 0.0005} 0.111296607965

params mean_test_score std_test_score

0 {'max_iter': 10000, 'alpha': 0.0003} 0.111869 0.001513

1 {'max_iter': 10000, 'alpha': 0.0002} 0.112745 0.001753

2 {'max_iter': 10000, 'alpha': 0.0004} 0.111463 0.001392

3 {'max_iter': 10000, 'alpha': 0.0005} 0.111297 0.001339

4 {'max_iter': 10000, 'alpha': 0.0007} 0.111538 0.001284

5 {'max_iter': 10000, 'alpha': 0.0006} 0.111359 0.001315

6 {'max_iter': 10000, 'alpha': 0.0009} 0.111915 0.001206

7 {'max_iter': 10000, 'alpha': 0.0008} 0.111706 0.001229

經過漫長的多輪測試，最后選擇了這六個模型：

lasso = Lasso(alpha=0.0005,max_iter=10000)

ridge = Ridge(alpha=60)

svr = SVR(gamma= 0.0004,kernel='rbf',C=13,epsilon=0.009)

ker = KernelRidge(alpha=0.2 ,kernel='polynomial',degree=3 , coef0=0.8)

ela = ElasticNet(alpha=0.005,l1_ratio=0.08,max_iter=10000)

bay = BayesianRidge()

集成方法 (Ensemble Methods)

加權平均

根據權重對各個模型加權平均：

class AverageWeight(BaseEstimator, RegressorMixin):

def __init__(self,mod,weight):

self.mod = mod

self.weight = weight

def fit(self,X,y):

self.models_ = [clone(x) for x in self.mod]

for model in self.models_:

model.fit(X,y)

return self

def predict(self,X):

w = list()

pred = np.array([model.predict(X) for model in self.models_])

# for every data point, single model prediction times weight, then add them together

for data in range(pred.shape[1]):

single = [pred[model,data]*weight for model,weight in zip(range(pred.shape[0]),self.weight)]

w.append(np.sum(single))

return w

weight_avg = AverageWeight(mod = [lasso,ridge,svr,ker,ela,bay],weight=[w1,w2,w3,w4,w5,w6])

score = rmse_cv(weight_avg,X_scaled,y_log)

print(score.mean()) # 0.10768459878025885

分數為0.10768，比任何單個模型都好。

然而若只用SVR和Kernel Ridge兩個模型，則效果更好，看來是其他幾個模型拖后腿了。。

weight_avg = AverageWeight(mod = [svr,ker],weight=[0.5,0.5])

score = rmse_cv(weight_avg,X_scaled,y_log)

print(score.mean()) # 0.10668349587195189

Stacking

Stacking的原理見下圖：

如果是像圖中那樣的兩層stacking，則是第一層5個模型，第二層1個元模型。第一層模型的作用是訓練得到一個\(\mathbb{R}^{n×m}\)的特征矩陣來用于輸入第二層模型訓練，其中n為訓練數據行數，m為第一層模型個數。

class stacking(BaseEstimator, RegressorMixin, TransformerMixin):

def __init__(self,mod,meta_model):

self.mod = mod

self.meta_model = meta_model

self.kf = KFold(n_splits=5, random_state=42, shuffle=True)

def fit(self,X,y):

self.saved_model = [list() for i in self.mod]

oof_train = np.zeros((X.shape[0], len(self.mod)))

for i,model in enumerate(self.mod):

for train_index, val_index in self.kf.split(X,y):

renew_model = clone(model)

renew_model.fit(X[train_index], y[train_index])

self.saved_model[i].append(renew_model)

oof_train[val_index,i] = renew_model.predict(X[val_index])

self.meta_model.fit(oof_train,y)

return self

def predict(self,X):

whole_test = np.column_stack([np.column_stack(model.predict(X) for model in single_model).mean(axis=1)

for single_model in self.saved_model])

return self.meta_model.predict(whole_test)

def get_oof(self,X,y,test_X):

oof = np.zeros((X.shape[0],len(self.mod)))

test_single = np.zeros((test_X.shape[0],5))

test_mean = np.zeros((test_X.shape[0],len(self.mod)))

for i,model in enumerate(self.mod):

for j, (train_index,val_index) in enumerate(self.kf.split(X,y)):

clone_model = clone(model)

clone_model.fit(X[train_index],y[train_index])

oof[val_index,i] = clone_model.predict(X[val_index])

test_single[:,j] = clone_model.predict(test_X)

test_mean[:,i] = test_single.mean(axis=1)

return oof, test_mean

最開始我用get_oof的方法將第一層模型的特征矩陣提取出來，再和原始特征進行拼接，最后的cv分數下降到了0.1018，然而在leaderboard上的分數卻變差了，看來這種方法會導致過擬合。

X_train_stack, X_test_stack = stack_model.get_oof(a,b,test_X_scaled)

X_train_add = np.hstack((a,X_train_stack))

X_test_add = np.hstack((test_X_scaled,X_test_stack))

print(rmse_cv(stack_model,X_train_add,b).mean()) # 0.101824682747

最后的結果提交，我用了Lasso，Ridge，SVR，Kernel Ridge，ElasticNet，BayesianRidge作為第一層模型，Kernel Ridge作為第二層模型。

stack_model = stacking(mod=[lasso,ridge,svr,ker,ela,bay],meta_model=ker)

stack_model.fit(a,b)

pred = np.exp(stack_model.predict(test_X_scaled))

result=pd.DataFrame({'Id':test.Id, 'SalePrice':pred})

result.to_csv("submission.csv",index=False)

總結

以上是生活随笔為你收集整理的zillow房价预测比赛_Kaggle竞赛 —— 房价预测 (House Prices)的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： PAT乙类1009 说反话 (20 分)
下一篇：测试一体机风扇分贝软件,9款小风扇深度横

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

编程问答

zillow房价预测比赛_Kaggle竞赛 —— 房价预测 (House Prices)

總結