日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

zillow房价预测比赛_Kaggle竞赛 —— 房价预测 (House Prices)

發(fā)布時(shí)間:2024/7/5 编程问答 48 豆豆
生活随笔 收集整理的這篇文章主要介紹了 zillow房价预测比赛_Kaggle竞赛 —— 房价预测 (House Prices) 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

這個(gè)比賽總的情況就是給你79個(gè)特征然后根據(jù)這些預(yù)測房價(jià) (SalePrice),這其中既有離散型也有連續(xù)性特征,而且存在大量的缺失值。不過好在比賽方提供了data_description.txt這個(gè)文件,里面對(duì)各個(gè)特征的含義進(jìn)行了描述,理解了其中內(nèi)容后對(duì)于大部分缺失值就都能順利插補(bǔ)了。

參加比賽首先要做的事是了解其評(píng)價(jià)指標(biāo),如果一開始就搞錯(cuò)了到最后可能就白費(fèi)功夫了-。- House Prices的評(píng)估指標(biāo)是均方根誤差 (RMSE),這是常見的用于回歸問題的指標(biāo) :

\[\sqrt{\frac{\sum_{i=1}^{N}(y_i-\hat{y_i})^2}{N}}

\]

我目前的得分是0.11421

對(duì)我的分?jǐn)?shù)提升最大的主要有兩塊:

特征工程 : 主要為離散型變量的排序賦值,特征組合和PCA

模型融合 : 主要為加權(quán)平均和Stacking

將在下文中一一說明。

目錄:

探索性可視化(Exploratory Visualization)

數(shù)據(jù)清洗(Data Cleaning)

特征工程(Feature Engineering)

基本建模&評(píng)估(Basic Modeling & Evaluation)

參數(shù)調(diào)整(Hyperparameters Tuning)

集成方法(Ensemble Methods)

探索性可視化(Exploratory Visualization)

由于原始特征較多,這里只選擇建造年份 (YearBuilt) 來進(jìn)行可視化:

plt.figure(figsize=(15,8))

sns.boxplot(train.YearBuilt, train.SalePrice)

一般認(rèn)為新房子比較貴,老房子比較便宜,從圖上看大致也是這個(gè)趨勢(shì),由于建造年份 (YearBuilt) 這個(gè)特征存在較多的取值 (從1872年到2010年),直接one hot encoding會(huì)造成過于稀疏的數(shù)據(jù),因此在特征工程中會(huì)將其進(jìn)行數(shù)字化編碼 (LabelEncoder) 。

數(shù)據(jù)清洗 (Data Cleaning)

這里主要的工作是處理缺失值,首先來看各特征的缺失值數(shù)量:

aa = full.isnull().sum()

aa[aa>0].sort_values(ascending=False)

PoolQC 2908

MiscFeature 2812

Alley 2719

Fence 2346

SalePrice 1459

FireplaceQu 1420

LotFrontage 486

GarageQual 159

GarageCond 159

GarageFinish 159

GarageYrBlt 159

GarageType 157

BsmtExposure 82

BsmtCond 82

BsmtQual 81

BsmtFinType2 80

BsmtFinType1 79

MasVnrType 24

MasVnrArea 23

MSZoning 4

BsmtFullBath 2

BsmtHalfBath 2

Utilities 2

Functional 2

Electrical 1

BsmtUnfSF 1

Exterior1st 1

Exterior2nd 1

TotalBsmtSF 1

GarageCars 1

BsmtFinSF2 1

BsmtFinSF1 1

KitchenQual 1

SaleType 1

GarageArea 1

如果我們仔細(xì)觀察一下data_description里面的內(nèi)容的話,會(huì)發(fā)現(xiàn)很多缺失值都有跡可循,比如上表第一個(gè)PoolQC,表示的是游泳池的質(zhì)量,其值缺失代表的是這個(gè)房子本身沒有游泳池,因此可以用 “None” 來填補(bǔ)。

下面給出的這些特征都可以用 “None” 來填補(bǔ):

cols1 = ["PoolQC" , "MiscFeature", "Alley", "Fence", "FireplaceQu", "GarageQual", "GarageCond", "GarageFinish", "GarageYrBlt", "GarageType", "BsmtExposure", "BsmtCond", "BsmtQual", "BsmtFinType2", "BsmtFinType1", "MasVnrType"]

for col in cols1:

full[col].fillna("None", inplace=True)

下面的這些特征多為表示XX面積,比如 TotalBsmtSF 表示地下室的面積,如果一個(gè)房子本身沒有地下室,則缺失值就用0來填補(bǔ)。

cols=["MasVnrArea", "BsmtUnfSF", "TotalBsmtSF", "GarageCars", "BsmtFinSF2", "BsmtFinSF1", "GarageArea"]

for col in cols:

full[col].fillna(0, inplace=True)

LotFrontage這個(gè)特征與LotAreaCut和Neighborhood有比較大的關(guān)系,所以這里用這兩個(gè)特征分組后的中位數(shù)進(jìn)行插補(bǔ)。

full['LotFrontage']=full.groupby(['LotAreaCut','Neighborhood'])['LotFrontage'].transform(lambda x: x.fillna(x.median()))

特征工程 (Feature Engineering)

離散型變量的排序賦值

對(duì)于離散型特征,一般采用pandas中的get_dummies進(jìn)行數(shù)值化,但在這個(gè)比賽中光這樣可能還不夠,所以下面我采用的方法是按特征進(jìn)行分組,計(jì)算該特征每個(gè)取值下SalePrice的平均數(shù)和中位數(shù),再以此為基準(zhǔn)排序賦值,下面舉個(gè)例子:

MSSubClass這個(gè)特征表示房子的類型,將數(shù)據(jù)按其分組:

full.groupby(['MSSubClass'])[['SalePrice']].agg(['mean','median','count'])

按表中進(jìn)行排序:

'180' : 1

'30' : 2 '45' : 2

'190' : 3, '50' : 3, '90' : 3,

'85' : 4, '40' : 4, '160' : 4

'70' : 5, '20' : 5, '75' : 5, '80' : 5, '150' : 5

'120': 6, '60' : 6

我總共大致排了20多個(gè)特征,具體見完整代碼。

特征組合

將原始特征進(jìn)行組合通常能產(chǎn)生意想不到的效果,然而這個(gè)數(shù)據(jù)集中原始特征有很多,不可能所有都一一組合,所以這里先用Lasso進(jìn)行特征篩選,選出較重要的一些特征進(jìn)行組合。

lasso=Lasso(alpha=0.001)

lasso.fit(X_scaled,y_log)

FI_lasso = pd.DataFrame({"Feature Importance":lasso.coef_}, index=data_pipe.columns)

FI_lasso[FI_lasso["Feature Importance"]!=0].sort_values("Feature Importance").plot(kind="barh",figsize=(15,25))

plt.xticks(rotation=90)

plt.show()

最終加了這些特征,這其中也包括了很多其他的各種嘗試:

class add_feature(BaseEstimator, TransformerMixin):

def __init__(self,additional=1):

self.additional = additional

def fit(self,X,y=None):

return self

def transform(self,X):

if self.additional==1:

X["TotalHouse"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"]

X["TotalArea"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"] + X["GarageArea"]

else:

X["TotalHouse"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"]

X["TotalArea"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"] + X["GarageArea"]

X["+_TotalHouse_OverallQual"] = X["TotalHouse"] * X["OverallQual"]

X["+_GrLivArea_OverallQual"] = X["GrLivArea"] * X["OverallQual"]

X["+_oMSZoning_TotalHouse"] = X["oMSZoning"] * X["TotalHouse"]

X["+_oMSZoning_OverallQual"] = X["oMSZoning"] + X["OverallQual"]

X["+_oMSZoning_YearBuilt"] = X["oMSZoning"] + X["YearBuilt"]

X["+_oNeighborhood_TotalHouse"] = X["oNeighborhood"] * X["TotalHouse"]

X["+_oNeighborhood_OverallQual"] = X["oNeighborhood"] + X["OverallQual"]

X["+_oNeighborhood_YearBuilt"] = X["oNeighborhood"] + X["YearBuilt"]

X["+_BsmtFinSF1_OverallQual"] = X["BsmtFinSF1"] * X["OverallQual"]

X["-_oFunctional_TotalHouse"] = X["oFunctional"] * X["TotalHouse"]

X["-_oFunctional_OverallQual"] = X["oFunctional"] + X["OverallQual"]

X["-_LotArea_OverallQual"] = X["LotArea"] * X["OverallQual"]

X["-_TotalHouse_LotArea"] = X["TotalHouse"] + X["LotArea"]

X["-_oCondition1_TotalHouse"] = X["oCondition1"] * X["TotalHouse"]

X["-_oCondition1_OverallQual"] = X["oCondition1"] + X["OverallQual"]

X["Bsmt"] = X["BsmtFinSF1"] + X["BsmtFinSF2"] + X["BsmtUnfSF"]

X["Rooms"] = X["FullBath"]+X["TotRmsAbvGrd"]

X["PorchArea"] = X["OpenPorchSF"]+X["EnclosedPorch"]+X["3SsnPorch"]+X["ScreenPorch"]

X["TotalPlace"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"] + X["GarageArea"] + X["OpenPorchSF"]+X["EnclosedPorch"]+X["3SsnPorch"]+X["ScreenPorch"]

return X

PCA

PCA是非常重要的一環(huán),對(duì)于最終分?jǐn)?shù)的提升很大。因?yàn)槲倚略龅倪@些特征都是和原始特征高度相關(guān)的,這可能導(dǎo)致較強(qiáng)的多重共線性 (Multicollinearity) ,而PCA恰可以去相關(guān)性。因?yàn)檫@里使用PCA的目的不是降維,所以 n_components 用了和原來差不多的維度,這是我多方實(shí)驗(yàn)的結(jié)果,即前面加XX特征,后面再降到XX維。

pca = PCA(n_components=410)

X_scaled=pca.fit_transform(X_scaled)

test_X_scaled = pca.transform(test_X_scaled)

基本建模&評(píng)估(Basic Modeling & Evaluation)

首先定義RMSE的交叉驗(yàn)證評(píng)估指標(biāo):

def rmse_cv(model,X,y):

rmse = np.sqrt(-cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=5))

return rmse

使用了13個(gè)算法和5折交叉驗(yàn)證來評(píng)估baseline效果:

LinearRegression

Ridge

Lasso

Random Forrest

Gradient Boosting Tree

Support Vector Regression

Linear Support Vector Regression

ElasticNet

Stochastic Gradient Descent

BayesianRidge

KernelRidge

ExtraTreesRegressor

XgBoost

names = ["LR", "Ridge", "Lasso", "RF", "GBR", "SVR", "LinSVR", "Ela","SGD","Bay","Ker","Extra","Xgb"]

for name, model in zip(names, models):

score = rmse_cv(model, X_scaled, y_log)

print("{}: {:.6f}, {:.4f}".format(name,score.mean(),score.std()))

結(jié)果如下, 總的來說樹模型普遍不如線性模型,可能還是因?yàn)間et_dummies后帶來的數(shù)據(jù)稀疏性,不過這些模型都是沒調(diào)過參的。

LR: 1026870159.526766, 488528070.4534

Ridge: 0.117596, 0.0091

Lasso: 0.121474, 0.0060

RF: 0.140764, 0.0052

GBR: 0.124154, 0.0072

SVR: 0.112727, 0.0047

LinSVR: 0.121564, 0.0081

Ela: 0.111113, 0.0059

SGD: 0.159686, 0.0092

Bay: 0.110577, 0.0060

Ker: 0.109276, 0.0055

Extra: 0.136668, 0.0073

Xgb: 0.126614, 0.0070

接下來建立一個(gè)調(diào)參的方法,應(yīng)時(shí)刻牢記評(píng)估指標(biāo)是RMSE,所以打印出的分?jǐn)?shù)也要是RMSE。

class grid():

def __init__(self,model):

self.model = model

def grid_get(self,X,y,param_grid):

grid_search = GridSearchCV(self.model,param_grid,cv=5, scoring="neg_mean_squared_error")

grid_search.fit(X,y)

print(grid_search.best_params_, np.sqrt(-grid_search.best_score_))

grid_search.cv_results_['mean_test_score'] = np.sqrt(-grid_search.cv_results_['mean_test_score'])

print(pd.DataFrame(grid_search.cv_results_)[['params','mean_test_score','std_test_score']])

舉例Lasso的調(diào)參:

grid(Lasso()).grid_get(X_scaled,y_log,{'alpha': [0.0004,0.0005,0.0007,0.0006,0.0009,0.0008],'max_iter':[10000]})

{'max_iter': 10000, 'alpha': 0.0005} 0.111296607965

params mean_test_score std_test_score

0 {'max_iter': 10000, 'alpha': 0.0003} 0.111869 0.001513

1 {'max_iter': 10000, 'alpha': 0.0002} 0.112745 0.001753

2 {'max_iter': 10000, 'alpha': 0.0004} 0.111463 0.001392

3 {'max_iter': 10000, 'alpha': 0.0005} 0.111297 0.001339

4 {'max_iter': 10000, 'alpha': 0.0007} 0.111538 0.001284

5 {'max_iter': 10000, 'alpha': 0.0006} 0.111359 0.001315

6 {'max_iter': 10000, 'alpha': 0.0009} 0.111915 0.001206

7 {'max_iter': 10000, 'alpha': 0.0008} 0.111706 0.001229

經(jīng)過漫長的多輪測試,最后選擇了這六個(gè)模型:

lasso = Lasso(alpha=0.0005,max_iter=10000)

ridge = Ridge(alpha=60)

svr = SVR(gamma= 0.0004,kernel='rbf',C=13,epsilon=0.009)

ker = KernelRidge(alpha=0.2 ,kernel='polynomial',degree=3 , coef0=0.8)

ela = ElasticNet(alpha=0.005,l1_ratio=0.08,max_iter=10000)

bay = BayesianRidge()

集成方法 (Ensemble Methods)

加權(quán)平均

根據(jù)權(quán)重對(duì)各個(gè)模型加權(quán)平均:

class AverageWeight(BaseEstimator, RegressorMixin):

def __init__(self,mod,weight):

self.mod = mod

self.weight = weight

def fit(self,X,y):

self.models_ = [clone(x) for x in self.mod]

for model in self.models_:

model.fit(X,y)

return self

def predict(self,X):

w = list()

pred = np.array([model.predict(X) for model in self.models_])

# for every data point, single model prediction times weight, then add them together

for data in range(pred.shape[1]):

single = [pred[model,data]*weight for model,weight in zip(range(pred.shape[0]),self.weight)]

w.append(np.sum(single))

return w

weight_avg = AverageWeight(mod = [lasso,ridge,svr,ker,ela,bay],weight=[w1,w2,w3,w4,w5,w6])

score = rmse_cv(weight_avg,X_scaled,y_log)

print(score.mean()) # 0.10768459878025885

分?jǐn)?shù)為0.10768,比任何單個(gè)模型都好。

然而若只用SVR和Kernel Ridge兩個(gè)模型,則效果更好,看來是其他幾個(gè)模型拖后腿了。。

weight_avg = AverageWeight(mod = [svr,ker],weight=[0.5,0.5])

score = rmse_cv(weight_avg,X_scaled,y_log)

print(score.mean()) # 0.10668349587195189

Stacking

Stacking的原理見下圖:

如果是像圖中那樣的兩層stacking,則是第一層5個(gè)模型,第二層1個(gè)元模型。第一層模型的作用是訓(xùn)練得到一個(gè)\(\mathbb{R}^{n×m}\)的特征矩陣來用于輸入第二層模型訓(xùn)練,其中n為訓(xùn)練數(shù)據(jù)行數(shù),m為第一層模型個(gè)數(shù)。

class stacking(BaseEstimator, RegressorMixin, TransformerMixin):

def __init__(self,mod,meta_model):

self.mod = mod

self.meta_model = meta_model

self.kf = KFold(n_splits=5, random_state=42, shuffle=True)

def fit(self,X,y):

self.saved_model = [list() for i in self.mod]

oof_train = np.zeros((X.shape[0], len(self.mod)))

for i,model in enumerate(self.mod):

for train_index, val_index in self.kf.split(X,y):

renew_model = clone(model)

renew_model.fit(X[train_index], y[train_index])

self.saved_model[i].append(renew_model)

oof_train[val_index,i] = renew_model.predict(X[val_index])

self.meta_model.fit(oof_train,y)

return self

def predict(self,X):

whole_test = np.column_stack([np.column_stack(model.predict(X) for model in single_model).mean(axis=1)

for single_model in self.saved_model])

return self.meta_model.predict(whole_test)

def get_oof(self,X,y,test_X):

oof = np.zeros((X.shape[0],len(self.mod)))

test_single = np.zeros((test_X.shape[0],5))

test_mean = np.zeros((test_X.shape[0],len(self.mod)))

for i,model in enumerate(self.mod):

for j, (train_index,val_index) in enumerate(self.kf.split(X,y)):

clone_model = clone(model)

clone_model.fit(X[train_index],y[train_index])

oof[val_index,i] = clone_model.predict(X[val_index])

test_single[:,j] = clone_model.predict(test_X)

test_mean[:,i] = test_single.mean(axis=1)

return oof, test_mean

最開始我用get_oof的方法將第一層模型的特征矩陣提取出來,再和原始特征進(jìn)行拼接,最后的cv分?jǐn)?shù)下降到了0.1018,然而在leaderboard上的分?jǐn)?shù)卻變差了,看來這種方法會(huì)導(dǎo)致過擬合。

X_train_stack, X_test_stack = stack_model.get_oof(a,b,test_X_scaled)

X_train_add = np.hstack((a,X_train_stack))

X_test_add = np.hstack((test_X_scaled,X_test_stack))

print(rmse_cv(stack_model,X_train_add,b).mean()) # 0.101824682747

最后的結(jié)果提交,我用了Lasso,Ridge,SVR,Kernel Ridge,ElasticNet,BayesianRidge作為第一層模型,Kernel Ridge作為第二層模型。

stack_model = stacking(mod=[lasso,ridge,svr,ker,ela,bay],meta_model=ker)

stack_model.fit(a,b)

pred = np.exp(stack_model.predict(test_X_scaled))

result=pd.DataFrame({'Id':test.Id, 'SalePrice':pred})

result.to_csv("submission.csv",index=False)

總結(jié)

以上是生活随笔為你收集整理的zillow房价预测比赛_Kaggle竞赛 —— 房价预测 (House Prices)的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。