當(dāng)前位置：首頁(yè) >

房价预测：回归问题

發(fā)布時(shí)間：2025/4/16 99 豆豆

生活随笔收集整理的這篇文章主要介紹了房价预测：回归问题小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

還有一種常見(jiàn)的機(jī)器學(xué)習(xí)問(wèn)題是回歸問(wèn)題，它預(yù)測(cè)的是連續(xù)值而不是離散標(biāo)簽，例如，根據(jù)氣象數(shù)據(jù)預(yù)測(cè)明天氣溫，或者根據(jù)軟件說(shuō)明書(shū)預(yù)測(cè)項(xiàng)目完成所需要的時(shí)間。

數(shù)據(jù)介紹

這里我們介紹一下數(shù)據(jù)。要預(yù)測(cè)的是是20世紀(jì)70年代波士頓房屋價(jià)格的中位數(shù)。這里給出的數(shù)據(jù)包括犯罪率、當(dāng)期房產(chǎn)稅率等。本次，我們有的數(shù)據(jù)點(diǎn)相對(duì)較少，只有506個(gè)，分為404個(gè)訓(xùn)練樣本和102個(gè)測(cè)試樣本。輸入數(shù)據(jù)的每個(gè)特征都有不同的取值范圍。有些特征是比例，取值范圍為0-1，有的特征取值范圍為1-12；還有的特征取值范圍為0-100等。

from keras.datasets import boston_housing(train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()train_data.shapetest_data.shape

結(jié)果分別是：
(404, 13)
(102, 13)

這里每個(gè)樣本有13個(gè)特征，比如犯罪率、每個(gè)住宅平均房屋間數(shù)、告訴公路的可達(dá)性等。

我們的目標(biāo)（或者說(shuō)希望的測(cè)試結(jié)果）是房屋價(jià)格的中位數(shù)，單位是千美元

train_targets

array([ 15.2, 42.3, 50. , 21.1, 17.7, 18.5, 11.3, 15.6, 15.6,
14.4, 12.1, 17.9, 23.1, 19.9, 15.7, 8.8, 50. , 22.5,
24.1, 27.5, 10.9, 30.8, 32.9, 24. , 18.5, 13.3, 22.9,…

對(duì)數(shù)據(jù)格式進(jìn)行處理

將取值范圍差異很大的數(shù)據(jù)直接輸入到神經(jīng)網(wǎng)絡(luò)中，雖然網(wǎng)絡(luò)會(huì)自動(dòng)適應(yīng)這種取值范圍不同的數(shù)據(jù)，但是不進(jìn)行數(shù)據(jù)處理直接學(xué)效果很不好。因?yàn)閿?shù)據(jù)差異比較大的數(shù)據(jù)在網(wǎng)絡(luò)中會(huì)整體學(xué)習(xí)效果有較大影響，所以我們需要先做標(biāo)準(zhǔn)化處理（0-1標(biāo)準(zhǔn)差）。

mean = train_data.mean(axis=0) # axis = 0表示變成一行，實(shí)際上是求每列均值 train_data -= mean std = train_data.std(axis=0) train_data /= stdtest_data -= mean test_data /= std

構(gòu)建網(wǎng)絡(luò)

由于樣本很小，所以我們用一個(gè)非常小的網(wǎng)絡(luò)，其中包含兩個(gè)隱藏層，每層有64個(gè)單元，一般來(lái)說(shuō)，訓(xùn)練數(shù)據(jù)越少，過(guò)擬合就會(huì)越嚴(yán)重，而較小的網(wǎng)絡(luò)可以降低過(guò)擬合。

from keras import models from keras import layersdef build_model():# Because we will need to instantiate# the same model multiple times,# we use a function to construct it.model = models.Sequential()model.add(layers.Dense(64, activation='relu',input_shape=(train_data.shape[1],)))model.add(layers.Dense(64, activation='relu'))model.add(layers.Dense(1))model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])return model

網(wǎng)絡(luò)的最后一層只有一個(gè)單元，沒(méi)有激活，是一個(gè)線性層。這是標(biāo)量回歸（標(biāo)量回歸是預(yù)測(cè)單一連續(xù)值的回歸）的典型設(shè)置。添加激活函數(shù)將會(huì)限制輸出范圍。例如，如果向最后一層添加sigmoid激活函數(shù)，網(wǎng)絡(luò)只學(xué)會(huì)預(yù)測(cè)0-1范圍內(nèi)的值。這里最后一層是純線性的，所以網(wǎng)絡(luò)可以學(xué)會(huì)預(yù)測(cè)任何范圍內(nèi)的值。

這里，我們使用了mse損失函數(shù)（均方誤差），這是回歸問(wèn)題常用的損失函數(shù)。

K折交叉驗(yàn)證

由于我們數(shù)據(jù)點(diǎn)很小，驗(yàn)證集會(huì)非常小（比如大約100個(gè)樣本）。因此，驗(yàn)證分?jǐn)?shù)可能會(huì)有很大波動(dòng)，不同劃分的結(jié)果可能會(huì)對(duì)數(shù)據(jù)產(chǎn)生較大的影響，所以我們使用K折交叉驗(yàn)證。

import numpy as npk = 4 num_val_samples = len(train_data) // k num_epochs = 100 all_scores = [] for i in range(k):print('processing fold #', i)# Prepare the validation data: data from partition # kval_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]# Prepare the training data: data from all other partitionspartial_train_data = np.concatenate([train_data[:i * num_val_samples],train_data[(i + 1) * num_val_samples:]],axis=0)partial_train_targets = np.concatenate( # concatenate合并兩個(gè)array數(shù)組，按行合并，axis =0 ，豎著合并[train_targets[:i * num_val_samples],train_targets[(i + 1) * num_val_samples:]],axis=0)# Build the Keras model (already compiled)model = build_model()# Train the model (in silent mode, verbose=0)model.fit(partial_train_data, partial_train_targets,epochs=num_epochs, batch_size=1, verbose=0)# Evaluate the model on the validation dataval_mse, val_mae = model.evaluate(val_data, val_targets, verbose=0)all_scores.append(val_mae)

我們這里設(shè)置迭代次數(shù)為100，運(yùn)行結(jié)果如下：

all_scores

[2.0750808349930412, 2.117215852926273, 2.9140411863232605, 2.4288365227161068]

np.mean(all_scores)

2.3837935992396706
可以看到，經(jīng)過(guò)4折交叉驗(yàn)證之后，預(yù)測(cè)結(jié)果和真實(shí)房間基本相差2400美元。

我們下面做500輪次，并修改最后一部分代碼：

from keras import backend as K# Some memory clean-up K.clear_session()num_epochs = 500 all_mae_histories = [] for i in range(k):print('processing fold #', i)# Prepare the validation data: data from partition # kval_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]# Prepare the training data: data from all other partitionspartial_train_data = np.concatenate([train_data[:i * num_val_samples],train_data[(i + 1) * num_val_samples:]],axis=0)partial_train_targets = np.concatenate([train_targets[:i * num_val_samples],train_targets[(i + 1) * num_val_samples:]],axis=0)# Build the Keras model (already compiled)model = build_model()# Train the model (in silent mode, verbose=0)history = model.fit(partial_train_data, partial_train_targets,validation_data=(val_data, val_targets),epochs=num_epochs, batch_size=1, verbose=0)mae_history = history.history['val_mean_absolute_error']all_mae_histories.append(mae_history)

下面我們可以計(jì)算每個(gè)輪次中所有折MAE的平均值。

average_mae_history = [np.mean([x[i] for x in all_mae_histories]) for i in range(num_epochs)]

畫(huà)圖

import matplotlib.pyplot as pltplt.plot(range(1, len(average_mae_history) + 1), average_mae_history) plt.xlabel('Epochs') plt.ylabel('Validation MAE') plt.show()

這里，我們看到縱軸范圍比較大，而且數(shù)據(jù)方差比較大，這張圖所表達(dá)的規(guī)律不太明顯。所以我們：

刪除前10個(gè)點(diǎn)
將每個(gè)數(shù)據(jù)點(diǎn)替換為前面數(shù)據(jù)點(diǎn)的移動(dòng)平均值，來(lái)得到光滑曲線

def smooth_curve(points, factor=0.9):smoothed_points = []for point in points:if smoothed_points:previous = smoothed_points[-1]smoothed_points.append(previous * factor + point * (1 - factor))else:def smooth_curve(points, factor=0.9):smoothed_points = []for point in points:if smoothed_points:previous = smoothed_points[-1]smoothed_points.append(previous * factor + point * (1 - factor))else:smoothed_points.append(point)return smoothed_pointssmooth_mae_history = smooth_curve(average_mae_history[10:])plt.plot(range(1, len(smooth_mae_history) + 1), smooth_mae_history) plt.xlabel('Epochs') plt.ylabel('Validation MAE') plt.show()

從圖中可以看出，驗(yàn)證MAE在80輪后不再顯著下降，之后開(kāi)始出現(xiàn)過(guò)擬合。

訓(xùn)練最終模型

我們得到最佳迭代次數(shù)這個(gè)超參數(shù)，大概是80，下面在全部訓(xùn)練集上訓(xùn)練結(jié)果

# Get a fresh, compiled model. model = build_model() # Train it on the entirety of the data. model.fit(train_data, train_targets,epochs=80, batch_size=16, verbose=0) test_mse_score, test_mae_score = model.evaluate(test_data, test_targets)

結(jié)果是：

test_mae_score

2.5532484335057877

預(yù)測(cè)房?jī)r(jià)和實(shí)際值大概相差2550元。

更多精彩內(nèi)容，歡迎關(guān)注我的微信公眾號(hào)：數(shù)據(jù)瞎分析

總結(jié)

以上是生活随笔為你收集整理的房价预测：回归问题的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

房价

上一篇：用Keras进行手写字体识别（MNIST
下一篇：卷机神经网络的可视化（可视化中间激活）