时间序列预测 预测时间段_应用时间序列预测:美国住宅
時間序列預(yù)測 預(yù)測時間段
1.簡介 (1. Introduction)
During these COVID19 months housing sector is rebounding rapidly after a downtime since the early months of the year. New residential house construction was down to about 1 million in April. As of July 1.5 million new houses are under construction (for comparison, in July of 2019 it was 1.3 million). The Census Bureau report released on August 18 shows some pretty good indicators for the housing market.
自今年初以來,在經(jīng)歷了停工之后,在這19個COVID19個月中,住房行業(yè)Swift反彈。 四月份的新住宅建設(shè)量降至約100萬套。 截至7月,正在建造150萬套新房屋(相比之下,2019年7月為130萬套)。 人口普查局 8月18日發(fā)布的報(bào)告顯示了一些相當(dāng)好的房地產(chǎn)市場指標(biāo)。
New house construction plays a significant role in the economy. Besides generating employment it simultaneously impacts timber, furniture and appliance markets. It’s also an important indicator of the overall health of the economy.
新房建設(shè)在經(jīng)濟(jì)中起著重要作用。 除了創(chuàng)造就業(yè)機(jī)會,它同時影響木材,家具和家電市場。 它也是經(jīng)濟(jì)整體健康狀況的重要指標(biāo)。
So one might ask, how will this crucial economic indicator play out the next few months and years to come after the COVID19 shock?
因此,有人可能會問,在COVID19沖擊之后的幾個月和幾年中,這一至關(guān)重要的經(jīng)濟(jì)指標(biāo)將如何發(fā)揮作用?
Answering these questions requires some forecasting.
回答這些問題需要一些預(yù)測。
The purpose of this article is to make short and medium-term forecasting of residential construction using a popular time series forecasting model called ARIMA.
本文的目的是使用流行的時間序列預(yù)測模型ARIMA對住宅建筑進(jìn)行短期和中期預(yù)測。
Even if you are not much into the housing market but are interested in data science, this is a practical forecasting example that might help you understand how forecasting works and how to implement a model in real-world application cases.
即使您對房地產(chǎn)市場的興趣不大,但對數(shù)據(jù)科學(xué)感興趣,這也是一個實(shí)用的預(yù)測示例,可以幫助您了解預(yù)測的工作原理以及如何在實(shí)際應(yīng)用案例中實(shí)現(xiàn)模型。
2.方法摘要 (2. Methods summary)
The objective is forecasting the construction of residential housing units in the short and medium-term using historical data obtained from census.gov database. Note that in the Census Bureau database, you’ll see there’re several datasets on housing indicators including “housing units started” and “housing units completed”; I’m using the latter for this article.
目的是使用從census.gov數(shù)據(jù)庫獲得的歷史數(shù)據(jù)預(yù)測短期和中期的住宅單元建設(shè)。 請注意,在人口普查局?jǐn)?shù)據(jù)庫中,您會看到有關(guān)房屋指標(biāo)的多個數(shù)據(jù)集,包括“房屋單元已開始”和“房屋單元已完成”; 我在本文中使用后者。
Census Bureau is a great source of time series data of all kinds on a large number of social, economic and business indicators. So if you are interested in time series analysis and modeling and want to avoid toy datasets, the Census Bureau is a great place to check out.
人口普查局是有關(guān)大量社會,經(jīng)濟(jì)和商業(yè)指標(biāo)的各種時間序列數(shù)據(jù)的重要來源。 因此,如果您對時間序列分析和建模感興趣,并且希望避免使用玩具數(shù)據(jù)集,那么人口普查局是個很好的結(jié)帳地點(diǎn)。
I am doing the modeling in R programming environment. Python has great libraries for data science and machine learning modeling, but in my opinion, R has the best package, calledfpp2, developed by Rob J Hyndman, for time series forecasting.
我正在R編程環(huán)境中進(jìn)行建模。 Python擁有用于數(shù)據(jù)科學(xué)和機(jī)器學(xué)習(xí)建模的出色庫 ,但我認(rèn)為R具有由Rob J Hyndman開發(fā)的用于時間序列預(yù)測的最佳軟件包fpp2 。
There are many methods for time series forecasting, and I have written about them in a few articles before (you can check out this and this), but for this analysis I am going to use ARIMA. Before settling on ARIMA I ran a couple of other models — Holt-Winter and ETS — but found that ARIMA has a better performance for this particular dataset.
時間序列預(yù)測有很多方法,我之前已經(jīng)在幾篇文章中對此進(jìn)行了介紹(您可以查看this和this ),但是對于此分析,我將使用ARIMA。 在著手ARIMA之前,我還運(yùn)行了其他幾個模型(Holt-Winter和ETS),但是發(fā)現(xiàn)ARIMA對于此特定數(shù)據(jù)集具有更好的性能。
3.數(shù)據(jù)準(zhǔn)備 (3. Data preparation)
The only library I am using is fpp2. If you install this library all required dependencies will accompany.
我使用的唯一庫是fpp2 。 如果安裝此庫,則所有必需的依賴項(xiàng)都會伴隨。
After importing data in the R programming environment (RStudio) I call the head() function.
在R編程環(huán)境(RStudio)中導(dǎo)入數(shù)據(jù)后,我調(diào)用head()函數(shù)。
# import librarylibrary(fpp2)# import data
df = read.csv("../housecons.csv", skip=6)# check out first few rows
head(df)
I noticed that the first few rows are empty, so I opened the CSV file outside of R to manually inspect for missing values and found that the first data did not appear until January of 1968. So I got rid of the missing values with a simple function na.omit().
我注意到前幾行是空的,因此我在R之外打開了CSV文件以手動檢查缺失值,發(fā)現(xiàn)直到1968年1月才出現(xiàn)第一個數(shù)據(jù)。因此,我用一個簡單的方法消除了缺失值函數(shù)na.omit() 。
# remove missing valuesdf = na.omit(df)# check out the rows again
head(df)
As you notice in the dataframe above, it has two columns — time stamp and the corresponding values. You might think it is already a time series data so let’s go ahead and build the model. Not so fast, the dataframe may look like a time series but it’s not in a format that is compatible with the modeling package.
如您在上面的數(shù)據(jù)框中所注意到的,它具有兩列-時間戳和相應(yīng)的值。 您可能會認(rèn)為它已經(jīng)是時間序列數(shù)據(jù),所以讓我們繼續(xù)構(gòu)建模型。 數(shù)據(jù)幀的速度不是很快,可能看起來像一個時間序列,但格式與建模包不兼容。
So we need to do some data processing.
因此,我們需要進(jìn)行一些數(shù)據(jù)處理。
As a side note, not just this dataset, any dataset you use for this kind of analysis in any package, you need to do pre-processing. This is an extra step but a necessary one. After all, this is not a cleaned, toy dataset that you typically find on the internet!
附帶說明一下,您不僅需要此數(shù)據(jù)集,還需要對用于任何軟件包中的此類分析的任何數(shù)據(jù)集進(jìn)行預(yù)處理。 這是一個額外的步驟,但卻是必要的步驟。 畢竟,這不是通常在互聯(lián)網(wǎng)上找到的干凈的玩具數(shù)據(jù)集!
# keep only `Value` columndf = df[, c(2)]# convert values into a time series object
series = ts(df, start = 1968, frequency = 12)# now check out the series
print(series)
The codes above are self-explanatory. Since we got rid of the “Period” column, I had to tell the program that the values start from 1968 and it’s an annual time series with 12-month frequency.
上面的代碼是不言自明的。 自從我們刪除了“ Period”一欄之后,我不得不告訴程序該值始于1968年,它是一個12個月一次的年度時間序列。
The original data was in long-form, now after processing it is converted to a wide-form so you can now see a lot of data in a small window.
原始數(shù)據(jù)采用長格式,現(xiàn)在經(jīng)過處理后將轉(zhuǎn)換為寬格式,因此您現(xiàn)在可以在一個小窗口中看到大量數(shù)據(jù)。
We are now done with data processing. Was that easy to process data for time series compared to other machine learning algorithms? I bet it was.
現(xiàn)在,我們完成了數(shù)據(jù)處理。 與其他機(jī)器學(xué)習(xí)算法相比,這樣容易處理時間序列數(shù)據(jù)嗎? 我敢打賭
Now that we have the data that we needed, shall we go ahead and build the model?
現(xiàn)在我們有了所需的數(shù)據(jù),我們是否應(yīng)該繼續(xù)構(gòu)建模型?
Not so fast!
沒那么快!
4.探索性分析 (4. Exploratory analysis)
Exploratory data analysis (EDA) may not seem like a pre-requisite, but in my opinion it is! And it’s for two reasons. First, without EDA you are absolutely blinded, you will have no idea what’s going into the model. You kind of need to know what raw material is going into the final product, don’t you?
探索性數(shù)據(jù)分析(EDA)似乎不是先決條件,但我認(rèn)為是! 這有兩個原因。 首先,如果沒有EDA,您絕對是盲目的,您將不知道模型會發(fā)生什么。 您有點(diǎn)需要知道最終產(chǎn)品將使用哪種原材料,不是嗎?
The second reason is an important one. As you will see later, I had to test the model on two different input data series for model performance. I only did this extra step after I discovered that the time series is not smooth, it has a structural break, which influenced the model performance (check out the figure below, do you see the structural break?).
第二個原因很重要。 稍后您將看到,我必須在兩個不同的輸入數(shù)據(jù)序列上測試模型的模型性能。 我僅在發(fā)現(xiàn)時間序列不平滑,有結(jié)構(gòu)性中斷之后才執(zhí)行此額外步驟,該結(jié)構(gòu)性中斷影響了模型性能(請查看下圖,您看到結(jié)構(gòu)性中斷了嗎?)。
Visualizing the series
可視化系列
The nice thing about the fpp2package is that you don’t have to separately install visualization libraries, it’s already built-in.
關(guān)于fpp2軟件包的fpp2是您不必單獨(dú)安裝可視化庫,它已經(jīng)內(nèi)置。
# plot the seriesautoplot(series) +
xlab(" ") +
ylab("New residential construction '000") +
ggtitle(" New residential construction") +
theme(plot.title = element_text(size=8))Monthly residential construction completed between 1968–20201968年至2020年之間完成的每??月住宅建設(shè)
It’s just one single plot above, but there is so much going on. If you are a data scientist, you could stop here and take a closer look and find out how many bits of information you could extract from this figure.
這只是上面的一個圖,但是發(fā)生了很多事情。 如果您是一名數(shù)據(jù)科學(xué)家,則可以在這里停下來仔細(xì)觀察,找出可以從該圖中提取多少信息。
Here is my interpretation:
這是我的解釋:
- the data has a strong seasonality; 數(shù)據(jù)具有很強(qiáng)的季節(jié)性;
- it also shows some cyclic behavior until c.1990, which then disappeared; 它也顯示出一些周期性的行為,直到1990年左右才消失。
- the series remained relatively stable since 1992 until the housing crisis; 自1992年以來,直到住房危機(jī)之前,該系列一直保持相對穩(wěn)定;
- the structural break due to market shock is clearly visible around 2008; 在2008年左右,市場沖擊引起的結(jié)構(gòu)性破壞顯而易見。
- the market is recovering since c. 2011 and growing steadily; 自c。開始市場復(fù)蘇。 2011年并穩(wěn)步增長;
- 2020 has yet another shock from COVID19. It’s not clearly visible in this figure, but if you zoom in you can detect it. 2020年又使COVID19感到震驚。 在此圖中看不到清晰可見的圖像,但是如果放大可以檢測到。
So much information you are able to extract from just a simple figure and these are all useful bits of information for building intuition before building models. That is why EDA is so essential in data science.
您可以從一個簡單的圖形中提取出如此多的信息,這些都是在構(gòu)建模型之前建立直覺的有用信息。 這就是為什么EDA在數(shù)據(jù)科學(xué)中如此重要的原因。
Trend
趨勢
The overall trend in the series is already visible in the first figure, but if you want better visibility of the trend you can do that by removing seasonality.
該系列的總體趨勢已經(jīng)在第一個圖中顯示出來了,但是如果您想更好地了解趨勢,可以通過消除季節(jié)性來實(shí)現(xiàn)。
# remove seasonality (monthly variation) to see yearly changesseries_ma = ma(series, 12)autoplot(series_ma) +
xlab("Time") + ylab("New residential construction '000")+
ggtitle("Series after removing seasonality" )+
theme(plot.title = element_text(size=8))Series after removing seasonality去除季節(jié)性后的系列
Seasonality
季節(jié)性
After seeing the general annual trend if you want to only focus on seasonality you could do that too with a seasonal sub-series plot.
在看到總體年度趨勢之后,如果您只想關(guān)注季節(jié)性,則也可以使用季節(jié)性子系列圖來實(shí)現(xiàn)。
# Seasonal sub-series plotseries_season = window(series, start=c(1968,1), end=c(2020,7))
ggsubseriesplot(series_season) +
ylab(" ") +
ggtitle("Seasonal subseries plot") +
theme(plot.title = element_text(size=10))Seasonal subseries plot季節(jié)性子系列劇情
Time series decomposition
時間序列分解
There is a nice way to show everything in one figure — it’s called the decomposition plot. Basically it is a composite of four information:
有一種很好的方法可以在一個圖中顯示所有內(nèi)容-這稱為分解圖。 基本上,它是四個信息的組合:
- the original series (i.e. data) 原始系列(即數(shù)據(jù))
- trend 趨勢
- seasonality 季節(jié)性
- random component 隨機(jī)成分
autoplot(decompose(predictor_series)) +
ggtitle("Decomposition of the predictor series")+
theme(plot.title = element_text(size=8))
The random data part is in this decomposition plot is the most interesting to me, since this component actually determines the uncertainty in forecasting. The smaller this random component the better.
這個分解圖中的隨機(jī)數(shù)據(jù)部分對我來說是最有趣的,因?yàn)榇私M件實(shí)際上確定了預(yù)測中的不確定性。 該隨機(jī)分量越小越好。
Time series decomposition時間序列分解Zooming in
放大
We could also zoom in on a specific part of the data series. For example, below I zoomed in to see the good times (1995–2005) and bad times (2006–2016) in the housing market.
我們還可以放大數(shù)據(jù)系列的特定部分。 例如,下面我放大查看房地產(chǎn)市場的好時光(1995-2005)和不好的時光(2006-2016)。
# zooming in on high timeseries_downtime = window(series, start=c(1993,1), end=c(2005,12))
autoplot(series_downtime) +
xlab("Time") + ylab("New residential construction '000")+
ggtitle(" New residential construction high time")+
theme(plot.title = element_text(size=8))# zooming in on down time
series_downtime = window(series, start=c(2005,1), end=c(2015,12))
autoplot(series_downtime) +
xlab("Time") + ylab("New residential construction '000")+
ggtitle(" New residential construction down time")+
theme(plot.title = element_text(size=8))Some good time and some bad time in the housing market房地產(chǎn)市場的一些好時光和一些壞時光
Enough of exploratory analysis, now let’s move on to the fun part of model building, shall we?
足夠的探索性分析之后,現(xiàn)在讓我們繼續(xù)進(jìn)行模型構(gòu)建的有趣部分,對吧?
5. ARIMA的預(yù)測 (5. Forecasting with ARIMA)
I already mentioned the rationale behind choosing ARIMA for this forecasting and it is because I tested the data with two other models but ARIMA showed a better performance.
我已經(jīng)提到選擇ARIMA進(jìn)行此預(yù)測的基本原理,這是因?yàn)槲矣闷渌麅蓚€模型測試了數(shù)據(jù),但是ARIMA顯示出更好的性能。
Once you have your data preprocessed and ready to use, building the actual model is surprisingly easy. As an aside, it is also the case in most modeling exercise; writing codes and executing models is a small part of the whole process you need to go through — from data gathering & cleaning, to building intuition to finding the right model.
一旦對數(shù)據(jù)進(jìn)行了預(yù)處理并可以使用,構(gòu)建實(shí)際模型就非常簡單。 順便說一句,在大多數(shù)建模練習(xí)中也是如此。 編寫代碼和執(zhí)行模型只是您需要經(jīng)歷的整個過程的一小部分-從數(shù)據(jù)收集和清理到建立直覺到找到正確的模型。
I followed 5 simple steps for implementing ARIMA:
我遵循了5個簡單的步驟來實(shí)現(xiàn)ARIMA:
1 ) determining the predictors series
1)確定預(yù)測變量序列
2 ) model parameterization
2)模型參數(shù)化
3 ) plotting forecast values
3)繪制預(yù)測值
4 ) making a point forecast for a specific year
4)預(yù)測特定年份的分?jǐn)?shù)
5 ) model evaluation/accuracy test
5)模型評估/準(zhǔn)確性測試
Below you get all the codes needed for model implementation.
在下面,您可以獲得模型實(shí)現(xiàn)所需的所有代碼。
# determine the predictor series (in case you choose a subset of the series)predictor_series = window(series, start=c(2011,1), end=c(2020,7))
autoplot(predictor_series) + labs(caption = " ")+ xlab("Time") + ylab("New residential construction '000")+
ggtitle(" Predictor series")+
theme(plot.title = element_text(size=8))# decomposition
options(repr.plot.width = 6, repr.plot.height = 3)
autoplot(decompose(predictor_series)) + ggtitle("Decomposition of the predictor series")+
theme(plot.title = element_text(size=8))# model
forecast_arima = auto.arima(predictor_series, seasonal=TRUE, stepwise = FALSE, approximation = FALSE)
forecast_arima = forecast(forecast_arima, h=60)# plot
autoplot(series, series=" Whole series") +
autolayer(predictor_series, series=" Predictor series") +
autolayer(forecast_arima, series=" ARIMA Forecast") +
ggtitle(" ARIMA forecasting") +
theme(plot.title = element_text(size=8))# point forecast
point_forecast_arima=tail(forecast_arima$mean, n=12)
point_forecast_arima = sum(point_forecast_arima)
cat("Forecast value ('000): ", round(point_forecast_arima))print('')
cat(" Current value ('000): ", sum(tail(predictor_series, n=12))) # current value# model description
forecast_arima['model']# accuracy
accuracy(forecast_arima)
Like I said before, I ran ARIMA with two different data series, the first one was the whole data series from 1968 to 2020. As you see below, the forecasted values are kind of flat (red series) and come with a lot of uncertainties.
就像我之前說過的那樣,我使用兩個不同的數(shù)據(jù)系列運(yùn)行ARIMA,第一個是1968年至2020年的整個數(shù)據(jù)系列。正如您在下面看到的那樣,預(yù)測值有點(diǎn)平坦(紅色系列),并且存在很多不確定性。
The forecast looked a bit unrealistic to me given the trend over the last 10 years. You might think it is due to COVID19 impact? I don’t think so, because the model shouldn’t have picked up that signal just yet since COVID impact is a tiny part of the whole series.
考慮到過去10年的趨勢,這一預(yù)測對我而言似乎有些不切實(shí)際。 您可能會認(rèn)為是由于COVID19的影響? 我不這么認(rèn)為,因?yàn)镃OVID的影響只是整個系列的一小部分,因此該模型還不應(yīng)該接收該信號。
Then I realized that this uncertainty is because of the historical features in the series, including the uneven cycles and trend and the structural break. So I made the decision to use the last 10 years data as a predictor.
然后我意識到這種不確定性是由于該系列的歷史特征,包括周期和趨勢的不均勻以及結(jié)構(gòu)性斷裂。 因此,我決定使用最近10年的數(shù)據(jù)作為預(yù)測指標(biāo)。
Forecasting with the whole time series整個時間序列的預(yù)測The figure below is how it looks like once we use a subset of the series as a predictor. You can just visually compare the forecast area and the associated uncertainties (in red) between the two figures.
下圖是一旦我們使用系列的子集作為預(yù)測變量后的樣子。 您可以直觀地比較兩個數(shù)字之間的預(yù)測區(qū)域和相關(guān)的不確定性(紅色)。
Forecasting with recent data of the series使用該系列的最新數(shù)據(jù)進(jìn)行預(yù)測Below is more a quantitative way of comparing the performance of two models based on two input series. As you can see both the AIC and RMSE have dramatically declined to give the second model a solid performance.
下面是一種比較定量的方法,用于比較基于兩個輸入序列的兩個模型的性能。 如您所見,AIC和RMSE都大大降低了第二模型的性能。
預(yù)測值 (Forecast values)
Enough about the model building process, but this article is about doing an actual forecast with a real-world dataset. Below are the forecast values for new residential house construction in the United States.
關(guān)于模型構(gòu)建的過程已經(jīng)足夠了,但是本文是關(guān)于使用真實(shí)數(shù)據(jù)集進(jìn)行實(shí)際預(yù)測的。 以下是美國新住宅建筑的預(yù)測值。
Current value (‘000 units): 1248
當(dāng)前值('000單位):1248
1-year forecast (‘000 units): 1310
1年預(yù)測(000個單位):1310
5-year forecast (‘000 units): 1558
五年預(yù)測('000單位):1558
6。結(jié)論 (6. Conclusions)
If the residential house construction continues along with the trend, 300K new residential housing units are expected to be built over the next 5 years. But this needs to be closely watched as the impact of COVID19 shock might be more apparent in the next few months.
如果住宅建設(shè)繼續(xù)保持這種趨勢,那么未來5年預(yù)計(jì)將建造30萬套新住宅。 但這需要密切注意,因?yàn)樵诮酉聛淼膸讉€月中,COVID19沖擊的影響可能會更加明顯。
I probably could’ve gotten a better model by tuning parameters or finding another model, but I wanted to keep it simple. In fact, as the adage goes, all models are wrong but some are useful. Hopefully, this AIMA model was useful in understanding some market dynamics.
我可能可以通過調(diào)整參數(shù)或找到其他模型來獲得更好的模型,但我想保持簡單。 實(shí)際上,正如諺語所說,所有模型都是錯誤的,但有些模型是有用的。 希望該AIMA模型有助于理解某些市場動態(tài)。
翻譯自: https://towardsdatascience.com/applied-time-series-forecasting-residential-housing-in-the-us-f8ab68e63f94
時間序列預(yù)測 預(yù)測時間段
總結(jié)
以上是生活随笔為你收集整理的时间序列预测 预测时间段_应用时间序列预测:美国住宅的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: nlp自然语言处理_不要被NLP Res
- 下一篇: 经验主义 保守主义_为什么我们需要行动主