當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

时间序列模式识别_空气质量传感器数据的时间序列模式识别

發(fā)布時(shí)間：2023/11/29 编程问答 24 豆豆

生活随笔收集整理的這篇文章主要介紹了时间序列模式识别_空气质量传感器数据的时间序列模式识别小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

時(shí)間序列模式識別

· 1. Introduction· 2. Exploratory Data Analysis ° 2.1 Pattern Changes ° 2.2 Correlation Between Features· 3. Anomaly Detection and Pattern Recognition ° 3.1 Point Anomaly Detection (System Fault) ° 3.2 Collective Anomaly Detection (External Event) ° 3.3 Clustering and Pattern Recognition (External Event)· 4. Conclusion·

· 1.簡介 · 2.探索性數(shù)據(jù)分析 °2.1 模式更改 °2.2 特征之間的相關(guān)性 · 3.異常檢測和模式識別 °3.1 點(diǎn)異常檢測(系統(tǒng)故障) °3.2 集體異常檢測(外部事件) °3.3 聚類和模式認(rèn)可(外部事件) · 4.結(jié)論 ·

Note: The detailed project report and the datasets used in this post can be found in my GitHub Page.

注意：本文中使用的詳細(xì)項(xiàng)目報(bào)告和數(shù)據(jù)集可以在我的GitHub Page中找到。

1.簡介 (1. Introduction)

This project was assigned to me by a client. There is no non-disclosure agreement required and the project does not contain any sensitive information. So, I decide to make this project public as part of my personal data science portfolio while anonymizing the client’s information.

該項(xiàng)目是由客戶分配給我的。不需要保密協(xié)議，該項(xiàng)目不包含任何敏感信息。因此，我決定將該項(xiàng)目公開，作為我的個(gè)人數(shù)據(jù)科學(xué)投資組合的一部分，同時(shí)匿名化客戶的信息。

In the project, there are two data sets, each consists of one week of sensor readings are provided to accomplish the following four tasks:

在該項(xiàng)目中，有兩個(gè)數(shù)據(jù)集，每個(gè)數(shù)據(jù)集包含一個(gè)星期的傳感器讀數(shù)，以完成以下四個(gè)任務(wù)：

1. Find anomalies in the data set to automatically flag events

1.在數(shù)據(jù)集中查找異常以自動(dòng)標(biāo)記事件

2. Categorize anomalies as “System fault” or “external event”

2.將異常分類為“系統(tǒng)故障”或“外部事件”

3. Provide any other useful conclusions from the pattern in the data set

3.根據(jù)數(shù)據(jù)集中的模式提供其他有用的結(jié)論

4. Visualize inter-dependencies of the features in the dataset

4.可視化數(shù)據(jù)集中要素的相互依賴性

In this report I am going to briefly walk through the steps I use for data analysis, visualization of feature correlation, machine learning techniques to automatically flag “system faults” and “external events” and my findings from the data.

在本報(bào)告中，我將簡要介紹我用于數(shù)據(jù)分析，特征關(guān)聯(lián)可視化，機(jī)器學(xué)習(xí)技術(shù)以自動(dòng)標(biāo)記“系統(tǒng)故障”和“外部事件”以及我從數(shù)據(jù)中發(fā)現(xiàn)的步驟。

2.探索性數(shù)據(jù)分析 (2. Exploratory Data Analysis)

My code and results in this section can be found here.

我在本節(jié)中的代碼和結(jié)果可以在這里找到。

The dataset comes with two CSV files, both of which can be accessed from my GitHub Page. I first import and concatenate them into one Pandas dataframe in Python. Some rearrangements are made to remove columns except the 11 features that we are interested in:

該數(shù)據(jù)集帶有兩個(gè)CSV文件，都可以從我的GitHub Page中訪問它們。我首先將它們導(dǎo)入并用Python連接到一個(gè)Pandas數(shù)據(jù)框中。除我們感興趣的11個(gè)功能外，還進(jìn)行了一些重新排列以刪除列：

Ozone
臭氧
Hydrogen Sulfide
硫化氫
Total VOCs
總VOC
Carbon Dioxide
二氧化碳
PM 1
1號紙
PM 2.5
下午2.5
PM 10
下午10點(diǎn)
Temperature (Internal & External)
溫度(內(nèi)部和外部)
Humidity (Internal & External).
濕度(內(nèi)部和外部)。

The timestamps span from May 26 to June 9, 2020 (14 whole days in total) in EDT (GMT-4) time zone. By subtraction, different intervals are found between each reading, ranging from 7 seconds to 3552 seconds. The top 5 frequent time intervals are listed below in Table 1, where most of them are close to 59 and 60 seconds, so it can be concluded that the sensor reads every minute. However, the inconsistency of reading intervals might be worth looking into if it is no deliberate interference involved since it might cause trouble in future time series analysis.

時(shí)間戳跨越EDT(GMT-4)時(shí)區(qū)的2020年5月26日至6月9日(共14天)。通過減法，可以在每個(gè)讀數(shù)之間找到不同的間隔，范圍從7秒到3552秒。下面的表1中列出了前5個(gè)最頻繁的時(shí)間間隔，其中大多數(shù)時(shí)間間隔接近59秒和60秒，因此可以得出結(jié)論，傳感器每分鐘都會讀取一次。但是，如果不涉及故意的干擾，則可能需要研究讀取間隔的不一致，因?yàn)檫@可能會在以后的時(shí)間序列分析中造成麻煩。

Table 1: Top 5 Time Intervals of the Sensor Measurements表1：傳感器測量的前5個(gè)時(shí)間間隔

For each of the features, the time series data are on different scales, so they are normalized in order for better visualization and machine learning efficiencies. Then they are plotted and visually inspected to discover any interesting patterns.

對于每個(gè)功能，時(shí)間序列數(shù)據(jù)的比例不同，因此對其進(jìn)行了歸一化，以實(shí)現(xiàn)更好的可視化和機(jī)器學(xué)習(xí)效率。然后將它們繪制出來并進(jìn)行視覺檢查，以發(fā)現(xiàn)任何有趣的圖案。

2.1模式變更 (2.1 Pattern Changes)

Some of the features seem to share similar pattern changes at specific time points. Three of the most significant ones (Temperature External, Humidity External, and Ozone) are shown below in Figure 1. It can be clearly seen that the areas highlighted with pink tend to have flat signals while the unhighlighted areas are sinusoidal.

一些功能似乎在特定時(shí)間點(diǎn)共享相似的模式更改。圖1中顯示了三個(gè)最重要的區(qū)域(外部溫度，外部濕度和臭氧)。可以清楚地看到，用粉紅色突出顯示的區(qū)域傾向于發(fā)出平坦的信號，而未突出顯示的區(qū)域則是正弦的。

Figure 1: Readings of Temperature (External), Humidity (External), and Ozone. They experience similar pattern changes at the same time points.圖1：溫度(外部)，濕度(外部)和臭氧的讀數(shù)。他們在同一時(shí)間點(diǎn)經(jīng)歷相似的模式變化。

According to common sense, the outdoor temperature reaches its high point at noon and goes down at night, I start to wonder the possibility that different test environments were involved during this 14-day period. To test the idea, Toronto weather data is queried from Canada Weather Stats [1]. The temperature and relative humidity are overlaid and compared with the external temperature and humidity in this dataset. The plot is shown in Figure 2. It can be seen that the actual temperature and humidity fluctuate in a sinusoidal fashion. Most parts of the temperature and humidity readings correlate well with the weather data, while the areas highlighted in pink remains relatively invariant. I am not provided with any relevant information on the environments that the measurements were taken, but from the plot, it can be reasonably inferred that the device has been relocated between indoor and outdoor environments during the 14-day period. This is also tested later in the automatic anomaly detection in Section 3.3.

根據(jù)常識，室外溫度會在中午達(dá)到最高點(diǎn)，并在晚上下降，我開始懷疑在這14天的時(shí)間內(nèi)是否涉及不同的測試環(huán)境。為了驗(yàn)證這個(gè)想法，可以從加拿大天氣統(tǒng)計(jì)中查詢多倫多的天氣數(shù)據(jù)[1]。覆蓋溫度和相對濕度，并與該數(shù)據(jù)集中的外部溫度和濕度進(jìn)行比較。該圖如圖2所示。可以看出，實(shí)際溫度和濕度以正弦形式波動(dòng)。溫度和濕度讀數(shù)的大部分與天氣數(shù)據(jù)具有良好的相關(guān)性，而用粉紅色突出顯示的區(qū)域則相對不變。在進(jìn)行測量的環(huán)境中，我沒有得到任何相關(guān)信息，但是從繪圖中可以合理地推斷出該設(shè)備在14天的時(shí)間內(nèi)已在室內(nèi)和室外環(huán)境之間重新放置。稍后還將在3.3節(jié)中的自動(dòng)異常檢測中對此進(jìn)行測試。

Figure 2: Temperature (External) and Humidity (External) overlaid with Toronto weather data. The pink highlighted areas remain relatively invariant compared to others.圖2：用多倫多天氣數(shù)據(jù)覆蓋的溫度(外部)和濕度(外部)。與其他區(qū)域相比，粉紅色突出顯示的區(qū)域保持相對不變。

2.2特征之間的關(guān)聯(lián) (2.2 Correlation Between Features)

Correlation is a technique for investigating the relationship between two quantitative, continuous variables in order to represent their inter-dependencies. Among different correlation techniques, Pearson’s correlation is the most common one, which measures the strength of association between the two variables. Its correlation coefficient scales from -1 to 1, where 1 represents the strongest positive correlation, -1 represents the strongest negative correlation and 0 represents no correlation. The correlation coefficients between each pair of the dataset are calculated and plotted as a heatmap, shown in Table 2. The scatter matrix of selected features is also plotted and attached in the Appendix section.

關(guān)聯(lián)是一種技術(shù)，用于研究兩個(gè)定??量的連續(xù)變量之間的關(guān)系，以表示它們之間的相互依賴性。在不同的相關(guān)技術(shù)中，Pearson相關(guān)是最常見的一種，它測量兩個(gè)變量之間的關(guān)聯(lián)強(qiáng)度。其相關(guān)系數(shù)從-1到1，其中1表示最強(qiáng)的正相關(guān)，-1表示最強(qiáng)的負(fù)相關(guān)，0表示無相關(guān)。計(jì)算每對數(shù)據(jù)集之間的相關(guān)系數(shù)，并將其繪制為熱圖，如表2所示。選定要素的散布矩陣也被繪制并附在附錄部分。

The first thing to be noticed is that PM 1, PM 2.5, and PM 10 are highly correlated with each other, which means they always fluctuate in the same fashion. Ozone is negatively correlated with Carbon Dioxide and positively correlates with Temperature (Internal) and Temperature (External). On the other hand, it is surprising not to find any significant correlation between Temperature (Internal) and Temperature (External), possibly due to the superior thermal insulation of the instrument. However, since there is no relevant knowledge provided, no conclusions can be made on the reasonability of this finding. Except for Ozone, Temperature (Internal) is also negatively correlated with Carbon Dioxide, Hydrogen Sulfide, and the three particulate matter measures. On the contrary, Temperature (External) positively correlates with Humidity (Internal) and three particulate matter measures, while negatively correlates with Humidity (External), just as what can be found from the time series plots in Figure 1.

首先要注意的是，PM 1，PM 2.5和PM 10彼此高度相關(guān)，這意味著它們始終以相同的方式波動(dòng)。臭氧與二氧化碳呈負(fù)相關(guān)，與溫度(內(nèi)部)和溫度(外部)呈正相關(guān)。另一方面，令人驚訝的是，沒有發(fā)現(xiàn)溫度(內(nèi)部)和溫度(外部)之間的任何顯著相關(guān)性，這可能是由于儀器具有出色的隔熱性。但是，由于沒有提供相關(guān)的知識，因此無法就此發(fā)現(xiàn)的合理性得出任何結(jié)論。除臭氧外，溫度(內(nèi)部)還與二氧化碳，硫化氫和三種顆粒物度量值呈負(fù)相關(guān)。相反，溫度(外部)與濕度(內(nèi)部)和三個(gè)顆粒物度量呈正相關(guān)，而與濕度(外部)呈負(fù)相關(guān)，正如圖1的時(shí)間序列圖所示。

Table 2: Heatmap of Pearson’s Correlation Coefficients between Features.表2：特征之間的Pearson相關(guān)系數(shù)熱圖。

3.異常檢測和模式識別 (3. Anomaly Detection and Pattern Recognition)

In this section, various anomaly detection methods are examined based on the dataset. The data come without labels, so there is no knowledge or classification rule is provided to distinguish between “system faults”, “external events” and others. Any details of the instrument and experiments are not provided either. Therefore, my results in this section might deviate from expectations, but I am trying my best to make assumptions, define problems, and then accomplish them based on my personal experience. The section consists of three parts: Point Anomaly Detection, Collective Anomaly Detection, and Clustering.

在本節(jié)中，將基于數(shù)據(jù)集檢查各種異常檢測方法。數(shù)據(jù)沒有標(biāo)簽，因此沒有提供知識或分類規(guī)則來區(qū)分“系統(tǒng)故障”，“外部事件”和其他。也沒有提供儀器和實(shí)驗(yàn)的任何細(xì)節(jié)。因此，我在本節(jié)中的結(jié)果可能會偏離預(yù)期，但我會盡力做出假設(shè)，定義問題，然后根據(jù)我的個(gè)人經(jīng)驗(yàn)完成這些假設(shè)。本節(jié)包括三個(gè)部分：點(diǎn)異常檢測，集體異常檢測和聚類。

3.1點(diǎn)異常檢測(系統(tǒng)故障) (3.1 Point Anomaly Detection (System Fault))

Point anomalies, or global outliers, are those data points that are entirely outside the scope of the usual signals without any support of close neighbors. It is usually caused by human or system error and needs to be removed during data cleaning for better performance in predictive modeling. In this dataset, by assuming the “system faults” are equivalent to such point anomalies, there are several features that are worth examining, such as the examples shown below in Figure 3.

點(diǎn)異常或全局離群值是完全不在常規(guī)信號范圍內(nèi)而沒有任何近鄰支持的數(shù)據(jù)點(diǎn)。它通常是由人為或系統(tǒng)錯(cuò)誤引起的，需要在數(shù)據(jù)清理過程中將其刪除以在預(yù)測建模中獲得更好的性能。在此數(shù)據(jù)集中，通過假設(shè)“系統(tǒng)故障”等效于此類點(diǎn)異常，有幾個(gè)特性值得研究，例如下面的圖3中所示的示例。

Figure 3: Time series signals before (left) and after (right) automatic flagging. The point anomalies (outliers) are marked as red. From top to bottom: Humidity (Internal), Total VOCs, and Carbon Dioxide.圖3：自動(dòng)標(biāo)記之前(左)和之后(右)的時(shí)間序列信號。點(diǎn)異常(異常值)標(biāo)記為紅色。從上至下：濕度(內(nèi)部)，總VOC和二氧化碳。

Here, from Humidity (Internal) to Total VOC to Carbon Dioxide, each represents a distinct complexity of point anomaly detection tasks. In the first one, three outliers sit on the level of 0, so a simple boolean filter can do its job of flagging these data points. In the second one, the outliers deviate significantly from the signal that we are interested in, so linear thresholds can be used to separate out the outliers. Both cases are easy to implement because they can be done by purely experience-based methods. When it comes to the third case, it is not possible to use a linear threshold to separate out the outliers since even if they deviate from its neighbors, the values may not be as great as the usual signals at other time points.

在這里，從濕度(內(nèi)部)到總VOC到二氧化碳，每個(gè)都代表了點(diǎn)異常檢測任務(wù)的獨(dú)特復(fù)雜性。在第一個(gè)中，三個(gè)離群值位于0級別上，因此一個(gè)簡單的布爾過濾器可以完成標(biāo)記這些數(shù)據(jù)點(diǎn)的工作。在第二個(gè)中，離群值明顯偏離了我們感興趣的信號，因此可以使用線性閾值來分離離群值。兩種情況都易于實(shí)現(xiàn)，因?yàn)樗鼈兛梢酝ㄟ^純粹基于經(jīng)驗(yàn)的方法來完成。在第三種情況下，不可能使用線性閾值來分離離群值，因?yàn)榧词闺x群值偏離其鄰域，其值也可能不如其他時(shí)間點(diǎn)的正常信號大。

For such cases, there are many ways to approach. One of the simple ways is to calculate the rolling mean or median at each time point and test if the actual value is within the prediction interval that is calculated by adding and subtracting a certain fluctuation range from the central line. Since we are dealing with outliers in our case, the rolling median is more robust, so it is used in this dataset.

對于這種情況，有很多方法可以解決。一種簡單的方法是計(jì)算每個(gè)時(shí)間點(diǎn)的滾動(dòng)平均值或中位數(shù)，并測試實(shí)際值是否在預(yù)測區(qū)間內(nèi)，該預(yù)測區(qū)間是通過從中心線增加或減去某個(gè)波動(dòng)范圍而得出的。由于在這種情況下我們正在處理離群值，因此滾動(dòng)中位數(shù)更為穩(wěn)健，因此在此數(shù)據(jù)集中使用了滾動(dòng)中位數(shù)。

Figure 4: Zoomed-in time series plot of Carbon Dioxide, with its rolling median and prediction interval圖4：二氧化碳的放大時(shí)間序列圖及其滾動(dòng)中位數(shù)和預(yù)測間隔

From Figure 4, it is even clearer to demonstrate this approach: Equal-distance prediction intervals are set based on the rolling median. Here, the rolling window is set to 5, which means that for each data point, we take four of its closest neighbors and calculate the median as the center of prediction. Then a prediction interval of ±0.17 is padded around the center. Any points outside are considered as outliers.

從圖4中可以更清楚地證明這種方法：等距預(yù)測間隔基于滾動(dòng)中值設(shè)置。在這里，滾動(dòng)窗口設(shè)置為5，這意味著對于每個(gè)數(shù)據(jù)點(diǎn)，我們?nèi)∑渌膫€(gè)最近的鄰居并計(jì)算中位數(shù)作為預(yù)測中心。然后，在中心周圍填充±0.17的預(yù)測間隔。外部的任何點(diǎn)均視為異常值。

It is straightforward and efficient to detect the point anomalies using this approach. However, it has deficiencies and may not be reliable enough to deal with more complex data. In this model, there are two parameters: rolling window size and prediction interval size, both of which are defined manually through experimentations with the given data. We are essentially solving the problem encountered when going from case 2 to case 3 as seen in Figure 3 by enabling the self-adjusting capability to the classification boundary between signal and outliers according to time. However, the bandwidth is fixed, so it becomes less useful in cases when the definition of point anomalies changes with time. For example, the definition of a super cheap flight ticket might be totally different between a normal weekday and the holiday season.

使用這種方法來檢測點(diǎn)異常非常簡單有效。但是，它有缺陷并且可能不夠可靠，無法處理更復(fù)雜的數(shù)據(jù)。在此模型中，有兩個(gè)參數(shù)：滾動(dòng)窗口大小和預(yù)測間隔大小，這兩個(gè)參數(shù)都是通過對給定數(shù)據(jù)進(jìn)行實(shí)驗(yàn)來手動(dòng)定義的。如圖3所示，我們基本上是通過從情況2轉(zhuǎn)到情況3來解決問題，方法是根據(jù)時(shí)間啟用信號和離群值之間的分類邊界的自調(diào)整功能。但是，帶寬是固定的，因此在點(diǎn)異常的定義隨時(shí)間變化的情況下，帶寬變得不再有用。例如，在正常工作日和假日季節(jié)之間，超廉價(jià)機(jī)票的定義可能完全不同。

That is when machine learning comes into play, where the model can learn from the data to adjust the parameters by itself when the time changes. It will know when a data point can be classified as point anomaly or not at a specific time point. However, I cannot construct such a supervised machine learning model on this task for this dataset since the premise is to have labeled data, and ours is not. It would be still be suggested to go along this path in the future because with such a model, the most accurate results will be generated no matter how complex the system or the data are.

那就是機(jī)器學(xué)習(xí)開始發(fā)揮作用的時(shí)候，模型可以從數(shù)據(jù)中學(xué)習(xí)，以在時(shí)間變化時(shí)自行調(diào)整參數(shù)。它將知道何時(shí)可以在特定時(shí)間點(diǎn)將數(shù)據(jù)點(diǎn)歸類為點(diǎn)異常或不將其分類為點(diǎn)異常。但是，我不能為此數(shù)據(jù)集在此任務(wù)上構(gòu)造這樣的監(jiān)督機(jī)器學(xué)習(xí)模型，因?yàn)榍疤崾且袠?biāo)記數(shù)據(jù)，而我們沒有。仍建議在將來沿這條路走，因?yàn)槭褂眠@種模型，無論系統(tǒng)或數(shù)據(jù)多么復(fù)雜，都將產(chǎn)生最準(zhǔn)確的結(jié)果。

Even though supervised learning methods would not work in this task, there are unsupervised learning techniques that can be useful such as clustering, also discussed in subsection 3.3. Clustering can group unlabeled data by using similarity measures, such as the distance between data points in the vector space. Then the point anomalies can be distinguished by selecting those far away from cluster centers. However, in order to use clustering for point anomaly detection in this dataset, we have to follow the assumption that external events are defined with the scope of multiple features instead of treating every single time series separately. More details on this are discussed in the following two subsections 3.2 and 3.3.

即使在這種任務(wù)下不能使用監(jiān)督學(xué)習(xí)方法，也有一些有用的無監(jiān)督學(xué)習(xí)技術(shù)，例如聚類，也在3.3小節(jié)中進(jìn)行了討論。聚類可以使用相似性度量(例如向量空間中數(shù)據(jù)點(diǎn)之間的距離)對未標(biāo)記的數(shù)據(jù)進(jìn)行分組。然后可以通過選擇遠(yuǎn)離聚類中心的點(diǎn)來區(qū)分點(diǎn)異常。但是，為了將聚類用于此數(shù)據(jù)集中的點(diǎn)異常檢測，我們必須遵循這樣的假設(shè)：外部事件是在多個(gè)特征的范圍內(nèi)定義的，而不是分別處理每個(gè)時(shí)間序列。以下兩個(gè)小節(jié)3.2和3.3討論了有關(guān)此問題的更多詳細(xì)信息。

3.2集體異常檢測(外部事件) (3.2 Collective Anomaly Detection (External Event))

If we define the “system fault” as point anomaly, then there are two directions to go with the “external event”. One of them is to define it as a collective anomaly that appears in every single time-series signals. The idea of collective anomaly is on the contrary to point anomaly. Point anomalies are discontinued values that are greatly deviated from usual signals while collective anomalies are usually continuous, but the values are out of expectations, such as significant increase or decrease at some time points. In this subsection, all 11 features are treated separately as a single time series. Then the task is to find the abrupt changes that happen in each of them.

如果我們將“系統(tǒng)故障”定義為點(diǎn)異常，則“外部事件”有兩個(gè)方向。其中之一是將其定義為出現(xiàn)在每個(gè)時(shí)間序列信號中的集體異常。集體異常的概念與點(diǎn)異常相反。點(diǎn)異常是不連續(xù)的值，與通常的信號有很大的偏離，而集體異常通常是連續(xù)的，但是這些值超出了預(yù)期，例如在某些時(shí)間點(diǎn)顯著增加或減少。在本小節(jié)中，所有11個(gè)功能部件均被視為一個(gè)時(shí)間序列。然后的任務(wù)是找到每個(gè)變化都發(fā)生的突變。

For such a problem, one typical approach is to tweak the usage of time series forecasting: we can fit a model to a certain time period before and predict the value after. The actual value is then compared to see if it falls into the prediction interval. It is very similar to the rolling median method used in the previous subsection, but only the previous time points are used here instead of using neighbors from both directions.

對于這樣的問題，一種典型的方法是調(diào)整時(shí)間序列預(yù)測的用法：我們可以將模型擬合到之前的某個(gè)時(shí)間段，然后預(yù)測之后的值。然后將實(shí)際值進(jìn)行比較，以查看其是否落在預(yù)測間隔內(nèi)。它與上一小節(jié)中使用的滾動(dòng)中值方法非常相似，但是此處僅使用前一個(gè)時(shí)間點(diǎn)，而不是使用雙向的鄰居。

There are different options for the model as well. Traditional time series forecast models like SARIMA is a good candidate, but the model may not be complex enough to accommodate the “patterns” that I mentioned in Section 2.1 and Section 3.3. Another option is to train a supervised regression model for the time series, which is quite widely used nowadays.

該模型也有不同的選擇。傳統(tǒng)的時(shí)間序列預(yù)測模型(如SARIMA)是不錯(cuò)的選擇，但該模型可能不夠復(fù)雜，無法適應(yīng)我在2.1節(jié)和3.3節(jié)中提到的“模式”。另一種選擇是為時(shí)間序列訓(xùn)練監(jiān)督回歸模型，該模型在當(dāng)今已被廣泛使用。

The idea is simple: Features are extracted from the time series using the concept of the sliding window, as seen in Table 3 and Figure 5. The sliding window size (blue) is set to be the same as the desired feature number k. Then for each data point (orange) in the time series, the features are data point values from its lag 1 to lag k before. As a result, a time series with N samples is able to be transformed into a table of N-k observations and k features. Next, by implementing the concept of “forward chaining”, each point is predicted by the regression model trained using observations from indices of 0 to k-1. In addition to the main regression model, two more quantile regressors are trained with different significance levels to predict the upper and lower bounds of the prediction interval, with which we are able to tell the actual value is above or below the interval band.

這個(gè)想法很簡單：使用滑動(dòng)窗口的概念從時(shí)間序列中提取特征，如表3和圖5所示。滑動(dòng)窗口的大小(藍(lán)色)設(shè)置為與所需特征編號k相同。然后，對于時(shí)間序列中的每個(gè)數(shù)據(jù)點(diǎn)(橙色)，特征都是從滯后1到滯后k的數(shù)據(jù)點(diǎn)值。結(jié)果，具有N個(gè)樣本的時(shí)間序列可以轉(zhuǎn)換為Nk個(gè)觀測值和k個(gè)特征的表。接下來，通過實(shí)施“正向鏈接”的概念，通過使用從0到k-1的索引的觀察值訓(xùn)練的回歸模型來預(yù)測每個(gè)點(diǎn)。除了主要的回歸模型外，還訓(xùn)練了另外兩個(gè)具有不同顯著性水平的分位數(shù)回歸器，以預(yù)測預(yù)測區(qū)間的上限和下限，由此我們可以判斷出實(shí)際值是在區(qū)間帶之上還是之下。

Table 3: Feature extraction algorithm through the sliding window表3：通過滑動(dòng)窗口的特征提取算法 Figure 5: Concepts of the sliding window and forward chaining [2]. The features and targets are extracted from the time series using a sliding window. The training process is based on “forward chaining”.圖5：滑動(dòng)窗口和前向鏈接的概念[2]。使用滑動(dòng)窗口從時(shí)間序列中提取特征和目標(biāo)。培訓(xùn)過程基于“前向鏈接”。

This method is applied to the given dataset, and an example is shown below as a result in Figure 6. The Ozone time series is hourly sampled for faster training speed and the features are extracted and fed into three Gradient Boosting Regressor models (1 main and 2 quantile regressors) using Scikit-Learn. The significance levels are chosen so that the prediction level represents a 90% confidence interval (shown as green in Figure 6 Top). The actual values are then compared with the prediction interval and flagged with red (unexpected increase) and blue (unexpected decrease) in Figure 6 Bottom.

此方法適用于給定的數(shù)據(jù)集，結(jié)果如圖6所示。下面是示例。每小時(shí)對臭氧時(shí)間序列進(jìn)行采樣，以提高訓(xùn)練速度，并提取特征并將其輸入到三個(gè)Gradient Boosting Regressor模型(1個(gè)主要模型和2個(gè)分位數(shù)回歸)使用Scikit-Learn。選擇顯著性水平，以便預(yù)測水平代表90％的置信區(qū)間(在圖6頂部顯示為綠色)。然后，將實(shí)際值與預(yù)測間隔進(jìn)行比較，并在圖6底部用紅色(意外增加)和藍(lán)色(意外減少)標(biāo)記。

Figure 6: Flagging results with the Ozone time series data (hourly sampled) by applying the quantile regressor method. Top: data and prediction interval; Bottom: data with flags showing above or below the prediction interval.圖6：通過應(yīng)用分位數(shù)回歸方法將臭氧時(shí)間序列數(shù)據(jù)(每小時(shí)采樣)標(biāo)記結(jié)果。頂部：數(shù)據(jù)和預(yù)測間隔；底部：帶有標(biāo)志的數(shù)據(jù)顯示在預(yù)測間隔之上或之下。

The results may not be super impressive yet, because more work still needs to be done around regression model selections and hyperparameter fine-tunings. However, it is already showing its capability of fagging these abrupt increases and decreases in all these spikes. One great thing about using machine learning models is that the model learns and evolves by itself when feeding with data. From Figure 6 Bottom, it can be seen that the last three hills (after about 240 hours) have fewer flagged points than the previous ones. It is not only because the magnitudes are smaller, but also due to the fact that the model is learning from the previous experience and start to adapt to the “idea” that it is now in “mountains” and periodic fluctuations should be expected. Therefore, it is not hard to conclude that the model performance can get better and better if more data instances are fed.

結(jié)果可能不會令人印象深刻，因?yàn)閲@回歸模型選擇和超參數(shù)微調(diào)仍需要做更多的工作。但是，它已經(jīng)顯示出阻止所有這些尖峰中這些突然增加和減少的能力。使用機(jī)器學(xué)習(xí)模型的一大好處是，模型在輸入數(shù)據(jù)時(shí)會自行學(xué)習(xí)和演化。從圖6的底部可以看出，最后三個(gè)山丘(約240小時(shí)后)的標(biāo)記點(diǎn)比以前的要少。這不僅是因?yàn)榉容^小，而且還因?yàn)樵撃Ｐ褪菑囊郧暗慕?jīng)驗(yàn)中學(xué)到并開始適應(yīng)“思想”的事實(shí)，因此它現(xiàn)在處于“山脈”中，應(yīng)該預(yù)期會出現(xiàn)周期性波動(dòng)。因此，不難得出這樣的結(jié)論：如果饋送了更多的數(shù)據(jù)實(shí)例，則模型性能會越來越好。

Beyond this quantile regression model, deep learning models such as LSTM might be able to achieve better performance. Long Short Term Memory (LSTM) is a specialized artificial Recurrent Neural Network (RNN) that is one of the state-of-the-art choices of sequence modeling due to its special design of feedback connections. However, it takes much longer time and effort to set up and fine-tune the network architecture, which exceeds the time allowance of this project, so it is not included as the presentable content in this report.

除了這種分位數(shù)回歸模型之外，諸如LSTM之類的深度學(xué)習(xí)模型也許還能獲得更好的性能。長短期記憶(LSTM)是一種專門的人工循環(huán)神經(jīng)網(wǎng)絡(luò)(RNN)，由于其特殊的反饋連接設(shè)計(jì)，它是序列建模的最新選擇之一。但是，建立和微調(diào)網(wǎng)絡(luò)體系結(jié)構(gòu)需要花費(fèi)更多的時(shí)間和精力，這超出了該項(xiàng)目的時(shí)間限制，因此在本報(bào)告中未將其包括在內(nèi)。

Again, on the other hand, as I mentioned in previous sections, the provided data does not come with labels showing which data points are considered as anomalies. It does cause difficulties in the collective anomaly detection task discussed in this subsection by limiting the model choices and performance. In the future, if some labels are provided, it will become a semi-supervised or supervised learning problem, so that it will become easier to achieve better results.

同樣，另一方面，正如我在前面的部分中提到的那樣，所提供的數(shù)據(jù)沒有附帶標(biāo)簽，這些標(biāo)簽顯示哪些數(shù)據(jù)點(diǎn)被視為異常。通過限制模型的選擇和性能，確實(shí)在本小節(jié)討論的集體異常檢測任務(wù)中造成了困難。將來，如果提供一些標(biāo)簽，它將成為半監(jiān)督或有監(jiān)督的學(xué)習(xí)問題，以便變得更容易獲得更好的結(jié)果。

3.3聚類和模式識別(外部事件) (3.3 Clustering and Pattern Recognition (External Event))

As I mentioned in subsection 3.2 above, recognizing the “external events” can be approached in two directions: one is to treat every time series separately and monitor any unexpected changes happened to the sensor signals, and the other one is to assume the events affect multiple features at the same time so that we hope to distinguish the events by looking at the unique characteristics shown in different features. In this case, if we have labeled data, it would be a common classification problem, but even without labels, we are still able to approach by clustering.

正如我在上面的3.2小節(jié)中提到的，認(rèn)識“外部事件”可以從兩個(gè)方向進(jìn)行：一個(gè)是分別處理每個(gè)時(shí)間序列并監(jiān)視傳感器信號發(fā)生的任何意外變化，另一個(gè)是假設(shè)事件影響因此，我們希望通過查看不同功能中顯示的獨(dú)特特征來區(qū)分事件。在這種情況下，如果我們?yōu)閿?shù)據(jù)加標(biāo)簽，這將是一個(gè)常見的分類問題，但是即使沒有標(biāo)簽，我們?nèi)匀荒軌蛲ㄟ^聚類進(jìn)行處理。

Clustering is an unsupervised machine learning technique that finds similarities between data according to the characteristics and groups similar data objects into clusters. It can be used as a stand-alone tool to get insights into data distribution and can also be used as a preprocessing step for other algorithms. There are many distinct methods of clustering, and here I am using two of the most commonly used ones: K-Means and DBSCAN.

聚類是一種無監(jiān)督的機(jī)器學(xué)習(xí)技術(shù)，可根據(jù)特征查找數(shù)據(jù)之間的相似性，并將相似的數(shù)據(jù)對象分組。它可以用作獨(dú)立工具來深入了解數(shù)據(jù)分布，也可以用作其他算法的預(yù)處理步驟。有許多不同的集群方法，這里我使用兩種最常用的方法：K-Means和DBSCAN。

K-means is one of the partitioning methods used for clustering. It randomly partitions objects into nonempty subsets and constantly adding new objects and adjust the centroids until a local minimum is met when optimizing the sum of squared distance between each object and centroid. On the other hand, Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a density-based method where a cluster is defined as the maximal set of density-connected points [3].

K均值是用于聚類的一種分區(qū)方法。它將對象隨機(jī)劃分為非空子集，并不斷添加新對象并調(diào)整質(zhì)心，直到優(yōu)化每個(gè)對象與質(zhì)心之間的平方距離之和達(dá)到局部最小值為止。另一方面，帶噪聲的基于密度的應(yīng)用程序空間聚類(DBSCAN)是一種基于密度的方法，其中，將聚類定義為密度連接點(diǎn)的最大集合[3]。

Figure 7: Clustering results projected to the first 2 principal components. Left: K-Means; Right: DBSCAN圖7：聚類結(jié)果預(yù)測到前兩個(gè)主要組成部分。左：K-均值；右：DBSCAN Figure 8: Clustering results projected to the first 3 principal components. Left: K-Means; Right: DBSCAN圖8：聚類結(jié)果預(yù)測到前三個(gè)主要組成部分。左：K-均值；右：DBSCAN Figure 9: Clustering results on time series of Temperature (External). Left: K-Means; Right: DBSCAN圖9：溫度(外部)時(shí)間序列的聚類結(jié)果。左：K-均值；右：DBSCAN

Principal Components Analysis (PCA) is a dimension reduction technique that creates new uncorrelated variables in order to increase interpretability and minimize information loss. In this project, after applying K-means and DBSCAN algorithms on the normalized data, PCA is performed and the clustering results are plotted in both 2D (Figure 7) and 3D (Figure 8) using the first 2 and 3 principal components. In addition, to view the clustering results from another perspective, the labeled time series plots are made, and the Temperature (External) plot is shown in Figure 9 as an example.

主成分分析(PCA)是一種降維技術(shù)，可創(chuàng)建新的不相關(guān)變量以提高可解釋性并最大程度地減少信息丟失。在這個(gè)項(xiàng)目中，在對標(biāo)準(zhǔn)化數(shù)據(jù)應(yīng)用K-means和DBSCAN算法之后，執(zhí)行PCA，并使用前2個(gè)和3個(gè)主成分在2D(圖7)和3D(圖8)中繪制聚類結(jié)果。此外，要從另一個(gè)角度查看聚類結(jié)果，請繪制標(biāo)記的時(shí)間序列圖，并以溫度(外部)圖為例，如圖9所示。

From the plots, it can be clearly seen that both methods are able to distinguish between the indoor/outdoor pattern changes that I mentioned in section 2.1. The main difference is that the partition-based K-means method is more sensitive to the magnitude changes caused by day/night alternation. Many variables in the dataset are subject to such obvious sinusoidal changes that happen at the same time, including Temperature (External and Internal), Humidity (External and internal) and Ozone. K-means tend to treat peaks and valleys differently. On the other hand, density-based DBSCAN cares less about the magnitude difference but pays more attention to the density distributions. Therefore, it clusters the whole sinusoidal part as one mass cloud as seen in Figure 7 and Figure 8.

從圖中可以清楚地看到，兩種方法都能夠區(qū)分我在2.1節(jié)中提到的室內(nèi)/室外模式變化。主要區(qū)別在于基于分區(qū)的K均值方法對晝夜交替引起的幅度變化更敏感。數(shù)據(jù)集中的許多變量會同時(shí)發(fā)生明顯的正弦變化，包括溫度(外部和內(nèi)部)，濕度(外部和內(nèi)部)和臭氧。 K均值傾向于以不同的方式對待峰谷。另一方面，基于密度的DBSCAN不太關(guān)心幅度差異，而更多地關(guān)注密度分布。因此，它將整個(gè)正弦曲線部分聚集為一個(gè)質(zhì)量云，如圖7和圖8所示。

It is not possible to comment on which clustering method is better than the other at this stage because they are distinctive enough to function for different interests. If we are more interested in treating the high-low portions of the sinusoidal signals differently, we are going to use K-means; if we only want to distinguish between indoor/outdoor mode, then DBSCAN is better. In addition, since it is an unsupervised learning task, there is no way of quantifying the performance between models except for visualizing and judging by experience. In the future, if some labeled data is provided, the results can be turned into a semi-supervised learning task, and more intuitions can be gained toward model selection.

在此階段，無法評論哪種聚類方法比另一種更好，因?yàn)樗鼈兙哂凶銐虻莫?dú)特性，可以針對不同的利益發(fā)揮作用。如果我們對以不同的方式處理正弦信號的高-低部分更感興趣，我們將使用K-means。如果我們只想?yún)^(qū)分室內(nèi)/室外模式，則DBSCAN更好。此外，由于這是一項(xiàng)無監(jiān)督的學(xué)習(xí)任務(wù)，因此除了根據(jù)經(jīng)驗(yàn)進(jìn)行可視化和判斷外，無法量化模型之間的性能。將來，如果提供一些標(biāo)記的數(shù)據(jù)，則可以將結(jié)果轉(zhuǎn)換為半監(jiān)督學(xué)習(xí)任務(wù)，并且可以獲得更多的直覺來進(jìn)行模型選擇。

4。結(jié)論 (4. Conclusion)

In this post, I briefly walk through the approaches and findings of exploratory data analysis and correlation analysis, as well as the constructions of three distinct modeling pipelines that used for point anomaly detection, collective anomaly detection, and clustering.

在本文中，我簡要介紹了探索性數(shù)據(jù)分析和相關(guān)性分析的方法和發(fā)現(xiàn)，以及用于點(diǎn)異常檢測，集體異常檢測和聚類的三個(gè)不同的建模管道的構(gòu)建。

In the exploratory data analysis section, the sensor reading time intervals are found to vary severely. Even if most of them are contained around a minute, the inconsistency problem is still worth looking into due to the fact that it might lower the efficiency and performance of analytical tasks. The measuring environment is found subject to changes from the time series plots and later reassured by aligning with the actual Toronto weather data as well as the clustering results. In addition, the correlation between features is studied and exemplified. A few confusions are raised such as the strange relationship between Temperature (Internal) and Temperature (External), which needs to be studied through experiments or the device itself.

在探索性數(shù)據(jù)分析部分中，發(fā)現(xiàn)傳感器讀取時(shí)間間隔變化很大。即使其中大多數(shù)都在大約一分鐘之內(nèi)，但由于不一致的問題可能會降低分析任務(wù)的效率和性能，因此仍然值得研究。發(fā)現(xiàn)測量環(huán)境可能會隨著時(shí)間序列圖的變化而變化，隨后通過與實(shí)際的多倫多天氣數(shù)據(jù)以及聚類結(jié)果保持一致來放心。另外，研究并舉例說明了特征之間的相關(guān)性。引起了一些混亂，例如溫度(內(nèi)部)和溫度(外部)之間的奇怪關(guān)系，需要通過實(shí)驗(yàn)或設(shè)備本身進(jìn)行研究。

In the anomaly detection section, since “system fault” and “external event” are not clearly defined, I split the project into three different tasks. Point anomalies are defined as severely deviated and discontinued data points. The rolling median method is used here to successfully automate the process of labeling such point anomalies. Collective anomalies, on the other hand, are defined as the deviated collection of data points, usually seen as abrupt increases or decreases. This task is accomplished by extracting features from time series data and then training of regression models. Clustering is also performed on the dataset using K-mean and DBSCAN, both of which play to their strength and successfully clustered data by leveraging their similar and dissimilar characteristics.

在異常檢測部分，由于未明確定義“系統(tǒng)故障”和“外部事件”，因此我將項(xiàng)目分為三個(gè)不同的任務(wù)。點(diǎn)異常定義為嚴(yán)重偏離和中斷的數(shù)據(jù)點(diǎn)。這里使用滾動(dòng)中值法來成功地自動(dòng)標(biāo)記此類點(diǎn)異常的過程。另一方面，集體異常定義為偏離的數(shù)據(jù)點(diǎn)集合，通常被視為突然增加或減少。通過從時(shí)間序列數(shù)據(jù)中提取特征，然后訓(xùn)練回歸模型來完成此任務(wù)。還使用K均值和DBSCAN對數(shù)據(jù)集執(zhí)行聚類，兩者均發(fā)揮了自己的優(yōu)勢，并通過利用它們的相似和不同特性成功地對數(shù)據(jù)進(jìn)行了聚類。

All of the anomaly detection models introduced in this project are only prototypes without extensive model sections and fine-tunings. There are great potentials for each of them to evolve into better forms if putting more effort and through gaining more knowledge of the data. For point anomalies, there are many more machine-learning-based outlier detection techniques such as isolation forest and local outlier factors to accommodate for more complex data forms. For collective anomaly, state-of-the-art LSTM is worth putting effort into, especially in time series data and sequence modeling. For clustering, there are many other families of methods, such as hierarchical and grid-based clustering. They are capable of achieving similar great performance.

本項(xiàng)目中介紹的所有異常檢測模型只是原型，沒有廣泛的模型部分和微調(diào)。如果付出更多的努力并獲得更多的數(shù)據(jù)知識，他們每個(gè)人都有很大的潛力發(fā)展成更好的形式。對于點(diǎn)異常，還有更多基于機(jī)器學(xué)習(xí)的離群值檢測技術(shù)，例如隔離林和局部離群值因素，可以適應(yīng)更復(fù)雜的數(shù)據(jù)形式。對于集體異常，值得投入最新的LSTM，特別是在時(shí)間序列數(shù)據(jù)和序列建模方面。對于群集，還有許多其他方法系列，例如分層和基于網(wǎng)格的群集。它們能夠?qū)崿F(xiàn)類似的出色性能。

Of course, these future directions are advised based on the premise of no labeled data. If experienced engineers or scientists are able to give their insights on which types of data are considered as “system fault” or “external event”, more exciting progress will surely be made by transforming the tasks into semi-supervised or supervised learning problems, where more tools will be available to choose.

當(dāng)然，這些未來的方向是在沒有標(biāo)簽數(shù)據(jù)的前提下提出的。如果經(jīng)驗(yàn)豐富的工程師或科學(xué)家能夠就哪些數(shù)據(jù)類型被視為“系統(tǒng)故障”或“外部事件”給出自己的見解，那么將任務(wù)轉(zhuǎn)變?yōu)榘氡O(jiān)督或監(jiān)督學(xué)習(xí)問題肯定會取得更加令人興奮的進(jìn)展。有更多工具可供選擇。

翻譯自: https://towardsdatascience.com/time-series-pattern-recognition-with-air-quality-sensor-data-4b94710bb290

時(shí)間序列模式識別

總結(jié)

以上是生活随笔為你收集整理的时间序列模式识别_空气质量传感器数据的时间序列模式识别的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。