當前位置：首頁 >

【论文阅读】Learning Traffic as Images: A Deep Convolutional ... [将交通作为图像学习: 用于大规模交通网络速度预测的深度卷积神经网络]（2）

發布時間：2023/12/15 40 豆豆

生活随笔收集整理的這篇文章主要介紹了【论文阅读】Learning Traffic as Images: A Deep Convolutional ... [将交通作为图像学习: 用于大规模交通网络速度预测的深度卷积神经网络]（2）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

【論文閱讀】Learning Traffic as Images: A Deep Convolutional Neural Network for Large-Scale Transportation Network Speed Prediction [將交通作為圖像學習: 用于大規模交通網絡速度預測的深度卷積神經網絡]（2）

2. Methods
- 2.1. Converting Network Traffic to Images（網絡流量轉化為圖像）
- 2.2. CNN for Network Traffic Prediction（CNN網絡流量預測）
- - 2.2.1. CNN Characteristics（CNN的特點）
  - 2.2.2. CNN Characteristics（CNN的特點）
  - 2.2.3. Convolutional Layers and Pooling Layers of the CNN（CNN的卷積層和池化層）
  - 2.2.4. CNN Optimization（CNN優化）
參考文獻

注: 閱讀原文請轉至link.

2. Methods

?Traffic information with time and space dimensions should be jointly considered to predict network-wide traffic congestion. Let $x$ - and $y$ -axis represent time and space of a matrix, respectively. The elements within the matrix are values of traffic variables associated with time and space. The generated matrix can be viewed as a channel of an image in the way that every pixel in the image shares the corresponding value in the matrix. As a result, the image is of $M$ pixels width and $N$ pixels height, where $M$ and $N$ are the two dimensions of the matrix. A two-step methodology, converting network traffic to images and the CNN for network traffic prediction, respectively, is designed to learn from the matrix and make predictions.
?在預測全網交通擁堵時，需要綜合考慮時間和空間兩個維度的交通信息。設 $x$ 軸和 $y$ 軸分別表示一個矩陣的時間和空間。矩陣中的元素是與時間和空間相關的交通變量的值。生成的矩陣可以看作是圖像的一個通道，圖像中的每個像素都共享矩陣中相應的值。因此，圖像的寬度為 $M$ 個像素，高度為 $N$ 個像素，其中 $M$ 和 $N$ 是矩陣的兩個維度。設計了兩步方法，分別將網絡流量轉換為圖像和CNN進行網絡流量預測，從矩陣中學習并進行預測。

2.1. Converting Network Traffic to Images（網絡流量轉化為圖像）

?A vehicle trajectory recorded by a floating car with a dedicated GPS device provides specific information on vehicle speed and position at a certain time. From the trajectory, the spatiotemporal traffic information on each road segment can be estimated and integrated further into a time-space matrix that serves as a time-space image.
?一輛正在行駛的汽車用專用的GPS設備記錄下車輛的軌跡，可以提供特定時間內車輛速度和位置的具體信息。從軌跡上可以估計出各路段的時空交通信息，并進一步整合成時空矩陣，即時空圖像。

?In the time dimension, time usually ranges from the beginning to the end of a day, and time intervals, which are usually 10 s to 5 min, depend on the sampling resolution of the GPS devices. Generally, narrow intervals, for example 10 s, are meaningless for traffic prediction. Thus, if the sampling resolution is high, these data may be aggregated to obtain wider intervals, such as several minutes.
?在時間維度上，時間通常為一天的開始到結束，時間間隔通常為10秒到5分鐘，這取決于GPS設備的采樣分辨率。一般來說，狹窄的間隔，例如10秒，對于流量預測是沒有意義的。因此，如果采樣區間很高，這些數據可以被聚合以獲得更寬的間隔，比如幾分鐘。

?In the space dimension, the selected trajectory is viewed as a sequence of dots with inner states, including vehicle position, average speed, etc. This sequence of dots can be ordered simply and linearly fitted into the y-axis, but may result in a high dimension and uninformative issues, because the sequences of dots are redundant and a large number of regions in this sequence are stable and lack variety. Therefore, to make the y-axis both compact and informative, the dots are grouped into sections, each representing a similar traffic state. The sections are then ordered spatially with reference to a predefined start point of a road, and then fitted into the y-axis.
?在空間維度上，所選擇的軌跡被看作是包含車輛位置、平均速度等內部狀態的點序列。這種點序列可以簡單有序地線性擬合到y軸上，但由于點序列是冗余的，并且該序列中有大量區域是穩定的，缺乏多樣性，因此可能會導致維度過高和信息過少問題。因此，為了使y軸既緊湊又信息豐富，這些點被分成了幾個部分，每個部分代表著相似的交通狀態。然后，根據預先設定的道路起點，將這些路段在空間上排序，然后將其貼合到y軸上。

?Finally, a time-space matrix can be constructed using time and space dimension information.Mathematically, we denote the time-space matrix by:
最后，利用時空維信息構造時空矩陣。在數學上，我們表示時空矩陣為:
$M=[m11m12?m1Nm21m22?m2N????mQ1mQ2…mQN]M=\begin{bmatrix}m_{11}&m_{12}&\dotsb&m_{1N}\\m_{21}&m_{22}&\dotsb&m_{2N}\\?&?&\dotsb&?\\m_{Q1}&m_{Q2}&…&m_{QN}\end{bmatrix}$

where $N$ is the length of time intervals, $Q$ is the length of road sections; the $i$ th column vector of $M$ is the traffic speed of the transportation network at time $I$ ; and pixel $m_{ij}$ is the average traffic speed on section $I$ at time $j$ . Matrix $M$ forms a channel of the image. Figure 1 illustrates the relations among raw averaged floating car speeds, time-space matrix, and the final image.
上式中， $N$ 為時間間隔長度， $Q$ 為路段長度; $M$ 的第 $i$ 列向量為第 $i$ 時刻交通網絡的交通速度;像素 $m_{ij}$ 為第 $i$ 段在第 $j$ 時刻的平均交通速度。矩陣 $M$ 構成圖像的一個通道。圖1展示了原始汽車平均行駛速度、時空矩陣和最終圖像之間的關系。

Figure 1. An illustration of the traffic-to-image conversion on a network.
圖1. 交通網絡上的流量轉換為圖像的說明

2.2. CNN for Network Traffic Prediction（CNN網絡流量預測）

2.2.1. CNN Characteristics（CNN的特點）

?The CNN has exhibited a significant learning ability in image understanding because of its unique method of extracting critical features from images. Compared to other deep learning architectures, two salient characteristics contribute to the uniqueness of CNN, namely, (a) locally-connected layers, which means output neurons in the layers are connected only to their local nearby input neurons, rather than the entire input neurons in fully-connected layers. These layers can extract features from an image effectively, because every layer attempts to retrieve a different feature regarding the prediction problem [31] ; and (b) a pooling mechanism, which largely reduces the number of parameters required to train the CNN while guaranteeing that the most important features are preserved.
?CNN因其獨特的提取圖像關鍵特征的方法，在圖像理解方面表現出了顯著的學習能力。與其他深度學習架構相比,兩個突出特征有助于CNN的獨特性,即 (a) 局部連接層,這意味著輸出層神經元只連接到局部附近的輸入神經元,而不是整個輸入神經元全層。這些層可以有效地從圖像中提取特征，因為每一層都試圖檢索關于預測問題 [31] 的不同特征; (b) 池化機制，在保證保留最重要特征的同時，大大減少了訓練CNN所需的參數數量。

?Sharing the two salient characteristics, the CNN is modified in the following aspects to adapt to the context of transportation: First, the model inputs are different, i.e., the input images have only one channel valued by traffic speeds of all roads in a transportation network, and the pixel values in the images range from zero to the maximum traffic speed or speed limits of the network. In contrast, in the image classification problem, the input images commonly have three channels, i.e., RGB, and pixel values range from 0 to 255. Although differences exist, the model inputs are normalized to prevent model weights from increasing the model training difficulty. Second, the model outputs are different. In the context of transportation, the model outputs are predicted traffic speeds on all road sections of a transportation network, whereas, in the image classification problem, model outputs are image class labels. Third, abstract features have different meanings. In the context of transportation, abstract features extracted by the convolutional and pooling layers are relations among road sections regarding traffic speeds. In the image classification problem, the abstract features can be shallow image edges and deep shapes of some objects in terms of its training objective. All of these abstract features are significant for a prediction problem [36]. Fourth, the training objectives differ because of distinct model outputs. In the context of transportation, because the outputs are continuous traffic speeds, continuous cost functions should be adopted accordingly. In the image classification problem, cross-entropy cost functions are usually used.
?具有這兩個顯著特點的CNN在以下幾個方面進行了修改，以適應交通的語境:首先，模型的輸入不同，即輸入圖像只有一個通道值，即交通網絡中所有道路的交通速度，圖像中的像素值在0到網絡的最大交通速度或限速范圍內。而在圖像分類問題中，輸入圖像通常有三個通道，即RGB通道，像素值范圍為0 ~ 255。盡管存在差異，但對模型輸入進行歸一化處理，以防止模型權值增加模型訓練難度。第二，模型輸出不同。在交通的背景下，模型輸出是預測交通網絡中所有路段的交通速度，而在圖像分類問題中，模型輸出是圖像分類標簽。第三，抽象特征有不同的含義。在交通環境下，卷積層和池化層提取的抽象特征是路段之間關于交通速度的關系。在圖像分類問題中，根據訓練目標，抽象特征可以是圖像的淺邊緣和某些對象的深形狀。所有這些抽象特征對于預測問題 [36] 都具有重要意義。第四，培養目標因模式產出不同而不同。在運輸的背景下，由于輸出是連續的交通速度，因此應該采用連續的代價函數。在圖像分類問題中，通常使用交叉熵損失函數。

2.2.2. CNN Characteristics（CNN的特點）

?Figure 2 shows the structure of CNN in the context of transportation with four main parts, that is, model input, traffic feature extraction, prediction, and model output. Each of the parts is explained below.
?圖2展示了交通環境下CNN的結構，主要有四個部分，即模型輸入、交通特征提取、預測和模型輸出。下面將對每個部分進行解釋。

?First, model input is the image generated from a transportation network with spatiotemporal characteristics. Let the lengths of input and output time intervals be $F$ and $P$ , respectively. The model input can be written as:
?第一，模型輸入是由具有時空特征的交通網絡生成的圖像。設輸入和輸出時間間隔的長度分別為 $F$ 和 $P$ 。模型輸入可以寫成:
$x^i=[m_i,m_{i+1},…,m_{i+P-1}],i∈[1,N-P-F+1]$
where $I$ is the sample index, $N$ is the length of time intervals, and $m_i$ is a column vector representing traffic speeds of all roads in a transportation network within one time unit.
上式中， $i$ 為樣本指標， $N$ 為時間間隔長度， $m_i$ 為表示一個時間單位內路網中所有道路交通速度的列向量。

?Second, the extraction of traffic features is the combination of convolutional and pooling layers, and is the core part of the CNN model. The pooling procedure is indicated by using $p o o l$ , and $L$ is denoted by the depth of CNN. Denote the input, output, and parameters of $l$ th layer by $x_l^j$ , $o_l^j$ and $W_l^j,b_l^j )$ , respectively, where $j$ is the channel index considering the multiple convolutional filters in the convolutional layer. The number of convolutional filters in lth layer is denoted by $c_l$ . The output in the first convolutional and pooling layers can be written as:
?第二，流量特征的提取是卷積層和池化層的結合，是CNN模型的核心部分。池化過程用 $p o o l$ 表示， $L$ 表示CNN的深度（層數）。分別用 $x_l^j$ , $o_l^j$ , $W_l^j,b_l^j )$ 表示第 $l$ 層的輸入，輸出和參數，其中 $j$ 為考慮卷積層中多個卷積濾波器的信道指數。第 $l$ 層卷積濾波器的個數用 $c_l$ 表示。第一個卷積層和池化層的輸出可以寫成:
$o_1^j=pool(σ(W_1^j x_1^j+b_1^j )),j∈[1,c_1 ]$
where $σ$ is the activation function, which will be discussed in next section. The output in the $l$ th $(l \neq = 1, l = 1 L)$ convolutional and pooling layers can be written as:
其中 $σ$ 是激活函數，將在下一節討論。第 $l$ 層 $(l \neq = 1, l = 1 L)$ 卷積和池化層的輸出可以寫成:
$o_l^j=pool(σ(∑_{k=1}^{c_{l-1}}(W_l^j x_l^k+b_l^j ) )),j∈[1,c_l ]$

?The extraction of traffic features has the following characteristics: (a) Convolution and pooling are processed in two dimensions. This part can learn the spatiotemporal relations of the road sections in terms of the prediction task in model training; (b) Different from layers with only four convolutions or pooling filters in Figure 2, in reality, the number of the layers in applications are set to be hundreds, which means hundreds of features can be learned by a CNN; and ? a CNN transforms the model input into deep features through these layers.
?交通特征的提取具有以下特點:(a)卷積和池化在兩個維度上進行。該部分可以根據模型訓練中的預測任務來學習路段的時空關系; (b)不同于圖2中只有4個卷積或池化過濾器的層，實際應用中設置的層數是數百個，這意味著一個CNN可以學習到數百個特征; ? CNN通過這些層將模型輸入轉換為深層特征。

?In the model prediction, the features learnt and outputted by traffic feature extraction are concatenated into a dense vector that contains the final and most high-level features of the input transportation network. The dense vector can be written as:
?在模型預測中，將交通特征提取學到的特征和輸出的特征串聯成一個包含輸入交通網絡最終和最高層特征的稠密向量。稠密向量可以寫成:
$o_L^{flatten}=flatten([o_L^1,o_L^2,…,o_L^j ]),j=c_L$
where $L$ is the depth of CNN and $f l a t t e n$ is the concatenating procedure discussed above.
其中 $L$ 是CNN的深度， $f l a t t e n$ 是上面討論的連接過程。

?Finally, the vector is transformed into model outputs through a fully connected layer. The model output can, thus, be written as:
?最后，通過全連通層將向量轉化為模型輸出。因此，模型輸出可以寫成:
$y^=WfoLflatten+bf=Wf(flatten(pool(σ(∑k=1cl?1(WLjxLk+bLj)))))+bf\begin{aligned}\hat{y}&=W_f o_L^{flatten}+b_f\\ &=W_f (flatten(pool(σ(∑_{k=1}^{c_{l-1}}(W_L^j x_L^k+b_L^j ) ))))+b_f\end{aligned}$
where $W_f$ and $b_f$ are parameters of the fully connected layer. $y^\hat{y}$ are the predicted network-wide traffic speeds.
其中 $W_f$ 、 $b_f$ 為全連接層參數。 $y^\hat{y}$ 是預測的網絡范圍內的流量速度。

Figure 2. Deep learning architecture of CNN in the context of transportation.
圖2. 交通背景下CNN的深度學習架構

2.2.3. Convolutional Layers and Pooling Layers of the CNN（CNN的卷積層和池化層）

?Before discussing the explicit layers, it should be noted that each layer is activated by an activation function. The benefits of employing the activation function are as follows: (a) the activation function transforms the output to a manageable and scaled data range, which is beneficial to model training; and (b) the combination of the activation function through layers can mimic very complex nonlinear functions making the CNN powerful enough to handle the complexity of a transportation network. In this study, the Relu function is applied and defined as follows:
?在討論卷積層之前，應該注意到每一層都是由一個激活函數激活的。使用激活函數的好處如下:(a)激活函數將輸出轉換為一個可管理和縮放的數據范圍，這有利于模型訓練;(b)分層組合激活函數可以模擬非常復雜的非線性函數，使CNN足夠強大，能夠處理復雜的交通網絡。在本研究中，應用Relu函數，定義如下:
$g1(x)={x,if?x>00,otherwiseg_1 (x)=\begin{cases}x,&\text{if}\ x>0\\0,&\text{otherwise}\end{cases}$
?Convolutional layers differ from traditional feedforward neural network where each input neuron is connected to each output neuron and the network is fully connected (fully-connected layer). The CNN uses convolutional filters over its input layer and obtains local connections where only local input neurons are connected to the output neuron (convolutional layer). Hundreds of filters are sometimes applied to the input and results are merged in each layer. One filter can extract one traffic feature from the input layer and, thus, hundreds of filters can extract hundreds of traffic features. Those extracted traffic features are combined further to extract a higher level and more abstract traffic features. The process confirms the compositionality of the CNN, meaning each filter composes a local path from lower-level into higher-level features. When one convolutional filter $W_l^r$ is applied to the input, the output can be formulated as:
?卷積層與傳統的前饋神經網絡不同，傳統的前饋神經網絡是每個輸入神經元與每個輸出神經元相連接，網絡是全連接的(全連接層)。CNN在其輸入層上使用卷積濾波器，并獲得只有局部輸入神經元連接到輸出神經元的局部連接(卷積層)。有時會有數百個過濾器應用于輸入，每一層的結果都會被合并。一個過濾器可以從輸入層提取一個流量特征，因此，數百個過濾器可以提取數百個流量特征。將提取的交通特征進一步組合，提取出更高層次、更抽象的交通特征。這個過程確認了CNN的組成性，這意味著每個過濾器組成了從低級到高級特征的局部路徑。當對輸入應用一個卷積濾波器W_l^r時，輸出可以表示為:
$y_{conv}=∑_{e=1}^m∑_{f=1}^n((W_l^r )_{ef} d_{ef})$
where $m$ and $n$ are two dimensions of the filter, $d_{ef}$ is the data value of the input matrix at positions $e$ and $f$ , and $W_l^r )_{ef}$ is the coefficient of the convolutional filter at positions e and f and $y_{conv}$ is the output.
其中 $m$ 和 $n$ 為濾波器的兩個維度， $d_{ef}$ 為輸入矩陣在 $e$ 和 $f$ 位置的數據值， $W_l^r )_{ef}$ 為卷積濾波器在 $e$ 和 $f$ 位置的系數， $y_{conv}$ 為輸出。

?Pooling layers are designed to downsample and aggregate data because they only extract salient numbers from the specific region. The pooling layers guarantee that CNN is locally invariant, which means that the CNN can always extract the same feature from the input, regardless of feature shifts, rotations, or scales [36] . Based on the above facts, the pooling layers can not only reduce the network scale of the CNN, but also identify the most prominent features of input layers. Taking the maximum operation as an example, the pooling layer can be formulated as:
?池化層被設計用來下采樣和聚合數據，因為它們只從特定區域提取顯著數字。池化層保證了CNN是局部不變的，這意味著CNN總是可以從輸入中提取相同的特征，而不管特征是平移、旋轉還是縮放 [36] 。基于以上事實，池化層不僅可以減小CNN的網絡規模，還可以識別出輸入層最顯著的特征。以最大值運算為例，池化層可以表示為:
$ypool=max?(def),e∈[1,…,p],f∈[1,…,q]y_{pool}=\text{max}?(d_{ef} ),e∈[1,…,p],f∈[1,…,q]$
where $p$ and $q$ are two dimensions of pooling window size, $d_{ef}$ is the data value of the input matrix at positions $e$ and $f$ , and $y_{pool}$ is the pooling output.
其中 $p$ 和 $q$ 是池化窗口大小的兩個維度， $d_{ef}$ 是輸入矩陣在位置 $e$ 和 $f$ 處的數據值， $y_{pool}$ 是池化輸出。

2.2.4. CNN Optimization（CNN優化）

?The predictions of the CNN are traffic speeds on different road sections, and the mean squared errors (MSEs) are employed to measure the distance between predictions and ground-truth traffic speeds. Thus, minimizing MSEs is taken as the training goal of the CNN. MSE can be written as:
?CNN的預測是不同路段的交通速度，使用平均平方誤差(MSEs)來測量預測與地面真實交通速度之間的距離。因此，最小化MSEs作為CNN的訓練目標。MSE可以寫成:
$MSE=1n∑i=1N(y^i?yi)2MSE=\frac{1}{n} ∑_{i=1}^N(\hat{y}_i-y_i)^2$
?Let the model parameters be set $Θ=(W_l^i,b_l^i,W_f,c_f)$ , the optimal values of $Θ$ can be determined according to the standard backpropagation algorithm similar to other studies on CNN [31] , [36] :
?設模型參數 $Θ=(W_l^i,b_l^i,W_f,c_f)$ ，根據CNN上類似其他研究的標準backpropagation算法確定 $Θ$ 的最優值 [31] , [36] :
$Θ=argmin?Θ?1n∑i=1N(y^i?yi)2=argmin?Θ?1n∥WfoLflatten+bf?y∥2=argmin?Θ?1n∥Wf(flatten(pool(σ(∑k=1cl?1(WLjxLk+bLj)))))+bf?y∥2\begin{aligned}Θ&=\text{arg}\min\limits_Θ?\frac{1}{n} ∑_{i=1}^N(\hat{y}_i-y_i )^2\\ &=\text{arg}\min\limits_Θ?\frac{1}{n} \Vert W_f o_L^{flatten}+b_f-y\Vert^2\\ &=\text{arg}\min\limits_Θ?\frac{1}{n} \Vert W_f (flatten(pool(σ(∑_{k=1}^{c_{l-1}}(W_L^j x_L^k+b_L^j ) ))))+b_f-y\Vert^2\end{aligned}$

參考文獻

31. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012.

36. LeCun, Y.; Bengio, Y. Convolutional networks for images, speech, and time series. In The Handbook of Brain Theory and Neural Networks; MIT Press: Cambridge, MA, USA, 1998; Volume 3361, pp. 255–258.