LSTM:《Long Short-Term Memory》的翻译并解读
LSTM:《Long Short-Term Memory》的翻譯并解讀
?
?
?
目錄
Long Short-Term Memory
Abstract
1 INTRODUCTION
2 PREVIOUS WORK ?
3 CONSTANT ERROR BACKPROP
3.1 EXPONENTIALLY DECAYING ERROR
3.2 CONSTANT ERROR FLOW: NAIVE APPROACH
4 LONG SHORT-TERM MEMORY
5 EXPERIMENTS
5.1 EXPERIMENT 1: EMBEDDED REBER GRAMMAR ?
?
?
?
Long Short-Term Memory
論文原文
地址01:https://arxiv.org/pdf/1506.04214.pdf
地址02:https://www.bioinf.jku.at/publications/older/2604.pdf
?
Abstract
| Learning to store information over extended time intervals via recurrent backpropagation ?takes a very long time, mostly due to insucient, decaying error back ?ow. We brie y review ?Hochreiter's 1991 analysis of this problem, then address it by introducing a novel, ecient, ?gradient-based method called \Long Short-Term Memory" (LSTM). Truncating the gradient ?where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 ?discrete time steps by enforcing constant error ?ow through \constant error carrousels" within ?special units. Multiplicative gate units learn to open and close access to the constant error ? ow. LSTM is local in space and time; its computational complexity per time step and weight ?is O(1). Our experiments with arti cial data involve local, distributed, real-valued, and noisy ?pattern representations. In comparisons with RTRL, BPTT, Recurrent Cascade-Correlation, ?Elman nets, and Neural Sequence Chunking, LSTM leads to many more successful runs, and ?learns much faster. LSTM also solves complex, arti cial long time lag tasks that have never ?been solved by previous recurrent network algorithms. | 通過周期性的反向傳播學習,在擴展的時間間隔內存儲信息需要很長的時間,這主要是由于不確定的、衰減的錯誤導致的。我們簡要回顧了Hochreiter在1991年對這個問題的分析,然后介紹了一種新穎的、獨特的、基于梯度的方法,稱為LSTM (LSTM)。在不造成傷害的情況下截斷梯度,LSTM可以學習在超過1000個離散時間步長的最小時間滯后上橋接,方法是通過在特殊單元內的“恒定誤差輪盤”強制執行恒定誤差。乘性門單元學習打開和關閉訪問的恒定誤差低。LSTM在空間和時間上都是局部的;其每時間步長的計算復雜度和權值為O(1)。我們對人工數據的實驗包括局部的、分布式的、實值的和有噪聲的模式表示。在與RTRL、BPTT、周期性級聯相關、Elman網和神經序列分塊的比較中,LSTM帶來了更多的成功運行,并且學習速度更快。LSTM還解決了以前的遞歸網絡算法所不能解決的復雜、人工的長時間滯后問題。 |
?
1 INTRODUCTION
| Recurrent networks can in principle use their feedback connections to store representations of recent input events in form of activations (\short-term memory", as opposed to \long-term memory" embodied by slowly changing weights). This is potentially signicant for many applications, including speech processing, non-Markovian control, and music composition (e.g., Mozer 1992). The most widely used algorithms for learning what to put in short-term memory, however, take too much time or do not work well at all, especially when minimal time lags between inputs and corresponding teacher signals are long. Although theoretically fascinating, existing methods do not provide clear practical advantages over, say, backprop in feedforward nets with limited time windows. This paper will review an analysis of the problem and suggest a remedy.? | 遞歸網絡原則上可以使用它們的反饋連接以激活的形式存儲最近輸入事件的表示(“短期記憶”,而不是“長期記憶”,后者由緩慢變化的權重表示)。這對許多應用程序都有潛在的重要性,包括語音處理、非馬爾可夫控制和音樂作曲(例如,Mozer 1992)。然而,最廣泛使用的學習短期記憶的算法要么花費了太多時間,要么根本就不能很好地工作,尤其是在輸入和相應教師信號之間的最小時滯很長時。雖然理論上很吸引人,但現有的方法并沒有提供明顯的實際優勢,例如,在有限時間窗口的前饋網絡中,backprop。本文將對這一問題進行分析,并提出解決辦法。 |
| The problem. With conventional \Back-Propagation Through Time" (BPTT, e.g., Williams and Zipser 1992, Werbos 1988) or \Real-Time Recurrent Learning" (RTRL, e.g., Robinson and Fallside 1987), error signals \ owing backwards in time" tend to either (1) blow up or (2) vanish: the temporal evolution of the backpropagated error exponentially depends on the size of the weights (Hochreiter 1991). Case (1) may lead to oscillating weights, while in case (2) learning to bridge long time lags takes a prohibitive amount of time, or does not work at all (see section 3).? | 這個問題。與傳統\反向傳播通過時間”(BPTT,例如,1992年威廉姆斯和拉鏈,Werbos 1988)或\實時復發性學習”(RTRL,例如,羅賓遜和Fallside 1987),誤差信號在時間上向后\由于”傾向于(1)炸毀或(2):消失的時間演化backpropagated誤差指數的大小取決于重量(Hochreiter 1991)。情形(1)可能會導致權值的振蕩,而情形(2)學習如何橋接長時間滯后的情況會花費大量的時間,或者根本不起作用(參見第3節)。 |
| The remedy. This paper presents \Long Short-Term Memory" (LSTM), a novel recurrent network architecture in conjunction with an appropriate gradient-based learning algorithm. LSTM is designed to overcome these error back- ow problems. It can learn to bridge time intervals in excess of 1000 steps even in case of noisy, incompressible input sequences, without loss of short time lag capabilities. This is achieved by an ecient, gradient-based algorithm for an architecture?enforcing constant (thus neither exploding nor vanishing) error ow through internal states of special units (provided the gradient computation is truncated at certain architecture-specic points | this does not aect long-term error ow though).? | 補救措施。本文提出了一種新的遞歸網絡結構——長短時記憶(LSTM),并結合適當的梯度學習算法。LSTM的設計就是為了克服這些錯誤的反向問題。它可以學習橋接超過1000步的時間間隔,即使在有噪聲、不可壓縮的輸入序列的情況下,也不會損失短時間延遲能力。這是通過一種特殊的、基于梯度的算法來實現的,它針對的是一種通過特殊單元的內部狀態來執行常量(因此既不會爆炸也不會消失)的錯誤(假設梯度計算在某些特定的體系結構點|被截斷,但這并不影響長期的錯誤)。 |
| Outline of paper. Section 2 will brie y review previous work. Section 3 begins with an outline of the detailed analysis of vanishing errors due to Hochreiter (1991). It will then introduce a naive approach to constant error backprop for didactic purposes, and highlight its problems concerning information storage and retrieval. These problems will lead to the LSTM architecture as described in Section 4. Section 5 will present numerous experiments and comparisons with competing methods. LSTM outperforms them, and also learns to solve complex, articial tasks no other recurrent net algorithm has solved. Section 6 will discuss LSTM's limitations and advantages. The appendix contains a detailed description of the algorithm (A.1), and explicit error ow formulae (A.2). | 第二部分將簡要回顧以前的工作。第3節以詳細分析Hochreiter(1991)所造成的消失誤差的大綱開始。然后,它將介紹一種用于教學目的的幼稚的不斷錯誤支持方法,并突出其在信息存儲和檢索方面的問題。這些問題將導致第4節中描述的LSTM體系結構。第5節將提供大量的實驗和與競爭方法的比較。LSTM比它們做得更好,而且還學會了解決復雜的人工任務,這是其他遞歸網絡算法所不能解決的。第6節將討論LSTM的局限性和優點。附錄中有算法的詳細描述(a .1),以及公式的顯式誤差(a .2)。 |
2 PREVIOUS WORK ?
| This section will focus on recurrent nets with time-varying inputs (as opposed to nets with stationary ?inputs and xpoint-based gradient calculations, e.g., Almeida 1987, Pineda 1987). ? | 本節將集中討論具有時變輸入的遞歸網絡(而不是具有固定輸入和基于x點的梯度計算的網絡,例如Almeida 1987和Pineda 1987)。 |
| Gradient-descent variants. The approaches of Elman (1988), Fahlman (1991), Williams ?(1989), Schmidhuber (1992a), Pearlmutter (1989), and many of the related algorithms in Pearlmutter's ?comprehensive overview (1995) suer from the same problems as BPTT and RTRL (see ?Sections 1 and 3). ? | 梯度下降法變體。Elman(1988)、Fahlman(1991)、Williams(1989)、Schmidhuber (1992a)、Pearlmutter(1989)的方法,以及Pearlmutter的綜合綜述(1995)中的許多相關算法,都是從與BPTT和RTRL相同的問題中提出的(見第1節和第3節) |
| Time-delays. Other methods that seem practical for short time lags only are Time-Delay ?Neural Networks (Lang et al. 1990) and Plate's method (Plate 1993), which updates unit activations ?based on a weighted sum of old activations (see also de Vries and Principe 1991). Lin et al. ?(1995) propose variants of time-delay networks called NARX networks. | 時間延遲。其他似乎只適用于短時間滯后的方法有時滯神經網絡(Lang et al. 1990)和Plate法(Plate 1993),后者基于舊激活的加權和更新單位激活(參見de Vries和Principe 1991)。Lin等人(1995)提出了時延網絡的變體NARX網絡。 |
| Time constants. To deal with long time lags, Mozer (1992) uses time constants in uencing ?changes of unit activations (deVries and Principe's above-mentioned approach (1991) may in fact ?be viewed as a mixture of TDNN and time constants). For long time lags, however, the time ?constants need external ne tuning (Mozer 1992). Sun et al.'s alternative approach (1993) updates ?the activation of a recurrent unit by adding the old activation and the (scaled) current net input. ?The net input, however, tends to perturb the stored information, which makes long-term storage ?impractical. ? | 時間常量。為了處理長時間滯后,Mozer(1992)使用時間常數來表示單位激活的變化(deVries and Principe’s上述方法(1991)實際上可以看作是TDNN和時間常數的混合物)。然而,對于長時間滯后,時間常數需要外部ne調諧(Mozer 1992)。Sun等人的替代方法(1993)通過添加舊的激活和(縮放的)當前凈輸入來更新一個經常性單元的激活。然而,凈輸入往往會干擾所存儲的信息,這使得長期存儲變得不切實際。 |
| Ring's approach. Ring (1993) also proposed a method for bridging long time lags. Whenever ?a unit in his network receives con icting error signals, he adds a higher order unit in uencing ?appropriate connections. Although his approach can sometimes be extremely fast, to bridge a ?time lag involving 100 steps may require the addition of 100 units. Also, Ring's net does not ?generalize to unseen lag durations. ? | 環的方法。Ring(1993)也提出了一種橋接長時間滯后的方法。當他的網絡中的一個單元接收到通信錯誤信號時,他就增加一個更高階的單元來建立適當的連接。雖然他的方法有時非常快,但要跨越100步的時間延遲可能需要增加100個單元。同樣,環網也不能推廣到看不見的滯后時間。 |
| Bengio et al.'s approaches. Bengio et al. (1994) investigate methods such as simulated ?annealing, multi-grid random search, time-weighted pseudo-Newton optimization, and discrete ?error propagation. Their \latch" and \2-sequence" problems are very similar to problem 3a with ?minimal time lag 100 (see Experiment 3). Bengio and Frasconi (1994) also propose an EM approach ?for propagating targets. With n so-called \state networks", at a given time, their system can be ?in one of only n dierent states. See also beginning of Section 5. But to solve continuous problems ?such as the \adding problem" (Section 5.4), their system would require an unacceptable number ?of states (i.e., state networks). | Bengio等人的方法。Bengio等人(1994)研究了模擬退火、多網格隨機搜索、時間加權偽牛頓優化和離散誤差傳播等方法。他們的“閂鎖”和“2-序列”問題與3a問題非常相似,只有最小的滯后時間100(見實驗3)。Bengio和Frasconi(1994)也提出了一種EM方法來傳播目標。對于n個所謂的“狀態網絡”,在給定的時間內,它們的系統只能處于n種不同狀態中的一種。參見第5節的開頭。但是,為了解決諸如“\添加問題”(第5.4節)之類的連續問題,它們的系統將需要不可接受的狀態數(即狀態的網絡)。 |
| Kalman lters. Puskorius and Feldkamp (1994) use Kalman lter techniques to improve ?recurrent net performance. Since they use \a derivative discount factor imposed to decay exponentially ?the eects of past dynamic derivatives," there is no reason to believe that their Kalman ?Filter Trained Recurrent Networks will be useful for very long minimal time lags. ?Second order nets. We will see that LSTM uses multiplicative units (MUs) to protect error ? ow from unwanted perturbations. It is not the rst recurrent net method using MUs though. ?For instance, Watrous and Kuhn (1992) use MUs in second order nets. Some dierences to LSTM ?are:
|
|
| Simple weight guessing. To avoid long time lag problems of gradient-based approaches we ?may simply randomly initialize all network weights until the resulting net happens to classify all ?training sequences correctly. In fact, recently we discovered (Schmidhuber and Hochreiter 1996, ?Hochreiter and Schmidhuber 1996, 1997) that simple weight guessing solves many of the problems ?in (Bengio 1994, Bengio and Frasconi 1994, Miller and Giles 1993, Lin et al. 1995) faster than ?the algorithms proposed therein. This does not mean that weight guessing is a good algorithm. ?It just means that the problems are very simple. More realistic tasks require either many free ?parameters (e.g., input weights) or high weight precision (e.g., for continuous-valued parameters), ?such that guessing becomes completely infeasible. ? | 簡單的猜測。為了避免基于梯度的方法的長時間滯后問題,我們可以簡單地隨機初始化所有網絡權值,直到最終得到的網絡正確地對所有訓練序列進行分類。事實上,最近我們發現(Schmidhuber and Hochreiter 1996, Hochreiter and Schmidhuber 1996, 1997)簡單的重量猜測解決了(Bengio 1994, Bengio and Frasconi 1994, Miller and Giles 1993, Lin et al. 1995)中的許多問題,比其中提出的算法更快。這并不意味著猜測權重是一個好的算法。這意味著問題很簡單。更實際的任務需要許多自由參數(例如,輸入權值)或較高的權值精度(例如,連續值參數),這樣猜測就變得完全不可行的。 |
| Adaptive sequence chunkers. Schmidhuber's hierarchical chunker systems (1992b, 1993) ?do have a capability to bridge arbitrary time lags, but only if there is local predictability across the ?subsequences causing the time lags (see also Mozer 1992). For instance, in his postdoctoral thesis ?(1993), Schmidhuber uses hierarchical recurrent nets to rapidly solve certain grammar learning ?tasks involving minimal time lags in excess of 1000 steps. The performance of chunker systems, ?however, deteriorates as the noise level increases and the input sequences become less compressible. ?LSTM does not suer from this problem. | 自適應序列chunkers。Schmidhuber的分層chunker系統(1992b, 1993)確實具有橋接任意時間滯后的能力,但前提是子序列具有局部可預測性,從而導致時間滯后(參見Mozer 1992)。例如,在他的博士后論文(1993)中,Schmidhuber使用層次遞歸網絡來快速解決某些語法學習任務,這些任務涉及的時間延遲最小,超過了1000步。然而,隨著噪聲水平的提高和輸入序列的可壓縮性的降低,chunker系統的性能會下降。LSTM不能解決這個問題。 |
?
3 CONSTANT ERROR BACKPROP? 固定誤差支持
3.1 EXPONENTIALLY DECAYING ERROR? ?指數衰減誤差
| Conventional BPTT (e.g. Williams and Zipser 1992). Output unit k's target at time t is denoted by dk (t). Using mean squared error, k's error signal is | 傳統的BPTT(如Williams和Zipser 1992)。輸出單元k在t時刻的目標用dk (t)表示,利用均方誤差,k的誤差信號為 ? |
| The corresponding contribution to wjl 's total weight update is #j (t)yl ?(t ?1), where ?is the ?learning rate, and l stands for an arbitrary unit connected to unit j. ?Outline of Hochreiter's analysis (1991, page 19-21). Suppose we have a fully connected ?net whose non-input unit indices range from 1 to n. Let us focus on local error ?ow from unit u ?to unit v (later we will see that the analysis immediately extends to global error ?ow). The error ?occurring at an arbitrary unit u at time step t is propagated \back into time" for q time steps, to ?an arbitrary unit v. This will scale the error by the following fact | wjl的總權重更新的相應貢獻是#j (t)yl (t 1),其中為學習率,l表示連接到j單元的任意單元。Hochreiter分析概要(1991年,第19-21頁)。假設我們有一個完全連通的網絡,它的非輸入單位指數范圍從1到n。讓我們關注從單位u到單位v的局部誤差ow(稍后我們將看到分析立即擴展到全局誤差ow)。發生在任意單位u上的時間步長t的誤差被傳播回時間中,對于q時間步長,傳播回任意單位v ? |
| ? | |
| ? |
?
3.2 CONSTANT ERROR FLOW: NAIVE APPROACH?常量錯誤流:簡單的方法
| A single unit. To avoid vanishing error signals, how can we achieve constant error ow through a single unit j with a single connection to itself? According to the rules above, at time t, j's local error back ow is #j (t) = f 0 j (netj (t))#j (t + 1)wjj . To enforce constant error ow through j, we?h j, we | 一個單元。為了避免消失的錯誤信號,我們如何通過一個單一的單位j與一個單一的連接到自己實現恒定的錯誤低?根據上面的規則,在t時刻,j的本地錯誤返回ow是#j (t) = f0 j (netj (t))#j (t + 1)wjj。為了通過j來執行常誤差ow,我們h j,我們 ? |
| In the experiments, this will be ensured by using the identity function fj : fj (x) = x; 8x, and by setting wjj = 1:0. We refer to this as the constant error carrousel (CEC). CEC will be LSTM's central feature (see Section 4). Of course unit j will not only be connected to itself but also to other units. This invokes two obvious, related problems (also inherent in all other gradient-based approaches): | 在實驗中,利用恒等函數fj: fj (x) = x來保證;設置wjj = 1:0。我們稱之為常誤差卡魯塞爾(CEC)。CEC將是LSTM的中心特性(參見第4節)。當然,單元j不僅與自身相連,還與其他單元相連。這引發了兩個明顯的、相關的問題(也是所有其他基于梯度的方法所固有的): |
|
|
| Of course, input and output weight con icts are not specic for long time lags, but occur for ?short time lags as well. Their eects, however, become particularly pronounced in the long time ?lag case: as the time lag increases, (1) stored information must be protected against perturbation ?for longer and longer periods, and | especially in advanced stages of learning | (2) more and ?more already correct outputs also require protection against perturbation. ? Due to the problems above the naive approach does not work well except in case of certain ?simple problems involving local input/output representations and non-repeating input patterns ?(see Hochreiter 1991 and Silva et al. 1996). The next section shows how to do it right. | 當然,輸入和輸出的權系數在長時間滯后時是不特定的,但在短時間滯后時也會出現。除,然而,在長時間滯后的情況下尤為明顯:隨著時間間隔的增加,(1)存儲信息必須防止擾動時間卻越來越長,學習|和|尤其是晚期(2)越來越多的正確輸出也需要防止擾動。 由于上述問題,天真的方法不能很好地工作,除非某些簡單的問題涉及本地輸入/輸出表示和非重復輸入模式(見Hochreiter 1991和Silva et al. 1996)。下一節將展示如何正確地執行此操作。 ? |
?
4 LONG SHORT-TERM MEMORY
| Memory cells and gate units. To construct an architecture that allows for constant error ow through special, self-connected units without the disadvantages of the naive approach, we extend the constant error carrousel CEC embodied by the self-connected, linear unit j from Section 3.2 by introducing additional features. A multiplicative input gate unit is introduced to protect the memory contents stored in j from perturbation by irrelevant inputs. Likewise, a multiplicative output gate unit is introduced which protects other units from perturbation by currently irrelevant memory contents stored in j. | 記憶單元和門單元。為了構建一個允許通過特殊的、自連接的單元實現恒定誤差的體系結構,同時又不存在樸素方法的缺點,我們通過引入額外的特性來擴展3.2節中自連接的線性單元j所包含的恒定誤差carrousel CEC。為了保護存儲在j中的存儲內容不受無關輸入的干擾,引入了乘法輸入門單元。同樣地,一個乘法輸出門單元被引入,它保護其他單元不受當前不相關的存儲在j中的內存內容的干擾。 |
| ? | |
| ? | |
| net Figure 1: Architecture of memory cel l cj (the box) and its gate units inj ; outj . The self-recurrent connection (with weight 1.0) indicates feedback with a delay of 1 time step. It builds the basis of the \constant error carrousel" CEC. The gate units open and close access to CEC. See text and appendix A.1 for details.? | 圖1:memory cel l cj(盒子)的結構和它的門單元inj;outj。自循環連接(權值為1.0)表示反饋延遲1個時間步長。它建立了恒定誤差carrousel“CEC”的基礎。星門單元打開和關閉CEC的入口。詳情見正文和附錄A.1。 ? |
| ls. ?Why gate units? To avoid input weight con icts, inj controls the error ?ow to memory cell ?cj 's input connections wcj i . To circumvent cj 's output weight con icts, outj controls the error ? ow from unit j's output connections. In other words, the net can use inj to decide when to keep ?or override information in memory cell cj , and outj to decide when to access memory cell cj and ?when to prevent other units from being perturbed by cj (see Figure 1). ? | 為什么門單位?為了避免輸入權值沖突,inj控制了內存單元cj的輸入連接的誤差。為了繞過cj的輸出權值,outj控制來自單位j的輸出連接的錯誤。換句話說,網絡可以使用inj來決定何時在內存單元cj中保留或覆蓋信息,而使用outj來決定何時訪問內存單元cj以及何時防止其他單元受到cj的干擾(參見圖1)。 |
| Error signals trapped within a memory cell's CEC cannot change { but dierent error signals ? owing into the cell (at dierent times) via its output gate may get superimposed. The output ?gate will have to learn which errors to trap in its CEC, by appropriately scaling them. The input gate will have to learn when to release errors, again by appropriately scaling them. Essentially, the multiplicative gate units open and close access to constant error ow through CEC. | 存儲單元的CEC中的錯誤信號不能改變{但是通過輸出門進入單元的不同錯誤信號(在不同的時間)可以被疊加。通過適當地擴展,輸出門必須了解在其CEC中應該捕獲哪些錯誤。輸入門必須學會何時釋放錯誤,再次通過適當地擴展它們。從本質上說,乘性門單元通過CEC打開和關閉對恒定誤差的訪問。? |
| Distributed output representations typically do require output gates. Not always are both ?gate types necessary, though | one may be sucient. For instance, in Experiments 2a and 2b in ?Section 5, it will be possible to use input gates only. In fact, output gates are not required in case ?of local output encoding | preventing memory cells from perturbing already learned outputs can ?be done by simply setting the corresponding weights to zero. Even in this case, however, output ?gates can be benecial: they prevent the net's attempts at storing long time lag memories (which ?are usually hard to learn) from perturbing activations representing easily learnable short time lag ?memories. (This will prove quite useful in Experiment 1, for instance.) ? | 分布式輸出表示通常需要輸出門。雖然|一個可能是必需的,但兩個門不一定都是必需的。例如,在第5節的2a和2b實驗中,將可能只使用輸入門。事實上,在本地輸出編碼為|的情況下,不需要輸出門,只要將相應的權值設置為0,就可以防止內存單元干擾已經學習過的輸出。然而,即使在這種情況下,輸出門也可能是有益的:它們阻止了網絡存儲長時間滯后記憶(通常很難學習)的嘗試,從而干擾了代表容易學習的短時間滯后記憶的激活。(例如,這在實驗1中將被證明非常有用。) |
| Network topology. We use networks with one input layer, one hidden layer, and one output ?layer. The (fully) self-connected hidden layer contains memory cells and corresponding gate units ?(for convenience, we refer to both memory cells and gate units as being located in the hidden ?layer). The hidden layer may also contain \conventional" hidden units providing inputs to gate ?units and memory cells. All units (except for gate units) in all layers have directed connections ?(serve as inputs) to all units in the layer above (or to all higher layers { Experiments 2a and 2b). ? Memory cell blocks. S memory cells sharing the same input gate and the same output gate ?form a structure called a \memory cell block of size S". Memory cell blocks facilitate information ?storage | as with conventional neural nets, it is not so easy to code a distributed input within a ?single cell. Since each memory cell block has as many gate units as a single memory cell (namely ?two), the block architecture can be even slightly more ecient (see paragraph \computational ?complexity"). A memory cell block of size 1 is just a simple memory cell. In the experiments ?(Section 5), we will use memory cell blocks of various sizes. ? | 網絡拓撲結構。我們使用一個輸入層、一個隱含層和一個輸出層的網絡。(完全)自連接的隱層包含內存單元和相應的柵極單元(為了方便起見,我們將位于隱層中的內存單元和柵極單元都稱為隱層)。所述隱層還可以包含提供柵極單元和存儲器單元輸入的常規“隱單元”。所有層中的所有單元(門單元除外)都有指向連接(作為輸入)到上面層中的所有單元(或所有更高的層{實驗2a和2b)。? 存儲單元塊。共享相同的輸入門和輸出門的內存單元形成一個稱為大小為S的內存單元塊的結構。記憶單元塊促進信息存儲|與傳統的神經網絡一樣,在單個單元內編碼分布式輸入并不容易。由于每個內存單元塊與單個內存單元(即兩個)具有同樣多的門單元,因此塊架構甚至可以更特殊一些(請參閱段落“計算復雜性”)。大小為1的內存單元塊只是一個簡單的內存單元。在實驗(第5部分)中,我們將使用不同大小的存儲單元塊。 |
| ? | ? |
| ? | |
|
Computational complexity. As with Mozer's focused recurrent backprop algorithm (Mozer ?1989), only the derivatives @scj ?@wil ?need to be stored and updated. Hence the LSTM algorithm is ?very ecient, with an excellent update complexity of O(W), where W the number of weights (see ?details in appendix A.1). Hence, LSTM and BPTT for fully recurrent nets have the same update ?complexity per time step (while RTRL's is much worse). Unlike full BPTT, however, LSTM is ?local in space and time3 ?: there is no need to store activation values observed during sequence ?processing in a stack with potentially unlimited size. | 學習。我們使用RTRL的一個變體(例如,Robinson和Fallside 1987),它適當地考慮了輸入和輸出門所引起的變化的乘法動力學。然而,以確保non-decaying錯誤backprop通過內部狀態的記憶細胞,與截斷BPTT(例如,威廉姆斯和彭1990),錯誤到達\存儲單元網絡輸入”(細胞cj,這包括netcj、netinj netoutj)得不到傳播更久遠的時代(盡管他們服務變化的權重)。只有在2個內存單元中,錯誤才會通過之前的內部狀態scj傳播回來。為了可視化這一點:一旦一個錯誤信號到達一個內存單元輸出,它將被輸出門激活和h0縮放。然后它在記憶細胞的CEC中,在那里它可以無限地慢下來而不需要被縮放。只有當它通過輸入門和g離開存儲單元時,它才通過輸入門激活和g 0再次被縮放。然后,它用于在截斷之前更改傳入的權重(有關顯式公式,請參閱附錄)。 計算的復雜性。與Mozer的重點循環支持算法(Mozer 1989)一樣,只需要存儲和更新導數@scj @wil。因此LSTM算法非常特殊,更新復雜度為O(W),其中W表示權值的數量(詳見附錄A.1)。因此,對于完全經常網,LSTM和BPTT的每一步更新復雜度是相同的(而RTRL要差得多)。但是,與完整的BPTT不同的是,LSTM在空間和時間上是局部的:不需要將序列處理期間觀察到的激活值存儲在具有無限大小的堆棧中。 |
| Abuse problem and solutions. In the beginning of the learning phase, error reduction ?may be possible without storing information over time. The network will thus tend to abuse ?memory cells, e.g., as bias cells (i.e., it might make their activations constant and use the outgoing ?connections as adaptive thresholds for other units). The potential diculty is: it may take a ?long time to release abused memory cells and make them available for further learning. A similar ?\abuse problem" appears if two memory cells store the same (redundant) information. There are ?at least two solutions to the abuse problem: (1) Sequential network construction (e.g., Fahlman ?1991): a memory cell and the corresponding gate units are added to the network whenever the error stops decreasing (see Experiment 2 in Section 5). (2) Output gate bias: each output gate gets a negative initial bias, to push initial memory cell activations towards zero. Memory cells with more negative bias automatically get \allocated" later (see Experiments 1, 3, 4, 5, 6 in Section 5). | 濫用問題及解決方法。在學習階段的開始,可以在不存儲信息的情況下減少錯誤。因此,該網絡將傾向于濫用記憶細胞,例如,作為偏見細胞。,它可能使它們的激活保持不變,并使用傳出連接作為其他單元的自適應閾值)。潛在的問題是:釋放被濫用的記憶細胞并使其用于進一步的學習可能需要很長時間。如果兩個記憶單元存儲相同的(冗余的)信息,就會出現類似的“濫用”問題。至少有兩個解決濫用問題:(1)順序網絡建設(例如,Fahlman 1991):一個存儲單元和相應的單元門時被添加到網絡錯誤停止減少(見實驗2節5)。(2)輸出門偏見:每個輸出門負初始偏差,將最初的記憶細胞激活為零。帶有更多負偏差的記憶細胞將被自動分配”稍后(參見第5節中的實驗1、3、4、5、6)。 |
| Internal state drift and remedies. If memory cell cj 's inputs are mostly positive or mostly ?negative, then its internal state sj will tend to drift away over time. This is potentially dangerous, ?for the h0 ?(sj ) will then adopt very small values, and the gradient will vanish. One way to circumvent ?this problem is to choose an appropriate function h. But h(x) = x, for instance, has the ?disadvantage of unrestricted memory cell output range. Our simple but eective way of solving ?drift problems at the beginning of learning is to initially bias the input gate inj towards zero. ?Although there is a tradeo?between the magnitudes of h0 ?(sj ) on the one hand and of yinj ?and ?f 0 ?inj on the other, the potential negative eect of input gate bias is negligible compared to the one ?of the drifting eect. With logistic sigmoid activation functions, there appears to be no need for ?ne-tuning the initial bias, as conrmed by Experiments 4 and 5 in Section 5.4. | 內部狀態漂移和補救措施。如果記憶細胞cj的輸入大部分是正的或大部分是負的,那么它的內部狀態sj會隨著時間的推移而漂移。這是潛在的危險,因為h0 (sj)將采用非常小的值,而梯度將消失。解決這個問題的一種方法是選擇一個合適的函數h,但是h(x) = x的缺點是不限制內存單元的輸出范圍。我們在學習之初解決漂移問題的簡單而有效的方法是使輸入門inj最初偏向于零。雖然在h0 (sj)與yinj和f0 inj的量級之間存在貿易,但與漂移效應相比,輸入門偏差的潛在負效應可以忽略不計。對于logistic sigmoid激活函數,似乎不需要對初始偏差進行ne調節,正如5.4節中的實驗4和實驗5所證實的那樣。 |
?
5 EXPERIMENTS?實驗
| Introduction. Which tasks are appropriate to demonstrate the quality of a novel long time lag | 介紹。哪些任務是合適的,以證明一個新的長時間滯后的質量 |
| algorithm? First of all, minimal time lags between relevant input signals and corresponding teacher ?signals must be long for al l training sequences. In fact, many previous recurrent net algorithms ?sometimes manage to generalize from very short training sequences to very long test sequences. ?See, e.g., Pollack (1991). But a real long time lag problem does not have any short time lag ?exemplars in the training set. For instance, Elman's training procedure, BPTT, oine RTRL, ?online RTRL, etc., fail miserably on real long time lag problems. See, e.g., Hochreiter (1991) and ?Mozer (1992). A second important requirement is that the tasks should be complex enough such ?that they cannot be solved quickly by simple-minded strategies such as random weight guessing.? | 算法?首先,對于all訓練序列,相關輸入信號與相應教師信號之間的最小時滯必須很長。事實上,許多以前的遞歸網絡算法有時能夠將非常短的訓練序列推廣到非常長的測試序列。參見,例如Pollack(1991)。但是一個真實的長時間滯后問題在訓練集中沒有任何短時間滯后的例子。例如,Elman的訓練過程,BPTT, oine RTRL, online RTRL等,在真實的長時間滯后問題上嚴重失敗。例如Hochreiter(1991)和Mozer(1992)。第二個重要的要求是,任務應該足夠復雜,不能用簡單的策略(如隨機猜測權值)快速解決。 ? |
| Guessing can outperform many long time lag algorithms. Recently we discovered ?(Schmidhuber and Hochreiter 1996, Hochreiter and Schmidhuber 1996, 1997) that many long ?time lag tasks used in previous work can be solved more quickly by simple random weight guessing ?than by the proposed algorithms. For instance, guessing solved a variant of Bengio and Frasconi's ?\parity problem" (1994) problem much faster4 ?than the seven methods tested by Bengio et al. ?(1994) and Bengio and Frasconi (1994). Similarly for some of Miller and Giles' problems (1993). Of ?course, this does not mean that guessing is a good algorithm. It just means that some previously ?used problems are not extremely appropriate to demonstrate the quality of previously proposed ?algorithms. ? | 猜測可以勝過許多長時間延遲的算法。最近我們發現(Schmidhuber and Hochreiter 1996, Hochreiter and Schmidhuber 1996, 1997),以前工作中使用的許多長時間延遲任務可以通過簡單的隨機猜測權值來快速解決,而不是通過所提出的算法。例如,猜測解決了Bengio和Frasconi's奇偶校驗問題(1994)的一個變體,比Bengio等人(1994)和Bengio和Frasconi(1994)測試的七種方法要快得多。類似地,米勒和賈爾斯的一些問題(1993年)。當然,這并不意味著猜測是一個好的算法。這只是意味著一些以前用過的問題不是非常適合用來演示以前提出的算法的質量。? |
| What's common to Experiments 1{6. All our experiments (except for Experiment 1) ?involve long minimal time lags | there are no short time lag training exemplars facilitating ?learning. Solutions to most of our tasks are sparse in weight space. They require either many ?parameters/inputs or high weight precision, such that random weight guessing becomes infeasible. ? | 實驗1{6。我們所有的實驗(除了實驗1)都涉及到長時間的最小滯后時間|沒有短時間的滯后訓練范例來促進學習。我們大多數任務的解在權值空間中是稀疏的。它們要么需要許多參數/輸入,要么需要較高的權值精度,這樣隨機猜測權值就變得不可行了。? |
| We always use on-line learning (as opposed to batch learning), and logistic sigmoids as activation ?functions. For Experiments 1 and 2, initial weights are chosen in the range [0:2; ?0:2], for ?the other experiments in [0:1; ?0:1]. Training sequences are generated randomly according to the ?various task descriptions. In slight deviation from the notation in Appendix A1, each discrete ?time step of each input sequence involves three processing steps:
| 我們總是使用在線學習(而不是批量學習),并使用邏輯sigmoids作為激活函數。實驗1和實驗2的初始權值選擇在[0:2;0:2],用于其他實驗[0:1;0:1)。根據不同的任務描述,隨機生成訓練序列。與附錄A1中的符號略有偏差,每個輸入序列的每個離散時間步都涉及三個處理步驟:
|
| For comparisons with recurrent nets taught by gradient descent, we give results only for RTRL, ?except for comparison 2a, which also includes BPTT. Note, however, that untruncated BPTT (see, ?e.g., Williams and Peng 1990) computes exactly the same gradient as oine RTRL. With long time ?lag problems, oine RTRL (or BPTT) and the online version of RTRL (no activation resets, online ?weight changes) lead to almost identical, negative results (as conrmed by additional simulations ?in Hochreiter 1991; see also Mozer 1992). This is because oine RTRL, online RTRL, and full ?BPTT all suer badly from exponential error decay. ? | 對于用梯度下降法講授的循環網的比較,我們只給出了RTRL的結果,除了比較2a,其中也包括了BPTT。但是,請注意未截斷的BPTT(參見, Williams和Peng(1990)計算的梯度與oine RTRL完全相同。由于存在長時間滯后問題,oine RTRL(或BPTT)和RTRL的在線版本(沒有激活重置,在線權重變化)導致幾乎相同的負結果(如Hochreiter 1991中的額外模擬所證實的;參見Mozer 1992)。這是因為oine RTRL、online RTRL和full BPTT都嚴重依賴于指數誤差衰減。? ? |
| Our LSTM architectures are selected quite arbitrarily. If nothing is known about the complexity ?of a given problem, a more systematic approach would be: start with a very small net consisting ?of one memory cell. If this does not work, try two cells, etc. Alternatively, use sequential network ?construction (e.g., Fahlman 1991). | 我們的LSTM架構是任意選擇的。如果對給定問題的復雜性一無所知,那么一種更系統的方法是:從一個由一個記憶單元組成的非常小的網絡開始。如果這不起作用,嘗試兩個單元格,等等。或者,使用順序網絡結構(例如,Fahlman 1991)。 ? ? ? |
Outline of experiments??試驗大綱
?
Subsection 5.7 will provide a detailed summary of experimental conditions in two tables for reference. |
|
?
5.1 EXPERIMENT 1: EMBEDDED REBER GRAMMAR ?實驗1:嵌入式REBER語法
| 任務。我們的首要任務是學習嵌入的Reber語法”,例如Smith和Zipser(1989)、Cleeremans等人(1989)和Fahlman(1991)。因為它允許訓練序列有短的時間滯后(只有9個步驟),所以它不是一個長時間滯后的問題。我們引入它有兩個原因:
|
| ? | ? |
| ? | ? |
?
?
?
?
?
?
?
?
?
?
?
《新程序員》:云原生和全面數字化實踐50位技術專家共同創作,文字、視頻、音頻交互閱讀總結
以上是生活随笔為你收集整理的LSTM:《Long Short-Term Memory》的翻译并解读的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: CV之SR:超分辨率(Super res
- 下一篇: TF之LSTM:利用LSTM算法对mni