當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Paper：《Generating Sequences With Recurrent Neural Networks》的翻译和解读

發布時間：2025/3/21 编程问答 49 豆豆

生活随笔收集整理的這篇文章主要介紹了 Paper：《Generating Sequences With Recurrent Neural Networks》的翻译和解读小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Paper：《Generating Sequences With Recurrent Neural Networks》的翻譯和解讀

Generating Sequences With Recurrent Neural Networks

Abstract

1、Introduction

2 Prediction Network?預測網絡

2.1 Long Short-Term Memory

3 Text Prediction??文本預測

3.1 Penn Treebank Experiments Penn Treebank實驗

3.2 Wikipedia Experiments??維基百科的實驗

4 Handwriting Prediction??筆跡的預測

4.1 Mixture Density Outputs??混合密度輸出

4.2 Experiments

4.3 Samples??樣品

5 Handwriting Synthesis??字合成

5.1 Synthesis Network??合成網絡

5.2 Experiments??實驗

5.3 Unbiased Sampling??公正的抽樣

5.4 Biased Sampling??有偏見的抽樣

5.5 Primed Sampling??啟動采樣

6 Conclusions and Future Work??結論與未來工作

Acknowledgements??致謝

References

Generating Sequences With Recurrent Neural Networks

利用遞歸神經網絡生成序列

論文原文：Generating Sequences With Recurrent Neural Networks
作者：
Alex Graves? ?Department of Computer Science? ?
University of Toronto? ? ? graves@cs.toronto.edu

Abstract

This paper shows how Long Short-term Memory recurrent neural net- works can be used to generate complex sequences with long-range struc- ture, simply by predicting one data point at a time. The approach is demonstrated for text (where the data are discrete) and online handwrit- ing (where the data are real-valued). It is then extended to handwriting synthesis by allowing the network to condition its predictions on a text sequence. The resulting system is able to generate highly realistic cursive handwriting in a wide variety of styles.

利用長短期記憶遞歸神經網絡，通過簡單地預測一個數據點來實現長時間的復雜序列生成。該方法適用于文本(數據是離散的)和在線手寫(數據是實值的)。然后，通過允許網絡根據文本序列調整預測，將其擴展到手寫合成。由此產生的系統能夠生成多種風格的高度逼真的草書。?

1、Introduction 介紹

Recurrent neural networks (RNNs) are a rich class of dynamic models that have been used to generate sequences in domains as diverse as music [6, 4], text [30] and motion capture data [29]. RNNs can be trained for sequence generation by processing real data sequences one step at a time and predicting what comes next. Assuming the predictions are probabilistic, novel sequences can be gener- ated from a trained network by iteratively sampling from the network’s output distribution, then feeding in the sample as input at the next step. In other words by making the network treat its inventions as if they were real, much like a person dreaming. Although the network itself is deterministic, the stochas- ticity injected by picking samples induces a distribution over sequences. This distribution is conditional, since the internal state of the network, and hence its predictive distribution, depends on the previous inputs.	遞歸神經網絡(RNNs)是一類豐富的動態模型，被用于生成音樂[6,4]、文本[30]和動作捕捉數據[29]等領域的序列。通過一步一步地處理真實的數據序列并預測接下來會發生什么，可以訓練RNNs來生成序列。假設預測是概率性的，通過對網絡的輸出分布進行迭代采樣，然后將樣本作為下一步的輸入，可以從訓練好的網絡中生成新的序列。換句話說，通過讓網絡把它的發明當作是真實的，就像一個人在做夢一樣。雖然網絡本身是確定性的，但通過取樣注入的隨機度會導致序列上的分布。這種分布是有條件的，因為網絡的內部狀態(因此它的預測分布)取決于以前的輸入。
RNNs are ‘fuzzy’ in the sense that they do not use exact templates from the training data to make predictions, but rather—like other neural networks— use their internal representation to perform a high-dimensional interpolation between training examples. This distinguishes them from n-gram models and compression algorithms such as Prediction by Partial Matching [5], whose pre- dictive distributions are determined by counting exact matches between the recent history and the training set. The result—which is immediately appar- ent from the samples in this paper—is that RNNs (unlike template-based al- gorithms) synthesise and reconstitute the training data in a complex way, and rarely generate the same thing twice. Furthermore, fuzzy predictions do not suf- fer from the curse of dimensionality, and are therefore much better at modelling real-valued or multivariate data than exact matches.	從某種意義上說，RNNs是“模糊的”，因為它們不使用來自訓練數據的準確模板來進行預測，而是像其他神經網絡一樣，使用它們的內部表示來在訓練實例之間執行高維插值。這將它們與n-gram模型和壓縮算法(如通過部分匹配[5]進行預測)進行了區分，后者的預測分布是通過計算最近歷史與訓練集之間的精確匹配來確定的從本文的樣本中可以看出，RNNs(與基于模板的算法不同)以一種復雜的方式綜合和重構訓練數據，并且很少生成相同的內容兩次。此外，模糊預測不依賴于維數的詛咒，因此在建模實值或多變量數據時，它比精確匹配要有效得多。
In principle a large enough RNN should be sufficient to generate sequences of arbitrary complexity. In practice however, standard RNNs are unable to store information about past inputs for very long [15]. As well as diminishing their ability to model long-range structure, this ‘amnesia’ makes them prone to instability when generating sequences. The problem (common to all conditional generative models) is that if the network’s predictions are only based on the last few inputs, and these inputs were themselves predicted by the network, it has little opportunity to recover from past mistakes. Having a longer memory has a stabilising effect, because even if the network cannot make sense of its recent history, it can look further back in the past to formulate its predictions. The problem of instability is especially acute with real-valued data, where it is easy for the predictions to stray from the manifold on which the training data lies. One remedy that has been proposed for conditional models is to inject noise into the predictions before feeding them back into the model [31], thereby increasing the model’s robustness to surprising inputs. However we believe that a better memory is a more profound and effective solution.	原則上，一個足夠大的RNN應該足以生成任意復雜度的序列。然而，在實踐中，標準的RNN不能存儲關于非常長的[15]的過去輸入的信息。這種“健忘癥”不僅會削弱他們對長期結構建模的能力，還會使他們在生成序列時變得不穩定。問題是(所有條件生成模型共有的)，如果網絡的預測僅僅基于最后的幾個輸入，而這些輸入本身是由網絡預測的，那么它幾乎沒有機會從過去的錯誤中恢復過來。擁有更長的記憶有一個穩定的效果，因為即使網絡不能理解它最近的歷史，它可以回顧過去來制定它的預測。對于實值數據來說，不穩定性問題尤其嚴重，因為預測很容易偏離訓練數據所在的流形。針對條件模型提出的一個補救措施是，在將預測反饋回模型[31]之前，在預測中加入噪音，從而提高模型對意外輸入的魯棒性。然而，我們相信更好的記憶是一個更深刻和有效的解決方案。?
Long Short-term Memory (LSTM) [16] is an RNN architecture designed to be better at storing and accessing information than standard RNNs. LSTM has recently given state-of-the-art results in a variety of sequence processing tasks, including speech and handwriting recognition [10, 12]. The main goal of this paper is to demonstrate that LSTM can use its memory to generate complex, realistic sequences containing long-range structure.	長短時記憶(LSTM)[16]是一種RNN結構，它比標準的RNN更適合于存儲和訪問信息。LSTM最近在一系列序列處理任務中給出了最先進的結果，包括語音和手寫識別[10,12]。本文的主要目的是證明LSTM可以利用它的內存來生成復雜的、真實的、包含長程結構的序列。
Figure 1:?Deep recurrent neural network prediction architecture.?The circles represent network layers, the solid lines represent weighted connections and the dashed lines represent predictions.	圖1:深度遞歸神經網絡預測體系結構。圓圈表示網絡層，實線表示加權連接，虛線表示預測。
Section 2 defines a ‘deep’ RNN composed of stacked LSTM layers, and ex- plains how it can be trained for next-step prediction and hence sequence gener- ation. Section 3 applies the prediction network to text from the Penn Treebank and Hutter Prize Wikipedia datasets. The network’s performance is compet- itive with state-of-the-art language models, and it works almost as well when predicting one character at a time as when predicting one word at a time. The highlight of the section is a generated sample of Wikipedia text, which showcases the network’s ability to model long-range dependencies. Section 4 demonstrates how the prediction network can be applied to real-valued data through the use of a mixture density output layer, and provides experimental results on the IAM Online Handwriting Database. It also presents generated handwriting samples proving the network’s ability to learn letters and short words direct from pen traces, and to model global features of handwriting style. Section 5 introduces an extension to the prediction network that allows it to condition its outputs on a short annotation sequence whose alignment with the predictions is unknown. This makes it suitable for handwriting synthesis, where a human user inputs a text and the algorithm generates a handwritten version of it. The synthesis network is trained on the IAM database, then used to generate cursive hand- writing samples, some of which cannot be distinguished from real data by the?naked eye. A method for biasing the samples towards higher probability (and greater legibility) is described, along with a technique for ‘priming’ the samples on real data and thereby mimicking a particular writer’s style. Finally, concluding remarks and directions for future work are given in Section 6.	第2節定義了一個由多層LSTM層組成的“深度”RNN，并討論了如何訓練它來進行下一步預測，從而實現序列生成。第3節將預測網絡應用于來自Penn Treebank和Hutter Prize Wikipedia數據集的文本。該網絡的性能與最先進的語言模型是競爭的，它在預測一個字符時的效果幾乎與預測一個單詞時的效果一樣好。本節的重點是生成的Wikipedia文本樣本，它展示了網絡建模長期依賴項的能力。第4節演示了如何通過混合密度輸出層將預測網絡應用于實值數據，并提供了IAM在線手寫數據庫的實驗結果。它還提供了生成的手寫樣本，證明該網絡能夠直接從手寫軌跡學習字母和短單詞，并對手寫風格的全局特征進行建模。第5節介紹了對預測網絡的擴展，該擴展允許預測網絡將其輸出設置為與預測一致的短注釋序列。這使得它適用于手寫合成，即人類用戶輸入文本，然后算法生成手寫版本。綜合網絡在IAM數據庫上進行訓練，然后生成草書手寫樣本，其中一些無法用肉眼分辨出真實數據。描述了一種將樣本偏向于更高概率(以及更大的易讀性)的方法，以及一種在真實數據上“啟動”樣本從而模仿特定作者風格的技術。最后，第六節給出結論和今后工作的方向。?

2 Prediction Network?預測網絡

Fig. 1 illustrates the basic recurrent neural network prediction architecture used in this paper. An input vector sequence x = (x1, . . . , xT ) is passed through weighted connections to a stack of N recurrently connected hidden layers to compute first the hidden vector sequences h n = (h n 1 , . . . , hn T ) and then the output vector sequence y = (y1, . . . , yT ). Each output vector yt is used to parameterise a predictive distribution Pr(xt+1|yt) over the possible next inputs xt+1. The first element x1 of every input sequence is always a null vector whose entries are all zero; the network therefore emits a prediction for x2, the first real input, with no prior information. The network is ‘deep’ in both space and time, in the sense that every piece of information passing either vertically or horizontally through the computation graph will be acted on by multiple successive weight matrices and nonlinearities.

圖1給出了本文所采用的基本遞歸神經網絡預測體系結構。一個輸入向量序列x = (x1，…， xT)通過加權連接到N個遞歸連接的隱層堆棧中，首先計算隱藏向量序列h N = (h N 1，…然后輸出向量序列y = (y1，…次)。每個輸出向量yt被用來參數化一個預測分布Pr(xt+1|yt)對可能的下一個輸入xt+1。每個輸入序列的第一個元素x1總是一個零向量，它的所有元素都是零;因此，該網絡發出對x2的預測，x2是第一個實際輸入，沒有先驗信息。這個網絡在空間和時間上都是“深”的，也就是說，通過計算圖垂直或水平傳遞的每一條信息都將受到多個連續的權重矩陣和非線性的影響。??

Note the ‘skip connections’ from the inputs to all hidden layers, and from all hidden layers to the outputs. These make it easier to train deep networks,?by reducing the number of processing steps between the bottom of the network and the top, and thereby mitigating the ‘vanishing gradient’ problem [1]. In the special case that N = 1 the architecture reduces to an ordinary, single layer next step prediction RNN.

注意從輸入到所有隱藏層的“跳過連接”，以及從所有隱藏層到輸出的“跳過連接”。通過減少網絡底部和頂部之間的處理步驟的數量，從而降低了訓練深度網絡的難度，從而減輕了[1]的“消失梯度”問題。在N = 1的特殊情況下，該體系結構簡化為一個普通的單層下一步預測RNN。

The hidden layer activations are computed by iterating the following equations from t = 1 to T and from n = 2 to N

where the W terms denote weight matrices (e.g. Wihn is the weight matrix connecting the inputs to the n th hidden layer, Wh1h1 is the recurrent connection at the first hidden layer, and so on), the b terms denote bias vectors (e.g. by is output bias vector) and H is the hidden layer function.

Given the hidden sequences, the output sequence is computed as follow:

where Y is the output layer function. The complete network therefore defines a function, parameterised by the weight matrices, from input histories x1:t to output vectors yt.

The output vectors yt are used to parameterise the predictive distribution Pr(xt+1|yt) for the next input. The form of Pr(xt+1|yt) must be chosen carefully to match the input data. In particular, finding a good predictive distribution for high-dimensional, real-valued data (usually referred to as density modelling), can be very challenging.

The probability given by the network to the input sequence x

The partial derivatives of the loss with respect to the network weights can be efficiently calculated with backpropagation through time [33] applied to the computation graph shown in Fig. 1, and the network can then be trained with gradient descen

隱層激活的計算方法如下:從t = 1到t，從n = 2到n

矩陣W術語表示的重量(例如Wihn權重矩陣連接輸入n th隱藏層,Wh1h1是復發性連接在第一個隱藏層,等等),b項表示偏差向量(例如輸出偏差向量)和H是隱藏層的功能。

給定隱藏序列，輸出序列計算如下:

其中Y為輸出層函數。因此，整個網絡定義了一個函數，由權矩陣參數化，從輸入歷史x1:t到輸出向量yt。

輸出向量yt用于參數化下一個輸入的預測分布Pr(xt+1|yt)。必須仔細選擇Pr(xt+1|yt)的形式來匹配輸入數據。特別是，為高維、實值數據(通常稱為密度建模)找到一個好的預測分布是非常具有挑戰性的。

由網絡給出的輸入序列x的概率

通過對圖1所示的計算圖進行時間[33]的反向傳播，可以有效地計算出損失相對于網絡權值的偏導數，然后通過梯度下行對網絡進行訓練

2.1 Long Short-Term Memory

? ? ? ? ? ? ? ? Figure 2: Long Short-term Memory Cell

In most RNNs the hidden layer function H is an elementwise application of a sigmoid function. However we have found that the Long Short-Term rm Memory?(LSTM) architecture [16], which uses purpose-built memory cells to store information, is better at finding and exploiting long range dependencies in the data. Fig. 2 illustrates a single LSTM memory cell. For the version of LSTM used in this paper [7] H is implemented by the following composite function:

where σ is the logistic sigmoid function, and i, f, o and c are respectively the input gate, forget gate, output gate, cell and cell input activation vectors, all of which are the same size as the hidden vector h. The weight matrix subscripts have the obvious meaning, for example Whi is the hidden-input gate matrix, Wxo is the input-output gate matrix etc. The weight matrices from the cell to gate vectors (e.g. Wci) are diagonal, so element m in each gate vector only receives input from element m of the cell vector. The bias terms (which are added to i, f, c and o) have been omitted for clarity.

圖2:長短時記憶細胞

在大多數網絡中，隱層函數H是s型函數的基本應用。然而，我們發現，使用專門構建的內存單元來存儲信息的長短期rm內存(LSTM)體系結構[16]更善于發現和利用數據中的長期依賴關系。圖2顯示了單個LSTM存儲單元。對于本文使用的LSTM版本，[7]H通過以下復合函數實現:

σ是物流乙狀結腸函數,和我,f, o和c分別輸入門,忘記門,輸出門,細胞和細胞激活輸入向量,都是同樣的大小隱藏向量h。權重矩陣下標有明顯的意義,例如Whi隱藏輸入門矩陣,Wxo輸入-輸出門矩陣等。單元到柵極向量(如Wci)的權重矩陣是對角的，因此每個柵極向量中的m元素只接收單元向量的m元素的輸入。偏置項(添加到i、f、c和o中)被省略，以保持清晰。

The original LSTM algorithm used a custom designed approximate gradient calculation that allowed the weights to be updated after every timestep [16]. However the full gradient can instead be calculated with backpropagation through time [11], the method used in this paper. One difficulty when training LSTM with the full gradient is that the derivatives sometimes become excessively large,leading to numerical problems. To prevent this, all the experiments in this paper clipped the derivative of the loss with respect to the network inputs to the LSTM layers (before the sigmoid and tanh functions are applied) to lie within a predefined range1 .

原始的LSTM算法使用了自定義設計的近似梯度計算，允許在每一步[16]之后更新權值。然而，全梯度可以通過時間[11]的反向傳播來計算，這是本文使用的方法。用全梯度法訓練LSTM的一個難點是導數有時會變得過大，導致數值問題。為了防止這種情況，本文中的所有實驗都將損耗對LSTM層的網絡輸入的導數(在應用sigmoid和tanh函數之前)限制在預定義的范圍e1內。?

3 Text Prediction??文本預測

Text data is discrete, and is typically presented to neural networks using ‘onehot’ input vectors. That is, if there are K text classes in total, and class k is fed in at time t, then xt is a length K vector whose entries are all zero except for the k th, which is one. Pr(xt+1|yt) is therefore a multinomial distribution, which can be naturally parameterised by a softmax function at the output layer:

文本數據是離散的，通常使用“onehot”輸入向量呈現給神經網絡。也就是說，如果總共有K個文本類，而類K是在t時刻輸入的，那么xt就是一個長度為K的向量，除了第K項是1外，其他項都是0。因此，Pr(xt+1|yt)是一個多項分布，可以通過輸出層的softmax函數自然參數化:

The only thing that remains to be decided is which set of classes to use. In most cases, text prediction (usually referred to as language modelling) is performed at the word level. K is therefore the number of words in the dictionary. This can be problematic for realistic tasks, where the number of words (including variant conjugations, proper names, etc.) often exceeds 100,000. As well as requiring many parameters to model, having so many classes demands a huge amount of training data to adequately cover the possible contexts for the words. In the case of softmax models, a further difficulty is the high computational cost of evaluating all the exponentials during training (although several methods have been to devised make training large softmax layers more efficient, including tree-based models [25, 23], low rank approximations [27] and stochastic derivatives [26]). Furthermore, word-level models are not applicable to text data containing non-word strings, such as multi-digit numbers or web addresse.

唯一需要決定的是使用哪一組類。在大多數情況下，文本預測(通常稱為語言建模)是在單詞級執行的。因此K是字典中的單詞數。這對于實際的任務來說是有問題的，因為單詞的數量(包括不同的詞形變化、專有名稱等)常常超過100,000。除了需要許多參數進行建模外，擁有如此多的類還需要大量的訓練數據來充分覆蓋單詞的可能上下文。在softmax模型中，另一個困難是在訓練期間評估所有指數的計算成本很高(盡管已經設計了幾種方法來提高訓練大型softmax層的效率，包括基于樹的模型[25,23]、低秩近似[27]和隨機導數[26])。此外，單詞級模型不適用于包含非單詞字符串的文本數據，如多位數字或web地址。

Character-level language modelling with neural networks has recently been considered [30, 24], and found to give slightly worse performance than equivalent word-level models. Nonetheless, predicting one character at a time is more interesting from the perspective of sequence generation, because it allows the network to invent novel words and strings. In general, the experiments in this paper aim to predict at the finest granularity found in the data, so as to maximise the generative flexibility of the networ

使用神經網絡的字符級語言建模最近被考慮[30,24]，并發現其性能略差于等價的字級模型。盡管如此，從序列生成的角度來看，一次預測一個字符更有趣，因為它允許網絡創建新的單詞和字符串。總的來說，本文的實驗旨在以數據中發現的最細粒度進行預測，從而最大限度地提高網絡的生成靈活性

3.1 Penn Treebank Experiments Penn Treebank實驗

The first set of text prediction experiments focused on the Penn Treebank portion of the Wall Street Journal corpus [22]. This was a preliminary study whose main purpose was to gauge the predictive power of the network, rather than to generate interesting sequences.	第一組文本預測實驗集中在《華爾街日報》語料庫[22]的賓夕法尼亞河岸部分。這是一項初步研究，其主要目的是評估網絡的預測能力，而不是生成有趣的序列。?
Although a relatively small text corpus (a little over a million words in total), the Penn Treebank data is widely used as a language modelling benchmark. The training set contains 930,000 words, the validation set contains 74,000 words and the test set contains 82,000 words. The vocabulary is limited to 10,000 words, with all other words mapped to a special ‘unknown word’ token. The end-ofsentence token was included in the input sequences, and was counted in the sequence loss. The start-of-sentence marker was ignored, because its role is already fulfilled by the null vectors that begin the sequences (c.f. Section 2).	盡管文本語料庫相對較小(總共超過100萬單詞)，Penn Treebank的數據被廣泛用作語言建模的基準。訓練集包含93萬字，驗證集包含7.4萬字，測試集包含8.2萬字。詞匯表被限制為10,000個單詞，所有其他單詞都映射到一個特殊的“未知單詞”標記。語句結束標記被包含在輸入序列中，并被計算在序列丟失中。句子開始標記被忽略，因為它的作用已經由開始序列的空向量完成(c.f. . Section 2)。
The experiments compared the performance of word and character-level LSTM predictors on the Penn corpus. In both cases, the network architecture was a single hidden layer with 1000 LSTM units. For the character-level network the input and output layers were size 49, giving approximately 4.3M weights in total, while the word-level network had 10,000 inputs and outputs and around 54M weights. The comparison is therefore somewhat unfair, as the word-level network had many more parameters. However, as the dataset is small, both networks were easily able to overfit the training data, and it is not clear whether the character-level network would have benefited from more weights. All networks were trained with stochastic gradient descent, using a learn rate of 0.0001 and a momentum of 0.99. The LSTM derivates were clipped in the range [?1, 1] (c.f. Section 2.1).	實驗比較了詞級和字符級LSTM預測器在Penn語料庫上的性能。在這兩種情況下，網絡架構都是一個包含1000 LSTM單元的單一隱含層。對于字符級網絡，輸入和輸出層的大小為49，總共給出了大約430萬的權重，而單詞級網絡有10,000個輸入和輸出，以及大約54M的權重。因此，這種比較有點不公平，因為單詞級網絡有更多的參數。然而，由于數據集較小，這兩個網絡都很容易對訓練數據進行過度擬合，而且還不清楚字符級網絡是否會從更大的權重中受益。所有網絡均采用隨機梯度下降訓練，學習率為0.0001，動量為0.99。LSTM衍生物被限制在[?1,1]范圍內(c.f。2.1節)。?
Neural networks are usually evaluated on test data with fixed weights. For prediction problems however, where the inputs are the targets, it is legitimate to allow the network to adapt its weights as it is being evaluated (so long as it only sees the test data once). Mikolov refers to this as dynamic evaluation. Dynamic evaluation allows for a fairer comparison with compression algorithms, for which there is no division between training and test sets, as all data is only predicted once.	神經網絡的評價通常采用固定權值的試驗數據。然而，對于輸入是目標的預測問題，允許網絡在評估時調整其權重是合理的(只要它只看到測試數據一次)。Mikolov稱之為動態評估。動態評估允許與壓縮算法進行更公平的比較，壓縮算法不需要劃分訓練集和測試集，因為所有數據只預測一次。?
Table 1: Penn Treebank Test Set Results. ‘BPC’ is bits-per-character. ‘Error’ is next-step classification error rate, for either characters or words.	表1:Penn Treebank測試集的結果。BPC的bits-per-character。“錯誤”是下一步的分類錯誤率，不管是字符還是單詞。
Since both networks overfit the training data, we also experiment with two types of regularisation: weight noise [18] with a std. deviation of 0.075 applied to the network weights at the start of each training sequence, and adaptive weight noise [8], where the variance of the noise is learned along with the weights using a Minimum description Length (or equivalently, variational inference) loss function. When weight noise was used, the network was initialised with the final weights of the unregularised network. Similarly, when adaptive weight noise was used, the weights were initialised with those of the network trained with weight noise. We have found that retraining with iteratively increased regularisation is considerably faster than training from random weights with regularisation. Adaptive weight noise was found to be prohibitively slow for the word-level network, so it was regularised with fixed-variance weight noise only. One advantage of adaptive weight is that early stopping is not needed?(the network can safely be stopped at the point of minimum total ‘description length’ on the training data). However, to keep the comparison fair, the same training, validation and test sets were used for all experiments.	因為網絡overfit訓練數據,我們也嘗試兩種regularisation:體重噪聲[18]std.偏差為0.075應用于網絡權值在每個訓練序列的開始,體重和自適應噪聲[8],噪聲的方差在哪里學習使用最小描述長度隨著重量損失函數(或等價變分推理)。當使用權值噪聲時，網絡被初始化為非正則化網絡的最終權值。類似地，當使用自適應權值噪聲時，權值與使用權值噪聲訓練的網絡的權值初始化。我們發現，用迭代增加的正則化進行再訓練要比用正則化進行隨機加權訓練快得多。自適應權值噪聲在詞級網絡中速度非常慢，因此只能用固定方差權值噪聲對其進行正則化。自適應權值的一個優點是不需要提前停止(網絡可以安全地在訓練數據上的最小總“描述長度”處停止)。然而，為了保持比較的公平性，所有的實驗都使用相同的訓練、驗證和測試集。?
The results are presented with two equivalent metrics: bits-per-character (BPC), which is the average value of ? log2 Pr(xt+1\|yt) over the whole test set; and perplexity which is two to the power of the average number of bits per word (the average word length on the test set is about 5.6 characters, so perplexity ≈ 2 5.6BP C ). Perplexity is the usual performance measure for language modelling.	結果用兩個等價的度量來表示:每個字符的比特數(BPC)，這是?log2 Pr(xt+1\|yt)在整個測試集上的平均值;perplexity為每個單詞平均位數的2次方(測試集上的平均單詞長度約為5.6個字符，所以perplexity≈2 5.6 bp C)。Perplexity是語言建模的常用性能度量。?
Table 1 shows that the word-level RNN performed better than the characterlevel network, but the gap appeared to close when regularisation is used. Overall the results compare favourably with those collected in Tomas Mikolov’s thesis [23]. For example, he records a perplexity of 141 for a 5-gram with KeyserNey smoothing, 141.8 for a word level feedforward neural network, 131.1 for the state-of-the-art compression algorithm PAQ8 and 123.2 for a dynamically evaluated word-level RNN. However by combining multiple RNNs, a 5-gram and a cache model in an ensemble, he was able to achieve a perplexity of 89.4. Interestingly, the benefit of dynamic evaluation was far more pronounced here than in Mikolov’s thesis (he records a perplexity improvement from 124.7 to 123.2 with word-level RNNs). This suggests that LSTM is better at rapidly adapting to new data than ordinary RNNs.	表1顯示，單詞級RNN的性能優于字符級網絡，但在使用正則化時，這種差距似乎縮小了。總的來說，這些結果與Tomas Mikolov的論文[23]中收集到的結果相比是令人滿意的。例如，他記錄了5克KeyserNey平滑算法的perplexity為141，單詞級前饋神經網絡的perplexity為141.8，最先進的壓縮算法PAQ8的perplexity為131.1，動態評估單詞級RNN的perplexity為123.2。然而，通過將多個RNNs、一個5克重的內存和一個緩存模型結合在一起，他可以得到一個令人困惑的89.4。有趣的是，動態評估的好處在這里比在Mikolov的論文中更明顯(他記錄了一個復雜的改進，從124.7到123.2的單詞級RNNs)。這表明LSTM在快速適應新數據方面優于普通的rns。?

3.2 Wikipedia Experiments??維基百科的實驗

In 2006 Marcus Hutter, Jim Bowery and Matt Mahoney organised the following challenge, commonly known as Hutter prize [17]: to compress the first 100 million bytes of the complete English Wikipedia data (as it was at a certain time on March 3rd 2006) to as small a file as possible. The file had to include not only the compressed data, but also the code implementing the compression algorithm. Its size can therefore be considered a measure of the minimum description length [13] of the data using a two part coding scheme.	在2006年，Marcus Hutter, Jim Bowery和Matt Mahoney組織了如下的挑戰，通常被稱為Hutter獎[17]:壓縮完整的英文維基百科數據的前1億字節(在2006年3月3日的某個時間)到一個盡可能小的文件。該文件不僅要包含壓縮數據，還要包含實現壓縮算法的代碼。因此，它的大小可以被認為是使用兩部分編碼方案的數據的最小描述長度[13]的度量。?
Wikipedia data is interesting from a sequence generation perspective because?it contains not only a huge range of dictionary words, but also many character sequences that would not be included in text corpora traditionally used for language modelling. For example foreign words (including letters from nonLatin alphabets such as Arabic and Chinese), indented XML tags used to define meta-data, website addresses, and markup used to indicate page formatting such as headings, bullet points etc. An extract from the Hutter prize dataset is shown in Figs. 3 and 4.	從序列生成的角度來看，Wikipedia的數據非常有趣，因為它不僅包含大量的字典單詞，而且還包含許多字符序列，而這些字符序列不會包含在傳統用于語言建模的文本語料庫中。例如，外來詞(包括來自非拉丁字母的字母，如阿拉伯語和漢語)、用于定義元數據的縮進XML標記、網站地址和用于指示頁面格式(如標題、項目符號等)的標記。Hutter prize數據集的摘錄如圖3和圖4所示。??
The first 96M bytes in the data were evenly split into sequences of 100 bytes and used to train the network, with the remaining 4M were used for validation. The data contains a total of 205 one-byte unicode symbols. The total number of characters is much higher, since many characters (especially those from nonLatin languages) are defined as multi-symbol sequences. In keeping with the principle of modelling the smallest meaningful units in the data, the network predicted a single byte at a time, and therefore had size 205 input and output layers.	數據中的前9600萬字節被均勻地分割成100字節的序列，用于訓練網絡，剩下的400萬字節用于驗證。數據總共包含205個一字節的unicode符號。字符的總數要高得多，因為許多字符(特別是來自非拉丁語言的字符)被定義為多符號序列。根據對數據中有意義的最小單位建模的原則，網絡每次預測一個字節，因此大小為205個輸入和輸出層。
Wikipedia contains long-range regularities, such as the topic of an article, which can span many thousand words. To make it possible for the network to capture these, its internal state (that is, the output activations ht of the hidden layers, and the activations ct of the LSTM cells within the layers) were only reset every 100 sequences. Furthermore the order of the sequences was not shuffled during training, as it usually is for neural networks. The network was therefore able to access information from up to 10K characters in the past when making predictions. The error terms were only backpropagated to the start of each 100 byte sequence, meaning that the gradient calculation was approximate. This form of truncated backpropagation has been considered before for RNN language modelling [23], and found to speed up training (by reducing the sequence length and hence increasing the frequency of stochastic weight updates) without affecting the network’s ability to learn long-range dependencies.	維基百科包含長期的規律，比如一篇文章的主題，可以跨越數千個單詞。為了使網絡能夠捕獲這些信息，其內部狀態(即隱含層的輸出激活ht和層內LSTM細胞的激活ct)每100個序列才重置一次。此外，在訓練過程中，序列的順序沒有像通常的神經網絡那樣被打亂。因此，在過去進行預測時，該網絡能夠訪問多達10K個字符的信息。錯誤項僅反向傳播到每個100字節序列的開始處，這意味著梯度計算是近似的。RNN語言建模[23]之前就考慮過這種截斷反向傳播的形式，并發現它可以在不影響網絡學習長期依賴關系的情況下加速訓練(通過減少序列長度，從而增加隨機權值更新的頻率)。??
A much larger network was used for this data than the Penn data (reflecting the greater size and complexity of the training set) with seven hidden layers of 700 LSTM cells, giving approximately 21.3M weights. The network was trained with stochastic gradient descent, using a learn rate of 0.0001 and a momentum of 0.9. It took four training epochs to converge. The LSTM derivates were clipped in the range [?1, 1].	這個數據使用了一個比Penn數據大得多的網絡(反映了訓練集的更大的規模和復雜性)，它有7個隱藏層，由700個LSTM單元組成，提供了大約21.3M的權重。該網絡采用隨機梯度下降訓練，學習率為0.0001，動量為0.9。它花了四個訓練的時代來匯合。LSTM衍生物被限制在[- 1,1]范圍內。?
As with the Penn data, we tested the network on the validation data with and without dynamic evaluation (where the weights are updated as the data is predicted). As can be seen from Table 2 performance was much better with dynamic evaluation. This is probably because of the long range coherence of Wikipedia data; for example, certain words are much more frequent in some articles than others, and being able to adapt to this during evaluation is advantageous. It may seem surprising that the dynamic results on the validation set were substantially better than on the training set. However this is easily explained by two factors: firstly, the network underfit the training data, and secondly some portions of the data are much more difficult than others (for example, plain text is harder to predict than XML tags).	與Penn的數據一樣，我們在驗證數據上對網絡進行了測試，包括動態評估和非動態評估(根據預測的數據更新權重)。從表2可以看出，動態評估的性能要好得多。這可能是因為維基百科數據的長期一致性;例如，某些詞匯在某些文章中出現的頻率要比其他詞匯高得多，能夠在評估時適應這些詞匯是有利的。看起來奇怪,驗證動態結果集大大優于在訓練集上。但是這很容易解釋為兩個因素:首先,網絡underfit訓練數據,第二部分的數據是比其他人更加困難(例如,純文本更難預測比XML標簽)。?
To put the results in context, the current winner of the Hutter Prize (a?variant of the PAQ-8 compression algorithm [20]) achieves 1.28 BPC on the same data (including the code required to implement the algorithm), mainstream compressors such as zip generally get more than 2, and a character level RNN applied to a text-only version of the data (i.e. with all the XML, markup tags etc. removed) achieved 1.54 on held-out data, which improved to 1.47 when the RNN was combined with a maximum entropy model [24].	上下文中的結果,當前Hutter獎得主(PAQ-8壓縮算法[20]的一種變體)達到1.28 BPC相同的數據(包括所需的代碼來實現算法),主流壓縮機等郵政通常得到超過2,和一個人物等級RNN應用于數據的文本版本(即所有的XML標記標簽等刪除)1.54實現了數據,當RNN與最大熵模型[24]相結合時，其性能提高到1.47。?
A four page sample generated by the prediction network is shown in Figs. 5 to 8. The sample shows that the network has learned a lot of structure from the data, at a wide range of different scales. Most obviously, it has learned a large vocabulary of dictionary words, along with a subword model that enables it to invent feasible-looking words and names: for example “Lochroom River”, “Mughal Ralvaldens”, “submandration”, “swalloped”. It has also learned basic punctuation, with commas, full stops and paragraph breaks occurring at roughly the right rhythm in the text blocks.	由預測網絡生成的四頁樣本如圖5 - 8所示。該示例表明，該網絡從數據中學習了大量不同規模的結構。最明顯的是，它學習了大量的字典詞匯，以及一個子單詞模型，使它能夠發明看起來可行的單詞和名稱:例如“Lochroom River”、“Mughal Ralvaldens”、“submandration”、“swalloped”。它還學習了基本的標點符號，逗號、句號和斷句在文本塊中以大致正確的節奏出現。
Being able to correctly open and close quotation marks and parentheses is a clear indicator of a language model’s memory, because the closure cannot be predicted from the intervening text, and hence cannot be modelled with shortrange context [30]. The sample shows that the network is able to balance not only parentheses and quotes, but also formatting marks such as the equals signs used to denote headings, and even nested XML tags and indentation.	能夠正確地打開和關閉引號和圓括號是語言模型內存的一個明確指標，因為不能從中間的文本中預測關閉，因此不能使用較短的上下文[30]建模。示例顯示，該網絡不僅能夠平衡圓括號和引號，還能夠平衡用于表示標題的等號等格式化標記，甚至還能夠平衡嵌套的XML標記和縮進。?
The network generates non-Latin characters such as Cyrillic, Chinese and Arabic, and seems to have learned a rudimentary model for languages other than English (e.g. it generates “es:Geotnia slago” for the Spanish ‘version’ of an article, and “nl:Rodenbaueri” for the Dutch one) It also generates convincing looking internet addresses (none of which appear to be real). The network generates distinct, large-scale regions, such as XML headers, bullet-point lists and article text. Comparison with Figs. 3 and 4 suggests that these regions are a fairly accurate reflection of the constitution of the real data (although the generated versions tend to be somewhat shorter and more jumbled together). This is significant because each region may span hundreds or even thousands of timesteps. The fact that the network is able to remain coherent over such large intervals (even putting the regions in an approximately correct order, such as having headers at the start of articles and bullet-pointed ‘see also’ lists at the end) is testament to its long-range memory.	網絡生成非拉丁字符如斯拉夫字母,中文和阿拉伯語,似乎學到了基本的模型除英語之外的其他語言(如生成“es: Geotnia slago”的西班牙語版的一篇文章,和荷蘭的“問:Rodenbaueri”)看起來也會產生令人信服的互聯網地址(似乎沒有真正的)。網絡生成不同的大型區域，如XML標頭、項目符號列表和文章文本。與圖3和圖4的比較表明，這些區域相當準確地反映了真實數據的構成(盡管生成的版本往往更短，更混亂)。這很重要，因為每個區域可能跨越數百甚至數千個時間步長。事實上，這個網絡能夠在如此大的時間間隔內保持一致(甚至將各個區域大致按正確的順序排列，例如在文章開頭有標題，在文章結尾有“參見”列表)，這證明了它的長期記憶力。
As with all text generated by language models, the sample does not make sense beyond the level of short phrases. The realism could perhaps be improved with a larger network and/or more data. However, it seems futile to expect meaningful language from a machine that has never been exposed to the sensory?world to which language refers. Lastly, the network’s adaptation to recent sequences during training (which allows it to benefit from dynamic evaluation) can be clearly observed in the extract. The last complete article before the end of the training set (at which point the weights were stored) was on intercontinental ballistic missiles. The influence of this article on the network’s language model can be seen from the profusion of missile-related terms. Other recent topics include ‘Individual Anarchism’, the Italian writer Italo Calvino and the International Organization for Standardization (ISO), all of which make themselves felt in the network’s vocabulary.	與所有由語言模型生成的文本一樣，示例的意義也僅限于短語級別。也許可以通過更大的網絡和/或更多的數據來改進現實主義。然而，期待一臺從未接觸過語言所指的感官世界的機器發出有意義的語言似乎是徒勞的。最后，在提取中可以清楚地觀察到網絡對訓練過程中最近序列的適應性(這使它能夠從動態評估中受益)。在訓練集結束之前的最后一篇完整的文章是關于洲際彈道導彈的。這篇文章對網絡語言模型的影響可以從大量的導彈相關術語中看出。最近的其他主題包括“個人無政府主義”、意大利作家伊塔洛·卡爾維諾和國際標準化組織(ISO)，所有這些都在該網絡的詞匯中有所體現。

4 Handwriting Prediction??筆跡的預測

To test whether the prediction network could also be used to generate convincing real-valued sequences, we applied it to online handwriting data (online in this context means that the writing is recorded as a sequence of pen-tip locations, as opposed to offline handwriting, where only the page images are available). Online handwriting is an attractive choice for sequence generation due to its low dimensionality (two real numbers per data point) and ease of visualisation.	為了測試預測網絡是否也能被用來生成令人信服的實值序列，我們將其應用于在線手寫數據(在這種情況下，在線意味著書寫被記錄為筆尖位置的序列，而離線手寫則只有頁面圖像可用)。由于其低維性(每個數據點兩個實數)和易于可視化，在線手寫是一個有吸引力的序列生成選擇。?
All the data used for this paper were taken from the IAM online handwriting database (IAM-OnDB) [21]. IAM-OnDB consists of handwritten lines collected from 221 different writers using a ‘smart whiteboard’. The writers were asked to write forms from the Lancaster-Oslo-Bergen text corpus [19], and the position of their pen was tracked using an infra-red device in the corner of the board. Samples from the training data are shown in Fig. 9. The original input data consists of the x and y pen co-ordinates and the points in the sequence when the pen is lifted off the whiteboard. Recording errors in the x, y data was corrected by interpolating to fill in for missing readings, and removing steps whose length exceeded a certain threshold. Beyond that, no preprocessing was used and the network was trained to predict the x, y co-ordinates and the endof-stroke markers one point at a time. This contrasts with most approaches to handwriting recognition and synthesis, which rely on sophisticated preprocessing and feature-extraction techniques. We eschewed such techniques because they tend to reduce the variation in the data (e.g. by normalising the character size, slant, skew and so-on) which we wanted the network to model. Predicting the pen traces one point at a time gives the network maximum flexibility to invent novel handwriting, but also requires a lot of memory, with the average letter occupying more than 25 timesteps and the average line occupying around 700. Predicting delayed strokes (such as dots for ‘i’s or crosses for ‘t’s that are added after the rest of the word has been written) is especially demanding.	本文使用的所有數據均來自IAM在線手寫數據庫(IAM- ondb)[21]。IAM-OnDB由使用“智能白板”從221位不同作者那里收集的手寫行組成。作家們被要求寫來自lancaster - oso - bergen文本文集[19]的表格，他們的筆的位置通過黑板角落里的紅外線設備進行跟蹤。訓練數據的樣本如圖9所示。原始輸入數據包括x和y筆坐標，以及當筆從白板上拿起時的順序中的點。記錄x、y數據中的錯誤是通過內插來填補缺失的讀數，并刪除長度超過某個閾值的步驟來糾正的。除此之外，沒有使用預處理，網絡被訓練來預測x, y坐標和內源性卒中標記點一次一個點。這與大多數依賴于復雜的預處理和特征提取技術的手寫識別和合成方法形成了對比。我們避免使用這些技術，因為它們會減少數據中的變化(例如，通過將字符大小、傾斜度、歪斜度等正常化)，而這些正是我們希望網絡建模的。預測筆的軌跡是一次一個點，這給了網絡最大的靈活性來創造新的筆跡，但也需要大量的內存，平均每個字母占用超過25個時間步長，平均一行占用大約700個時間步長。預測延遲的筆畫(比如“i”的點，或者“t”的叉，這些都是在單詞的其余部分都寫完之后才加上去的)尤其困難。?
IAM-OnDB is divided into a training set, two validation sets and a test set, containing respectively 5364, 1438, 1518 and 3859 handwritten lines taken from 775, 192, 216 and 544 forms. For our experiments, each line was treated as a separate sequence (meaning that possible dependencies between successive lines were ignored). In order to maximise the amount of training data, we used the training set, test set and the larger of the validation sets for training and the smaller validation set for early-stopping.	IAM-OnDB分為一個訓練集、兩個驗證集和一個測試集，分別包含5364、1438、1518和3859個手寫行，分別來自775、192、216和544個表單。在我們的實驗中，每一行都被視為一個單獨的序列(這意味著連續行之間可能的依賴關系被忽略了)。為了最大化訓練數據量，我們使用訓練集、測試集和較大的驗證集進行訓練，較小的驗證集進行早期停止。?
Figure 9: Training samples from the IAM online handwriting database. Notice the wide range of writing styles, the variation in line angle and character sizes, and the writing and recording errors, such as the scribbled out letters in the first line and the repeated word in the final line.	圖9:來自IAM在線手寫數據庫的訓練樣本。注意書寫風格的廣泛變化，行角和字符大小的變化，以及書寫和記錄錯誤，如第一行中潦草的字母和最后一行中重復的單詞。
The lack of independent test set means that the recorded results may be somewhat overfit on the validation set; however the validation results are of secondary importance, since no benchmark results exist and the main goal was to generate convincing-looking handwriting. The principal challenge in applying the prediction network to online handwriting data was determining a predictive distribution suitable for real-valued inputs. The following section describes how this was done.	缺乏獨立的測試集意味著記錄的結果可能在驗證集上有些過擬合;然而，驗證結果是次要的，因為沒有基準測試結果存在，主要目標是生成令人信服的筆跡。將預測網絡應用于在線手寫數據的主要挑戰是確定一個適用于實值輸入的預測分布。下面的部分將描述如何實現這一點。?

4.1 Mixture Density Outputs??混合密度輸出

The idea of mixture density networks [2, 3] is to use the outputs of a neural network to parameterise a mixture distribution. A subset of the outputs are used to define the mixture weights, while the remaining outputs are used to parameterise the individual mixture components. The mixture weight outputs are normalised with a softmax function to ensure they form a valid discrete distribution, and the other outputs are passed through suitable functions to keep their values within meaningful range (for example the exponential function is typically applied to outputs used as scale parameters, which must be positive).	混合密度網絡[2,3]的思想是利用神經網絡的輸出來參數化混合分布。輸出的一個子集用于定義混合權重，而其余的輸出用于參數化單獨的混合組件。混合重量與softmax函數輸出正常,確保它們形成一個有效的離散分布,和其他的輸出是通過合適的函數來保持它們的值有意義的范圍內(例如指數函數通常用于輸出作為尺度參數,必須積極)。?
Mixture density network are trained by maximising the log probability density of the targets under the induced distributions. Note that the densities are normalised (up to a fixed constant) and are therefore straightforward to differentiate and pick unbiased sample from, in contrast with restricted Boltzmann machines [14] and other undirected models. Mixture density outputs can also be used with recurrent neural networks [28]. In this case the output distribution is conditioned not only on the current input, but on the history of previous inputs. Intuitively, the number of components is the number of choices the network has for the next output given the inputs so far.	通過最大限度地提高目標在誘導分布下的概率密度，訓練混合密度網絡。請注意，密度是標準化的(到一個固定的常數)，因此是直接的區分和挑選無偏的樣本，相比之下，限制玻爾茲曼機[14]和其他無定向模型。混合密度輸出也可用于遞歸神經網絡[28]。在這種情況下，輸出分布不僅取決于當前輸入，而且取決于以前輸入的歷史。直觀地說，組件的數量就是到目前為止給定輸入的網絡對下一個輸出的選擇的數量。?
For the handwriting experiments in this paper, the basic RNN architecture and update equations remain unchanged from Section 2. Each input vector xt consists of a real-valued pair x1, x2 that defines the pen offset from the previous?input, along with a binary x3 that has value 1 if the vector ends a stroke (that is, if the pen was lifted off the board before the next vector was recorded) and value 0 otherwise. A mixture of bivariate Gaussians was used to predict x1 and x2, while a Bernoulli distribution was used for x3. Each output vector yt therefore consists of the end of stroke probability e, along with a set of means μ j , standard deviations σ j , correlations ρ j and mixture weights π j for the M mixture components. That is	對于本文中的手寫實驗，基本的 RNN 架構和更新方程與第 2 節相比保持不變。每個輸入向量 xt 由一個實值對 x1、x2 組成，它定義了與前一個輸入的筆偏移量，以及一個二進制 x3 如果向量結束筆劃（即，如果在記錄下一個向量之前將筆從板上抬起），則值為 1，否則值為 0。雙變量高斯混合用于預測 x1 和 x2，而伯努利分布用于預測 x3。因此，每個輸出向量 yt 由筆畫結束概率 e 以及一組均值 μ j 、標準偏差 σ j 、相關性 ρ j 和 M 個混合分量的混合權重 π j 組成。那是
This can be substituted into Eq. (6) to determine the sequence loss (up to a constant that depends only on the quantisation of the data and does not influence network training):	可以代入式(6)確定序列損耗(可達常數，僅依賴于數據的量子化，不影響網絡訓練):
Figure 10: Mixture density outputs for handwriting prediction. The top heatmap shows the sequence of probability distributions for the predicted pen locations as the word ‘under’ is written. The densities for successive predictions are added together, giving high values where the distributions overlap.	圖10:手寫預測的混合密度輸出。頂部的熱圖顯示了“下”這個詞寫的時候，預測的筆位置的概率分布序列。連續預測的密度被加在一起，給出了分布重疊的高值。
Two types of prediction are visible from the density map: the small blobs that spell out the letters are the predictions as the strokes are being written, the three large blobs are the predictions at the ends of the strokes for the first point in the next stroke. The end-of-stroke predictions have much higher variance because the pen position was not recorded when it was off the whiteboard, and hence there may be a large distance between the end of one stroke and the start of the next. The bottom heatmap shows the mixture component weights during the same sequence. The stroke ends are also visible here, with the most active components switching off in three places, and other components switching on: evidently end-of-stroke predictions use a different set of mixture components from in-stroke predictions.	從密度圖中可以看到兩種類型的預測:拼出字母的小斑點是正在書寫筆畫的預測，三個大斑點是在下一個筆畫的第一個點的筆畫末端的預測。筆劃結束預測的方差要大得多，因為當筆離開白板時，沒有記錄筆的位置，因此在一次筆劃結束和下一次筆劃開始之間可能有很大的距離。底部的熱圖顯示了在相同的序列中混合成分的權重。筆劃終點在這里也可以看到，最活躍的部分在三個地方關閉，其他部分打開:顯然，筆劃終點預測使用的是一組不同于筆劃內預測的混合部分。

4.2 Experiments

Each point in the data sequences consisted of three numbers: the x and y offset from the previous point, and the binary end-of-stroke feature. The network input layer was therefore size 3. The co-ordinate offsets were normalised to mean 0, std. dev. 1 over the training set. 20 mixture components were used to model the offsets, giving a total of 120 mixture parameters per timestep (20 weights, 40 means, 40 standard deviations and 20 correlations). A further parameter was used to model the end-of-stroke probability, giving an output layer of size 121. Two network architectures were compared for the hidden layers: one with three hidden layers, each consisting of 400 LSTM cells, and one with a single hidden layer of 900 LSTM cells. Both networks had around 3.4M weights. The three layer network was retrained with adaptive weight noise [8], with all std. devs. initialised to 0.075. Training with fixed variance weight noise proved ineffective, probably because it prevented the mixture density layer from using precisely specified weights.

數據序列中的每個點由三個數字組成:前一個點的x和y偏移量，以及二進制行程結束特征。因此，網絡輸入層的大小為3。在訓練集上，坐標偏移被歸一化為均值0，標準偏差1。20個混合分量被用來對偏移進行建模，每個時間步共給出120個混合參數(20個權重，40個平均值，40個標準差和20個相關系數)。使用進一步的參數來建模行程結束的概率，輸出層的大小為121。比較了兩種隱含層的網絡結構:一種是三層隱含層，每層包含400個LSTM單元，另一種是單個隱含層包含900個LSTM單元。兩個網絡的重量都在340萬磅左右。采用自適應權值噪聲[8]對三層網絡進行再訓練。初始化到0.075。用固定方差權值噪聲進行訓練被證明是無效的，可能是因為它阻止了混合密度層使用精確指定的權值。?

The networks were trained with rmsprop, a form of stochastic gradient descent where the gradients are divided by a running average of their recent magnitude [32]. Define i = ?L(x) ?wi where wi is network weight i. The weight update equations were:

這些網絡是用rmsprop進行訓練的，rmsprop是一種隨機梯度下降的形式，梯度除以其最近大小[32]的運行平均值。?L(x)?wi其中wi為網絡權值i，權值更新方程為:

The output derivatives ?L(x) ?y?t were clipped in the range [?100, 100], and the LSTM derivates were clipped in the range [?10, 10]. Clipping the output gradients proved vital for numerical stability; even so, the networks sometimes had numerical problems late on in training, after they had started overfitting on the training data.

Table 3 shows that the three layer network had an average per-sequence loss 15.3 nats lower than the one layer net. However the sum-squared-error was slightly lower for the single layer network. the use of adaptive weight noise reduced the loss by another 16.7 nats relative to the unregularised three layer network, but did not significantly change the sum-squared error. The adaptive weight noise network appeared to generate the best samples.

輸出衍生品?L (x)?y?t剪的范圍(100?100)和LSTM衍生物被夾在[10?10日]。剪切輸出梯度被證明對數值穩定性至關重要;即便如此，這些網絡有時在訓練后期會出現數字問題，那時它們已經開始對訓練數據進行過度擬合。

表3顯示，三層網絡的每序列平均損失比一層網絡低15.3納特。而單層網絡的平方和誤差略低。與非正則化三層網絡相比，使用自適應加權噪聲減少了16.7納特的損失，但并沒有顯著改變平方和誤差。自適應權值噪聲網絡似乎產生了最好的樣本。

4.3 Samples??樣品

Fig. 11 shows handwriting samples generated by the prediction network. The network has clearly learned to model strokes, letters and even short words (especially common ones such as ‘of’ and ‘the’). It also appears to have learned a basic character level language models, since the words it invents (‘eald’, ‘bryoes’, ‘lenrest’) look somewhat plausible in English. Given that the average character occupies more than 25 timesteps, this again demonstrates the network’s ability to generate coherent long-range structures.

圖11為預測網絡生成的筆跡樣本。該網絡顯然已經學會了模仿筆劃、字母甚至是簡短的單詞(尤其是“of”和“The”等常見單詞)。它似乎還學會了一種基本的字符級語言模型，因為它發明的單詞(“eald”、“bryoes”、“lenrest”)在英語中似乎有些可信。考慮到平均字符占用超過25個時間步長，這再次證明了該網絡生成連貫的遠程結構的能力。?

5 Handwriting Synthesis??字合成

Handwriting synthesis is the generation of handwriting for a given text. Clearly the prediction networks we have described so far are unable to do this, since there is no way to constrain which letters the network writes. This section describes an augmentation that allows a prediction network to generate data sequences conditioned on some high-level annotation sequence (a character string, in the case of handwriting synthesis). The resulting sequences are sufficiently convincing that they often cannot be distinguished from real handwriting. Furthermore, this realism is achieved without sacrificing the diversity in writing style demonstrated in the previous section.

手寫合成是生成給定文本的手寫。顯然，我們目前所描述的預測網絡無法做到這一點，因為沒有辦法限制網絡所寫的字母。本節描述一種擴展，它允許預測網絡根據某些高級注釋序列(在手寫合成的情況下是字符串)生成數據序列。由此產生的序列足以讓人相信，它們常常無法與真實筆跡區分開來。此外，這種現實主義是在不犧牲前一節所展示的寫作風格多樣性的情況下實現的。?

The main challenge in conditioning the predictions on the text is that the two sequences are of very different lengths (the pen trace being on average twenty five times as long as the text), and the alignment between them is unknown until the data is generated. This is because the number of co-ordinates used to write each character varies greatly according to style, size, pen speed etc. One neural network model able to make sequential predictions based on two sequences of different length and unknown alignment is the RNN transducer [9]. However preliminary experiments on handwriting synthesis with RNN transducers were not encouraging. A possible explanation is that the transducer uses two separate RNNs to process the two sequences, then combines their outputs to make decisions, when it is usually more desirable to make all the information available to single network. This work proposes an alternative model, where a ‘soft window’ is convolved with the text string and fed in as an extra input to the prediction network. The parameters of the window are output by the network?at the same time as it makes the predictions, so that it dynamically determines an alignment between the text and the pen locations. Put simply, it learns to decide which character to write next.

調整對文本的預測的主要挑戰是，這兩個序列的長度非常不同(鋼筆軌跡的平均長度是文本的25倍)，在生成數據之前，它們之間的對齊是未知的。這是因為書寫每個字符所用的坐標的數量會根據風格、大小、筆速等而變化很大。RNN傳感器[9]是一種能夠根據兩種不同長度和未知排列的序列進行序列預測的神經網絡模型。然而，使用RNN傳感器進行手寫合成的初步實驗并不令人鼓舞。一種可能的解釋是，傳感器使用兩個獨立的rns來處理這兩個序列，然后結合它們的輸出來做出決策，而通常情況下，將所有信息提供給單一網絡是更可取的。這項工作提出了一個替代模型，其中一個“軟窗口”與文本字符串進行卷積，并作為一個額外的輸入輸入到預測網絡中。窗口的參數是由網絡在進行預測的同時輸出的，因此它可以動態地確定文本和筆位置之間的對齊。簡單地說，它學會決定接下來寫哪個字符。

5.1 Synthesis Network??合成網絡

Fig. 12 illustrates the network architecture used for handwriting synthesis. As with the prediction network, the hidden layers are stacked on top of each other, each feeding up to the layer above, and there are skip connections from the inputs to all hidden layers and from all hidden layers to the outputs. The difference is the added input from the character sequence, mediated by the window layer. Given a length U character sequence c and a length T data sequence x, the soft window wt into c at timestep t (1 ≤ t ≤ T) is defined by the following discrete convolution with a mixture of K Gaussian functions	圖12展示了用于手寫合成的網絡結構。與預測網絡一樣，隱藏層是堆疊在一起的，每一層向上向上，從輸入到所有隱藏層，從所有隱藏層到輸出都有跳躍連接。不同之處在于由窗口層調節的字符序列的附加輸入。給定一個長度為U的字符序列c和一個長度為T的數據序列x，在第T步(1≤T≤T)時，軟窗口wt轉化為c，定義為與K高斯函數混合的離散卷積

Note that the location parameters κt are defined as offsets from the previous locations ct?1, and that the size of the offset is constrained to be greater than zero. Intuitively, this means that network learns how far to slide each window at each step, rather than an absolute location. Using offsets was essential to getting the network to align the text with the pen trace.	注意κt被定義為位置參數較前位置偏移ct?1,偏移的大小限制是大于零的。直觀地說，這意味著network了解在每一步中滑動每個窗口的距離，而不是絕對位置。使用偏移量對使網絡將文本與鋼筆軌跡對齊至關重要。?


The wt vectors are passed to the second and third hidden layers at time t, and the first hidden layer at time t+1 (to avoid creating a cycle in the processing graph). The update equations for the hidden layers are	wt向量在t時刻傳遞到第二層和第三層隱含層，在t+1時刻傳遞到第一層隱含層(避免在處理圖中創建一個循環)。隱層的更新方程為

5.2 Experiments??實驗

The synthesis network was applied to the same input data as the handwriting prediction network in the previous section. The character-level transcriptions from the IAM-OnDB were now used to define the character sequences c. The full transcriptions contain 80 distinct characters (capital letters, lower case letters, digits, and punctuation). However we used only a subset of 57, with all?digits and most of the punctuation characters replaced with a generic ‘nonletter’ label2 .	將合成網絡應用于與前一節筆跡預測網絡相同的輸入數據。來自IAM-OnDB的字符級轉錄現在用于定義字符序列c。完整的轉錄包含80個不同的字符(大寫字母、小寫字母、數字和標點符號)。然而，我們只使用了57的一個子集，所有的數字和大部分的標點符號都被一個通用的“非字母”標簽2所取代。
The network architecture was as similar as possible to the best prediction network: three hidden layers of 400 LSTM cells each, 20 bivariate Gaussian mixture components at the output layer and a size 3 input layer. The character sequence was encoded with one-hot vectors, and hence the window vectors were size 57. A mixture of 10 Gaussian functions was used for the window parameters, requiring a size 30 parameter vector. The total number of weights was increased to approximately 3.7M.	該網絡結構與最佳預測網絡盡可能相似:3個隱藏層，每個隱藏層有400個LSTM單元，輸出層有20個二元高斯混合分量，輸入層有3個大小。字符序列采用單熱向量編碼，因此窗口向量大小為57。窗口參數混合使用了10個高斯函數，需要一個大小為30的參數向量。總重量增加到約3.7M。?

The network was trained with rmsprop, using the same parameters as in the previous section. The network was retrained with adaptive weight noise, initial standard deviation 0.075, and the output and LSTM gradients were again clipped in the range [?100, 100] and [?10, 10] respectively.	使用與前一節相同的參數，使用rmsprop對網絡進行了訓練。使用自適應權值噪聲對網絡進行再訓練，初始標準偏差為0.075，再次將輸出和LSTM梯度分別限制在[?100,100]和[?10,10]范圍內。?
Table 4 shows that adaptive weight noise gave a considerable improvement in log-loss (around 31.3 nats) but no significant change in sum-squared error. The regularised network appears to generate slightly more realistic sequences, although the difference is hard to discern by eye. Both networks performed considerably better than the best prediction network. In particular the sumsquared-error was reduced by 44%. This is likely due in large part to the improved predictions at the ends of strokes, where the error is largest	表4顯示，自適應權值噪聲在對數損失(約31.3 nats)方面有顯著改善，但在平方和誤差方面沒有顯著變化。這個規則化的網絡似乎產生了更真實的序列，盡管這種差異很難用肉眼辨別。這兩個網絡都比最佳預測網絡表現得好得多。特別是sumsquared-error減少了44%。這可能在很大程度上是由于改進了筆畫末端的預測，在那里誤差最大

5.3 Unbiased Sampling??公正的抽樣

Given c, an unbiased sample can be picked from Pr(x|c) by iteratively drawing xt+1 from Pr (xt+1|yt), just as for the prediction network. The only difference is that we must also decide when the synthesis network has finished writing the text and should stop making any future decisions. To do this, we use the following heuristic: as soon as φ(t, U + 1) > φ(t, u) ? 1 ≤ u ≤ U the current input xt is defined as the end of the sequence and sampling ends. Examples of unbiased synthesis samples are shown in Fig. 15. These and all subsequent figures were generated using the synthesis network retrained with adaptive weight noise. Notice how stylistic traits, such as character size, slant, cursiveness etc. vary?widely between the samples, but remain more-or-less consistent within them. This suggests that the network identifies the traits early on in the sequence, then remembers them until the end. By looking through enough samples for a given text, it appears to be possible to find virtually any combination of stylistic traits, which suggests that the network models them independently both from each other and from the text. ‘

給定c，從Pr(x|c)中迭代提取xt+1從Pr(xt+1|yt)中提取無偏樣本，與預測網絡一樣。唯一不同的是，我們還必須決定合成網絡何時完成文本的編寫，并且應該停止做任何未來的決定。為此,我們使用以下啟發式:一旦φ(t, U + 1) >φ(t, U)?1≤≤U當前輸入xt被定義為序列圖和采樣結束的結束。無偏合成樣本的例子如圖15所示。這些和所有后續的圖像都是使用自適應權值噪聲再訓練的合成網絡生成的。請注意，風格特征，如字符大小、傾斜度、曲線性等，在不同的樣本之間差異很大，但在樣本內部卻或多或少保持一致。這表明，該網絡在序列的早期識別出這些特征，然后記住它們，直到最后。通過對給定文本的足夠多的樣本進行研究，似乎有可能發現幾乎任何風格特征的組合，這表明網絡對它們進行獨立建模，既相互獨立，也獨立于文本。”

Blind taste tests’ carried out by the author during presentations suggest that at least some unbiased samples cannot be distinguished from real handwriting by the human eye. Nonetheless the network does make mistakes we would not expect a human writer to make, often involving missing, confused or garbled letters3 ; this suggests that the network sometimes has trouble determining the alignment between the characters and the trace. The number of mistakes increases markedly when less common words or phrases are included in the character sequence. Presumably this is because the network learns an implicit character-level language model from the training set that gets confused when rare or unknown transitions occur.

作者在演講中進行的盲品測試表明，至少有一些沒有偏見的樣品無法通過肉眼分辨出真跡。盡管如此，網絡確實會犯一些我們不希望人類作家會犯的錯誤，通常包括丟失、混淆或含混的信件;這表明，網絡有時難以確定字符和跟蹤之間的對齊。當較不常見的單詞或短語被包含在字符序列中時，錯誤數量顯著增加。這可能是因為當罕見或未知的轉換發生時，網絡會從訓練集中學習隱式的字符級語言模型。?

5.4 Biased Sampling??有偏見的抽樣

One problem with unbiased samples is that they tend to be difficult to read (partly because real handwriting is difficult to read, and partly because the network is an imperfect model). Intuitively, we would expect the network to give higher probability to good handwriting because it tends to be smoother and more predictable than bad handwriting. If this is true, we should aim to output more probable elements of Pr(x|c) if we want the samples to be easier to read. A principled search for high probability samples could lead to a difficult inference problem, as the probability of every output depends on all previous outputs. However a simple heuristic, where the sampler is biased towards more probable predictions at each step independently, generally gives good results. Define the probability bias b as a real number greater than or equal to zero. Before drawing a sample from Pr(xt+1|yt), each standard deviation σ j t in the Gaussian mixture is recalculated from Eq. (21) to

無偏樣本的一個問題是它們往往難以閱讀(部分原因是真實的筆跡難以閱讀，部分原因是網絡模型不完善)。直覺上，我們認為網絡會給好筆跡更高的概率，因為它比糟糕的筆跡更平滑、更可預測。如果這是真的，我們應該輸出更多可能的元素Pr(x|c)，如果我們想讓樣本更容易閱讀。對高概率樣本的原則性搜索可能會導致一個困難的推理問題，因為每個輸出的概率依賴于所有以前的輸出。然而，一個簡單的啟發式，其中采樣器是偏向于更可能的預測，在每一步獨立，通常會給出良好的結果。將概率偏差b定義為大于或等于零的實數。之前圖紙樣本公關(xt + 1 |次),每一個標準差σj t在高斯混合重新計算從情商。(21)

This artificially reduces the variance in both the choice of component from the mixture, and in the distribution of the component itself. When b = 0 unbiased sampling is recovered, and as b → ∞ the variance in the sampling disappears?and the network always outputs the mode of the most probable component in the mixture (which is not necessarily the mode of the mixture, but at least a reasonable approximation). Fig. 16 shows the effect of progressively increasing the bias, and Fig. 17 shows samples generated with a low bias for the same texts as Fig. 15.

這就人為地減少了混合中組分的選擇和組分本身分布的差異。b = 0時的采樣恢復,當b→∞方差在抽樣消失,網絡總是輸出模式最可能的組件的混合物(不一定是混合的模式,但至少有一個合理的近似)。圖16為逐步增大偏倚的效果，圖17為與圖15相同文本產生的低偏倚樣本。

5.5 Primed Sampling??啟動采樣

Another reason to constrain the sampling would be to generate handwriting in the style of a particular writer (rather than in a randomly selected style). The easiest way to do this would be to retrain it on that writer only. But even without retraining, it is possible to mimic a particular style by ‘priming’ the network with a real sequence, then generating an extension with the real sequence still in the network’s memory. This can be achieved for a real x, c and a synthesis character string s by setting the character sequence to c 0 = c + s and clamping the data inputs to x for the first T timesteps, then sampling as usual until the sequence ends. Examples of primed samples are shown in Figs. 18 and 19. The fact that priming works proves that the network is able to remember stylistic features identified earlier on in the sequence. This technique appears to work better for sequences in the training data than those the network has never seen.	限制抽樣的另一個原因是生成特定作者風格的筆跡(而不是隨機選擇的風格)。最簡單的方法是只對那個編寫器進行再培訓。但是，即使不進行再訓練，也可以通過用真實序列“啟動”網絡來模仿特定的樣式，然后生成一個擴展，其中真實序列仍然保留在網絡的內存中。這可以通過將字符序列設置為c 0 = c + s，并在第一個T時間步將數據輸入固定到x，然后像往常一樣采樣，直到序列結束，從而實現對實際的x、c和合成字符串s的采樣。啟動樣本的例子如圖18和圖19所示。啟動起作用的事實證明，網絡能夠記住在序列前面識別出的文體特征。這種技術似乎比網絡從未見過的序列更適合訓練數據。?
Primed sampling and reduced variance sampling can also be combined. As shown in Figs. 20 and 21 this tends to produce samples in a ‘cleaned up’ version of the priming style, with overall stylistic traits such as slant and cursiveness retained, but the strokes appearing smoother and more regular. A possible application would be the artificial enhancement of poor handwriting.	啟動抽樣和減少方差抽樣也可以結合使用。如圖20和圖21所示，這往往會產生一種“凈化”版的引語風格，保留了整體風格特征，如傾斜度和曲線感，但筆觸看起來更平滑、更有規則。一種可能的應用是人為地改進拙劣的筆跡。?

6 Conclusions and Future Work??結論與未來工作

This paper has demonstrated the ability of Long Short-Term Memory recurrent neural networks to generate both discrete and real-valued sequences with complex, long-range structure using next-step prediction. It has also introduced a novel convolutional mechanism that allows a recurrent network to condition its predictions on an auxiliary annotation sequence, and used this approach to synthesise diverse and realistic samples of online handwriting. Furthermore, it has shown how these samples can be biased towards greater legibility, and how they can be modelled on the style of a particular writer.

本文證明了長短時記憶遞歸神經網絡利用下一步預測生成具有復雜、長時程結構的離散和實值序列的能力。它還引入了一種新穎的卷積機制，允許遞歸網絡根據輔助注釋序列調整預測，并使用這種方法來合成各種真實的在線手寫樣本。此外，它還展示了這些樣本如何傾向于更大的易讀性，以及如何模仿特定作者的風格。?

Several directions for future work suggest themselves. One is the application of the network to speech synthesis, which is likely to be more challenging than handwriting synthesis due to the greater dimensionality of the data points. Another is to gain a better insight into the internal representation of the data, and to use this to manipulate the sample distribution directly. It would also be interesting to develop a mechanism to automatically extract high-level annotations from sequence data. In the case of handwriting, this could allow for?more nuanced annotations than just text, for example stylistic features, different forms of the same letter, information about stroke order and so on.

未來的工作有幾個方向。一是網絡在語音合成中的應用，由于數據點的維數較大，語音合成可能比手寫合成更具挑戰性。另一種方法是更好地了解數據的內部表示形式，并使用它直接操縱樣本分布。開發一種從序列數據中自動提取高級注釋的機制也很有趣。在書寫的情況下，這可能允許比文本更微妙的注釋，例如風格特征，同一字母的不同形式，關于筆畫順序的信息等等。?

Acknowledgements??致謝

Thanks to Yichuan Tang, Ilya Sutskever, Navdeep Jaitly, Geoffrey Hinton and other colleagues at the University of Toronto for numerous useful comments and suggestions. This work was supported by a Global Scholarship from the Canadian Institute for Advanced Research.

感謝多倫多大學的Yichuan Tang, Ilya Sutskever, Navdeep Jaitly, Geoffrey Hinton和其他同事提供了許多有用的意見和建議。這項工作得到了加拿大高級研究所的全球獎學金的支持。

References

[1] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, March 1994.
[2] C. Bishop. Mixture density networks. Technical report, 1994.
[3] C. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, Inc., 1995.
[4] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent. Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription. In Proceedings of the Twenty-nine International Conference on Machine Learning (ICML’12), 2012.
[5] J. G. Cleary, Ian, and I. H. Witten. Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications, 32:396–402, 1984.
[6] D. Eck and J. Schmidhuber. A first look at music composition using lstm recurrent neural networks. Technical report, IDSIA USI-SUPSI Instituto Dalle Molle.
[7] F. Gers, N. Schraudolph, and J. Schmidhuber. Learning precise timing with LSTM recurrent networks. Journal of Machine Learning Research, 3:115–143, 2002.
[8] A. Graves. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, volume 24, pages 2348–2356. 2011.
[9] A. Graves. Sequence transduction with recurrent neural networks. In ICML Representation Learning Worksop, 2012.
[10] A. Graves, A. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In Proc. ICASSP, 2013.
[11] A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18:602–610, 2005.
[12] A. Graves and J. Schmidhuber. Offline handwriting recognition with multidimensional recurrent neural networks. In Advances in Neural Information Processing Systems, volume 21, 2008.
[13] P. D. Gr¨unwald. The Minimum Description Length Principle (Adaptive Computation and Machine Learning). The MIT Press, 2007.
[14] G. Hinton. A Practical Guide to Training Restricted Boltzmann Machines. Technical report, 2010.
[15] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-term Dependencies. In S. C. Kremer and J. F. Kolen, editors, A Field Guide to Dynamical Recurrent Neural Networks. 2001.
[16] S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735–1780, 1997.
[17] M. Hutter. The Human Knowledge Compression Contest, 2012. [18] K.-C. Jim, C. Giles, and B. Horne. An analysis of noise in recurrent neural networks: convergence and generalization. Neural Networks, IEEE Transactions on, 7(6):1424 –1438, 1996. [19] S. Johansson, R. Atwell, R. Garside, and G. Leech. The tagged LOB corpus user’s manual; Norwegian Computing Centre for the Humanities, 1986.
[20] B. Knoll and N. de Freitas. A machine learning perspective on predictive coding with paq. CoRR, abs/1108.3298, 2011.
[21] M. Liwicki and H. Bunke. IAM-OnDB - an on-line English sentence database acquired from handwritten text on a whiteboard. In Proc. 8th Int. Conf. on Document Analysis and Recognition, volume 2, pages 956– 961, 2005.
[22] M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. Building a large annotated corpus of english: The penn treebank. COMPUTATIONAL LINGUISTICS, 19(2):313–330, 1993.
[23] T. Mikolov. Statistical Language Models based on Neural Networks. PhD thesis, Brno University of Technology, 2012.
[24] T. Mikolov, I. Sutskever, A. Deoras, H. Le, S. Kombrink, and J. Cernocky. Subword language modeling with neural networks. Technical report, Unpublished Manuscript, 2012.
[25] A. Mnih and G. Hinton. A Scalable Hierarchical Distributed Language Model. In Advances in Neural Information Processing Systems, volume 21, 2008.
[26] A. Mnih and Y. W. Teh. A fast and simple algorithm for training neural probabilistic language models. In Proceedings of the 29th International Conference on Machine Learning, pages 1751–1758, 2012.
[27] T. N. Sainath, A. Mohamed, B. Kingsbury, and B. Ramabhadran. Lowrank matrix factorization for deep neural network training with highdimensional output targets. In Proc. ICASSP, 2013.
[28] M. Schuster. Better generative models for sequential data problems: Bidirectional recurrent mixture density networks. pages 589–595. The MIT Press, 1999.
[29] I. Sutskever, G. E. Hinton, and G. W. Taylor. The recurrent temporal restricted boltzmann machine. pages 1601–1608, 2008.
[30] I. Sutskever, J. Martens, and G. Hinton. Generating text with recurrent neural networks. In ICML, 2011.
[31] G. W. Taylor and G. E. Hinton. Factored conditional restricted boltzmann machines for modeling motion style. In Proc. 26th Annual International Conference on Machine Learning, pages 1025–1032, 2009.
[32] T. Tieleman and G. Hinton. Lecture 6.5 - rmsprop: Divide the gradient by a running average of its recent magnitude, 2012.
[33] R. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Back-propagation: Theory, Architectures and Applications, pages 433–486. 1995.

總結

以上是生活随笔為你收集整理的Paper：《Generating Sequences With Recurrent Neural Networks》的翻译和解读的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： TF之LSTM：利用多层LSTM算法对M
下一篇： DataScience：深入探讨与分析机

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

编程问答

Paper：《Generating Sequences With Recurrent Neural Networks》的翻译和解读

Generating Sequences With Recurrent Neural Networks

Abstract

1、Introduction 介紹

2 Prediction Network?預測網絡

2.1 Long Short-Term Memory

3 Text Prediction??文本預測

3.1 Penn Treebank Experiments Penn Treebank實驗

3.2 Wikipedia Experiments??維基百科的實驗

4 Handwriting Prediction??筆跡的預測

4.1 Mixture Density Outputs??混合密度輸出

4.2 Experiments

4.3 Samples??樣品

5 Handwriting Synthesis??字合成

5.1 Synthesis Network??合成網絡

5.2 Experiments??實驗

5.3 Unbiased Sampling??公正的抽樣

5.4 Biased Sampling??有偏見的抽樣

5.5 Primed Sampling??啟動采樣

6 Conclusions and Future Work??結論與未來工作

Acknowledgements??致謝

References

總結