日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Paper:《Generating Sequences With Recurrent Neural Networks》的翻译和解读

發布時間:2025/3/21 编程问答 49 豆豆
生活随笔 收集整理的這篇文章主要介紹了 Paper:《Generating Sequences With Recurrent Neural Networks》的翻译和解读 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Paper:《Generating Sequences With Recurrent Neural Networks》的翻譯和解讀

目錄

Generating Sequences With Recurrent Neural Networks

Abstract

1、Introduction

2 Prediction Network?預測網絡

2.1 Long Short-Term Memory

3 Text Prediction??文本預測

3.1 Penn Treebank Experiments Penn Treebank實驗

3.2 Wikipedia Experiments??維基百科的實驗

4 Handwriting Prediction??筆跡的預測

4.1 Mixture Density Outputs??混合密度輸出

4.2 Experiments

4.3 Samples??樣品

5 Handwriting Synthesis??字合成

5.1 Synthesis Network??合成網絡

5.2 Experiments??實驗

5.3 Unbiased Sampling??公正的抽樣

5.4 Biased Sampling??有偏見的抽樣

5.5 Primed Sampling??啟動采樣

6 Conclusions and Future Work??結論與未來工作

Acknowledgements??致謝

References


Generating Sequences With Recurrent Neural Networks

利用遞歸神經網絡生成序列

論文原文:Generating Sequences With Recurrent Neural Networks
作者
Alex Graves? ?Department of Computer Science? ?
University of Toronto? ? ? graves@cs.toronto.edu

Abstract

This paper shows how Long Short-term Memory recurrent neural net- works can be used to generate complex sequences with long-range struc- ture, simply by predicting one data point at a time. The approach is demonstrated for text (where the data are discrete) and online handwrit- ing (where the data are real-valued). It is then extended to handwriting synthesis by allowing the network to condition its predictions on a text sequence. The resulting system is able to generate highly realistic cursive handwriting in a wide variety of styles.

利用長短期記憶遞歸神經網絡,通過簡單地預測一個數據點來實現長時間的復雜序列生成。該方法適用于文本(數據是離散的)和在線手寫(數據是實值的)。然后,通過允許網絡根據文本序列調整預測,將其擴展到手寫合成。由此產生的系統能夠生成多種風格的高度逼真的草書。?

1、Introduction 介紹

Recurrent neural networks (RNNs) are a rich class of dynamic models that have been used to generate sequences in domains as diverse as music [6, 4], text [30] and motion capture data [29]. RNNs can be trained for sequence generation by processing real data sequences one step at a time and predicting what comes next. Assuming the predictions are probabilistic, novel sequences can be gener- ated from a trained network by iteratively sampling from the network’s output distribution, then feeding in the sample as input at the next step. In other words by making the network treat its inventions as if they were real, much like a person dreaming. Although the network itself is deterministic, the stochas- ticity injected by picking samples induces a distribution over sequences. This distribution is conditional, since the internal state of the network, and hence its predictive distribution, depends on the previous inputs.

遞歸神經網絡(RNNs)是一類豐富的動態模型,被用于生成音樂[6,4]、文本[30]和動作捕捉數據[29]等領域的序列。通過一步一步地處理真實的數據序列并預測接下來會發生什么,可以訓練RNNs來生成序列。假設預測是概率性的,通過對網絡的輸出分布進行迭代采樣,然后將樣本作為下一步的輸入,可以從訓練好的網絡中生成新的序列。換句話說,通過讓網絡把它的發明當作是真實的,就像一個人在做夢一樣。雖然網絡本身是確定性的,但通過取樣注入的隨機度會導致序列上的分布。這種分布是有條件的,因為網絡的內部狀態(因此它的預測分布)取決于以前的輸入。

RNNs are ‘fuzzy’ in the sense that they do not use exact templates from the training data to make predictions, but rather—like other neural networks— use their internal representation to perform a high-dimensional interpolation between training examples. This distinguishes them from n-gram models and compression algorithms such as Prediction by Partial Matching [5], whose pre- dictive distributions are determined by counting exact matches between the recent history and the training set. The result—which is immediately appar-

ent from the samples in this paper—is that RNNs (unlike template-based al- gorithms) synthesise and reconstitute the training data in a complex way, and rarely generate the same thing twice. Furthermore, fuzzy predictions do not suf- fer from the curse of dimensionality, and are therefore much better at modelling real-valued or multivariate data than exact matches.

從某種意義上說,RNNs是“模糊的”,因為它們不使用來自訓練數據的準確模板來進行預測,而是像其他神經網絡一樣,使用它們的內部表示來在訓練實例之間執行高維插值。這將它們與n-gram模型和壓縮算法(如通過部分匹配[5]進行預測)進行了區分,后者的預測分布是通過計算最近歷史與訓練集之間的精確匹配來確定的

從本文的樣本中可以看出,RNNs(與基于模板的算法不同)以一種復雜的方式綜合和重構訓練數據,并且很少生成相同的內容兩次。此外,模糊預測不依賴于維數的詛咒,因此在建模實值或多變量數據時,它比精確匹配要有效得多。

In principle a large enough RNN should be sufficient to generate sequences of arbitrary complexity. In practice however, standard RNNs are unable to store information about past inputs for very long [15]. As well as diminishing their ability to model long-range structure, this ‘amnesia’ makes them prone to instability when generating sequences. The problem (common to all conditional generative models) is that if the network’s predictions are only based on the last few inputs, and these inputs were themselves predicted by the network, it has little opportunity to recover from past mistakes. Having a longer memory has a stabilising effect, because even if the network cannot make sense of its recent history, it can look further back in the past to formulate its predictions. The problem of instability is especially acute with real-valued data, where it is easy for the predictions to stray from the manifold on which the training data lies. One remedy that has been proposed for conditional models is to inject noise into the predictions before feeding them back into the model [31], thereby increasing the model’s robustness to surprising inputs. However we believe that a better memory is a more profound and effective solution.

原則上,一個足夠大的RNN應該足以生成任意復雜度的序列。然而,在實踐中,標準的RNN不能存儲關于非常長的[15]的過去輸入的信息。這種“健忘癥”不僅會削弱他們對長期結構建模的能力,還會使他們在生成序列時變得不穩定。問題是(所有條件生成模型共有的),如果網絡的預測僅僅基于最后的幾個輸入,而這些輸入本身是由網絡預測的,那么它幾乎沒有機會從過去的錯誤中恢復過來。擁有更長的記憶有一個穩定的效果,因為即使網絡不能理解它最近的歷史,它可以回顧過去來制定它的預測。對于實值數據來說,不穩定性問題尤其嚴重,因為預測很容易偏離訓練數據所在的流形。針對條件模型提出的一個補救措施是,在將預測反饋回模型[31]之前,在預測中加入噪音,從而提高模型對意外輸入的魯棒性。然而,我們相信更好的記憶是一個更深刻和有效的解決方案。?

Long Short-term Memory (LSTM) [16] is an RNN architecture designed to be better at storing and accessing information than standard RNNs. LSTM has recently given state-of-the-art results in a variety of sequence processing tasks, including speech and handwriting recognition [10, 12]. The main goal of this paper is to demonstrate that LSTM can use its memory to generate complex, realistic sequences containing long-range structure.長短時記憶(LSTM)[16]是一種RNN結構,它比標準的RNN更適合于存儲和訪問信息。LSTM最近在一系列序列處理任務中給出了最先進的結果,包括語音和手寫識別[10,12]。本文的主要目的是證明LSTM可以利用它的內存來生成復雜的、真實的、包含長程結構的序列

Figure 1:?Deep recurrent neural network prediction architecture.?The circles represent network layers, the solid lines represent weighted connections and the dashed lines represent predictions.

圖1:深度遞歸神經網絡預測體系結構。圓圈表示網絡層,實線表示加權連接,虛線表示預測。

Section 2 defines a ‘deep’ RNN composed of stacked LSTM layers, and ex- plains how it can be trained for next-step prediction and hence sequence gener- ation. Section 3 applies the prediction network to text from the Penn Treebank and Hutter Prize Wikipedia datasets. The network’s performance is compet- itive with state-of-the-art language models, and it works almost as well when predicting one character at a time as when predicting one word at a time. The highlight of the section is a generated sample of Wikipedia text, which showcases the network’s ability to model long-range dependencies. Section 4 demonstrates how the prediction network can be applied to real-valued data through the use of a mixture density output layer, and provides experimental results on the IAM Online Handwriting Database. It also presents generated handwriting samples proving the network’s ability to learn letters and short words direct from pen traces, and to model global features of handwriting style. Section 5 introduces an extension to the prediction network that allows it to condition its outputs on a short annotation sequence whose alignment with the predictions is unknown. This makes it suitable for handwriting synthesis, where a human user inputs a text and the algorithm generates a handwritten version of it. The synthesis network is trained on the IAM database, then used to generate cursive hand- writing samples, some of which cannot be distinguished from real data by the?naked eye. A method for biasing the samples towards higher probability (and greater legibility) is described, along with a technique for ‘priming’ the samples on real data and thereby mimicking a particular writer’s style. Finally, concluding remarks and directions for future work are given in Section 6.

第2節定義了一個由多層LSTM層組成的“深度”RNN,并討論了如何訓練它來進行下一步預測,從而實現序列生成。

第3節將預測網絡應用于來自Penn Treebank和Hutter Prize Wikipedia數據集的文本。該網絡的性能與最先進的語言模型是競爭的,它在預測一個字符時的效果幾乎與預測一個單詞時的效果一樣好。本節的重點是生成的Wikipedia文本樣本,它展示了網絡建模長期依賴項的能力。

第4節演示了如何通過混合密度輸出層將預測網絡應用于實值數據,并提供了IAM在線手寫數據庫的實驗結果。它還提供了生成的手寫樣本,證明該網絡能夠直接從手寫軌跡學習字母和短單詞,并對手寫風格的全局特征進行建模。

第5節介紹了對預測網絡的擴展,該擴展允許預測網絡將其輸出設置為與預測一致的短注釋序列。這使得它適用于手寫合成,即人類用戶輸入文本,然后算法生成手寫版本。綜合網絡在IAM數據庫上進行訓練,然后生成草書手寫樣本,其中一些無法用肉眼分辨出真實數據。描述了一種將樣本偏向于更高概率(以及更大的易讀性)的方法,以及一種在真實數據上“啟動”樣本從而模仿特定作者風格的技術。

最后,第六節給出結論和今后工作的方向。?

2 Prediction Network?預測網絡

Fig. 1 illustrates the basic recurrent neural network prediction architecture used in this paper. An input vector sequence x = (x1, . . . , xT ) is passed through weighted connections to a stack of N recurrently connected hidden layers to compute first the hidden vector sequences h n = (h n 1 , . . . , hn T ) and then the output vector sequence y = (y1, . . . , yT ). Each output vector yt is used to parameterise a predictive distribution Pr(xt+1|yt) over the possible next inputs xt+1. The first element x1 of every input sequence is always a null vector whose entries are all zero; the network therefore emits a prediction for x2, the first real input, with no prior information. The network is ‘deep’ in both space and time, in the sense that every piece of information passing either vertically or horizontally through the computation graph will be acted on by multiple successive weight matrices and nonlinearities.

圖1給出了本文所采用的基本遞歸神經網絡預測體系結構。一個輸入向量序列x = (x1,…, xT)通過加權連接到N個遞歸連接的隱層堆棧中,首先計算隱藏向量序列h N = (h N 1,…然后輸出向量序列y = (y1,…次)。每個輸出向量yt被用來參數化一個預測分布Pr(xt+1|yt)對可能的下一個輸入xt+1。每個輸入序列的第一個元素x1總是一個零向量,它的所有元素都是零;因此,該網絡發出對x2的預測,x2是第一個實際輸入,沒有先驗信息。這個網絡在空間和時間上都是“深”的,也就是說,通過計算圖垂直或水平傳遞的每一條信息都將受到多個連續的權重矩陣和非線性的影響。??

Note the ‘skip connections’ from the inputs to all hidden layers, and from all hidden layers to the outputs. These make it easier to train deep networks,?by reducing the number of processing steps between the bottom of the network and the top, and thereby mitigating the ‘vanishing gradient’ problem [1]. In the special case that N = 1 the architecture reduces to an ordinary, single layer next step prediction RNN.注意從輸入到所有隱藏層的“跳過連接”,以及從所有隱藏層到輸出的“跳過連接”。通過減少網絡底部和頂部之間的處理步驟的數量,從而降低了訓練深度網絡的難度,從而減輕了[1]的“消失梯度”問題。在N = 1的特殊情況下,該體系結構簡化為一個普通的單層下一步預測RNN。

The hidden layer activations are computed by iterating the following equations from t = 1 to T and from n = 2 to N

where the W terms denote weight matrices (e.g. Wihn is the weight matrix connecting the inputs to the n th hidden layer, Wh1h1 is the recurrent connection at the first hidden layer, and so on), the b terms denote bias vectors (e.g. by is output bias vector) and H is the hidden layer function.

Given the hidden sequences, the output sequence is computed as follow:

where Y is the output layer function. The complete network therefore defines a function, parameterised by the weight matrices, from input histories x1:t to output vectors yt.

The output vectors yt are used to parameterise the predictive distribution Pr(xt+1|yt) for the next input. The form of Pr(xt+1|yt) must be chosen carefully to match the input data. In particular, finding a good predictive distribution for high-dimensional, real-valued data (usually referred to as density modelling), can be very challenging.

The probability given by the network to the input sequence x

The partial derivatives of the loss with respect to the network weights can be efficiently calculated with backpropagation through time [33] applied to the computation graph shown in Fig. 1, and the network can then be trained with gradient descen

隱層激活的計算方法如下:從t = 1到t,從n = 2到n

矩陣W術語表示的重量(例如Wihn權重矩陣連接輸入n th隱藏層,Wh1h1是復發性連接在第一個隱藏層,等等),b項表示偏差向量(例如輸出偏差向量)和H是隱藏層的功能。

給定隱藏序列,輸出序列計算如下:

其中Y為輸出層函數。因此,整個網絡定義了一個函數,由權矩陣參數化,從輸入歷史x1:t到輸出向量yt。

輸出向量yt用于參數化下一個輸入的預測分布Pr(xt+1|yt)。必須仔細選擇Pr(xt+1|yt)的形式來匹配輸入數據。特別是,為高維、實值數據(通常稱為密度建模)找到一個好的預測分布是非常具有挑戰性的。

由網絡給出的輸入序列x的概率

通過對圖1所示的計算圖進行時間[33]的反向傳播,可以有效地計算出損失相對于網絡權值的偏導數,然后通過梯度下行對網絡進行訓練

2.1 Long Short-Term Memory

? ? ? ? ? ? ? ? Figure 2: Long Short-term Memory Cell

In most RNNs the hidden layer function H is an elementwise application of a sigmoid function. However we have found that the Long Short-Term rm Memory?(LSTM) architecture [16], which uses purpose-built memory cells to store information, is better at finding and exploiting long range dependencies in the data. Fig. 2 illustrates a single LSTM memory cell. For the version of LSTM used in this paper [7] H is implemented by the following composite function:

where σ is the logistic sigmoid function, and i, f, o and c are respectively the input gate, forget gate, output gate, cell and cell input activation vectors, all of which are the same size as the hidden vector h. The weight matrix subscripts have the obvious meaning, for example Whi is the hidden-input gate matrix, Wxo is the input-output gate matrix etc. The weight matrices from the cell to gate vectors (e.g. Wci) are diagonal, so element m in each gate vector only receives input from element m of the cell vector. The bias terms (which are added to i, f, c and o) have been omitted for clarity.

圖2:長短時記憶細胞

在大多數網絡中,隱層函數H是s型函數的基本應用。然而,我們發現,使用專門構建的內存單元來存儲信息的長短期rm內存(LSTM)體系結構[16]更善于發現和利用數據中的長期依賴關系。圖2顯示了單個LSTM存儲單元。對于本文使用的LSTM版本,[7]H通過以下復合函數實現:

σ是物流乙狀結腸函數,和我,f, o和c分別輸入門,忘記門,輸出門,細胞和細胞激活輸入向量,都是同樣的大小隱藏向量h。權重矩陣下標有明顯的意義,例如Whi隱藏輸入門矩陣,Wxo輸入-輸出門矩陣等。單元到柵極向量(如Wci)的權重矩陣是對角的,因此每個柵極向量中的m元素只接收單元向量的m元素的輸入。偏置項(添加到i、f、c和o中)被省略,以保持清晰。

The original LSTM algorithm used a custom designed approximate gradient calculation that allowed the weights to be updated after every timestep [16]. However the full gradient can instead be calculated with backpropagation through time [11], the method used in this paper. One difficulty when training LSTM with the full gradient is that the derivatives sometimes become excessively large,leading to numerical problems. To prevent this, all the experiments in this paper clipped the derivative of the loss with respect to the network inputs to the LSTM layers (before the sigmoid and tanh functions are applied) to lie within a predefined range1 .

原始的LSTM算法使用了自定義設計的近似梯度計算,允許在每一步[16]之后更新權值。然而,全梯度可以通過時間[11]的反向傳播來計算,這是本文使用的方法。用全梯度法訓練LSTM的一個難點是導數有時會變得過大,導致數值問題。為了防止這種情況,本文中的所有實驗都將損耗對LSTM層的網絡輸入的導數(在應用sigmoid和tanh函數之前)限制在預定義的范圍e1內。?

3 Text Prediction??文本預測

Text data is discrete, and is typically presented to neural networks using ‘onehot’ input vectors. That is, if there are K text classes in total, and class k is fed in at time t, then xt is a length K vector whose entries are all zero except for the k th, which is one. Pr(xt+1|yt) is therefore a multinomial distribution, which can be naturally parameterised by a softmax function at the output layer:

文本數據是離散的,通常使用“onehot”輸入向量呈現給神經網絡。也就是說,如果總共有K個文本類,而類K是在t時刻輸入的,那么xt就是一個長度為K的向量,除了第K項是1外,其他項都是0。因此,Pr(xt+1|yt)是一個多項分布,可以通過輸出層的softmax函數自然參數化:

The only thing that remains to be decided is which set of classes to use. In most cases, text prediction (usually referred to as language modelling) is performed at the word level. K is therefore the number of words in the dictionary. This can be problematic for realistic tasks, where the number of words (including variant conjugations, proper names, etc.) often exceeds 100,000. As well as requiring many parameters to model, having so many classes demands a huge amount of training data to adequately cover the possible contexts for the words. In the case of softmax models, a further difficulty is the high computational cost of evaluating all the exponentials during training (although several methods have been to devised make training large softmax layers more efficient, including tree-based models [25, 23], low rank approximations [27] and stochastic derivatives [26]). Furthermore, word-level models are not applicable to text data containing non-word strings, such as multi-digit numbers or web addresse.

唯一需要決定的是使用哪一組類。在大多數情況下,文本預測(通常稱為語言建模)是在單詞級執行的。因此K是字典中的單詞數。這對于實際的任務來說是有問題的,因為單詞的數量(包括不同的詞形變化、專有名稱等)常常超過100,000。除了需要許多參數進行建模外,擁有如此多的類還需要大量的訓練數據來充分覆蓋單詞的可能上下文。在softmax模型中,另一個困難是在訓練期間評估所有指數的計算成本很高(盡管已經設計了幾種方法來提高訓練大型softmax層的效率,包括基于樹的模型[25,23]、低秩近似[27]和隨機導數[26])。此外,單詞級模型不適用于包含非單詞字符串的文本數據,如多位數字或web地址。

Character-level language modelling with neural networks has recently been considered [30, 24], and found to give slightly worse performance than equivalent word-level models. Nonetheless, predicting one character at a time is more interesting from the perspective of sequence generation, because it allows the network to invent novel words and strings. In general, the experiments in this paper aim to predict at the finest granularity found in the data, so as to maximise the generative flexibility of the networ

使用神經網絡的字符級語言建模最近被考慮[30,24],并發現其性能略差于等價的字級模型。盡管如此,從序列生成的角度來看,一次預測一個字符更有趣,因為它允許網絡創建新的單詞和字符串。總的來說,本文的實驗旨在以數據中發現的最細粒度進行預測,從而最大限度地提高網絡的生成靈活性

3.1 Penn Treebank Experiments Penn Treebank實驗

The first set of text prediction experiments focused on the Penn Treebank portion of the Wall Street Journal corpus [22]. This was a preliminary study whose main purpose was to gauge the predictive power of the network, rather than to generate interesting sequences.

第一組文本預測實驗集中在《華爾街日報》語料庫[22]的賓夕法尼亞河岸部分。這是一項初步研究,其主要目的是評估網絡的預測能力,而不是生成有趣的序列。?

Although a relatively small text corpus (a little over a million words in total), the Penn Treebank data is widely used as a language modelling benchmark. The training set contains 930,000 words, the validation set contains 74,000 words and the test set contains 82,000 words. The vocabulary is limited to 10,000 words, with all other words mapped to a special ‘unknown word’ token. The end-ofsentence token was included in the input sequences, and was counted in the sequence loss. The start-of-sentence marker was ignored, because its role is already fulfilled by the null vectors that begin the sequences (c.f. Section 2).盡管文本語料庫相對較小(總共超過100萬單詞),Penn Treebank的數據被廣泛用作語言建模的基準。訓練集包含93萬字,驗證集包含7.4萬字,測試集包含8.2萬字。詞匯表被限制為10,000個單詞,所有其他單詞都映射到一個特殊的“未知單詞”標記。語句結束標記被包含在輸入序列中,并被計算在序列丟失中。句子開始標記被忽略,因為它的作用已經由開始序列的空向量完成(c.f. . Section 2)。
The experiments compared the performance of word and character-level LSTM predictors on the Penn corpus. In both cases, the network architecture was a single hidden layer with 1000 LSTM units. For the character-level network the input and output layers were size 49, giving approximately 4.3M weights in total, while the word-level network had 10,000 inputs and outputs and around 54M weights. The comparison is therefore somewhat unfair, as the word-level network had many more parameters. However, as the dataset is small, both networks were easily able to overfit the training data, and it is not clear whether the character-level network would have benefited from more weights. All networks were trained with stochastic gradient descent, using a learn rate of 0.0001 and a momentum of 0.99. The LSTM derivates were clipped in the range [?1, 1] (c.f. Section 2.1).

實驗比較了詞級和字符級LSTM預測器在Penn語料庫上的性能。在這兩種情況下,網絡架構都是一個包含1000 LSTM單元的單一隱含層。對于字符級網絡,輸入和輸出層的大小為49,總共給出了大約430萬的權重,而單詞級網絡有10,000個輸入和輸出,以及大約54M的權重。因此,這種比較有點不公平,因為單詞級網絡有更多的參數。然而,由于數據集較小,這兩個網絡都很容易對訓練數據進行過度擬合,而且還不清楚字符級網絡是否會從更大的權重中受益。所有網絡均采用隨機梯度下降訓練,學習率為0.0001,動量為0.99。LSTM衍生物被限制在[?1,1]范圍內(c.f。2.1節)。?

Neural networks are usually evaluated on test data with fixed weights. For prediction problems however, where the inputs are the targets, it is legitimate to allow the network to adapt its weights as it is being evaluated (so long as it only sees the test data once). Mikolov refers to this as dynamic evaluation. Dynamic evaluation allows for a fairer comparison with compression algorithms, for which there is no division between training and test sets, as all data is only predicted once.

神經網絡的評價通常采用固定權值的試驗數據。然而,對于輸入是目標的預測問題,允許網絡在評估時調整其權重是合理的(只要它只看到測試數據一次)。Mikolov稱之為動態評估。動態評估允許與壓縮算法進行更公平的比較,壓縮算法不需要劃分訓練集和測試集,因為所有數據只預測一次。?

Table 1: Penn Treebank Test Set Results. ‘BPC’ is bits-per-character. ‘Error’ is next-step classification error rate, for either characters or words.

表1:Penn Treebank測試集的結果。BPC的bits-per-character。“錯誤”是下一步的分類錯誤率,不管是字符還是單詞。
Since both networks overfit the training data, we also experiment with two types of regularisation: weight noise [18] with a std. deviation of 0.075 applied to the network weights at the start of each training sequence, and adaptive weight noise [8], where the variance of the noise is learned along with the weights using a Minimum description Length (or equivalently, variational inference) loss function. When weight noise was used, the network was initialised with the final weights of the unregularised network. Similarly, when adaptive weight noise was used, the weights were initialised with those of the network trained with weight noise. We have found that retraining with iteratively increased regularisation is considerably faster than training from random weights with regularisation. Adaptive weight noise was found to be prohibitively slow for the word-level network, so it was regularised with fixed-variance weight noise only. One advantage of adaptive weight is that early stopping is not needed?(the network can safely be stopped at the point of minimum total ‘description length’ on the training data). However, to keep the comparison fair, the same training, validation and test sets were used for all experiments.

因為網絡overfit訓練數據,我們也嘗試兩種regularisation:體重噪聲[18]std.偏差為0.075應用于網絡權值在每個訓練序列的開始,體重和自適應噪聲[8],噪聲的方差在哪里學習使用最小描述長度隨著重量損失函數(或等價變分推理)。當使用權值噪聲時,網絡被初始化為非正則化網絡的最終權值。類似地,當使用自適應權值噪聲時,權值與使用權值噪聲訓練的網絡的權值初始化。我們發現,用迭代增加的正則化進行再訓練要比用正則化進行隨機加權訓練快得多。自適應權值噪聲在詞級網絡中速度非常慢,因此只能用固定方差權值噪聲對其進行正則化。自適應權值的一個優點是不需要提前停止(網絡可以安全地在訓練數據上的最小總“描述長度”處停止)。然而,為了保持比較的公平性,所有的實驗都使用相同的訓練、驗證和測試集。?

The results are presented with two equivalent metrics: bits-per-character (BPC), which is the average value of ? log2 Pr(xt+1|yt) over the whole test set; and perplexity which is two to the power of the average number of bits per word (the average word length on the test set is about 5.6 characters, so perplexity ≈ 2 5.6BP C ). Perplexity is the usual performance measure for language modelling.

結果用兩個等價的度量來表示:每個字符的比特數(BPC),這是?log2 Pr(xt+1|yt)在整個測試集上的平均值;perplexity為每個單詞平均位數的2次方(測試集上的平均單詞長度約為5.6個字符,所以perplexity≈2 5.6 bp C)。Perplexity是語言建模的常用性能度量。?

Table 1 shows that the word-level RNN performed better than the characterlevel network, but the gap appeared to close when regularisation is used. Overall the results compare favourably with those collected in Tomas Mikolov’s thesis [23]. For example, he records a perplexity of 141 for a 5-gram with KeyserNey smoothing, 141.8 for a word level feedforward neural network, 131.1 for the state-of-the-art compression algorithm PAQ8 and 123.2 for a dynamically evaluated word-level RNN. However by combining multiple RNNs, a 5-gram and a cache model in an ensemble, he was able to achieve a perplexity of 89.4. Interestingly, the benefit of dynamic evaluation was far more pronounced here than in Mikolov’s thesis (he records a perplexity improvement from 124.7 to 123.2 with word-level RNNs). This suggests that LSTM is better at rapidly adapting to new data than ordinary RNNs.

表1顯示,單詞級RNN的性能優于字符級網絡,但在使用正則化時,這種差距似乎縮小了。總的來說,這些結果與Tomas Mikolov的論文[23]中收集到的結果相比是令人滿意的。例如,他記錄了5克KeyserNey平滑算法的perplexity為141,單詞級前饋神經網絡的perplexity為141.8,最先進的壓縮算法PAQ8的perplexity為131.1,動態評估單詞級RNN的perplexity為123.2。然而,通過將多個RNNs、一個5克重的內存和一個緩存模型結合在一起,他可以得到一個令人困惑的89.4。有趣的是,動態評估的好處在這里比在Mikolov的論文中更明顯(他記錄了一個復雜的改進,從124.7到123.2的單詞級RNNs)。這表明LSTM在快速適應新數據方面優于普通的rns。?

3.2 Wikipedia Experiments??維基百科的實驗

In 2006 Marcus Hutter, Jim Bowery and Matt Mahoney organised the following challenge, commonly known as Hutter prize [17]: to compress the first 100 million bytes of the complete English Wikipedia data (as it was at a certain time on March 3rd 2006) to as small a file as possible. The file had to include not only the compressed data, but also the code implementing the compression algorithm. Its size can therefore be considered a measure of the minimum description length [13] of the data using a two part coding scheme.

在2006年,Marcus Hutter, Jim Bowery和Matt Mahoney組織了如下的挑戰,通常被稱為Hutter獎[17]:壓縮完整的英文維基百科數據的前1億字節(在2006年3月3日的某個時間)到一個盡可能小的文件。該文件不僅要包含壓縮數據,還要包含實現壓縮算法的代碼。因此,它的大小可以被認為是使用兩部分編碼方案的數據的最小描述長度[13]的度量。?

Wikipedia data is interesting from a sequence generation perspective because?it contains not only a huge range of dictionary words, but also many character sequences that would not be included in text corpora traditionally used for language modelling. For example foreign words (including letters from nonLatin alphabets such as Arabic and Chinese), indented XML tags used to define meta-data, website addresses, and markup used to indicate page formatting such as headings, bullet points etc. An extract from the Hutter prize dataset is shown in Figs. 3 and 4.

從序列生成的角度來看,Wikipedia的數據非常有趣,因為它不僅包含大量的字典單詞,而且還包含許多字符序列,而這些字符序列不會包含在傳統用于語言建模的文本語料庫中。例如,外來詞(包括來自非拉丁字母的字母,如阿拉伯語和漢語)、用于定義元數據的縮進XML標記、網站地址和用于指示頁面格式(如標題、項目符號等)的標記。Hutter prize數據集的摘錄如圖3和圖4所示。??

The first 96M bytes in the data were evenly split into sequences of 100 bytes and used to train the network, with the remaining 4M were used for validation. The data contains a total of 205 one-byte unicode symbols. The total number of characters is much higher, since many characters (especially those from nonLatin languages) are defined as multi-symbol sequences. In keeping with the principle of modelling the smallest meaningful units in the data, the network predicted a single byte at a time, and therefore had size 205 input and output layers.數據中的前9600萬字節被均勻地分割成100字節的序列,用于訓練網絡,剩下的400萬字節用于驗證。數據總共包含205個一字節的unicode符號。字符的總數要高得多,因為許多字符(特別是來自非拉丁語言的字符)被定義為多符號序列。根據對數據中有意義的最小單位建模的原則,網絡每次預測一個字節,因此大小為205個輸入和輸出層。
Wikipedia contains long-range regularities, such as the topic of an article, which can span many thousand words. To make it possible for the network to capture these, its internal state (that is, the output activations ht of the hidden layers, and the activations ct of the LSTM cells within the layers) were only reset every 100 sequences. Furthermore the order of the sequences was not shuffled during training, as it usually is for neural networks. The network was therefore able to access information from up to 10K characters in the past when making predictions. The error terms were only backpropagated to the start of each 100 byte sequence, meaning that the gradient calculation was approximate. This form of truncated backpropagation has been considered before for RNN language modelling [23], and found to speed up training (by reducing the sequence length and hence increasing the frequency of stochastic weight updates) without affecting the network’s ability to learn long-range dependencies.

維基百科包含長期的規律,比如一篇文章的主題,可以跨越數千個單詞。為了使網絡能夠捕獲這些信息,其內部狀態(即隱含層的輸出激活ht和層內LSTM細胞的激活ct)每100個序列才重置一次。此外,在訓練過程中,序列的順序沒有像通常的神經網絡那樣被打亂。因此,在過去進行預測時,該網絡能夠訪問多達10K個字符的信息。錯誤項僅反向傳播到每個100字節序列的開始處,這意味著梯度計算是近似的。RNN語言建模[23]之前就考慮過這種截斷反向傳播的形式,并發現它可以在不影響網絡學習長期依賴關系的情況下加速訓練(通過減少序列長度,從而增加隨機權值更新的頻率)。??

A much larger network was used for this data than the Penn data (reflecting the greater size and complexity of the training set) with seven hidden layers of 700 LSTM cells, giving approximately 21.3M weights. The network was trained with stochastic gradient descent, using a learn rate of 0.0001 and a momentum of 0.9. It took four training epochs to converge. The LSTM derivates were clipped in the range [?1, 1].

這個數據使用了一個比Penn數據大得多的網絡(反映了訓練集的更大的規模和復雜性),它有7個隱藏層,由700個LSTM單元組成,提供了大約21.3M的權重。該網絡采用隨機梯度下降訓練,學習率為0.0001,動量為0.9。它花了四個訓練的時代來匯合。LSTM衍生物被限制在[- 1,1]范圍內。?

As with the Penn data, we tested the network on the validation data with and without dynamic evaluation (where the weights are updated as the data is predicted). As can be seen from Table 2 performance was much better with dynamic evaluation. This is probably because of the long range coherence of Wikipedia data; for example, certain words are much more frequent in some articles than others, and being able to adapt to this during evaluation is advantageous. It may seem surprising that the dynamic results on the validation set were substantially better than on the training set. However this is easily explained by two factors: firstly, the network underfit the training data, and secondly some portions of the data are much more difficult than others (for example, plain text is harder to predict than XML tags).

與Penn的數據一樣,我們在驗證數據上對網絡進行了測試,包括動態評估和非動態評估(根據預測的數據更新權重)。從表2可以看出,動態評估的性能要好得多。這可能是因為維基百科數據的長期一致性;例如,某些詞匯在某些文章中出現的頻率要比其他詞匯高得多,能夠在評估時適應這些詞匯是有利的。看起來奇怪,驗證動態結果集大大優于在訓練集上。但是這很容易解釋為兩個因素:首先,網絡underfit訓練數據,第二部分的數據是比其他人更加困難(例如,純文本更難預測比XML標簽)。?

To put the results in context, the current winner of the Hutter Prize (a?variant of the PAQ-8 compression algorithm [20]) achieves 1.28 BPC on the same data (including the code required to implement the algorithm), mainstream compressors such as zip generally get more than 2, and a character level RNN applied to a text-only version of the data (i.e. with all the XML, markup tags etc. removed) achieved 1.54 on held-out data, which improved to 1.47 when the RNN was combined with a maximum entropy model [24].

上下文中的結果,當前Hutter獎得主(PAQ-8壓縮算法[20]的一種變體)達到1.28 BPC相同的數據(包括所需的代碼來實現算法),主流壓縮機等郵政通常得到超過2,和一個人物等級RNN應用于數據的文本版本(即所有的XML標記標簽等刪除)1.54實現了數據,當RNN與最大熵模型[24]相結合時,其性能提高到1.47。?

A four page sample generated by the prediction network is shown in Figs. 5 to 8. The sample shows that the network has learned a lot of structure from the data, at a wide range of different scales. Most obviously, it has learned a large vocabulary of dictionary words, along with a subword model that enables it to invent feasible-looking words and names: for example “Lochroom River”, “Mughal Ralvaldens”, “submandration”, “swalloped”. It has also learned basic punctuation, with commas, full stops and paragraph breaks occurring at roughly the right rhythm in the text blocks.由預測網絡生成的四頁樣本如圖5 - 8所示。該示例表明,該網絡從數據中學習了大量不同規模的結構。最明顯的是,它學習了大量的字典詞匯,以及一個子單詞模型,使它能夠發明看起來可行的單詞和名稱:例如“Lochroom River”、“Mughal Ralvaldens”、“submandration”、“swalloped”。它還學習了基本的標點符號,逗號、句號和斷句在文本塊中以大致正確的節奏出現。
Being able to correctly open and close quotation marks and parentheses is a clear indicator of a language model’s memory, because the closure cannot be predicted from the intervening text, and hence cannot be modelled with shortrange context [30]. The sample shows that the network is able to balance not only parentheses and quotes, but also formatting marks such as the equals signs used to denote headings, and even nested XML tags and indentation.

能夠正確地打開和關閉引號和圓括號是語言模型內存的一個明確指標,因為不能從中間的文本中預測關閉,因此不能使用較短的上下文[30]建模。示例顯示,該網絡不僅能夠平衡圓括號和引號,還能夠平衡用于表示標題的等號等格式化標記,甚至還能夠平衡嵌套的XML標記和縮進。?

The network generates non-Latin characters such as Cyrillic, Chinese and Arabic, and seems to have learned a rudimentary model for languages other than English (e.g. it generates “es:Geotnia slago” for the Spanish ‘version’ of an article, and “nl:Rodenbaueri” for the Dutch one) It also generates convincing looking internet addresses (none of which appear to be real).

The network generates distinct, large-scale regions, such as XML headers, bullet-point lists and article text. Comparison with Figs. 3 and 4 suggests that these regions are a fairly accurate reflection of the constitution of the real data (although the generated versions tend to be somewhat shorter and more jumbled together). This is significant because each region may span hundreds or even thousands of timesteps. The fact that the network is able to remain coherent over such large intervals (even putting the regions in an approximately correct order, such as having headers at the start of articles and bullet-pointed ‘see also’ lists at the end) is testament to its long-range memory.

網絡生成非拉丁字符如斯拉夫字母,中文和阿拉伯語,似乎學到了基本的模型除英語之外的其他語言(如生成“es: Geotnia slago”的西班牙語版的一篇文章,和荷蘭的“問:Rodenbaueri”)看起來也會產生令人信服的互聯網地址(似乎沒有真正的)。

網絡生成不同的大型區域,如XML標頭、項目符號列表和文章文本。與圖3和圖4的比較表明,這些區域相當準確地反映了真實數據的構成(盡管生成的版本往往更短,更混亂)。這很重要,因為每個區域可能跨越數百甚至數千個時間步長。事實上,這個網絡能夠在如此大的時間間隔內保持一致(甚至將各個區域大致按正確的順序排列,例如在文章開頭有標題,在文章結尾有“參見”列表),這證明了它的長期記憶力。

As with all text generated by language models, the sample does not make sense beyond the level of short phrases. The realism could perhaps be improved with a larger network and/or more data. However, it seems futile to expect meaningful language from a machine that has never been exposed to the sensory?world to which language refers.

Lastly, the network’s adaptation to recent sequences during training (which allows it to benefit from dynamic evaluation) can be clearly observed in the extract. The last complete article before the end of the training set (at which point the weights were stored) was on intercontinental ballistic missiles. The influence of this article on the network’s language model can be seen from the profusion of missile-related terms. Other recent topics include ‘Individual Anarchism’, the Italian writer Italo Calvino and the International Organization for Standardization (ISO), all of which make themselves felt in the network’s vocabulary.

與所有由語言模型生成的文本一樣,示例的意義也僅限于短語級別。也許可以通過更大的網絡和/或更多的數據來改進現實主義。然而,期待一臺從未接觸過語言所指的感官世界的機器發出有意義的語言似乎是徒勞的。

最后,在提取中可以清楚地觀察到網絡對訓練過程中最近序列的適應性(這使它能夠從動態評估中受益)。在訓練集結束之前的最后一篇完整的文章是關于洲際彈道導彈的。這篇文章對網絡語言模型的影響可以從大量的導彈相關術語中看出。最近的其他主題包括“個人無政府主義”、意大利作家伊塔洛·卡爾維諾和國際標準化組織(ISO),所有這些都在該網絡的詞匯中有所體現。

4 Handwriting Prediction??筆跡的預測

To test whether the prediction network could also be used to generate convincing real-valued sequences, we applied it to online handwriting data (online in this context means that the writing is recorded as a sequence of pen-tip locations, as opposed to offline handwriting, where only the page images are available). Online handwriting is an attractive choice for sequence generation due to its low dimensionality (two real numbers per data point) and ease of visualisation.

為了測試預測網絡是否也能被用來生成令人信服的實值序列,我們將其應用于在線手寫數據(在這種情況下,在線意味著書寫被記錄為筆尖位置的序列,而離線手寫則只有頁面圖像可用)。由于其低維性(每個數據點兩個實數)和易于可視化,在線手寫是一個有吸引力的序列生成選擇。?

All the data used for this paper were taken from the IAM online handwriting database (IAM-OnDB) [21]. IAM-OnDB consists of handwritten lines collected from 221 different writers using a ‘smart whiteboard’. The writers were asked to write forms from the Lancaster-Oslo-Bergen text corpus [19], and the position of their pen was tracked using an infra-red device in the corner of the board. Samples from the training data are shown in Fig. 9. The original input data consists of the x and y pen co-ordinates and the points in the sequence when the pen is lifted off the whiteboard. Recording errors in the x, y data was corrected by interpolating to fill in for missing readings, and removing steps whose length exceeded a certain threshold. Beyond that, no preprocessing was used and the network was trained to predict the x, y co-ordinates and the endof-stroke markers one point at a time. This contrasts with most approaches to handwriting recognition and synthesis, which rely on sophisticated preprocessing and feature-extraction techniques. We eschewed such techniques because they tend to reduce the variation in the data (e.g. by normalising the character size, slant, skew and so-on) which we wanted the network to model. Predicting the pen traces one point at a time gives the network maximum flexibility to invent novel handwriting, but also requires a lot of memory, with the average letter occupying more than 25 timesteps and the average line occupying around 700. Predicting delayed strokes (such as dots for ‘i’s or crosses for ‘t’s that are added after the rest of the word has been written) is especially demanding.

本文使用的所有數據均來自IAM在線手寫數據庫(IAM- ondb)[21]。IAM-OnDB由使用“智能白板”從221位不同作者那里收集的手寫行組成。作家們被要求寫來自lancaster - oso - bergen文本文集[19]的表格,他們的筆的位置通過黑板角落里的紅外線設備進行跟蹤。訓練數據的樣本如圖9所示。原始輸入數據包括x和y筆坐標,以及當筆從白板上拿起時的順序中的點。記錄x、y數據中的錯誤是通過內插來填補缺失的讀數,并刪除長度超過某個閾值的步驟來糾正的。除此之外,沒有使用預處理,網絡被訓練來預測x, y坐標和內源性卒中標記點一次一個點。這與大多數依賴于復雜的預處理和特征提取技術的手寫識別和合成方法形成了對比。我們避免使用這些技術,因為它們會減少數據中的變化(例如,通過將字符大小、傾斜度、歪斜度等正常化),而這些正是我們希望網絡建模的。預測筆的軌跡是一次一個點,這給了網絡最大的靈活性來創造新的筆跡,但也需要大量的內存,平均每個字母占用超過25個時間步長,平均一行占用大約700個時間步長。預測延遲的筆畫(比如“i”的點,或者“t”的叉,這些都是在單詞的其余部分都寫完之后才加上去的)尤其困難。?

IAM-OnDB is divided into a training set, two validation sets and a test set, containing respectively 5364, 1438, 1518 and 3859 handwritten lines taken from 775, 192, 216 and 544 forms. For our experiments, each line was treated as a separate sequence (meaning that possible dependencies between successive lines were ignored). In order to maximise the amount of training data, we used the training set, test set and the larger of the validation sets for training and the smaller validation set for early-stopping.

IAM-OnDB分為一個訓練集、兩個驗證集和一個測試集,分別包含5364、1438、1518和3859個手寫行,分別來自775、192、216和544個表單。在我們的實驗中,每一行都被視為一個單獨的序列(這意味著連續行之間可能的依賴關系被忽略了)。為了最大化訓練數據量,我們使用訓練集、測試集和較大的驗證集進行訓練,較小的驗證集進行早期停止。?

Figure 9: Training samples from the IAM online handwriting database. Notice the wide range of writing styles, the variation in line angle and character sizes, and the writing and recording errors, such as the scribbled out letters in the first line and the repeated word in the final line.

圖9:來自IAM在線手寫數據庫的訓練樣本。注意書寫風格的廣泛變化,行角和字符大小的變化,以及書寫和記錄錯誤,如第一行中潦草的字母和最后一行中重復的單詞。

The lack of independent test set means that the recorded results may be somewhat overfit on the validation set; however the validation results are of secondary importance, since no benchmark results exist and the main goal was to generate convincing-looking handwriting. The principal challenge in applying the prediction network to online handwriting data was determining a predictive distribution suitable for real-valued inputs. The following section describes how this was done.

缺乏獨立的測試集意味著記錄的結果可能在驗證集上有些過擬合;然而,驗證結果是次要的,因為沒有基準測試結果存在,主要目標是生成令人信服的筆跡。將預測網絡應用于在線手寫數據的主要挑戰是確定一個適用于實值輸入的預測分布。下面的部分將描述如何實現這一點。?

4.1 Mixture Density Outputs??混合密度輸出

The idea of mixture density networks [2, 3] is to use the outputs of a neural network to parameterise a mixture distribution. A subset of the outputs are used to define the mixture weights, while the remaining outputs are used to parameterise the individual mixture components. The mixture weight outputs are normalised with a softmax function to ensure they form a valid discrete distribution, and the other outputs are passed through suitable functions to keep their values within meaningful range (for example the exponential function is typically applied to outputs used as scale parameters, which must be positive).

混合密度網絡[2,3]的思想是利用神經網絡的輸出來參數化混合分布。輸出的一個子集用于定義混合權重,而其余的輸出用于參數化單獨的混合組件。混合重量與softmax函數輸出正常,確保它們形成一個有效的離散分布,和其他的輸出是通過合適的函數來保持它們的值有意義的范圍內(例如指數函數通常用于輸出作為尺度參數,必須積極)。?

Mixture density network are trained by maximising the log probability density of the targets under the induced distributions. Note that the densities are normalised (up to a fixed constant) and are therefore straightforward to differentiate and pick unbiased sample from, in contrast with restricted Boltzmann machines [14] and other undirected models. Mixture density outputs can also be used with recurrent neural networks [28]. In this case the output distribution is conditioned not only on the current input, but on the history of previous inputs. Intuitively, the number of components is the number of choices the network has for the next output given the inputs so far.通過最大限度地提高目標在誘導分布下的概率密度,訓練混合密度網絡。請注意,密度是標準化的(到一個固定的常數),因此是直接的區分和挑選無偏的樣本,相比之下,限制玻爾茲曼機[14]和其他無定向模型。混合密度輸出也可用于遞歸神經網絡[28]。在這種情況下,輸出分布不僅取決于當前輸入,而且取決于以前輸入的歷史。直觀地說,組件的數量就是到目前為止給定輸入的網絡對下一個輸出的選擇的數量。?

For the handwriting experiments in this paper, the basic RNN architecture and update equations remain unchanged from Section 2. Each input vector xt consists of a real-valued pair x1, x2 that defines the pen offset from the previous?input, along with a binary x3 that has value 1 if the vector ends a stroke (that is, if the pen was lifted off the board before the next vector was recorded) and value 0 otherwise. A mixture of bivariate Gaussians was used to predict x1 and x2, while a Bernoulli distribution was used for x3. Each output vector yt therefore consists of the end of stroke probability e, along with a set of means μ j , standard deviations σ j , correlations ρ j and mixture weights π j for the M mixture components. That is

對于本文中的手寫實驗,基本的 RNN 架構和更新方程與第 2 節相比保持不變。每個輸入向量 xt 由一個實值對 x1、x2 組成,它定義了與前一個輸入的筆偏移量,以及一個二進制 x3 如果向量結束筆劃(即,如果在記錄下一個向量之前將筆從板上抬起),則值為 1,否則值為 0。 雙變量高斯混合用于預測 x1 和 x2,而伯努利分布用于預測 x3。 因此,每個輸出向量 yt 由筆畫結束概率 e 以及一組均值 μ j 、標準偏差 σ j 、相關性 ρ j 和 M 個混合分量的混合權重 π j 組成。 那是

This can be substituted into Eq. (6) to determine the sequence loss (up to a constant that depends only on the quantisation of the data and does not influence network training):

可以代入式(6)確定序列損耗(可達常數,僅依賴于數據的量子化,不影響網絡訓練):

Figure 10: Mixture density outputs for handwriting prediction. The top heatmap shows the sequence of probability distributions for the predicted pen locations as the word ‘under’ is written. The densities for successive predictions are added together, giving high values where the distributions overlap.

圖10:手寫預測的混合密度輸出。頂部的熱圖顯示了“下”這個詞寫的時候,預測的筆位置的概率分布序列。連續預測的密度被加在一起,給出了分布重疊的高值。

Two types of prediction are visible from the density map: the small blobs that spell out the letters are the predictions as the strokes are being written, the three large blobs are the predictions at the ends of the strokes for the first point in the next stroke. The end-of-stroke predictions have much higher variance because the pen position was not recorded when it was off the whiteboard, and hence there may be a large distance between the end of one stroke and the start of the next.

The bottom heatmap shows the mixture component weights during the same sequence. The stroke ends are also visible here, with the most active components switching off in three places, and other components switching on: evidently end-of-stroke predictions use a different set of mixture components from in-stroke predictions.

從密度圖中可以看到兩種類型的預測:拼出字母的小斑點是正在書寫筆畫的預測,三個大斑點是在下一個筆畫的第一個點的筆畫末端的預測。筆劃結束預測的方差要大得多,因為當筆離開白板時,沒有記錄筆的位置,因此在一次筆劃結束和下一次筆劃開始之間可能有很大的距離。

底部的熱圖顯示了在相同的序列中混合成分的權重。筆劃終點在這里也可以看到,最活躍的部分在三個地方關閉,其他部分打開:顯然,筆劃終點預測使用的是一組不同于筆劃內預測的混合部分。

4.2 Experiments

Each point in the data sequences consisted of three numbers: the x and y offset from the previous point, and the binary end-of-stroke feature. The network input layer was therefore size 3. The co-ordinate offsets were normalised to mean 0, std. dev. 1 over the training set. 20 mixture components were used to model the offsets, giving a total of 120 mixture parameters per timestep (20 weights, 40 means, 40 standard deviations and 20 correlations). A further parameter was used to model the end-of-stroke probability, giving an output layer of size 121. Two network architectures were compared for the hidden layers: one with three hidden layers, each consisting of 400 LSTM cells, and one with a single hidden layer of 900 LSTM cells. Both networks had around 3.4M weights. The three layer network was retrained with adaptive weight noise [8], with all std. devs. initialised to 0.075. Training with fixed variance weight noise proved ineffective, probably because it prevented the mixture density layer from using precisely specified weights.

數據序列中的每個點由三個數字組成:前一個點的x和y偏移量,以及二進制行程結束特征。因此,網絡輸入層的大小為3。在訓練集上,坐標偏移被歸一化為均值0,標準偏差1。20個混合分量被用來對偏移進行建模,每個時間步共給出120個混合參數(20個權重,40個平均值,40個標準差和20個相關系數)。使用進一步的參數來建模行程結束的概率,輸出層的大小為121。比較了兩種隱含層的網絡結構:一種是三層隱含層,每層包含400個LSTM單元,另一種是單個隱含層包含900個LSTM單元。兩個網絡的重量都在340萬磅左右。采用自適應權值噪聲[8]對三層網絡進行再訓練。初始化到0.075。用固定方差權值噪聲進行訓練被證明是無效的,可能是因為它阻止了混合密度層使用精確指定的權值。?

The networks were trained with rmsprop, a form of stochastic gradient descent where the gradients are divided by a running average of their recent magnitude [32]. Define i = ?L(x) ?wi where wi is network weight i. The weight update equations were:

這些網絡是用rmsprop進行訓練的,rmsprop是一種隨機梯度下降的形式,梯度除以其最近大小[32]的運行平均值。?L(x)?wi其中wi為網絡權值i,權值更新方程為:

The output derivatives ?L(x) ?y?t were clipped in the range [?100, 100], and the LSTM derivates were clipped in the range [?10, 10]. Clipping the output gradients proved vital for numerical stability; even so, the networks sometimes had numerical problems late on in training, after they had started overfitting on the training data.

Table 3 shows that the three layer network had an average per-sequence loss 15.3 nats lower than the one layer net. However the sum-squared-error was slightly lower for the single layer network. the use of adaptive weight noise reduced the loss by another 16.7 nats relative to the unregularised three layer network, but did not significantly change the sum-squared error. The adaptive weight noise network appeared to generate the best samples.

輸出衍生品?L (x)?y?t剪的范圍(100?100)和LSTM衍生物被夾在[10?10日]。剪切輸出梯度被證明對數值穩定性至關重要;即便如此,這些網絡有時在訓練后期會出現數字問題,那時它們已經開始對訓練數據進行過度擬合。

表3顯示,三層網絡的每序列平均損失比一層網絡低15.3納特。而單層網絡的平方和誤差略低。與非正則化三層網絡相比,使用自適應加權噪聲減少了16.7納特的損失,但并沒有顯著改變平方和誤差。自適應權值噪聲網絡似乎產生了最好的樣本。

4.3 Samples??樣品

Fig. 11 shows handwriting samples generated by the prediction network. The network has clearly learned to model strokes, letters and even short words (especially common ones such as ‘of’ and ‘the’). It also appears to have learned a basic character level language models, since the words it invents (‘eald’, ‘bryoes’, ‘lenrest’) look somewhat plausible in English. Given that the average character occupies more than 25 timesteps, this again demonstrates the network’s ability to generate coherent long-range structures.

圖11為預測網絡生成的筆跡樣本。該網絡顯然已經學會了模仿筆劃、字母甚至是簡短的單詞(尤其是“of”和“The”等常見單詞)。它似乎還學會了一種基本的字符級語言模型,因為它發明的單詞(“eald”、“bryoes”、“lenrest”)在英語中似乎有些可信。考慮到平均字符占用超過25個時間步長,這再次證明了該網絡生成連貫的遠程結構的能力。?

5 Handwriting Synthesis??字合成

Handwriting synthesis is the generation of handwriting for a given text. Clearly the prediction networks we have described so far are unable to do this, since there is no way to constrain which letters the network writes. This section describes an augmentation that allows a prediction network to generate data sequences conditioned on some high-level annotation sequence (a character string, in the case of handwriting synthesis). The resulting sequences are sufficiently convincing that they often cannot be distinguished from real handwriting. Furthermore, this realism is achieved without sacrificing the diversity in writing style demonstrated in the previous section.

手寫合成是生成給定文本的手寫。顯然,我們目前所描述的預測網絡無法做到這一點,因為沒有辦法限制網絡所寫的字母。本節描述一種擴展,它允許預測網絡根據某些高級注釋序列(在手寫合成的情況下是字符串)生成數據序列。由此產生的序列足以讓人相信,它們常常無法與真實筆跡區分開來。此外,這種現實主義是在不犧牲前一節所展示的寫作風格多樣性的情況下實現的。?

The main challenge in conditioning the predictions on the text is that the two sequences are of very different lengths (the pen trace being on average twenty five times as long as the text), and the alignment between them is unknown until the data is generated. This is because the number of co-ordinates used to write each character varies greatly according to style, size, pen speed etc. One neural network model able to make sequential predictions based on two sequences of different length and unknown alignment is the RNN transducer [9]. However preliminary experiments on handwriting synthesis with RNN transducers were not encouraging. A possible explanation is that the transducer uses two separate RNNs to process the two sequences, then combines their outputs to make decisions, when it is usually more desirable to make all the information available to single network. This work proposes an alternative model, where a ‘soft window’ is convolved with the text string and fed in as an extra input to the prediction network. The parameters of the window are output by the network?at the same time as it makes the predictions, so that it dynamically determines an alignment between the text and the pen locations. Put simply, it learns to decide which character to write next.

調整對文本的預測的主要挑戰是,這兩個序列的長度非常不同(鋼筆軌跡的平均長度是文本的25倍),在生成數據之前,它們之間的對齊是未知的。這是因為書寫每個字符所用的坐標的數量會根據風格、大小、筆速等而變化很大。RNN傳感器[9]是一種能夠根據兩種不同長度和未知排列的序列進行序列預測的神經網絡模型。然而,使用RNN傳感器進行手寫合成的初步實驗并不令人鼓舞。一種可能的解釋是,傳感器使用兩個獨立的rns來處理這兩個序列,然后結合它們的輸出來做出決策,而通常情況下,將所有信息提供給單一網絡是更可取的。這項工作提出了一個替代模型,其中一個“軟窗口”與文本字符串進行卷積,并作為一個額外的輸入輸入到預測網絡中。窗口的參數是由網絡在進行預測的同時輸出的,因此它可以動態地確定文本和筆位置之間的對齊。簡單地說,它學會決定接下來寫哪個字符。

5.1 Synthesis Network??合成網絡

Fig. 12 illustrates the network architecture used for handwriting synthesis. As with the prediction network, the hidden layers are stacked on top of each other, each feeding up to the layer above, and there are skip connections from the inputs to all hidden layers and from all hidden layers to the outputs. The difference is the added input from the character sequence, mediated by the window layer.

Given a length U character sequence c and a length T data sequence x, the soft window wt into c at timestep t (1 ≤ t ≤ T) is defined by the following discrete convolution with a mixture of K Gaussian functions

圖12展示了用于手寫合成的網絡結構。與預測網絡一樣,隱藏層是堆疊在一起的,每一層向上向上,從輸入到所有隱藏層,從所有隱藏層到輸出都有跳躍連接。不同之處在于由窗口層調節的字符序列的附加輸入。

給定一個長度為U的字符序列c和一個長度為T的數據序列x,在第T步(1≤T≤T)時,軟窗口wt轉化為c,定義為與K高斯函數混合的離散卷積

Note that the location parameters κt are defined as offsets from the previous locations ct?1, and that the size of the offset is constrained to be greater than zero. Intuitively, this means that network learns how far to slide each window at each step, rather than an absolute location. Using offsets was essential to getting the network to align the text with the pen trace.

注意κt被定義為位置參數較前位置偏移ct?1,偏移的大小限制是大于零的。直觀地說,這意味著network了解在每一步中滑動每個窗口的距離,而不是絕對位置。使用偏移量對使網絡將文本與鋼筆軌跡對齊至關重要。?

The wt vectors are passed to the second and third hidden layers at time t, and the first hidden layer at time t+1 (to avoid creating a cycle in the processing graph). The update equations for the hidden layers are

wt向量在t時刻傳遞到第二層和第三層隱含層,在t+1時刻傳遞到第一層隱含層(避免在處理圖中創建一個循環)。隱層的更新方程為

5.2 Experiments??實驗

The synthesis network was applied to the same input data as the handwriting prediction network in the previous section. The character-level transcriptions from the IAM-OnDB were now used to define the character sequences c. The full transcriptions contain 80 distinct characters (capital letters, lower case letters, digits, and punctuation). However we used only a subset of 57, with all?digits and most of the punctuation characters replaced with a generic ‘nonletter’ label2 .

將合成網絡應用于與前一節筆跡預測網絡相同的輸入數據。來自IAM-OnDB的字符級轉錄現在用于定義字符序列c。完整的轉錄包含80個不同的字符(大寫字母、小寫字母、數字和標點符號)。然而,我們只使用了57的一個子集,所有的數字和大部分的標點符號都被一個通用的“非字母”標簽2所取代。

The network architecture was as similar as possible to the best prediction network: three hidden layers of 400 LSTM cells each, 20 bivariate Gaussian mixture components at the output layer and a size 3 input layer. The character sequence was encoded with one-hot vectors, and hence the window vectors were size 57. A mixture of 10 Gaussian functions was used for the window parameters, requiring a size 30 parameter vector. The total number of weights was increased to approximately 3.7M.

該網絡結構與最佳預測網絡盡可能相似:3個隱藏層,每個隱藏層有400個LSTM單元,輸出層有20個二元高斯混合分量,輸入層有3個大小。字符序列采用單熱向量編碼,因此窗口向量大小為57。窗口參數混合使用了10個高斯函數,需要一個大小為30的參數向量。總重量增加到約3.7M。?

The network was trained with rmsprop, using the same parameters as in the previous section. The network was retrained with adaptive weight noise, initial standard deviation 0.075, and the output and LSTM gradients were again clipped in the range [?100, 100] and [?10, 10] respectively.

使用與前一節相同的參數,使用rmsprop對網絡進行了訓練。使用自適應權值噪聲對網絡進行再訓練,初始標準偏差為0.075,再次將輸出和LSTM梯度分別限制在[?100,100]和[?10,10]范圍內。?

Table 4 shows that adaptive weight noise gave a considerable improvement in log-loss (around 31.3 nats) but no significant change in sum-squared error. The regularised network appears to generate slightly more realistic sequences, although the difference is hard to discern by eye. Both networks performed considerably better than the best prediction network. In particular the sumsquared-error was reduced by 44%. This is likely due in large part to the improved predictions at the ends of strokes, where the error is largest

表4顯示,自適應權值噪聲在對數損失(約31.3 nats)方面有顯著改善,但在平方和誤差方面沒有顯著變化。這個規則化的網絡似乎產生了更真實的序列,盡管這種差異很難用肉眼辨別。這兩個網絡都比最佳預測網絡表現得好得多。特別是sumsquared-error減少了44%。這可能在很大程度上是由于改進了筆畫末端的預測,在那里誤差最大

5.3 Unbiased Sampling??公正的抽樣

Given c, an unbiased sample can be picked from Pr(x|c) by iteratively drawing xt+1 from Pr (xt+1|yt), just as for the prediction network. The only difference is that we must also decide when the synthesis network has finished writing the text and should stop making any future decisions. To do this, we use the following heuristic: as soon as φ(t, U + 1) > φ(t, u) ? 1 ≤ u ≤ U the current input xt is defined as the end of the sequence and sampling ends. Examples of unbiased synthesis samples are shown in Fig. 15. These and all subsequent figures were generated using the synthesis network retrained with adaptive weight noise. Notice how stylistic traits, such as character size, slant, cursiveness etc. vary?widely between the samples, but remain more-or-less consistent within them. This suggests that the network identifies the traits early on in the sequence, then remembers them until the end. By looking through enough samples for a given text, it appears to be possible to find virtually any combination of stylistic traits, which suggests that the network models them independently both from each other and from the text. ‘

給定c,從Pr(x|c)中迭代提取xt+1從Pr(xt+1|yt)中提取無偏樣本,與預測網絡一樣。唯一不同的是,我們還必須決定合成網絡何時完成文本的編寫,并且應該停止做任何未來的決定。為此,我們使用以下啟發式:一旦φ(t, U + 1) >φ(t, U)?1≤≤U當前輸入xt被定義為序列圖和采樣結束的結束。無偏合成樣本的例子如圖15所示。這些和所有后續的圖像都是使用自適應權值噪聲再訓練的合成網絡生成的。請注意,風格特征,如字符大小、傾斜度、曲線性等,在不同的樣本之間差異很大,但在樣本內部卻或多或少保持一致。這表明,該網絡在序列的早期識別出這些特征,然后記住它們,直到最后。通過對給定文本的足夠多的樣本進行研究,似乎有可能發現幾乎任何風格特征的組合,這表明網絡對它們進行獨立建模,既相互獨立,也獨立于文本。”

Blind taste tests’ carried out by the author during presentations suggest that at least some unbiased samples cannot be distinguished from real handwriting by the human eye. Nonetheless the network does make mistakes we would not expect a human writer to make, often involving missing, confused or garbled letters3 ; this suggests that the network sometimes has trouble determining the alignment between the characters and the trace. The number of mistakes increases markedly when less common words or phrases are included in the character sequence. Presumably this is because the network learns an implicit character-level language model from the training set that gets confused when rare or unknown transitions occur.作者在演講中進行的盲品測試表明,至少有一些沒有偏見的樣品無法通過肉眼分辨出真跡。盡管如此,網絡確實會犯一些我們不希望人類作家會犯的錯誤,通常包括丟失、混淆或含混的信件;這表明,網絡有時難以確定字符和跟蹤之間的對齊。當較不常見的單詞或短語被包含在字符序列中時,錯誤數量顯著增加。這可能是因為當罕見或未知的轉換發生時,網絡會從訓練集中學習隱式的字符級語言模型。?

5.4 Biased Sampling??有偏見的抽樣

One problem with unbiased samples is that they tend to be difficult to read (partly because real handwriting is difficult to read, and partly because the network is an imperfect model). Intuitively, we would expect the network to give higher probability to good handwriting because it tends to be smoother and more predictable than bad handwriting. If this is true, we should aim to output more probable elements of Pr(x|c) if we want the samples to be easier to read. A principled search for high probability samples could lead to a difficult inference problem, as the probability of every output depends on all previous outputs. However a simple heuristic, where the sampler is biased towards more probable predictions at each step independently, generally gives good results. Define the probability bias b as a real number greater than or equal to zero. Before drawing a sample from Pr(xt+1|yt), each standard deviation σ j t in the Gaussian mixture is recalculated from Eq. (21) to

無偏樣本的一個問題是它們往往難以閱讀(部分原因是真實的筆跡難以閱讀,部分原因是網絡模型不完善)。直覺上,我們認為網絡會給好筆跡更高的概率,因為它比糟糕的筆跡更平滑、更可預測。如果這是真的,我們應該輸出更多可能的元素Pr(x|c),如果我們想讓樣本更容易閱讀。對高概率樣本的原則性搜索可能會導致一個困難的推理問題,因為每個輸出的概率依賴于所有以前的輸出。然而,一個簡單的啟發式,其中采樣器是偏向于更可能的預測,在每一步獨立,通常會給出良好的結果。將概率偏差b定義為大于或等于零的實數。之前圖紙樣本公關(xt + 1 |次),每一個標準差σj t在高斯混合重新計算從情商。(21)

This artificially reduces the variance in both the choice of component from the mixture, and in the distribution of the component itself. When b = 0 unbiased sampling is recovered, and as b → ∞ the variance in the sampling disappears?and the network always outputs the mode of the most probable component in the mixture (which is not necessarily the mode of the mixture, but at least a reasonable approximation). Fig. 16 shows the effect of progressively increasing the bias, and Fig. 17 shows samples generated with a low bias for the same texts as Fig. 15.

這就人為地減少了混合中組分的選擇和組分本身分布的差異。b = 0時的采樣恢復,當b→∞方差在抽樣消失,網絡總是輸出模式最可能的組件的混合物(不一定是混合的模式,但至少有一個合理的近似)。圖16為逐步增大偏倚的效果,圖17為與圖15相同文本產生的低偏倚樣本。

5.5 Primed Sampling??啟動采樣

Another reason to constrain the sampling would be to generate handwriting in the style of a particular writer (rather than in a randomly selected style). The easiest way to do this would be to retrain it on that writer only. But even without retraining, it is possible to mimic a particular style by ‘priming’ the network with a real sequence, then generating an extension with the real sequence still in the network’s memory. This can be achieved for a real x, c and a synthesis character string s by setting the character sequence to c 0 = c + s and clamping the data inputs to x for the first T timesteps, then sampling as usual until the sequence ends. Examples of primed samples are shown in Figs. 18 and 19. The fact that priming works proves that the network is able to remember stylistic features identified earlier on in the sequence. This technique appears to work better for sequences in the training data than those the network has never seen.

限制抽樣的另一個原因是生成特定作者風格的筆跡(而不是隨機選擇的風格)。最簡單的方法是只對那個編寫器進行再培訓。但是,即使不進行再訓練,也可以通過用真實序列“啟動”網絡來模仿特定的樣式,然后生成一個擴展,其中真實序列仍然保留在網絡的內存中。這可以通過將字符序列設置為c 0 = c + s,并在第一個T時間步將數據輸入固定到x,然后像往常一樣采樣,直到序列結束,從而實現對實際的x、c和合成字符串s的采樣。啟動樣本的例子如圖18和圖19所示。啟動起作用的事實證明,網絡能夠記住在序列前面識別出的文體特征。這種技術似乎比網絡從未見過的序列更適合訓練數據。?

Primed sampling and reduced variance sampling can also be combined. As shown in Figs. 20 and 21 this tends to produce samples in a ‘cleaned up’ version of the priming style, with overall stylistic traits such as slant and cursiveness retained, but the strokes appearing smoother and more regular. A possible application would be the artificial enhancement of poor handwriting.啟動抽樣和減少方差抽樣也可以結合使用。如圖20和圖21所示,這往往會產生一種“凈化”版的引語風格,保留了整體風格特征,如傾斜度和曲線感,但筆觸看起來更平滑、更有規則。一種可能的應用是人為地改進拙劣的筆跡。?

6 Conclusions and Future Work??結論與未來工作

This paper has demonstrated the ability of Long Short-Term Memory recurrent neural networks to generate both discrete and real-valued sequences with complex, long-range structure using next-step prediction. It has also introduced a novel convolutional mechanism that allows a recurrent network to condition its predictions on an auxiliary annotation sequence, and used this approach to synthesise diverse and realistic samples of online handwriting. Furthermore, it has shown how these samples can be biased towards greater legibility, and how they can be modelled on the style of a particular writer.

本文證明了長短時記憶遞歸神經網絡利用下一步預測生成具有復雜、長時程結構的離散和實值序列的能力。它還引入了一種新穎的卷積機制,允許遞歸網絡根據輔助注釋序列調整預測,并使用這種方法來合成各種真實的在線手寫樣本。此外,它還展示了這些樣本如何傾向于更大的易讀性,以及如何模仿特定作者的風格。?

??

Several directions for future work suggest themselves. One is the application of the network to speech synthesis, which is likely to be more challenging than handwriting synthesis due to the greater dimensionality of the data points. Another is to gain a better insight into the internal representation of the data, and to use this to manipulate the sample distribution directly. It would also be interesting to develop a mechanism to automatically extract high-level annotations from sequence data. In the case of handwriting, this could allow for?more nuanced annotations than just text, for example stylistic features, different forms of the same letter, information about stroke order and so on.未來的工作有幾個方向。一是網絡在語音合成中的應用,由于數據點的維數較大,語音合成可能比手寫合成更具挑戰性。另一種方法是更好地了解數據的內部表示形式,并使用它直接操縱樣本分布。開發一種從序列數據中自動提取高級注釋的機制也很有趣。在書寫的情況下,這可能允許比文本更微妙的注釋,例如風格特征,同一字母的不同形式,關于筆畫順序的信息等等。?

Acknowledgements??致謝

Thanks to Yichuan Tang, Ilya Sutskever, Navdeep Jaitly, Geoffrey Hinton and other colleagues at the University of Toronto for numerous useful comments and suggestions. This work was supported by a Global Scholarship from the Canadian Institute for Advanced Research.

感謝多倫多大學的Yichuan Tang, Ilya Sutskever, Navdeep Jaitly, Geoffrey Hinton和其他同事提供了許多有用的意見和建議。這項工作得到了加拿大高級研究所的全球獎學金的支持。

References

[1] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, March 1994.
[2] C. Bishop. Mixture density networks. Technical report, 1994.
[3] C. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, Inc., 1995.
[4] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent. Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription. In Proceedings of the Twenty-nine International Conference on Machine Learning (ICML’12), 2012.
[5] J. G. Cleary, Ian, and I. H. Witten. Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications, 32:396–402, 1984.
[6] D. Eck and J. Schmidhuber. A first look at music composition using lstm recurrent neural networks. Technical report, IDSIA USI-SUPSI Instituto Dalle Molle.
[7] F. Gers, N. Schraudolph, and J. Schmidhuber. Learning precise timing with LSTM recurrent networks. Journal of Machine Learning Research, 3:115–143, 2002.
[8] A. Graves. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, volume 24, pages 2348–2356. 2011.
[9] A. Graves. Sequence transduction with recurrent neural networks. In ICML Representation Learning Worksop, 2012.
[10] A. Graves, A. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In Proc. ICASSP, 2013.
[11] A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18:602–610, 2005.
[12] A. Graves and J. Schmidhuber. Offline handwriting recognition with multidimensional recurrent neural networks. In Advances in Neural Information Processing Systems, volume 21, 2008.
[13] P. D. Gr¨unwald. The Minimum Description Length Principle (Adaptive Computation and Machine Learning). The MIT Press, 2007.
[14] G. Hinton. A Practical Guide to Training Restricted Boltzmann Machines. Technical report, 2010.
[15] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-term Dependencies. In S. C. Kremer and J. F. Kolen, editors, A Field Guide to Dynamical Recurrent Neural Networks. 2001.
[16] S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735–1780, 1997.
[17] M. Hutter. The Human Knowledge Compression Contest, 2012. [18] K.-C. Jim, C. Giles, and B. Horne. An analysis of noise in recurrent neural networks: convergence and generalization. Neural Networks, IEEE Transactions on, 7(6):1424 –1438, 1996. [19] S. Johansson, R. Atwell, R. Garside, and G. Leech. The tagged LOB corpus user’s manual; Norwegian Computing Centre for the Humanities, 1986.
[20] B. Knoll and N. de Freitas. A machine learning perspective on predictive coding with paq. CoRR, abs/1108.3298, 2011.
[21] M. Liwicki and H. Bunke. IAM-OnDB - an on-line English sentence database acquired from handwritten text on a whiteboard. In Proc. 8th Int. Conf. on Document Analysis and Recognition, volume 2, pages 956– 961, 2005.
[22] M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. Building a large annotated corpus of english: The penn treebank. COMPUTATIONAL LINGUISTICS, 19(2):313–330, 1993.
[23] T. Mikolov. Statistical Language Models based on Neural Networks. PhD thesis, Brno University of Technology, 2012.
[24] T. Mikolov, I. Sutskever, A. Deoras, H. Le, S. Kombrink, and J. Cernocky. Subword language modeling with neural networks. Technical report, Unpublished Manuscript, 2012.
[25] A. Mnih and G. Hinton. A Scalable Hierarchical Distributed Language Model. In Advances in Neural Information Processing Systems, volume 21, 2008.
[26] A. Mnih and Y. W. Teh. A fast and simple algorithm for training neural probabilistic language models. In Proceedings of the 29th International Conference on Machine Learning, pages 1751–1758, 2012.
[27] T. N. Sainath, A. Mohamed, B. Kingsbury, and B. Ramabhadran. Lowrank matrix factorization for deep neural network training with highdimensional output targets. In Proc. ICASSP, 2013.
[28] M. Schuster. Better generative models for sequential data problems: Bidirectional recurrent mixture density networks. pages 589–595. The MIT Press, 1999.
[29] I. Sutskever, G. E. Hinton, and G. W. Taylor. The recurrent temporal restricted boltzmann machine. pages 1601–1608, 2008.
[30] I. Sutskever, J. Martens, and G. Hinton. Generating text with recurrent neural networks. In ICML, 2011.
[31] G. W. Taylor and G. E. Hinton. Factored conditional restricted boltzmann machines for modeling motion style. In Proc. 26th Annual International Conference on Machine Learning, pages 1025–1032, 2009.
[32] T. Tieleman and G. Hinton. Lecture 6.5 - rmsprop: Divide the gradient by a running average of its recent magnitude, 2012.
[33] R. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their computational complexity. In Back-propagation: Theory, Architectures and Applications, pages 433–486. 1995.

總結

以上是生活随笔為你收集整理的Paper:《Generating Sequences With Recurrent Neural Networks》的翻译和解读的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。

国产日韩欧美视频 | 一级国产视频 | 天堂va在线观看 | 黄色性av| 国产免费一区二区三区最新 | 午夜精品一区二区三区免费 | 久草电影免费在线观看 | 99视频在线免费播放 | 99麻豆久久久国产精品免费 | 久久久综合九色合综国产精品 | 91网站免费观看 | 国产精品久久久久久久久久ktv | 国产黄色片久久久 | 久久久国产精品成人免费 | 91精品在线播放 | 成人国产精品免费观看 | 国产小视频在线免费观看视频 | 麻豆精品视频 | 人人爱人人舔 | 夜夜操网 | 香蕉免费 | 69精品久久 | 欧美精品xxx | 亚洲国产欧洲综合997久久, | 欧美小视频在线观看 | 国精产品永久999 | 欧美一级视频免费 | 国内精品久久久久久久久久清纯 | 久久久久成人免费 | 99久久精品久久久久久动态片 | 黄色小视频在线观看免费 | 福利av影院 | 久久手机看片 | 亚洲欧美乱综合图片区小说区 | 精品国产一区二区久久 | 日韩精品欧美视频 | 一区二区精品 | 国产精品一区二区在线播放 | 久久网站最新地址 | 欧美精品一区二区在线观看 | 一区二区在线电影 | 91九色成人 | 中文字幕日本在线 | 国产精品国产亚洲精品看不卡 | 亚洲 成人 一区 | 四虎视频 | 国产精品黄色av | 国产精品精品久久久久久 | 国产精品美女久久久久久久久久久 | 一区二区三区在线观看中文字幕 | 91麻豆精品国产91久久久无限制版 | 国产视频在线看 | 国产99区 | 91精选 | 观看免费av| 久久精品之 | 日韩免费成人 | 欧美久久久久久久久中文字幕 | 五月婷婷激情网 | 伊人在线视频 | 亚洲欧美一区二区三区孕妇写真 | 亚洲电影第一页av | 国产一级片免费视频 | 91精品国产欧美一区二区 | 午夜日b视频 | 亚洲国产欧美在线看片xxoo | 国产成人性色生活片 | av女优中文字幕在线观看 | 激情av在线播放 | 黄色亚洲片 | 五月婷婷一区 | 黄色国产精品 | 91在线精品秘密一区二区 | 伊人黄色网 | 超碰公开在线 | 国产99久久久精品 | 国产精品一区二区三区在线免费观看 | 91私密视频 | 婷婷综合电影 | 综合久久综合久久 | 人人射人人澡 | 国内外成人免费在线视频 | 日韩欧美在线影院 | 成人国产精品久久久 | 2019精品手机国产品在线 | 在线视频欧美亚洲 | 午夜视频在线观看一区二区三区 | av日韩在线网站 | 91高清一区| 亚洲欧洲精品久久 | www.伊人网 | 精品国产一区二区在线 | 久久天 | 91精品1区2区 | 在线网址你懂得 | 在线小视频 | 激情网站网址 | 免费精品在线视频 | 久久久久久久久久久久久久电影 | 五月天网站在线 | 丁香资源影视免费观看 | www.国产精品 | 天天射天天舔天天干 | 麻豆视频在线免费 | 欧美日韩视频在线播放 | 中文字幕在线免费观看视频 | 国产va精品免费观看 | 精品国产人成亚洲区 | 99精品欧美一区二区三区 | 在线观看黄网 | 国产中文字幕亚洲 | 伊人成人精品 | 国产成人综| 婷婷社区五月天 | 欧美在线视频第一页 | 五月婷婷视频在线观看 | 国产精品美女久久久久久久 | 91视频久久 | 亚洲日本精品视频 | 9在线观看免费高清完整 | 成人欧美亚洲 | 亚洲欧洲精品一区二区 | 日韩中文字幕免费在线播放 | 国产精品资源 | 深爱五月激情五月 | 亚洲精品美女视频 | 超碰精品在线 | 国产美女被啪进深处喷白浆视频 | 亚州av成人 | 伊人电影在线观看 | 在线免费观看视频你懂的 | 免费看黄网站在线 | 日韩a在线观看 | 久久全国免费视频 | 97人人爽人人 | 最新婷婷色| 激情丁香月 | 久草热久草视频 | 亚洲精品在线一区二区三区 | 国产高清一区二区 | 亚洲精品国产综合久久 | 激情视频91| 一级黄色片在线免费看 | 欧美a级片免费看 | 91pony九色丨交换 | 国产另类av | 超碰97人人在线 | 国产精品久久久久影院 | 成人9ⅰ免费影视网站 | 亚洲资源在线网 | 手机av在线免费观看 | 中文字幕 婷婷 | 久久深夜福利免费观看 | 一区中文字幕在线观看 | 五月天久久 | 亚洲精品久久久蜜桃直播 | 天天操狠狠操网站 | 久草9视频| 青青啪 | 国产丝袜制服在线 | 欧美一区二区三区在线播放 | 色wwww| 亚洲美女精品 | www.亚洲精品视频 | 在线播放视频一区 | 国产中文视频 | 天堂视频中文在线 | 99热99| 久久精品2 | 成人蜜桃视频 | 国产色婷婷 | 九色视频自拍 | 久久久久久久久久免费 | 日本中出在线观看 | 日韩免费在线视频 | 国产免费影院 | 久久精品一二三区白丝高潮 | 日韩精品视频在线观看免费 | 黄色软件大全网站 | 国产成人在线免费观看 | 一区二区三区视频在线 | 极品嫩模被强到高潮呻吟91 | 久久久久国产精品一区二区 | 久草爱视频 | 在线91播放 | 在线小视频国产 | 美女免费视频一区 | 97天天综合网 | 国产日韩欧美在线影视 | 欧美做受69 | 国产麻豆果冻传媒在线观看 | 夜夜骑日日 | 久久超| 天干啦夜天干天干在线线 | 欧美黄色成人 | 久久午夜精品影院一区 | 日日爱视频 | 国产精品综合av一区二区国产馆 | 天天干天天想 | 深爱激情婷婷网 | 欧美性色综合网站 | 岛国一区在线 | 日韩精品一区二区在线观看视频 | 久久视 | 天天插天天操天天干 | 国产视频精品免费播放 | 国产又粗又长的视频 | 日韩成人xxxx | 91精品一区国产高清在线gif | 香蕉视频在线视频 | 色偷偷88888欧美精品久久久 | 欧美一级片免费播放 | 久久香蕉电影 | 亚洲女人天堂成人av在线 | 黄色毛片观看 | 99九九99九九九视频精品 | 国产精品九九久久久久久久 | 午夜免费福利视频 | 久久免费视频3 | 日本xxxxav| 欧美日韩精品在线观看 | 人成午夜视频 | 手机成人免费视频 | 国产午夜精品av一区二区 | 十八岁免进欧美 | 美女黄频在线观看 | 国产美女黄网站免费 | 久青草视频 | 丁香六月婷婷激情 | 国产精品永久在线观看 | 国产精品原创在线 | 午夜久久网 | 亚洲精品2区 | 四虎成人精品在永久免费 | 在线观看的av网站 | 国产精品美女久久久 | 中文字幕 国产专区 | 国产永久免费高清在线观看视频 | 国产v在线 | 免费高清在线视频一区· | 亚洲午夜激情网 | 丁香花在线视频观看免费 | 日躁夜躁狠狠躁2001 | 狠狠做六月爱婷婷综合aⅴ 日本高清免费中文字幕 | 国产品久精国精产拍 | 久草精品在线观看 | 2019中文字幕第一页 | 日日夜夜综合 | 国产麻豆视频 | 国产91粉嫩白浆在线观看 | 狠狠干狠狠色 | 国产精品久久久久久久久久免费看 | 亚洲五月婷婷 | 性色在线视频 | 日韩在线视频国产 | 久久午夜电影网 | 欧美日韩一区二区在线观看 | 色资源网免费观看视频 | 97国产大学生情侣白嫩酒店 | 亚洲综合色播 | 国产老妇av | 亚洲国产三级在线观看 | 99久久婷婷国产 | 日韩在线视频在线观看 | 久久久高清免费视频 | 黄色一级大片在线观看 | 国产区精品区 | 国产在线高清视频 | 日韩精品电影在线播放 | 欧美精品亚洲精品 | 精品亚洲va在线va天堂资源站 | 国产黄色片久久 | 亚洲黄色区 | 国产一级视频在线观看 | 中文字幕日韩精品有码视频 | 天天干天天操av | 国产欧美日韩一区 | www.久久久com| 三上悠亚一区二区在线观看 | 五月开心网 | 免费观看xxxx9999片 | 婷香五月 | 一色屋精品视频在线观看 | av蜜桃在线| 美女啪啪图片 | 久久开心激情 | 欧美视频在线二区 | 91影视成人| 激情图片久久 | 一区二区三区 亚洲 | 久久天堂精品视频 | 91色蜜桃 | 国产精品99精品久久免费 | 美女视频黄在线 | 日韩三级视频在线观看 | 欧美国产高清 | 国产美女无遮挡永久免费 | 99中文字幕在线观看 | 不卡视频在线 | 91精品国产自产在线观看永久 | 亚洲在线黄色 | 97av影院 | 成人日批视频 | 视频在线一区二区三区 | 成人a级黄色片 | 欧美日韩国产精品一区二区 | 国产成人一区二区三区久久精品 | 超碰人人乐| 国内精品视频在线播放 | 国产精品女同一区二区三区久久夜 | 亚洲欧美国产日韩在线观看 | 久久亚洲福利视频 | 国产一区电影在线观看 | 成人小视频免费在线观看 | 最近中文字幕高清字幕在线视频 | 国产亚洲视频系列 | 国产精品美女久久久久久2018 | 久久久久免费视频 | 久久免费国产 | 九色91福利| 久草视频视频在线播放 | a级国产乱理论片在线观看 特级毛片在线观看 | www在线观看国产 | 色婷婷激情五月 | a v在线视频 | 色爱区综合激月婷婷 | 17婷婷久久www | 少妇搡bbbb搡bbb搡忠贞 | 18性欧美xxxⅹ性满足 | 色网站黄 | 女人18片毛片90分钟 | 亚洲精品国产精品久久99 | 欧美一二三在线 | 亚洲国产大片 | 国产精品久久久久久欧美 | 久久久久久久福利 | 日本在线观看一区 | 国产精品高潮呻吟久久av无 | 婷婷激情久久 | 久草视频在线免费 | 国产精品一区二区精品视频免费看 | 在线免费观看国产 | 色www精品视频在线观看 | 精品久久精品久久 | 亚洲aⅴ一区二区三区 | 亚洲精品国偷自产在线99热 | 亚洲欧洲精品一区二区精品久久久 | 日韩高清在线观看 | 国产视频69 | 久久久久观看 | 欧美成人免费在线 | 天堂av网址 | 中文字幕最新精品 | 久精品视频免费观看2 | 国产日韩在线视频 | 久久国产精品偷 | 91色一区二区三区 | 久久综合影音 | av在线直接看 | 日本爱爱片 | 久久婷五月 | 97精产国品一二三产区在线 | 日韩一区二区三区在线观看 | 国产精品一区二区无线 | 免费在线看v| 婷婷在线免费视频 | 亚洲欧美综合精品久久成人 | 国产第页 | a级片韩国 | 九色免费视频 | 国产亚洲激情视频在线 | 丁香综合激情 | 国产一区在线免费观看视频 | 久久国产亚洲精品 | 91福利视频网站 | 日韩av三区 | 免费日韩一区二区三区 | av电影在线免费 | 久久一区91 | 亚洲欧洲视频 | 国产99久久久欧美黑人 | a级免费观看| 国产成人三级一区二区在线观看一 | 午夜久久福利 | 久久精品99久久久久久2456 | 国产黄色免费电影 | 高清在线观看av | 久久国产欧美日韩 | 国产精品女人久久久久久 | 精品国产一区二区三区久久久 | 亚洲 欧美 日韩 综合 | 天天躁日日躁狠狠躁av麻豆 | 国产精品自产拍在线观看网站 | 中文字幕视频播放 | 一区二区日韩av | 999男人的天堂 | 日韩网站在线免费观看 | 国产精品久久久久久久久久 | 亚洲精品在线观看的 | 国产精品久久麻豆 | 亚洲国产人午在线一二区 | 97在线免费视频 | 99热只有精品在线观看 | 亚洲在线视频免费 | 色狠狠狠| 免费高清在线视频一区· | 亚洲国产精品va在线看 | 国产女人40精品一区毛片视频 | 亚洲日日射| 蜜臀av性久久久久蜜臀aⅴ涩爱 | 国产中文在线视频 | 免费看成人a | 国产成人黄色片 | 亚洲精品一区中文字幕乱码 | 中文字幕成人网 | 国产色拍拍拍拍在线精品 | 亚洲色影爱久久精品 | 免费高清国产 | 成人黄色中文字幕 | 免费av的网站 | 欧美日本高清视频 | 亚洲精品乱码久久 | 中文字幕在线看 | 国产美女主播精品一区二区三区 | 久久99精品一区二区三区三区 | 欧美 亚洲 另类 激情 另类 | 天堂av观看 | 亚洲综合精品视频 | 国产精品国产毛片 | 永久精品视频 | 日日操夜夜操狠狠操 | 色哟哟国产精品 | 久久久久北条麻妃免费看 | 99精品国产免费久久久久久下载 | av片在线观看 | 99热这里只有精品免费 | 国内精品久久久久久中文字幕 | 在线观看91av | 免费看的黄色 | 欧美日韩一级久久久久久免费看 | 一区二区三区高清不卡 | 久久免费毛片 | 成人毛片在线视频 | 少妇视频在线播放 | 狠狠干夜夜操天天爽 | 亚洲最新视频在线 | 国产在线日韩 | 日韩欧美精品一区 | 久久久五月天 | 免费在线观看成人小视频 | 欧美一区二区三区免费观看 | 国产乱码精品一区二区蜜臀 | 91插插插网站 | www.天天成人国产电影 | 午夜av免费在线观看 | 国产另类xxxxhd高清 | 中文字幕在线免费 | 2019中文最近的2019中文在线 | 在线视频麻豆 | 亚洲闷骚少妇在线观看网站 | 91网在线看 | 精品久久久久久久久久 | 婷婷中文在线 | 国产精品精品国产 | 亚州国产精品久久久 | 久久成人国产精品 | 成人教育av | 日韩欧美网址 | 97色婷婷成人综合在线观看 | 麻花豆传媒mv在线观看 | 国产午夜av | 99性视频 | 亚洲色图 校园春色 | 国产 日韩 中文字幕 | 麻豆久久精品 | 天堂在线一区二区三区 | 亚洲黄色一级大片 | 久久艹综合 | 视频在线观看国产 | 97精品国产97久久久久久免费 | 91污在线观看 | 日韩av一区二区三区在线观看 | 91一区二区三区久久久久国产乱 | 视频一区二区精品 | 亚洲开心激情 | 欧美日韩高清在线一区 | 免费福利在线播放 | 国产精品一区二区三区久久久 | 五月天婷婷在线观看视频 | 5月丁香婷婷综合 | 日本久久成人中文字幕电影 | 久久欧美精品 | 91在线观看视频网站 | 精品一区二区三区香蕉蜜桃 | 久草在线视频首页 | 中文字幕av最新 | 亚洲男人天堂a | 国产综合福利在线 | 中文字幕亚洲高清 | 一区二区视频免费在线观看 | www.玖玖玖| 国产视频美女 | 日免费视频| 91麻豆精品一区二区三区 | 日韩 在线a | 天天se天天cao天天干 | 日本在线免费看 | 丁香六月网 | 五月天av在线 | 欧美在线视频免费 | 91在线视频播放 | 国产精品一区二区av麻豆 | 在线视频1卡二卡三卡 | 韩国av一区二区 | 久久夜色精品亚洲噜噜国4 午夜视频在线观看欧美 | 高清免费在线视频 | 日韩免费高清在线 | 国产香蕉视频在线播放 | 亚洲国产精品一区二区久久hs | 99国产一区二区三精品乱码 | 久久神马影院 | 国产高清免费观看 | 亚洲综合色丁香婷婷六月图片 | 国产精品99久久久久久久久 | 99久久er热在这里只有精品15 | 免费在线视频一区二区 | 亚洲人成人在线 | 91看片看淫黄大片 | 亚洲区另类春色综合小说校园片 | 狠狠色丁香婷婷综合橹88 | 射综合网| 日韩久久久久 | www婷婷| 特级免费毛片 | 亚州中文av | 成人蜜桃视频 | 天天操天天玩 | 天天搞夜夜骑 | 亚洲国产中文字幕在线观看 | a成人v在线 | 超碰在线97免费 | av在线播放网址 | 国产二区视频在线观看 | 91中文字幕网 | 日韩精品一区二区三区不卡 | 在线视频 亚洲 | 色综合久久久 | 在线观看片 | 99精品热 | 久久高清免费观看 | av天天干 | 97超碰成人在线 | 在线观看第一页 | 激情五月五月婷婷 | 国产一区二区久久精品 | 欧美在线视频精品 | 成人久久久久久久久久 | 97韩国电影 | 69精品视频在线观看 | 国产精品久久久久久久免费 | 国产精品美乳一区二区免费 | 成 人 黄 色 视频播放1 | 久久综合久久久 | 久久精品一级片 | 97色涩 | 99产精品成人啪免费网站 | 欧美精品小视频 | 国产二区电影 | 91资源在线播放 | 精品久久久久久亚洲综合网站 | 久久不卡电影 | 91重口视频 | 免费精品国产va自在自线 | 国内精品久久久久久久久久久久 | 激情五月激情综合网 | 天天鲁天天干天天射 | 国产成人精品一二三区 | 久久精品五月 | 国产91精品久久久久久 | 欧美精品久久久久久久久久 | 成人毛片一区 | 免费视频一区 | 91av短视频| 九九精品久久 | 免费国产亚洲视频 | 久久激情综合网 | 色是在线视频 | 亚洲成人午夜av | 成人黄色av免费在线观看 | 国产一区国产二区在线观看 | 精品日韩中文字幕 | 欧美伦理一区 | 久久久久麻豆 | 成人久久 | 色国产在线 | 久久99热久久99精品 | 8x8x在线观看视频 | 日本特黄一级 | 91麻豆精品91久久久久同性 | 中国一区二区视频 | 成人h在线 | 99视频久久 | 一区av在线播放 | 国产精品永久在线 | 国产在线黄色 | 国内偷拍精品视频 | 日日干夜夜骑 | 欧美疯狂性受xxxxx另类 | 国产夫妻av在线 | 国产成免费视频 | 亚洲综合一区二区精品导航 | 精品视频不卡 | 国产大片免费久久 | 日本精品视频在线观看 | 欧美美女视频在线观看 | 日韩精品久久久 | 天天操天天色天天射 | 在线观看蜜桃视频 | 最近更新好看的中文字幕 | 在线观看精品视频 | 欧美在线观看视频 | 日本动漫做毛片一区二区 | 久久99久久99精品中文字幕 | 国产亚洲精品久久久久久大师 | 毛片区| 国产在线播放不卡 | 国产一线天在线观看 | 99视频在线观看免费 | 欧美大片在线看免费观看 | 国产69精品久久久久99 | 天堂va在线观看 | 亚洲综合网站在线观看 | 中文字幕在线观看三区 | 久精品在线 | 日韩在线激情 | 日韩在线中文字幕视频 | 国产99久久久精品视频 | 久久综合久久综合九色 | 日韩最新在线视频 | 久久综合久久综合久久 | 中文字幕中文字幕在线中文字幕三区 | 欧美日韩免费观看一区二区三区 | 中文字幕在线播放av | 欧美一级日韩三级 | 最新国产福利 | 欧美一级裸体视频 | 国产日韩精品视频 | 欧美另类巨大 | av午夜电影 | 久久99精品久久久久久 | 999毛片| 亚洲免费国产视频 | 久久久噜噜噜久久久 | 国产一区二区三区高清播放 | 日韩欧在线 | 欧美日韩国产一区二区三区在线观看 | 福利网址在线观看 | 综合久久精品 | 天天操狠狠干 | 亚洲成av人片在线观看无 | 国产在线观看91 | 久久免费精品 | 久精品在线 | 五月天,com| 欧美精彩视频在线观看 | 日韩欧美在线中文字幕 | 亚洲国产精品va在线看 | 日本aaaa级毛片在线看 | 婷婷伊人综合 | 久久精品在线视频 | 狠狠色狠狠色综合系列 | 婷婷爱五月天 | 日韩精品一区二区免费 | 爱色婷婷 | 成人黄色一级视频 | 精品美女国产在线 | 国产欧美最新羞羞视频在线观看 | mm1313亚洲精品国产 | 伊人六月 | 色婷婷婷 | 一区二区三区在线影院 | 久久精品久久综合 | 久久在线视频在线 | 久久精品5 | 久久综合九色 | 精品国产一区在线观看 | 中文字幕av在线免费 | av在线一| 免费福利在线观看 | 日本不卡123区 | 99热这里只有精品在线观看 | 亚洲精品资源在线观看 | 国产在线毛片 | 人人讲| 国产精品视频内 | 全久久久久久久久久久电影 | 久久国产精品一二三区 | 中文字幕视频在线播放 | 在线观看成人国产 | 激情婷婷在线 | 国产永久免费观看 | 探花视频在线观看免费版 | 国产99区 | 操老逼免费视频 | 91女人18片女毛片60分钟 | 久久精品视频国产 | 99久久精品一区二区成人 | 久精品视频免费观看2 | 国产高清视频免费观看 | 2021国产在线 | 国产精品18久久久久白浆 | 69视频永久免费观看 | 99视频这里只有 | 久久天堂亚洲 | 日韩欧美91 | 黄色亚洲 | 日韩av线观看 | 久久99国产精品视频 | 日韩在线资源 | 国产亚洲视频在线观看 | 亚洲精品国产精品乱码不99热 | 99精品欧美一区二区三区 | 国产日女人| 精品在线观看免费 | 欧美成人xxxx| 免费视频18| 涩涩网站在线 | 在线观看涩涩 | 在线97 | 天天插伊人 | 9幺看片 | 欧美成人精品三级在线观看播放 | 三级动态视频在线观看 | 国产高清视频在线观看 | 一二三四精品 | 国产一区二区成人 | 韩国av免费看 | 久久成人高清 | 国产精品久久99 | 91探花国产综合在线精品 | 久久久久久久久综合 | 国产精品久久艹 | 国产成人久久精品一区二区三区 | 在线免费观看一区二区三区 | 丝袜+亚洲+另类+欧美+变态 | 99热精品国产一区二区在线观看 | 91麻豆精品国产91久久久无需广告 | 一区二区三区在线免费观看视频 | www.亚洲精品在线 | 天天综合区 | 欧美精品在线视频 | 91精品国产网站 | 欧美极品久久 | 91精品第一页| 亚洲精品综合在线 | 天天躁日日 | 精品国产一二三四区 | 亚洲第一区在线播放 | 综合色在线观看 | 欧美在线观看视频 | 亚在线播放中文视频 | wwwwww色 | 天天躁天天操 | 久草在线视频看看 | 精品美女在线视频 | 免费看片色 | 天天综合人人 | 国产日韩av在线 | 久久av影院 | 高清av中文在线字幕观看1 | 91cn国产在线| 黄色软件在线看 | 久久精品1区2区 | 97视频在线免费播放 | 久草在线免费资源 | 黄色精品免费 | 91中文字幕在线观看 | 日韩高清免费无专码区 | 麻豆视频国产在线观看 | 日韩美一区二区三区 | 国产又粗又硬又长又爽的视频 | 2019免费中文字幕 | 婷婷色狠狠 | 特级毛片在线免费观看 | 91免费观看国产 | 日韩在线视频一区 | 蜜桃视频色 | 国产精品久久久久久久久久了 | 91av资源网 | 九七视频在线 | 毛片1000部免费看 | 一本之道乱码区 | 91最新地址永久入口 | 黄色亚洲片 | 日韩精品亚洲专区在线观看 | 四季av综合网站 | 中文永久字幕 | 久久综合9988久久爱 | 夜夜躁狠狠躁日日躁视频黑人 | 精品国产久 | 精品国产a | 91精品国自产在线偷拍蜜桃 | 丁香花在线视频观看免费 | 99热这里只有精品国产首页 | 日韩av电影免费观看 | 在线播放亚洲激情 | 91亚洲精品国产 | 欧美日韩亚洲第一 | 91精品国产欧美一区二区 | 西西人体4444www高清视频 | 久久久久 | 人成在线免费视频 | 国产精品久久久久永久免费 | 亚洲欧美日韩国产一区二区 | 97**国产露脸精品国产 | 久艹在线播放 | 麻豆精品国产传媒 | 日韩av免费大片 | 在线免费观看成人 | 国产精品亚州 | 免费看日韩片 | 色多多视频在线 | 激情婷婷色| 久久乐九色婷婷综合色狠狠182 | 久久久精品小视频 | 一区二区三区久久精品 | 黄色一级大片免费看 | 免费人成在线观看网站 | 久久久天堂 | 欧美日韩亚洲一 | 国产福利在线不卡 | 天天插日日射 | 狠狠狠干 | 亚洲成av人片在线观看 | 国产成人99久久亚洲综合精品 | 色播五月激情综合网 | 国产精品四虎 | 国产色影院 | 日韩电影久久久 | 伊人五月天av | 亚洲一区二区精品3399 | 黄色免费网站 | 色资源二区在线视频 | 精品国产一区二区三区久久久久久 | 91 在线视频播放 | 国产精品一二 | 日本丰满少妇免费一区 | 亚洲春色成人 | 九九在线播放 | 狠狠操狠狠干2017 | 日p视频在线观看 | 狠狠干激情 | 波多野结衣在线观看视频 | 噜噜色官网 | 国产免费黄视频在线观看 | 天天摸天天舔 | 五月婷婷久草 | 亚洲精品高清在线 | 夜夜夜草| 97国产一区二区 | 久久综合影视 | 国产成人免费 | 超碰在线中文字幕 | 日韩精品在线免费播放 | 涩涩网站在线播放 | 日韩极品在线 | 中文字幕精品视频 | a午夜在线 | 午夜精品影院 | 亚洲精品日韩一区二区电影 | 免费国产黄线在线观看视频 | 在线小视频你懂得 | 亚洲精品国产精品乱码在线观看 | 91麻豆精品91久久久久同性 | 国产成人精品一二三区 | 久久永久视频 | 亚洲黄色av | 成人午夜网址 | 国产色区| 中文字幕 成人 | 国产98色在线 | 日韩 | 久久久综合香蕉尹人综合网 | 亚洲综合色婷婷 | 国产精品视频地址 | 亚洲一片黄 | 精品国产伦一区二区三区免费 | 国产视频精品久久 | 在线观看中文字幕2021 | 手机看片国产日韩 | 欧美天天综合网 | 黄污网站在线观看 | 欧美日韩一区二区三区免费视频 | 91在线视频观看 | 7777精品伊人久久久大香线蕉 | 亚洲精品乱码久久久久v最新版 | 日韩精品一区二区三区免费视频观看 | 黄色在线观看www | 精品国产大片 | 日日夜夜狠狠 | 久久人人爽视频 | 国产乱对白刺激视频在线观看女王 | 五月天亚洲激情 | 欧美国产一区二区 | 精品国产亚洲日本 | 日韩精品一区二区三区视频播放 | 91精品啪在线观看国产 | 久久久伊人网 | 国产精品正在播放 | 欧美日韩不卡一区二区三区 | 中文字幕乱码电影 | 亚洲欧美久久 | 99精品视频在线播放观看 | 色国产精品一区在线观看 | 中文字幕av在线不卡 | 99久久精品免费看国产一区二区三区 | 欧美一区二区免费在线观看 | 欧美精品小视频 | 波多野结衣电影一区二区三区 | 久久久国产精品网站 | 97免费在线视频 | 亚洲视频1 | 狠狠狠综合 | 四季av综合网站 | 国产精品美女www爽爽爽视频 | 伊人中文字幕在线 | 亚洲黄色免费观看 | 久久国产日韩 | 天堂网在线视频 | 亚洲综合黄色 | 国产在线黄色 | 伊人色综合久久天天 | 91九色在线视频 | 久久久久久久国产精品影院 | 国产亚洲精品久久久久久大师 | 在线视频 一区二区 | 亚洲国产欧洲综合997久久, | 久久精品视频日本 | 国产精品一区二区在线 | 97在线视频免费播放 | 91禁在线观看 | 久久精品一区二区国产 | 久久国产精品一区二区三区四区 | 九九免费观看全部免费视频 | 成人三级网站在线观看 | 亚洲伊人色 | 国产精品99久久久久久久久 | 欧美成人精品在线 | 天天综合久久综合 | 欧美另类亚洲 | 国产一区二区三区在线免费观看 | 高清av中文字幕 | 91人人爽人人爽人人精88v | 精品美女视频 | 亚洲一级影院 | 日韩av在线一区二区 | 成人免费网站视频 | 五月婷婷久草 | 97国产在线 | 亚州精品在线视频 | 美女av免费看| 亚洲精品男女 | 高清不卡毛片 | 免费色婷婷 | 亚洲精品视频在线播放 | 国产精品第54页 | 超碰成人网 | 在线看片视频 | 日韩高清免费观看 | 亚洲天堂在线观看完整版 | 中文字幕国产 | 国产原创av在线 | 蜜臀av性久久久久av蜜臀妖精 | 美女亚洲精品 | 99精品免费久久久久久久久 | 99超碰在线播放 | 99久久综合精品五月天 | 午夜av在线电影 | www.久久免费视频 | www.久久久com| 国色天香av | 爱干视频| 日韩欧美91 | 成人动漫一区二区 | 亚州国产视频 | 韩日精品在线观看 | 97精品一区 | 一本一本久久a久久精品综合 | 成人毛片一区 | 日日碰狠狠添天天爽超碰97久久 | 中文在线资源 | 国产综合精品一区二区三区 | 色视频网址 | 国产精品99久久久久久人免费 | 最新日本中文字幕 | 色网址99 | 精品成人免费 | 色综合久久五月 | 国产黄免费在线观看 |