當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Deep Learning-论文翻译以及笔记

發布時間：2025/3/15 编程问答 29 豆豆

生活随笔收集整理的這篇文章主要介紹了 Deep Learning-论文翻译以及笔记小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

論文題目：Deep Learning
論文來源:Deep Learning_2015_Nature
翻譯人：莫陌莫墨

Deep Learning

Yann LeCun? Yoshua Bengio? Geoffrey Hinton

深度學習

Yann LeCun? Yoshua Bengio? Geoffrey Hinton

Abstract

Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech rec-ognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.

摘要

深度學習允許由多個處理層組成的計算模型學習具有多個抽象級別的數據表示。這些方法極大地提升了語音識別、視覺目標識別、目標檢測以及許多其他領域的最新技術，例如藥物發現和基因組學。深度學習通過使用反向傳播算法來指示機器應如何更新其內部參數（從上一層的表示形式計算每一層的表示形式），從而發現大型數據集中的復雜結構。深層卷積網絡在處理圖像、視頻、語音和音頻方面帶來了突破，而遞歸網絡則對諸如文本和語音之類的順序數據有所啟發。

正文

Machine-learning technology powers many aspects of modern society: from web searches to content filtering on social networks to recommendations on e-commerce websites, and it is increasingly present in consumer products such as cameras and smartphones. Machine-learning systems are used to identify objects in images, transcribe speech into text, match news items, posts or products with users’ interests, and select relevant results of search. Increasingly, these applications make use of a class of techniques called deep learning.

Conventional machine-learning techniques were limited in their ability to process natural data in their raw form. For decades, constructing a pattern-recognition or machine-learning system required careful engineering and considerable domain expertise to design a feature extractor that transformed the raw data (such as the pixel values of an image) into a suitable internal representation or feature vector from which the learning subsystem, often a classifier, could detect or classify patterns in the input.

Representation learning is a set of methods that allows a machine to be fed with raw data and to automatically discover the representations needed for detection or classification. Deep-learning methods are representation-learning methods with multiple levels of representation, obtained by composing simple but non-linear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level. With the composition of enough such transformations, very complex functions can be learned. For classification tasks, higher layers of representation amplify aspects of the input that are important for discrimination and suppress irrelevant variations. An image, for example, comes in the form of an array of pixel values, and the learned features in the first layer of representation typically represent the presence or absence of edges at particular orientations and locations in the image. The second layer typically detects motifs by spotting particular arrangements of edges, regardless of small variations in the edge positions. The third layer may assemble motifs into larger combinations that correspond to parts of familiar objects, and subsequent layers would detect objects as combinations of these parts. The key aspect of deep learning is that these layers of features are not designed by human engineers: they are learned from data using a general-purpose learning procedure.

Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years. It has turned out to be very good at discovering intricate structures in high-dimensional data and is therefore applicable to many domains of science, business and government. In addition to beating records in image recognition and speech recognition, it has beaten other machine-learning techniques at predicting the activity of potential drug molecules, analysing particle accelerator data, reconstructing brain circuits, and predicting the effects of mutations in non-coding DNA on gene expression and disease. Perhaps more surprisingly, deep learning has produced extremely promising results for various tasks in natural language understanding, particularly topic classification, sentiment analysis, question answering and lan-guage translation.

We think that deep learning will have many more successes in the near future because it requires very little engineering by hand, so it can easily take advantage of increases in the amount of available computation and data. New learning algorithms and architectures that are currently being developed for deep neural networks will only accelerate this progress.

機器學習技術為現代社會的各個方面提供了強大的支持：從網絡搜索到社交網絡上的內容過濾再到電子商務網站上的推薦，并且它越來越多地出現在諸如相機和智能手機之類的消費產品中。機器學習系統用于識別圖像中的目標，語音轉錄為文本，新聞標題、帖子或具有用戶興趣的產品匹配，以及選擇相關的搜索結果。這些應用程序越來越多地使用一類稱為深度學習的技術。

傳統的機器學習技術在處理原始格式的自然數據方面的能力受到限制。幾十年來，構建模式識別或機器學習系統需要認真的工程設計和相當多的領域專業知識，才能設計特征提取器，以將原始數據（例如圖像的像素值）轉換為合適的內部表示或特征向量，學習子系統（通常是分類器）可以對輸入的圖片進行檢測或分類。

表示學習是一組方法，這些方法允許向機器提供原始數據并自動發現檢測或分類所需的表示。深度學習方法是具有表示形式的多層次的表示學習方法，它是通過組合簡單但非線性的模塊而獲得的，每個模塊都將一個級別（從原始輸入開始）的表示形式轉換為更高、更抽象的級別的表示形式。有了足夠多的此類轉換，就可以學習非常復雜的功能。對于分類任務，較高的表示層會放大輸入中對區分非常重要的方面，并抑制不相關的變化。例如，圖像以像素值序列的形式出現，并且在表示的第一層中學習的特征通常表示圖像中特定方向和位置上是否存在邊緣。第二層通常通過發現邊緣的特定布置來檢測圖案，而與邊緣位置的微小變化無關。第三層可以將圖案組裝成與熟悉的對象的各個部分相對應的較大組合，并且隨后的層將這些部分的組合作為目標進行檢測。深度學習的關鍵在于每層的功能不是由人類工程師設計的，而是通用訓練過程從數據中學習的。

深度學習在解決多年來抵制人工智能界最大嘗試的問題方面取得了重大進展。事實證明，它非常善于發現高維數據中的復雜結構，因此適用于科學、商業和政府的許多領域。除了打破圖像識別和語音識別中的記錄，它在預測潛在藥物分子的活性、分析粒子加速器數據，重建腦回路和預測非編碼DNA突變對基因表達和疾病的影響方面還優于其他機器學習技術。更令人驚訝的是，深度學習在自然語言理解中的各種任務上產生了非常有希望的結果，尤其是主題分類、情感分析、問答系統和語言翻譯。

由于深度學習只需要極少的人工操作，我們認為其在不久的將來會取得更多的成功，因此可以輕松地利用增加的可用計算量和數據量的優勢。目前正在為深度神經網絡開發的新學習算法和體系結構只會加速這一進展。
總述：先敘述了機器學習的廣泛應用，傳統的機器學習局限與輸入需要對原始數據加工，而加工是一個手藝活，需要很多的經驗和算法知識，然后引入Representation learning 。

Supervised learning

The most common form of machine learning, deep or not, is supervised learning. Imagine that we want to build a system that can classify images as containing, say, a house, a car, a person or a pet. We first collect a large data set of images of houses, cars, people and pets, each labelled with its category. During training, the machine is shown an image and produces an output in the form of a vector of scores, one for each category. We want the desired category to have the highest score of all categories, but this is unlikely to happen before training. We compute an objective function that measures the error (or distance) between the output scores and the desired pattern of scores. The machine then modifies its internal adjustable parameters to reduce this error. These adjustable parameters, often called weights, are real numbers that can be seen as ‘knobs’ that define the input–output function of the machine. In a typical deep-learning system, there may be hundreds of millions of these adjustable weights, and hundreds of millions of labelled examples with which to train the machine.

To properly adjust the weight vector, the learning algorithm computes a gradient vector that, for each weight, indicates by what amount the error would increase or decrease if the weight were increased by a tiny amount. The weight vector is then adjusted in the opposite direction to the gradient vector.

The objective function, averaged over all the training examples, can be seen as a kind of hilly landscape in the high-dimensional space of weight values. The negative gradient vector indicates the direction of steepest descent in this landscape, taking it closer to a minimum, where the output error is low on average.

In practice, most practitioners use a procedure called stochastic gradient descent (SGD). This consists of showing the input vector for a few examples, computing the outputs and the errors, computing the average gradient for those examples, and adjusting the weights accordingly. The process is repeated for many small sets of examples from the training set until the average of the objective function stops decreasing. It is called stochastic because each small set of examples gives a noisy estimate of the average gradient over all examples. This simple procedure usually finds a good set of weights surprisingly quickly when compared with far more elaborate optimization techniques. After training, the performance of the system is measured on a different set of examples called a test set. This serves to test the generalization ability of the machine — its ability to produce sensible answers on new inputs that it has never seen during training.

Many of the current practical applications of machine learning use linear classifiers on top of hand-engineered features. A two-class linear classifier computes a weighted sum of the feature vector components. If the weighted sum is above a threshold, the input is classified as belonging to a particular category.

Since the 1960s we have known that linear classifiers can only carve their input space into very simple regions, namely half-spaces separated by a hyperplane. But problems such as image and speech recognition require the input–output function to be insensitive to irrelevant variations of the input, such as variations in position, orientation or illumination of an object, or variations in the pitch or accent of speech, while being very sensitive to particular minute variations (for example, the difference between a white wolf and a breed of wolf-like white dog called a Samoyed). At the pixel level, images of two Samoyeds in different poses and in different environments may be very different from each other, whereas two images of a Samoyed and a wolf in the same position and on similar backgrounds may be very similar to each other. A linear classifier, or any other ‘shallow’ classifier operating on aw pixels could not possibly distinguish the latter two, while putting the former two in the same category. This is why shallow classifiers require a good feature extractor that solves the selectivity–invariance dilemma — one that produces representations that are selective to the aspects of the image that are important for discrimination, but that are invariant to irrelevant aspects such as the pose of the animal. To make classifiers more powerful, one can use generic non-linear features, as with kernel methods, but generic features such as those arising with the Gaussian kernel do not allow the learner to generalize well far from the training examples. The conventional option is to hand design good feature extractors, which requires a considerable amount of engineering skill and domain expertise. But this can all be avoided if good features can be learned automatically using a general-purpose learning procedure. This is the key advantage of deep learning.

A deep-learning architecture is a multilayer stack of simple modules, all (or most) of which are subject to learning, and many of which compute non-linear input–output mappings. Each module in the stack transforms its input to increase both the selectivity and the invariance of the representation. With multiple non-linear layers, say a depth of 5 to 20, a system can implement extremely intricate functions of its inputs that are simultaneously sensitive to minute details-distinguishing Samoyeds from white wolves-and insensitive to large irrelevant variations such as the background, pose, lighting and surrounding objects.

監督學習

不論深度與否，機器學習最常見的形式都是監督學習。想象一下，我們想建立一個可以將圖像分類為包含房屋、汽車、人或寵物的系統。我們首先收集大量的房屋、汽車、人和寵物的圖像數據集，每個圖像均標有類別。在訓練過程中，機器將顯示一張圖像，并輸出一個分數向量，每個類別一個。我們希望所需的類別在所有類別中得分最高，但這不太可能在訓練之前發生。我們計算一個目標函數，該函數測量輸出得分與期望得分模式之間的誤差（或距離）。然后機器修改其內部可更新參數以減少此誤差。這些可更新的參數（通常稱為權重）是實數，可以看作是定義機器輸入輸出功能的“旋鈕”。在典型的深度學習系統中，可能會有數以億計的可更新權重，以及數億個帶有標簽的實例，用于訓練模型。

為了適當地更新權重向量，學習算法計算一個梯度向量，針對每個權重，該梯度向量表明，如果權重增加很小的量，誤差將增加或減少的相應的量。然后沿與梯度向量相反的方向更新權重向量。

在所有訓練示例中平均的目標函數可以在權重值的高維空間中被視為一種丘陵地形。負梯度矢量指示此地形中最陡下降的方向，使其更接近最小值，其中輸出誤差平均較低。

在實踐中，大多數從業者使用一種稱為隨機梯度下降（SGD）的算法。這包括顯示幾個示例的輸入向量，計算輸出和誤差，計算這些示例的平均梯度以及相應地更新權重。對訓練集中的許多小樣本示例重復此過程，直到目標函數的平均值停止下降。之所以稱其為隨機的，是因為每個小的示例集都會給出所有示例中平均梯度的噪聲估計。與更復雜的優化技術相比[18]，這種簡單的過程通常會出乎意料地快速找到一組良好的權重。訓練后，系統的性能將在稱為測試集的不同示例集上進行測量。這用于測試機器的泛化能力：機器在新的輸入數據上產生好的效果的能力，這些輸入數據在訓練集上是沒有的。

機器學習的許多當前實際應用都在人工設計的基礎上使用線性分類器。兩類別線性分類器計算特征向量分量的加權和。如果加權和大于閾值，則將輸入分為特定類別。

自二十世紀六十年代以來，我們就知道線性分類器只能將其輸入空間劃分為非常簡單的區域，即由超平面分隔的對半空間。但是，諸如圖像和語音識別之類的問題要求輸入輸出功能對輸入的不相關變化不敏感，例如目標的位置、方向或照明的變化，或語音的音高或口音的變化。對特定的微小變化敏感（例如，白狼與薩摩耶之間的差異，薩摩耶是很像狼的白狗）。在像素級別，兩幅處于不同姿勢和不同環境中的薩摩耶圖像可能差別很大，而兩幅位于相同位置且背景相似的薩摩耶和狼的圖像可能非常相似。線性分類器或其他任何在其上運行的“淺”分類器無法區分后兩幅圖片，而將前兩幅圖像歸為同一類別。這就是為什么淺分類器需要一個好的特征提取器來解決選擇性不變性難題的原因。提取器可以產生對圖像中對于辨別重要的方面具有選擇性但對不相關方面（例如動物的姿態）不變的表示形式。為了使分類器更強大，可以使用通用的非線性特征，如核方法，但是諸如高斯核所產生的那些通用特征，使學習者無法從訓練示例中很好地概括。傳統的選擇是人工設計好的特征提取器，這需要大量的工程技術和領域專業知識。但是，如果可以使用通用學習過程自動學習好的功能，則可以避免所有這些情況。這是深度學習的關鍵優勢。

深度學習架構是簡單模塊的多層堆疊，所有模塊（或大多數模塊）都需要學習，并且其中許多模塊都會計算非線性的輸入-輸出映射。堆疊中的每個模塊都會轉換其輸入，以增加表示的選擇性和不變性。系統具有多個非線性層（例如深度為5到20），可以實現極為復雜的輸入功能，這些功能同時對細小的細節敏感（區分薩摩耶犬與白狼），并且對不相關的大變化不敏感，例如背景功能、姿勢、燈光和周圍物體。
2，敘述監督學習（supervised learning）：就是給各種狗的圖片（train dataset），提示網絡架構這是狗（label），多次重復后再給一張狗的圖片，訓練好的網絡能自動反應，給出結果，這是貓。那么到底是怎么train的呢？我們需要定義一個objective function，它的作用是計算預測值和label之間的distance，網絡learning的任務就是縮小這個objective function的值，也就是讓預測值不斷接近真值，這個objective function是關于weight的函數，下面就粗略的提到一種優化的方法，叫隨機梯度下降(SGD)，就是每次在所有的樣本中隨機選一個樣本，計算objective function關于weight的偏導（梯度），讓weight往梯度的負方向（梯度的負方向就是objective function即誤差減小的方向）變化，然后多次重復，最后發現objective function的值不變了或者變得很小了，就停止迭代，此時的參數weight，就是我們的網絡學習到的。最后就可以訓練處有自己理解的一個架構，從而可以判斷新的物體是不是之前自己學習過的物體，給出結果

Backpropagation to train multilayer architectures

From the earliest days of pattern recognition, the aim of researchers has been to replace hand-engineered features with trainable multilayer networks, but despite its simplicity, the solution was not widely understood until the mid 1980s. As it turns out, multilayer architectures can be trained by simple stochastic gradient descent. As long as the modules are relatively smooth functions of their inputs and of their internal weights, one can compute gradients using the backpropagation procedure. The idea that this could be done, and that it worked, was discovered independently by several different groups during the 1970s and 1980s.

The backpropagation procedure to compute the gradient of an objective function with respect to the weights of a multilayer stack of modules is nothing more than a practical application of the chain rule for derivatives. The key insight is that the derivative (or gradient) of the objective with respect to the input of a module can be computed by working backwards from the gradient with respect to the output of that module (or the input of the subsequent module) (Fig. 1). The backpropagation equation can be applied repeatedly to propagate gradients through all modules, starting from the output at the top (where the network produces its prediction) all the way to the bottom (where the external input is fed). Once these gradients have been computed, it is straightforward to compute the gradients with respect to the weights of each module.
Many applications of deep learning use feedforward neural network architectures (Fig. 1), which learn to map a fixed-size input (for example, an image) to a fixed-size output (for example, a prob-ability for each of several categories). To go from one layer to the next, a set of units compute a weighted sum of their inputs from the previous layer and pass the result through a non-linear function. At present, the most popular non-linear function is the rectified linear unit (ReLU), which is simply the half-wave rectifier $f (z) = m a x ? (0, z)$ . In past decades, neural nets used smoother non-linearities, such as $t a n h (z)$ or $1 / (1 + e x p ? (? z))$ , but the ReLU typically learns much faster in networks with many layers, allowing training of a deep supervised network without unsupervised pre-training. Units that are not in the input or output layer are conventionally called hidden units. The hidden layers can be seen as distorting the input in a non-linear way so that categories become linearly separable by the last layer (Fig. 1).

In the late 1990s, neural nets and backpropagation were largely forsaken by the machine-learning community and ignored by the computer-vision and speech-recognition communities. It was widely thought that learning useful, multistage, feature extractors with lit-tle prior knowledge was infeasible. In particular, it was commonly thought that simple gradient descent would get trapped in poor local minima-weight configurations for which no small change would reduce the average error.

In practice, poor local minima are rarely a problem with large networks. Regardless of the initial conditions, the system nearly always reaches solutions of very similar quality. Recent theoretical and empirical results strongly suggest that local minima are not a serious issue in general. Instead, the landscape is packed with a combinatorially large number of saddle points where the gradient is zero, and the surface curves up in most dimensions and curves down in the remainder. The analysis seems to show that saddle points with only a few downward curving directions are present in very large numbers, but almost all of them have very similar values of the objec-tive function. Hence, it does not much matter which of these saddle points the algorithm gets stuck at.

Interest in deep feedforward networks was revived around 2006 by a group of researchers brought together by the Canadian Institute for Advanced Research (CIFAR). The researchers intro-duced unsupervised learning procedures that could create layers of feature detectors without requiring labelled data. The objective in learning each layer of feature detectors was to be able to reconstruct or model the activities of feature detectors (or raw inputs) in the layer below. By ‘pre-training’ several layers of progressively more complex feature detectors using this reconstruction objective, the weights of a deep network could be initialized to sensible values. A final layer of output units could then be added to the top of the network and the whole deep system could be fine-tuned using standard backpropagation. This worked remarkably well for recognizing handwritten digits or for detecting pedestrians, especially when the amount of labelled data was very limited36.

The first major application of this pre-training approach was in speech recognition, and it was made possible by the advent of fast graphics processing units (GPUs) that were convenient to program and allowed researchers to train networks 10 or 20 times faster. In 2009, the approach was used to map short temporal windows of coef-ficients extracted from a sound wave to a set of probabilities for the various fragments of speech that might be represented by the frame in the centre of the window. It achieved record-breaking results on a standard speech recognition benchmark that used a small vocabu-lary and was quickly developed to give record-breaking results on a large vocabulary task. By 2012, versions of the deep net from 2009 were being developed by many of the major speech groups and were already being deployed in Android phones. For smaller data sets, unsupervised pre-training helps to prevent overfitting, leading to significantly better generalization when the number of labelled examples is small, or in a transfer setting where we have lots of examples for some ‘source’ tasks but very few for some ‘target’ tasks. Once deep learning had been rehabilitated, it turned out that the pre-training stage was only needed for small data sets.

There was, however, one particular type of deep, feedforward network that was much easier to train and generalized much better than networks with full connectivity between adjacent layers. This was the convolutional neural network (ConvNet). It achieved many practical successes during the period when neural networks were out of favour and it has recently been widely adopted by the computervision community

反向傳播訓練多層架構

從模式識別的早期開始，研究人員的目的一直是用可訓練的多層網絡代替手工設計的功能，但是盡管它很簡單，但直到二十世紀八十年代中期才廣泛了解該解決方案。事實證明，可以通過簡單的隨機梯度下降來訓練多層體系結構。只要模塊是其輸入及其內部權重的相對平滑函數，就可以使用反向傳播過程來計算梯度。二十世紀七十年代和二十世紀八十年代，幾個不同的小組獨立地發現了可以做到這一點并且起作用的想法。

反向傳播程序用于計算目標函數相對于模塊多層堆疊權重的梯度，無非是導數鏈規則的實際應用。關鍵的見解是，相對于模塊輸入的目標的導數（或梯度）可以通過相對于該模塊的輸出（或后續模塊的輸入）的梯度進行反運算來計算（圖1）。反向傳播方程式可以反復應用，以通過所有模塊傳播梯度，從頂部的輸出（網絡產生其預測）一直到底部的輸出（外部輸入被饋送）。一旦計算出這些梯度，就可以相對于每個模塊的權重來計算梯度。

深度學習的許多應用都使用前饋神經網絡體系結構（圖1），該體系結構會將固定大小的輸入（例如圖像）映射到固定大小的輸出（例如幾個類別中的每一個的概率）。為了從一層到下一層，一組單元計算它們來自上一層的輸入的加權和，并將結果傳遞給非線性函數。目前，最流行的非線性函數是整流線性單元（ReLU），即半波整流器 $f (z) = m a x ? (0, z)$ 。在過去的幾十年中，神經網絡使用了更平滑的非線性，例如 $t a n h (z) ? ?$ 或 $1 / (1 + e x p ? (? z))$ ，但ReLU通常在具有多個層的網絡中學習得更快，從而可以在無需監督的情況下進行深度監督的網絡訓練。不在輸入或輸出層中的單元通常稱為隱藏單元。隱藏的層可以被視為以非線性方式使輸入失真，以便類別可以由最后一層實現線性分別（圖1）。

在二十世紀九十年代后期，神經網絡和反向傳播在很大程度上被機器學習領域拋棄，而被計算機視覺和語音識別領域所忽略。人們普遍認為，在沒有先驗知識的情況下學習有用的多階段特征提取器是不可行的。特別是，通常認為簡單的梯度下降會陷入不良的局部極小值——權重配置，對其進行很小的變化將減少平均誤差。

實際上，較差的局部最小值在大型網絡中很少出現問題。不管初始條件如何，該系統幾乎總是能獲得效果非常相似的解決方案。最近的理論和經驗結果強烈表明，局部極小值通常不是一個嚴重的問題。取而代之的是，景觀中堆積了許多鞍點，其中梯度為零，并且曲面在大多數維度上都向上彎曲，而在其余維度上則向下彎曲。分析似乎表明，只有少數幾個向下彎曲方向的鞍點存在很多，但幾乎所有鞍點的目標函數值都非常相似。因此，算法陷入這些鞍點中的哪一個都沒關系。

加拿大高級研究所（CIFAR）召集的一組研究人員在2006年左右恢復了對深層前饋網絡的興趣。研究人員介紹了無需監督的學習程序，這些程序可以創建特征檢測器層，而無需標記數據。學習特征檢測器每一層的目的是能夠在下一層中重建或建模特征檢測器（或原始輸入）的活動。通過使用此重建目標“預訓練”幾層逐漸復雜的特征檢測器，可以將深度網絡的權重初始化為合理的值。然后可以將輸出單元的最后一層添加到網絡的頂部，并且可以使用標準反向傳播對整個深度系統進行微調。這對于識別手寫數字或檢測行人非常有效，特別是在標記數據量非常有限的情況下。

這種預訓練方法的第一個主要應用是語音識別，而快速圖形處理單元（GPU）的出現使編程成為可能，并且使研究人員訓練網絡的速度提高了10或20倍，從而使之成為可能。在2009年，該方法用于將從聲波提取的系數的短暫時間窗口映射到可能由窗口中心的幀表示的各種語音片段的一組概率。它在使用少量詞匯的標準語音識別基準上取得了創紀錄的結果，并迅速發展為大型詞匯任上取得了創紀錄的結果。到2012年，許多主要的語音組織都在開發2009年以來的深度網絡版本，并且已經在Android手機中進行了部署。對于較小的數據集，無監督的預訓練有助于防止過擬合，從而在標記的示例數量較少時或在轉移設置中，對于一些“源”任務，我們有很多示例，而對于某些“源”任務卻很少，這會導致泛化效果更好“目標”任務。恢復深度學習后，事實證明，僅對于小型數據集才需要進行預訓練。

但是，存在一種特定類型的深層前饋網絡，它比相鄰層之間具有完全連接的網絡更容易訓練和推廣。這就是卷積神經網絡（ConvNet）。在神經網絡未受關注期間，它取得了許多實際的成功，并且最近被計算機視覺界廣泛采用。
3，敘述BP算法（Backpropagation to train multilayer architectures ）：核心思想就是 chain rule for derivatives（鏈式求導），然后說神經網絡和反向傳播算法在機器學習里被遺忘，后來在2006年有個團隊提出了 unsupervised learning procedures，又revive了， unsupervised learning procedure顧名思義就是無監督嘛，能夠用沒有標簽的數據就創造網絡，它的厲害之處就是做一個pre-training ，作用是能夠把我們的參數weight初始化到一個合適的值，并且呢對一些小的數據集，unsupervised pre-training能夠避免過擬合

Convolutional neural networks

ConvNets are designed to process data that come in the form of multiple arrays, for example a colour image composed of three 2D arrays containing pixel intensities in the three colour channels. Many data modalities are in the form of multiple arrays: 1D for signals and sequences, including language; 2D for images or audio spectrograms; and 3D for video or volumetric images. There are four key ideas behind ConvNets that take advantage of the properties of natural signals: local connections, shared weights, pooling and the use of many layers.

The architecture of a typical ConvNet (Fig. 2) is structured as a series of stages. The first few stages are composed of two types of layers: convolutional layers and pooling layers. Units in a convolutional layer are organized in feature maps, within which each unit is connected to local patches in the feature maps of the previous layer through a set of weights called a filter bank. The result of this local weighted sum is then passed through a non-linearity such as a ReLU. All units in a feature map share the same filter bank. Different feature maps in a layer use different filter banks. The reason for this architecture is twofold. First, in array data such as images, local groups of values are often highly correlated, forming distinctive local motifs that are easily detected. Second, the local statistics of images and other signals are invariant to location. In other words, if a motif can appear in one part of the image, it could appear anywhere, hence the idea of units at different locations sharing the same weights and detecting the same pattern in different parts of the array. Mathematically, the filtering operation performed by a feature map is a discrete convolution, hence the name.

Although the role of the convolutional layer is to detect local conjunctions of features from the previous layer, the role of the pooling layer is to merge semantically similar features into one. Because the relative positions of the features forming a motif can vary somewhat, reliably detecting the motif can be done by coarse-graining the position of each feature. A typical pooling unit computes the maximum of a local patch of units in one feature map (or in a few feature maps). Neighbouring pooling units take input from patches that are shifted by more than one row or column, thereby reducing the dimension of the representation and creating an invariance to small shifts and distortions. Two or three stages of convolution, non-linearity and pooling are stacked, followed by more convolutional and fully-connected layers. Backpropagating gradients through a ConvNet is as simple as through a regular deep network, allowing all the weights in all the filter banks to be trained.

Deep neural networks exploit the property that many natural signals are compositional hierarchies, in which higher-level features are obtained by composing lower-level ones. In images, local combinations of edges form motifs, motifs assemble into parts, and parts form objects. Similar hierarchies exist in speech and text from sounds to phones, phonemes, syllables, words and sentences. The pooling allows representations to vary very little when elements in the previous layer vary in position and appearance.

The convolutional and pooling layers in ConvNets are directly inspired by the classic notions of simple cells and complex cells in visual neuroscience, and the overall architecture is reminiscent of the LGN–V1–V2–V4–IT hierarchy in the visual cortex ventral pathway. When ConvNet models and monkeys are shown the same picture, the activations of high-level units in the ConvNet explains half of the variance of random sets of 160 neurons in the monkey’s inferotemporal cortex. ConvNets have their roots in the neocognitron, the architecture of which was somewhat similar, but did not have an end-to-end supervised-learning algorithm such as backpropagation. A primitive 1D ConvNet called a time-delay neural net was used for the recognition of phonemes and simple words.

There have been numerous applications of convolutional networks going back to the early 1990s, starting with time-delay neural networks for speech recognition and document reading. The document reading system used a ConvNet trained jointly with a probabilistic model that implemented language constraints. By the late 1990s this system was reading over 10% of all the cheques in the United States. A number of ConvNet-based optical character recognition and handwriting recognition systems were later deployed by Microsoft. ConvNets were also experimented with in the early 1990s for object detection in natural images, including faces and hands and for face recognition.

卷積神經網絡

ConvNets被設計為處理以多個陣列形式出現的數據，例如，由三個二維通道組成的彩色圖像，其中三個二維通道在三個彩色通道中包含像素強度。許多數據形式以多個數組的形式出現：一維用于信號和序列，包括語言；2D用于圖像或音頻頻譜圖；和3D視頻或體積圖像。ConvNets有四個利用自然信號屬性的關鍵思想：局部連接，共享權重，池化和多層使用。

典型的ConvNet的體系結構（圖2）由一系列階段構成。前幾個階段由兩種類型的層組成：卷積層和池化層。卷積層中的單元組織在特征圖中，其中每個單元通過稱為濾波器組的一組權重連接到上一層特征圖中的局部塊。然后，該局部加權和的結果將通過非線性（如ReLU）傳遞。特征圖中的所有單元共享相同的過濾器組。圖層中的不同要素圖使用不同的濾鏡庫。這種體系結構的原因有兩個。首先，在諸如圖像的陣列數據中，局部的值通常高度相關，從而形成易于檢測的獨特局部圖案。其次，圖像和其他信號的局部統計量對于位置是不變的。換句話說，如果圖形可以出現在圖像的一部分中，則它可以出現在任何位置，因此，位于不同位置的單元在數組的不同部分共享相同的權重并檢測相同的圖案。在數學上，由特征圖執行的過濾操作是離散卷積，因此得名。

盡管卷積層的作用是檢測上一層的特征的局部連接，但池化層的作用是將語義相似的要素合并為一個。由于形成圖案的特征的相對位置可能會略有變化，因此可以通過對每個特征的位置進行粗粒度來可靠地檢測圖案。一個典型的池化單元計算一個特征圖中（或幾個特征圖中）的局部塊的最大值。相鄰的池化單元從移動了不止一個行或一列的色塊中獲取輸入，從而減小了表示的尺寸，并為小幅度的移位和失真創建了不變性。卷積、非線性和池化的兩個或三個階段被堆疊，隨后是更多卷積和全連接的層。通過ConvNet進行反向傳播的梯度與通過常規深度網絡一樣簡單，從而可以訓練所有濾波器組中的所有權重。

深度神經網絡利用了許多自然信號是成分層次結構的特性，其中通過組合較低層的特征獲得較高層的特征。在圖像中，邊緣的局部組合形成圖案，圖案組裝成零件，而零件形成對象。從聲音到電話，音素，音節，單詞和句子，語音和文本中也存在類似的層次結構。當上一層中的元素的位置和外觀變化時，池化使表示形式的變化很小。

卷積網絡中的卷積和池化層直接受到視覺神經科學中簡單細胞和復雜細胞的經典概念的啟發，整個架構讓人聯想到視覺皮層腹側通路中的LGN-V1-V2-V4-IT層次結構。當ConvNet模型和猴子顯示相同的圖片時，ConvNet中高層單元的激活解釋了猴子下顳葉皮層中160個神經元隨機集合的一半方差。 ConvNets的根源是新認知器，其架構有些相似，但沒有反向傳播等端到端監督學習算法。稱為時延神經網絡的原始一維ConvNet用于識別音素和簡單單詞。

卷積網絡的大量應用可以追溯到二十世紀九十年代初，首先是用于語音識別和文檔閱讀的時延神經網絡。該文檔閱讀系統使用了一個ConvNet，并與一個實現語言約束的概率模型一起進行了培訓。到二十世紀九十年代后期，該系統已讀取了美國所有支票的10％以上。 Microsoft隨后部署了許多基于ConvNet的光學字符識別和手寫識別系統。在二十世紀九十年代初，還對ConvNets進行試驗，以檢測自然圖像中的物體，包括面部和手部，以及面部識別。
講卷積神經網絡（Convolutional neural networks）：ConvNets背后的四個關鍵思想:
局部連接（local connections）：每個神經元其實沒有必要對全局圖像進行感知，只需要對局部進行感知，然后在更高層將局部的信息綜合起來就得到了全局的信息；
權值共享（shared weights）：權值共享（也就是卷積操作）減少了權值數量，降低了網絡復雜度，可以看成是特征提取的方式。其中隱含的原理是：圖像中的一部分的統計特性與其他部分是一樣的。意味著我們在這一部分學習的特征也能用在另一部分上，所以對于這個圖像上的所有位置，我們都能使用同樣的學習特征；
池化（ pooling）:在通過卷積獲得了特征 (features) 之后，下一步我們希望利用這些特征去做分類。人們可以用所有提取得到的特征去訓練分類器，例如 softmax 分類器，但這樣做面臨計算量的挑戰，并且容易出現過擬合 (over-fitting)，因此，為了描述大的圖像，可以對不同位置的特征進行聚合統計，如計算平均值或者是最大值，即mean-pooling和max-pooling；多層（the use of many layers）。
接下來就講到，典型 ConvNet的結構： convolution layers, non-linearity and pooling ，分別是卷積層，非線性操作，池化層，然后將這個結構多次堆疊就構成了ConvNet的隱藏層，然后講了ConvNets中卷積層和池化層的設計靈感

Image understanding with deep convolutional networks

Since the early 2000s, ConvNets have been applied with great success to the detection, segmentation and recognition of objects and regions in images. These were all tasks in which labelled data was relatively abundant, such as traffic sign recognition, the segmentation of biological images particularly for connectomics, and the detection of faces, text, pedestrians and human bodies in natural images. A major recent practical success of ConvNets is face recognition.

Importantly, images can be labelled at the pixel level, which will have applications in technology, including autonomous mobile robots and self-driving cars. Companies such as Mobileye and NVIDIA are using such ConvNet-based methods in their upcoming vision systems for cars. Other applications gaining importance involve natural language understanding and speech recognition.

Despite these successes, ConvNets were largely forsaken by the mainstream computer-vision and machine-learning communities until the ImageNet competition in 2012. When deep convolutional networks were applied to a data set of about a million images from the web that contained 1,000 different classes, they achieved spectacular results, almost halving the error rates of the best competing approaches. This success came from the efficient use of GPUs, ReLUs, a new regularization technique called dropout, and techniques to generate more training examples by deforming the existing ones. This success has brought about a revolution in computer vision; ConvNets are now the dominant approach for almost all recognition and detection tasks and approach human performance on some tasks. A recent stunning demonstration combines ConvNets and recurrent net modules for the generation of image captions (Fig. 3).

Recent ConvNet architectures have 10 to 20 layers of ReLUs, hundreds of millions of weights, and billions of connections between units. Whereas training such large networks could have taken weeks only two years ago, progress in hardware, software and algorithm parallelization have reduced training times to a few hours.

The performance of ConvNet-based vision systems has caused most major technology companies, including Google, Facebook, Microsoft, IBM, Y ahoo!, Twitter and Adobe, as well as a quickly growing number of start-ups to initiate research and development projects and to deploy ConvNet-based image understanding products and services.

ConvNets are easily amenable to efficient hardware implementations in chips or field-programmable gate arrays. A number of companies such as NVIDIA, Mobileye, Intel, Qualcomm and Samsung are developing ConvNet chips to enable real-time vision applications in smartphones, cameras, robots and self-driving cars.

深度卷積網絡的圖像理解

自二十一世紀初以來，ConvNets已成功應用于圖像中對象和區域的檢測、分割和識別。這些都是標記數據相對豐富的任務，例如交通標志識別、生物圖像分割、尤其是用于連接組學，以及在自然圖像中檢測人臉、文字、行人和人體。ConvNets最近在實踐中取得的主要成功是面部識別[59]。

重要的是，可以在像素級別標記圖像，這將在技術中得到應用，包括自動駕駛機器人和自動駕駛汽車。 Mobileye和NVIDIA等公司正在其即將推出的汽車視覺系統中使用基于ConvNet的方法。其他日益重要的應用包括自然語言理解和語音識別。

盡管取得了這些成功，但ConvNet在很大程度上被主流計算機視覺和機器學習領域棄用，直到2012年ImageNet競賽為止。當深度卷積網絡應用于來自網絡的大約一百萬個圖像的數據集時，其中包含1000個不同的類別，取得了驚人的成績，幾乎使最佳競爭方法的錯誤率降低了一半。成功的原因是有效利用了GPU、ReLU、一種稱為dropout的新正則化技術，以及通過使現有示例變形而生成更多訓練示例的技術。這一成功帶來了計算機視覺的一場革命。現在，ConvNets是幾乎所有識別和檢測任務的主要方法，并且在某些任務上達到了人類水平。最近的一次令人震驚的演示結合了ConvNets和遞歸網絡模塊以生成圖像字幕（圖3）。

最新的ConvNet架構具有10到20層ReLU，數億個權重以及單元之間的數十億個連接。盡管培訓如此大型的網絡可能僅在兩年前才花了幾周的時間，但是硬件、軟件和算法并行化方面的進步已將培訓時間減少到幾個小時。

基于ConvNet的視覺系統的性能已引起大多數主要技術公司的發展，其中包括Google、Facebook、Microsoft、IBM、Y ahoo、Twitter和Adobe，以及數量迅速增長的初創公司啟動了研究和開發項目，部署基于ConvNet的圖像理解產品和服務。

卷積網絡很容易適應芯片或現場可編程門陣列中的高效硬件實現。 NVIDIA、Mobileye、英特爾、高通和三星等多家公司正在開發ConvNet芯片，以支持智能手機、相機、機器人和自動駕駛汽車中的實時視覺應用。
深卷積網絡對圖像進行理解（Image understanding with deep convolutional networks）：感覺這段沒講什么技術上的類容，主要就是各個互聯網巨頭用DNN做出了厲害的成績

Distributed representations and language processing

Deep-learning theory shows that deep nets have two different exponential advantages over classic learning algorithms that do not use distributed representations. Both of these advantages arise from the power of composition and depend on the underlying data-generating distribution having an appropriate componential structure. First, learning distributed representations enable generalization to new combinations of the values of learned features beyond those seen during training (for example, $2 n$ combinations are possible with $n$ binary features). Second, composing layers of representation in a deep net brings the potential for another exponential advantage (exponential in the depth).

The hidden layers of a multilayer neural network learn to represent the network’s inputs in a way that makes it easy to predict the target outputs. This is nicely demonstrated by training a multilayer neural network to predict the next word in a sequence from a local context of earlier words. Each word in the context is presented to the network as a one-of-N vector, that is, one component has a value of 1 and the rest are 0. In the first layer, each word creates a different pattern of activations, or word vectors (Fig. 4). In a language model, the other layers of the network learn to convert the input word vectors into an output word vector for the predicted next word, which can be used to predict the probability for any word in the vocabulary to appear as the next word. The network learns word vectors that contain many active components each of which can be interpreted as a separate feature of the word, as was first demonstrated in the context of learning distributed representations for symbols. These semantic features were not explicitly present in the input. They were discovered by the learning procedure as a good way of factorizing the structured relationships between the input and output symbols into multiple ‘micro-rules’ . Learning word vectors turned out to also work very well when the word sequences come from a large corpus of real text and the individual micro-rules are unreliable. When trained to predict the next word in a news story, for example, the learned word vectors for Tuesday and Wednesday are very similar, as are the word vectors for Sweden and Norway. Such representations are called distributed representations because their elements (the features) are not mutually exclusive and their many configurations correspond to the variations seen in the observed data. These word vectors are composed of learned features that were not determined ahead of time by experts, but automatically discovered by the neural network. Vector representations of words learned from text are now very widely used in natural language applications.

The issue of representation lies at the heart of the debate between the logic-inspired and the neural-network-inspired paradigms for cognition. In the logic-inspired paradigm, an instance of a symbol is something for which the only property is that it is either identical or non-identical to other symbol instances. It has no internal structure that is relevant to its use; and to reason with symbols, they must be bound to the variables in judiciously chosen rules of inference. By contrast, neural networks just use big activity vectors, big weight matrices and scalar non-linearities to perform the type of fast ‘intuitive’ inference that underpins effortless commonsense reasoning.

Before the introduction of neural language models, the standard approach to statistical modelling of language did not exploit distributed representations: it was based on counting frequencies of occurrences of short symbol sequences of length up to $N$ (called $N ? g r a m s$ ). The number of possible $N ? g r a m s$ is on the order of $V^N$ , where $V$ is the vocabulary size, so taking into account a context of more than a handful of words would require very large training corpora. $N ? g r a m s$ treat each word as an atomic unit, so they cannot generalize across semantically related sequences of words, whereas neural language models can because they associate each word with a vector of real valued features, and semantically related words end up close to each other in that vector space (Fig. 4).

分布式表示和語言處理

深度學習理論表明，與不使用分布式表示的經典學習算法相比，深網具有兩個不同的指數優勢。這兩個優點都來自于組合的力量，并取決于具有適當組件結構的底層數據生成分布。首先，學習分布式表示可以將學習到的特征值的新組合推廣到訓練期間看不到的那些新組合（例如使用 $n$ 個二進制特征可以進行 $2 n$ 個組合）。其次，在一個深層網絡中構成表示層會帶來另一個指數優勢（深度指數）。

多層神經網絡的隱藏層學習以易于預測目標輸出的方式來表示網絡的輸入。通過訓練多層神經網絡從較早單詞的局部上下文中預測序列中的下一個單詞，可以很好地證明這一點。上下文中的每個單詞都以 $N$ 個向量的形式呈現給網絡，也就是說，一個組成部分的值為1，其余均為0。在第一層中，每個單詞都會創建不同的激活模式，或者字向量（圖4）。在語言模型中，網絡的其他層學習將輸入的單詞矢量轉換為預測的下一個單詞的輸出單詞矢量，這可用于預測詞匯表中任何單詞出現為下一個單詞的概率。網絡學習包含許多有效成分的單詞向量，每個成分都可以解釋為單詞的一個獨立特征，如在學習符號的分布式表示形式時首先證明的那樣。這些語義特征未在輸入中明確顯示。通過學習過程可以發現它們，這是將輸入和輸出符號之間的結構化關系分解為多個“微規則”的好方法。當單詞序列來自大量的真實文本并且單個微規則不可靠時，學習單詞向量也可以很好地工作。例如，在訓練以預測新聞故事中的下一個單詞時，周二和周三學到的單詞向量與瑞典和挪威的單詞向量非常相似。這樣的表示稱為分布式表示，因為它們的元素（特征）不是互斥的，并且它們的許多配置對應于在觀察到的數據中看到的變化。這些詞向量由專家事先未確定但由神經網絡自動發現的學習特征組成。從文本中學到的單詞的矢量表示現在已在自然語言應用中得到廣泛使用。

表示問題是邏輯啟發和神經網絡啟發的認知范式之間爭論的核心。在邏輯啟發范式中，符號實例是某些事物，其唯一屬性是它與其他符號實例相同或不同。它沒有與其使用相關的內部結構；為了用符號進行推理，必須將它們綁定到明智選擇的推理規則中的變量。相比之下，神經網絡僅使用較大的活動矢量，較大的權重矩陣和標量非線性來執行快速的“直覺”推斷類型，從而支持毫不費力的常識推理。

在引入神經語言模型之前，語言統計建模的標準方法并未利用分布式表示形式：它是基于對長度不超過 $N$ （稱為 $N ? g r a m s$ ）的短符號序列的出現頻率進行計數。可能的 $N ? g r a m s$ 的數量在 $V^N$ 的數量級上，其中 $V$ 是詞匯量，因此考慮到少數單詞的上下文，將需要非常大的訓練語料庫。 $N ? g r a m s$ 將每個單詞視為一個原子單元，因此它們無法在語義上相關的單詞序列中進行泛化，而神經語言模型則可以將它們與實值特征向量相關聯，而語義相關的單詞最終彼此靠近在該向量空間中（圖4）。
Distributed representations and language processing，深度學習理論表明，與不使用分布式表示的經典學習算法相比，深網具有兩個不同的指數優勢。這兩個優點都來自于組合的力量，并取決于具有適當組件結構的底層數據生成分布，然后講述了下分布式以及應用發展

Recurrent neural networks

When backpropagation was first introduced, its most exciting use was for training recurrent neural networks (RNNs). For tasks that involve sequential inputs, such as speech and language, it is often better to use RNNs (Fig. 5). RNNs process an input sequence one element at a time, maintaining in their hidden units a ‘state vector’ that implicitly contains information about the history of all the past elements of the sequence. When we consider the outputs of the hidden units at different discrete time steps as if they were the outputs of different neurons in a deep multilayer network (Fig. 5, right), it becomes clear how we can apply backpropagation to train RNNs.

RNNs are very powerful dynamic systems, but training them has proved to be problematic because the backpropagated gradients either grow or shrink at each time step, so over many time steps they typically explode or vanish.

Thanks to advances in their architecture and ways of training them, RNNs have been found to be very good at predicting the next character in the text or the next word in a sequence, but they can also be used for more complex tasks. For example, after reading an English sentence one word at a time, an English ‘encoder’ network can be trained so that the final state vector of its hidden units is a good representation of the thought expressed by the sentence. This thought vector can then be used as the initial hidden state of (or as extra input to) a jointly trained French ‘decoder’ network, which outputs a probability distribution for the first word of the French translation. If a particular first word is chosen from this distribution and provided as input to the decoder network it will then output a probability distribution for the second word of the translation and so on until a full stop is chosen. Overall, this process generates sequences of French words according to a probability distribution that depends on the English sentence. This rather naive way of performing machine translation has quickly become competitive with the state-of-the-art, and this raises serious doubts about whether understanding a sen-tence requires anything like the internal symbolic expressions that are manipulated by using inference rules. It is more compatible with the view that everyday reasoning involves many simultaneous analogies that each contribute plausibility to a conclusion.

Instead of translating the meaning of a French sentence into an English sentence, one can learn to ‘translate’ the meaning of an image into an English sentence (Fig. 3). The encoder here is a deep ConvNet that converts the pixels into an activity vector in its last hidden layer. The decoder is an RNN similar to the ones used for machine translation and neural language modelling. There has been a surge of interest in such systems recently .

RNNs, once unfolded in time (Fig. 5), can be seen as very deep feedforward networks in which all the layers share the same weights. Although their main purpose is to learn long-term dependencies, theoretical and empirical evidence shows that it is difficult to learn to store information for very long.

To correct for that, one idea is to augment the network with an explicit memory. The first proposal of this kind is the long short-term memory (LSTM) networks that use special hidden units, the natural behaviour of which is to remember inputs for a long time. A special unit called the memory cell acts like an accumulator or a gated leaky neuron: it has a connection to itself at the next time step that has a weight of one, so it copies its own real-valued state and accumulates the external signal, but this self-connection is multiplicatively gated by another unit that learns to decide when to clear the content of the memory.

LSTM networks have subsequently proved to be more effective than conventional RNNs, especially when they have several layers for each time step, enabling an entire speech recognition system that goes all the way from acoustics to the sequence of characters in the transcription. LSTM networks or related forms of gated units are also currently used for the encoder and decoder networks that perform so well at machine translation.

Over the past year, several authors have made different proposals to augment RNNs with a memory module. Proposals include the Neural Turing Machine in which the network is augmented by a ‘tape-like’ memory that the RNN can choose to read from or write to, and memory networks, in which a regular network is augmented by a kind of associative memory. Memory networks have yielded excellent performance on standard question-answering benchmarks. The memory is used to remember the story about which the network is later asked to answer questions.

Beyond simple memorization, neural Turing machines and memory networks are being used for tasks that would normally require reasoning and symbol manipulation. Neural Turing machines can be taught ‘algorithms’ . Among other things, they can learn to output a sorted list of symbols when their input consists of an unsorted sequence in which each symbol is accompanied by a real value that indicates its priority in the list. Memory networks can be trained to keep track of the state of the world in a setting similar to a text adventure game and after reading a story, they can answer questions that require complex inference90. In one test example, the network is shown a 15-sentence version of the The Lord of the Rings and correctly answers questions such as “where is Frodo now?”.

遞歸神經網絡

首次引入反向傳播時，其最令人興奮的用途是訓練循環神經網絡（RNN）。對于涉及順序輸入的任務，例如語音和語言，通常最好使用RNN（圖5）。 RNN一次處理一個輸入序列的一個元素，在其隱藏的單元中維護一個“狀態向量”，該“狀態向量”隱式包含有關該序列的所有過去元素的歷史信息。當我們將隱藏單位在不同離散時間步長的輸出視為是深層多層網絡中不同神經元的輸出時（圖5右），顯然我們可以如何應用反向傳播來訓練RNN。

RNN是非常強大的動態系統，但是事實證明，訓練它們是有問題的，因為反向傳播的梯度在每個時間步長都會增大或縮小，因此在許多時間步長上它們通常會爆炸或消失。

由于其結構和訓練方法的進步，人們發現RNN非常擅長預測文本中的下一個字符或序列中的下一個單詞，但它們也可以用于更復雜的任務。例如，一次讀一個單詞的英語句子后，可以訓練英語的“編碼器”網絡，使其隱藏單元的最終狀態向量很好地表示了該句子表達的思想。然后，可以將此思想向量用作聯合訓練的法語“解碼器”網絡的初始隱藏狀態（或作為其額外輸入），該網絡將輸出法語翻譯的第一個單詞的概率分布。如果從該分布中選擇了一個特定的第一個單詞，并將其作為輸入提供給解碼器網絡，則它將輸出翻譯的第二個單詞的概率分布，依此類推，直到選擇了句號。總體而言，此過程根據取決于英語句子的概率分布生成法語單詞序列。這種相當幼稚的執行機器翻譯的方式已迅速與最新技術競爭，這引起了人們對理解句子是否需要諸如通過使用推理規則操縱的內部符號表達式之類的嚴重質疑。日常推理涉及許多同時進行的類比，每個類比都為結論提供了合理性，這一觀點與觀點更加兼容。

與其將法語句子的含義翻譯成英語句子，不如學習將圖像的含義“翻譯”成英語句子（圖3）。這里的編碼器是一個深層的ConvNet，可將像素轉換為其最后一個隱藏層中的活動矢量。解碼器是一個RNN，類似于用于機器翻譯和神經語言建模的RNN。近年來，對此類系統的興趣激增。

RNNs隨時間展開（圖5），可以看作是非常深的前饋網絡，其中所有層共享相同的權重。盡管它們的主要目的是學習長期依賴關系，但理論和經驗證據表明，很難長期存儲信息。

為了解決這個問題，一個想法是用顯式內存擴展網絡。此類第一種建議是使用特殊隱藏單元的長短期記憶（LSTM）網絡，其自然行為是長時間記住輸入。稱為存儲單元的特殊單元的作用類似于累加器或門控泄漏神經元：它在下一時間步與其自身具有連接，其權重為1，因此它復制自己的實值狀態并累積外部信號，但是此自連接是由另一個單元乘法控制的，該單元學會確定何時清除內存內容。

LSTM網絡隨后被證明比常規RNN更有效，特別是當它們在每個時間步都有多層時，使整個語音識別系統從聲學到轉錄中的字符序列都一路走來。LSTM網絡或相關形式的門控單元目前也用于編碼器和解碼器網絡，它們在機器翻譯方面表現出色。

在過去的一年中，幾位作者提出了不同的建議，以使用內存模塊擴展RNN。建議包括神經圖靈機，其中網絡由RNN可以選擇讀取或寫入的“像帶”存儲器來增強，以及存儲網絡，其中常規網絡由一種關聯性存儲器來增強。內存網絡在標準問答基準方面已表現出出色的性能。存儲器用于記住故事，有關該故事后來被要求網絡回答問題。

除了簡單的記憶外，神經圖靈機和存儲網絡還用于執行通常需要推理和符號操作的任務。神經圖靈機可以被稱為“算法”。除其他事項外，當他們的輸入由未排序的序列組成時，他們可以學習輸出已排序的符號列表，其中每個符號都帶有一個實數值，該實數值指示其在列表中的優先級。可以訓練記憶網絡，使其在類似于文字冒險游戲的環境中跟蹤世界狀況，閱讀故事后，它們可以回答需要復雜推理的問題。在一個測試示例中，該網絡顯示了15句的《指環王》，并正確回答了諸如“ Frodo現在在哪里？”之類的問題。

7，Recurrent neural networks 六，七部分都是講的DL在文本和語言處理領域的發展

The future of deep learning

Unsupervised learning had a catalytic effect in reviving interest in deep learning, but has since been overshadowed by the successes of purely supervised learning. Although we have not focused on it in this Review, we expect unsupervised learning to become far more important in the longer term. Human and animal learning is largely unsupervised: we discover the structure of the world by observing it, not by being told the name of every object.

Human vision is an active process that sequentially samples the optic array in an intelligent, task-specific way using a small, high-resolution fovea with a large, low-resolution surround. We expect much of the future progress in vision to come from systems that are trained end-to-end and combine ConvNets with RNNs that use reinforcement learning to decide where to look. Systems combining deep learning and rein-forcement learning are in their infancy, but they already outperform passive vision systems99 at classification tasks and produce impressive results in learning to play many different video game.

Natural language understanding is another area in which deep learning is poised to make a large impact over the next few years. We expect systems that use RNNs to understand sentences or whole documents will become much better when they learn strategies for selectively attending to one part at a time.

Ultimately, major progress in artificial intelligence will come about through systems that combine representation learning with complex reasoning. Although deep learning and simple reasoning have been used for speech and handwriting recognition for a long time, new paradigms are needed to replace rule-based manipulation of symbolic expressions by operations on large vectors.

深度學習的未來

無監督學習在恢復對深度學習的興趣方面起了催化作用，但此后被監督學習的成功所掩蓋。盡管我們在本評論中并未對此進行關注，但我們希望從長遠來看，無監督學習將變得越來越重要。人類和動物的學習在很大程度上不受監督：我們通過觀察來發現世界的結構，而不是通過告知每個物體的名稱來發現世界的結構。

人的視覺是一個活躍的過程，它使用具有高分辨率，低分辨率環繞的小型高分辨率中央凹，以智能的，針對特定任務的方式對光學陣列進行順序采樣。我們期望在視覺上未來的許多進步都將來自端到端訓練的系統，并將ConvNets與RNN結合起來，后者使用強化學習來決定在哪里看。結合了深度學習和強化學習的系統尚處于起步階段，但在分類任務上它們已經超過了被動視覺系統，并且在學習玩許多不同的視頻游戲方面產生了令人印象深刻的結果。

自然語言理解是深度學習必將在未來幾年產生巨大影響的另一個領域。我們希望使用RNN理解句子或整個文檔的系統在學習一次選擇性地關注一部分的策略時會變得更好。

最終，人工智能的重大進步將通過將表示學習與復雜推理相結合的系統來實現。盡管長期以來，深度學習和簡單推理已被用于語音和手寫識別，但仍需要新的范例來通過對大向量進行運算來代替基于規則的符號表達操縱。

深度學習的將來（The future of deep learning ）：主要對Unsupervised learning 的發展有個很棒的展望

總結

以上是生活随笔為你收集整理的Deep Learning-论文翻译以及笔记的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： css设置字体颜色、文本对齐方式、首行缩
下一篇：锐界机器人_2019款锐界智能家居远程控