swap最大值和平均值_SWAP:Softmax加权平均池
swap最大值和平均值
Blake Elias is a Researcher at the New England Complex Systems Institute.Shawn Jain is an AI Resident at Microsoft Research.
布萊克·埃里亞斯 ( Blake Elias) 是 新英格蘭復雜系統研究所的研究員。 Shawn Jain 是 Microsoft Research 的 AI駐地 。
Our method, softmax-weighted average pooling (SWAP), applies average-pooling, but re-weights the inputs by the softmax of each window.
我們的方法softmax加權平均池(SWAP)應用平均池,但是通過每個窗口的softmax對輸入進行加權。
We present a pooling method for convolutional neural networks as an alternative to max-pooling or average pooling. Our method, softmax-weighted average pooling (SWAP), applies average-pooling, but re-weights the inputs by the softmax of each window. While the forward-pass values are nearly identical to those of max-pooling, SWAP’s backward pass has the property that all elements in the window receive a gradient update, rather than just the maximum one. We hypothesize that these richer, more accurate gradients can improve the learning dynamics. Here, we instantiate this idea and investigate learning behavior on the CIFAR-10 dataset. We find that SWAP neither allows us to increase learning rate nor yields improved model performance.
我們提出了卷積神經網絡的池化方法,以替代最大池化或平均池化。 我們的方法softmax加權平均池(SWAP)應用平均池,但是通過每個窗口的softmax對輸入進行加權。 盡管前向傳遞值與最大池化值幾乎相同,但是SWAP的向后傳遞具有以下屬性:窗口中的所有元素均接收漸變更新,而不僅僅是最大更新。 我們假設這些更豐富,更準確的漸變可以改善學習動力。 在這里,我們實例化此想法并研究CIFAR-10數據集上的學習行為。 我們發現SWAP既不能提高學習率,也不能提高模型性能。
起源 (Origins)
While watching James Martens’ lecture on optimization, from DeepMind / UCL’s Deep Learning course, we noted his point that as learning progresses, you must either lower the learning rate or increase batch size to ensure convergence. Either of these techniques results in a more accurate estimate of the gradient. This got us thinking about the need for accurate gradients. Separately, we had been doing an in-depth review of how backpropagation computes gradients for all types of layers. In doing this exercise for convolution and pooling, we noted that max-pooling only computes a gradient with respect to the maximum value in a window. This discards information — how can we make this better? Could we get a more accurate estimate of the gradient by using all the information?
在觀看James Martens在 DeepMind / UCL的“深度學習”課程上的優化講座時 ,我們注意到他的觀點,即隨著學習的進行,您必須降低學習率或增加批處理量以確保收斂。 這些技術中的任何一種都會導致對梯度的更準確的估計。 這使我們開始思考是否需要精確的漸變。 另外,我們一直在深入研究反向傳播如何計算所有類型圖層的梯度。 在進行卷積和池化練習時,我們注意到最大池化僅計算相對于窗口最大值的梯度。 這會丟棄信息-我們如何才能使其變得更好? 通過使用所有信息,我們能否獲得更準確的梯度估計?
Max-pooling discards gradient information — how can we make this better?
最大池丟棄了梯度信息-我們如何使它變得更好?
進一步的背景 (Further Background)
Max-Pooling is typically used in CNNs for vision tasks as a downsampling method. For example, AlexNet used 3x3 Max-Pooling. [cite]
在CNN中,Max-Pooling通常作為下采樣方法用于視覺任務。 例如,AlexNet使用3x3 Max-Pooling。 [ 引用 ]
In vision applications, max-pooling takes a feature map as input, and outputs a smaller feature map. If the input image is 4x4, a 2x2 max-pooling operator with a stride of 2 (no overlap) will output a 2x2 feature map. The 2x2 kernel of the max-pooling operator has 2x2 non-overlapping ‘positions’ on the input feature map. For each position, the maximum value in the 2x2 window is selected as the value in the output feature map. The other values are discarded.
在視覺應用中,最大池化將要素圖作為輸入,并輸出較小的要素圖。 如果輸入圖像為4x4,則跨度為2(無重疊)的2x2最大合并運算符將輸出2x2特征圖。 max-pooling運算符的2x2內核在輸入要素圖上具有2x2不重疊的“位置”。 對于每個位置,選擇2x2窗口中的最大值作為輸出要素圖中的值。 其他值將被丟棄。
The implicit assumption is “bigger values are better,” — i.e. larger values are more important to the final output. This modelling decision is motivated by our intuition, although may not be absolutely correct. [Ed.: Maybe the other values matter as well! In a near-tie situation, maybe propagating gradients to the second-largest value could make it the largest value. This may change the trajectory the model takes as its learning. Updating the second-largest value as well, could be the better learning trajectory to follow.]
隱含的假設是“值越大越好”,即值越大對最終輸出越重要。 盡管并非完全正確,但此建模決策是出于我們的直覺。 [編輯:也許其他價值觀也很重要! 在接近平局的情況下,也許將梯度傳播到第二大值可能會使它成為最大值。 這可能會改變模型學習的軌跡。 同樣,更新第二大的值可能也是更好的學習軌跡。]
You might be wondering, is this differentiable? After all, deep learning requires that all operations in the model be differentiable, in order to compute gradients. In the purely mathematical sense, this is not a differentiable operation. In practice, in the backward pass, all positions corresponding to the maximum simply copy the inbound gradients; all the non-maximum positions simply set their gradients to zero. PyTorch implements this as a custom CUDA kernel (this function invokes this function).
您可能想知道,這與眾不同嗎? 畢竟,深度學習要求模型中的所有運算都是可微的,以便計算梯度。 從純粹的數學意義上講,這不是微分運算。 實際上,在向后遍歷中,所有與最大值對應的位置都只是復制入站漸變; 所有非最大位置只需將其梯度設置為零即可。 PyTorch將其實現為自定義CUDA內核( 此函數調用此函數 )。
In other words, Max-Pooling generates sparse gradients. And it works! From AlexNet [cite] to ResNet [cite] to Reinforcement Learning [cite cite], it’s widely used.
換句話說,Max-Pooling生成稀疏漸變。 而且有效! 從AlexNet [ 引用 ]到RESNET [ 引用 ]以強化學習[ 舉 舉 ],它的廣泛應用。
Many variants have been developed; Average-Pooling outputs the average, instead of the max, over the window. Dilated Max-Pooling makes the window non-contiguous; instead, it uses a checkerboard like pattern.
已經開發了許多變體。 平均池輸出窗口上的平均值而不是最大值。 擴展的最大池使窗口不連續; 相反,它使用棋盤狀圖案。
arXiv (via arXiv (通過StackOverflow).StackOverflow )。Controversially, Geoff Hinton doesn’t like Max-Pooling:
有爭議的是,Geoff Hinton不喜歡Max-Pooling:
The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster.
卷積神經網絡中使用的池化操作是一個很大的錯誤,它運行良好的事實是一場災難。
If the pools do not overlap, pooling loses valuable information about where things are. We need this information to detect precise relationships between the parts of an object. Its [sic] true that if the pools overlap enough, the positions of features will be accurately preserved by “coarse coding” (see my paper on “distributed representations” in 1986 for an explanation of this effect). But I no longer believe that coarse coding is the best way to represent the poses of objects relative to the viewer (by pose I mean position, orientation, and scale).
如果池不重疊,則池將丟失有關事物位置的有價值的信息。 我們需要此信息來檢測對象各部分之間的精確關系。 它的[ 原文 ]誠然,如果池重疊足夠的特征的位置將被準確地“粗編碼”保存(見我于1986年“分布式表示”紙的這種效應的解釋)。 但是我不再相信粗略編碼是代表對象相對于觀察者的姿態的最佳方法(所謂姿態,是指位置,方向和比例)。
[Source: Geoff Hinton on Reddit.]
[來源:杰夫欣頓上書簽交易 。]
動機 (Motivation)
Max-Pooling generates sparse gradients. With better gradient estimates, could we take larger steps by increasing learning rate, and therefore converge faster?
Max-Pooling生成稀疏漸變。 有了更好的梯度估計,我們是否可以通過提高學習率來采取更大的步驟,從而收斂得更快?
Sparse gradients discard too much information. With better gradient estimates, could we take larger steps by increasing learning rate, and therefore converge faster?
稀疏的梯度會丟棄過多的信息。 有了更好的梯度估計,我們是否可以通過提高學習率來采取更大的步驟,從而收斂得更快?
Although the outbound gradients generated by Max-Pool are sparse, this operation is typically used in a Conv → Max-Pool chain of operations. Notice that the trainable parameters (i.e., the filter values, F) are all in the Conv operator. Note also, that:
盡管Max-Pool生成的出站漸變稀疏,但此操作通常在Conv→Max-Pool操作鏈中使用。 注意,可訓練參數(即過濾器值F )都在Conv運算符中。 另請注意:
dL/dF = Conv(X, dL/dO), where:
dL / dF = Conv(X,dL / dO) ,其中:
dL/dF are the gradients with respect to the convolutional filter
dL / dF是相對于卷積濾波器的梯度
dL/dO is the outbound gradient from Max-Pool, and
dL / dO是Max-Pool的出站梯度,并且
X is the input to Conv (forward).
X是Conv(正向)的輸入。
As a result, all positions in the convolutional filter F get gradients. However, those gradients are computed from a sparse matrix dL/dO instead of a dense matrix. (The degree of sparsity depends on the Max-Pool window size.)
結果,卷積濾波器F中的所有位置都得到梯度。 但是,這些梯度是根據稀疏矩陣dL / dO而不是密集矩陣計算的。 (稀疏程度取決于最大池窗口的大小。)
Forward:
前鋒:
Backward:
落后:
Figure 3: Max pooling generates sparse gradients. (Authors’ image)圖3:最大池生成稀疏梯度。 (作者的圖片)Note also that dL/dF is not sparse, as each sparse entry of dL/dO sends a gradient value back to all entries dL/dF.
還要注意, dL / dF 不是稀疏的,因為dL / dO的每個稀疏條目都會將梯度值發送回所有條目dL / dF 。
But this raises a question. While dL/dF is not sparse itself, its entries are calculated based on an averaging of sparse inputs. If its inputs (dL/dO — the outbound gradient of Max-Pool) — were dense, could dL/dF be a better estimate of the true gradient? How can we make dL/dO dense while still retaining the “bigger values are better” assumption of Max-Pool?
但這提出了一個問題。 盡管dL / dF 本身不是稀疏的,但其條目是根據稀疏輸入的平均值計算得出的。 如果其輸入( dL / dO -Max-Pool的出站梯度)很密集,那么dL / dF是否可以更好地估算真實梯度? 我們如何才能使dL / dO致密,同時仍保留Max-Pool的“越大越好”的假設?
One solution is Average-Pooling. There, all activations pass a gradient backwards, rather than just the max in each window. However, it violates MaxPool’s assumption that “bigger values are better.”
一種解決方案是平均池。 在那里, 所有激活都向后傳遞漸變,而不僅僅是每個窗口中的最大值。 但是,它違反了MaxPool的假設,即“值越大越好”。
Enter Softmax-Weighted Average-Pooling (SWAP). The forward pass is best explained as pseudo-code:
輸入Softmax加權平均池(SWAP)。 最好將前向傳遞解釋為偽代碼:
average_pool(O, weights=softmax_per_window(O))
average_pool(O,權重= softmax_per_window(O))
Figure 4: SWAP produces a value almost the same as max-pooling — but passes gradients back to all entries in the window. (Authors’ image)圖4:SWAP產生的值幾乎與最大池化相同,但是將梯度傳遞回窗口中的所有條目。 (作者的圖片)The softmax operator normalizes the values into a probability distribution, however, it heavily favors large values. This gives it a max-pool like effect.
softmax運算符將這些值歸一化為概率分布,但是,它非常喜歡較大的值。 這給了它一個類似最大池的效果。
On the backward pass, dL/dO is dense, because each outbound activation in A depends on all activations in its window — not just the max value. Non-max values in O now receive relatively small, but non-zero, gradients. Bingo!
在向后傳遞時, dL / dO很密集,因為A中的每個出站激活都取決于其窗口中的所有激活,而不僅僅是最大值。 現在, O中的非最大值會收到相對較小但非零的漸變。 答對了!
實驗裝置 (Experimental Setup)
We conducted our experiments on CIFAR10. Our code is available here. We fixed the architecture of the network to:
我們在CIFAR10上進行了實驗。 我們的代碼在這里 。 我們將網絡架構固定為:
We tested three different variants of the “Pool” layer: two baselines (Max-Pool and Average-Pool), in addition to SWAP. Models were trained for 100 epochs using SGD, LR=1e-3 (unless otherwise mentioned).
我們測試了“ Pool”層的三種不同變體:除SWAP之外,還提供了兩個基準(Max-Pool和Average-Pool)。 使用SGD,LR = 1e-3(除非另有說明)為模型訓練了100個時期。
We also trained SWAP with a {25, 50, 400}% increase in LR. This was to test the idea that, with more accurate gradients we could take larger steps, and with larger steps the model would converge faster.
我們還對SWAP進行了培訓,使LR增加了{25、50、400}%。 這是為了檢驗這樣的想法:使用更準確的漸變,我們可以采用更大的步長,而采用更大的步長,模型可以收斂得更快。
結果 (Results)
討論區 (Discussion)
SWAP shows worse performance compared to both baselines. We do not understand why this is the case. An increase in LR provided no benefit; generally, worse performance vs baseline was observed as LR increased. We attribute the 400% increase in LR performing better than the 50% increase to randomness; we tested with only a single random seed and reported only a single trial. Another possible explanation for the 400% increase performing better, is simply the ability to “cover more ground” with a higher LR.
與兩個基準相比,SWAP的性能均較差。 我們不明白為什么會這樣。 LR的增加無濟于事; 通常,隨著LR的增加,與基線相比,性能較差。 我們將LR表現的400%增長優于50%的增長歸因于隨機性。 我們僅使用一個隨機種子進行了測試,并且僅報告了一項試驗。 對于400%的更高性能表現的另一個可能的解釋是,具有更高LR的“覆蓋更多地面”的能力。
An increase in LR provided no benefit; generally, worse performance vs baseline was observed as LR increased.
LR的增加無濟于事; 通常,隨著LR的增加,與基線相比,性能較差。
未來的工作和結論 (Future Work and Conclusion)
While SWAP did not show improvement, we still want to try several experiments:
盡管SWAP并未顯示出改善,但我們仍想嘗試幾個實驗:
Overlapping pool windows. One possibility is to use overlapping pool windows (i.e. stride = 1), rather than the disjoint windows we used here (with stride = 2). Modern convolutional architectures, like AlexNet and ResNet both use overlapping pool windows. So, for a fair comparison, it would be sensible to compare with something closer to state-of-the-art, rather than the architecture we used here for simplicity. Indeed, Hinton’s critique of max-pooling is most stringent in the case of non-overlapping pool windows, with the reasoning that this throws out spatial information.
重疊的游泳池窗戶。 一種可能是使用重疊的池窗口(即,步幅= 1),而不是我們在此使用的不相交的窗口(步幅= 2)。 諸如AlexNet和ResNet之類的現代卷積體系結構都使用重疊的池窗口。 因此,為了進行公平的比較,比較接近最新技術的東西而不是我們為簡單起見而使用的體系結構比較是明智的。 確實,對于非重疊的泳池窗戶,欣頓對最大泳池的批評最為嚴格,理由是這會拋出空間信息。
Histogram of activations. We would like to try Max-Pool & SWAP with the exact same initialization, train both, and compare the distributions of gradients. Investigating the difference in gradients may offer a better understanding of the stark contrast in training behavior.
激活的直方圖。 我們想嘗試使用完全相同的初始化程序來進行Max-Pool和SWAP訓練,同時訓練它們和比較梯度分布。 研究梯度差異可以更好地理解訓練行為中的鮮明對比。
Improving gradient accuracy is still an exciting area. How else can we modify the model or the gradient computation to improve gradient accuracy?
改善梯度精度仍然是令人興奮的領域。 我們還能如何修改模型或梯度計算以提高梯度精度?
翻譯自: https://towardsdatascience.com/swap-softmax-weighted-average-pooling-70977a69791b
swap最大值和平均值
總結
以上是生活随笔為你收集整理的swap最大值和平均值_SWAP:Softmax加权平均池的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 我的世界怎么去末地
- 下一篇: u-net语义分割_使用U-Net的语义