當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

我总结了70篇论文的方法，帮你透彻理解神经网络的剪枝算法

發布時間：2023/12/14 编程问答 42 豆豆

生活随笔收集整理的這篇文章主要介紹了我总结了70篇论文的方法，帮你透彻理解神经网络的剪枝算法小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

無論是在計算機視覺、自然語言處理還是圖像生成方面，深度神經網絡目前表現出來的性能都是最先進的。然而，它們在計算能力、內存或能源消耗方面的成本可能令人望而卻步，這使得大部份公司的因為有限的硬件資源而完全負擔不起訓練的費用。但是許多領域都受益于神經網絡，因此需要找到一個在保持其性能的同時降低成本的辦法。

這就是神經網絡壓縮的重點。該領域包含多個方法系列，例如量化 [11]、分解[13]、蒸餾 [32]。而本文的重點是剪枝。

神經網絡剪枝是一種移除網絡中性能良好但需要大量資源的多余部分的方法。盡管大型神經網絡已經無數次證明了它們的學習能力，但事實證明，在訓練過程結束后，并非它們的所有部分都仍然有用。這個想法是在不影響網絡性能的情況下消除這些多余部分。

不幸的是，每年發表的數十篇(可能是數百篇的話)論文都揭示了這個被認為直截了當的想法所隱藏的復雜性。事實上，只要快速瀏覽一下文獻，就會發現有無數方法可以在訓練前、訓練中或訓練后識別這些無用的部分，或將其移除;最主要的是并不是所有類型的剪枝都能加速神經網絡，這才是關鍵所在。

這篇文章的目標是為解決圍繞神經網絡剪枝各種問題。我們將依次回顧三個似乎是整個領域核心的問題：“我應該修剪什么樣的部分？”，“如何判斷哪些部分可以修剪？”和“如何在不損害網絡的情況下進行修剪？”。綜上所述，我們將詳細介紹剪枝結構、剪枝標準和剪枝方法。

1 - 剪枝介紹

1.1 - 非結構化剪枝

在談到神經網絡的成本時，參數數量肯定是最廣泛使用的指標之一，還有 FLOPS（每秒浮點運算）。當我們看到網絡顯示出天文數字的權重（GPT3的參數數量是1,750億）確實令人生畏。實際上，修剪連接是文獻中最廣泛的范式之一，足以被視為處理剪枝時的默認框架。 Han等人的開創性工作[26]提出了這種剪枝方法，并作為許多貢獻的基礎 [18, 21, 25]。

直接修剪參數有很多優點。首先，它很簡單，因為在參數張量中用零替換它們的權重值就足以修剪連接。被廣泛使用的深度學習框架，例如 Pytorch，允許輕松訪問網絡的所有參數，使其實現起來非常簡單。盡管如此，修剪連接的最大優勢是它們是網絡中最小、最基本的元素，因此，它們的數量足以在不影響性能的情況下大量修剪它們。如此精細的粒度允許修剪非常細微的模式，例如，最多可修剪卷積核內的參數。由于修剪權重完全不受任何約束的限制，并且是修剪網絡的最佳方式，因此這種范式稱為非結構化剪枝。

然而，這種方法存在一個主要的、致命的缺點：大多數框架和硬件無法加速稀疏矩陣計算，這意味著無論你用多少個零填充參數張量，它都不會影響網絡的實際成本。然而，影響它的是以一種直接改變網絡架構的方式進行修剪，任何框架都可以處理。

非結構化（左）和結構化（右）剪枝的區別：結構化剪枝去除卷積濾波器和內核行，而不僅僅是剪枝連接。這導致中間表示中的特征圖更少。

1.2 - 結構化剪枝

這就是為什么許多工作都專注于修剪更大的結構的原因，例如整個神經元 [36]，或者在更現代的深度卷積網絡中直接等效，卷積過濾器 [40, 41, 66]。由于大型網絡往往包括許多卷積層，每個層數多達數百或數千個過濾器，因此過濾器修剪允許使用可利用但足夠精細的粒度。移除這樣的結構不僅會導致稀疏層可以直接實例化為更薄的層，而且這樣做還會消除作為此類過濾器輸出的特征圖。

因此，由于參數較少這種網絡不僅易于存儲，而且它們需要更少的計算并生成更輕的中間表示，因此在運行時需要更少的內存。實際上，有時減少帶寬比減少參數計數更有益。事實上，對于涉及大圖像的任務，例如語義分割或對象檢測，中間表示可能會消耗大量內存，遠遠超過網絡本身。由于這些原因，過濾器修剪現在被視為結構化剪枝的默認類型。

然而，在應用這種修剪時，應注意以下幾個方面。讓我們考慮如何構建卷積層：對于輸入通道中的 C 和輸出通道中的 C，卷積層由 Cout 過濾器組成，每個過濾器都計算 Cin 核；每個過濾器輸出一個特征圖，在每個過濾器中，一個內核專用于每個輸入通道。考慮到這種架構，在修剪整個過濾器時，人們可能會觀察到修剪當前過濾器，然后它會影響當前輸出的特征圖，實際上也會導致在隨后的層中修剪相應的過濾器。這意味著，在修剪過濾器時，實際上可能會修剪一開始被認為要刪除的參數數量的兩倍。

讓我們也考慮一下，當整個層碰巧被修剪時（這往往是由于層崩潰 [62]，但并不總是破壞網絡，具體取決于架構），前一層的輸出現在完全沒有連接，因此也被刪減：刪減整個層實際上可能刪減其所有先前的層，這些層的輸出在其他地方沒有以某種方式連接（由于殘差連接[28]或整個并行路徑[61]）。因此在修剪過濾器時，應考慮計算實際修剪參數的確切數量。事實上，根據過濾器在體系結構中的分布情況，修剪相同數量的過濾器可能不會導致相同數量的實際修剪參數，從而使任何結果都無法與之進行比較。

在轉移話題之前，讓我們提一下，盡管數量很少，但有些工作專注于修剪卷積核（過濾器）、核內結構 [2,24, 46] 甚至特定的參數結構。但是，此類結構需要特殊的實現才能實現任何類型的加速（如非結構化剪枝）。然而，另一種可利用的結構是通過修剪每個內核中除一個參數之外的所有參數并將卷積轉換為“位移層”（shift layers），然后可以將其總結為位移操作和 1×1 卷積的組合 [24]。

結構化剪枝的危險：改變層的輸入和輸出維度會導致一些差異。如果在左邊，兩個層輸出相同數量的特征圖，然后可以很好地相加，右邊的剪枝產生不同維度的中間表示，如果不處理它們就無法相加。

2 - 剪枝標準

一旦決定了要修剪哪種結構，下一個可能會問的問題是：“現在，我如何確定要保留哪些結構以及要修剪哪些結構？”。為了回答這個問題，需要一個適當的修剪標準，這將對參數、過濾器或其他的相對重要性進行排名。

2.1- 權重大小標準

一個非常直觀且非常有效的標準是修剪絕對值（或“幅度”）最小的權重。實際上，在權重衰減的約束下，那些對函數沒有顯著貢獻的函數在訓練期間會縮小幅度。因此，多余的權重被定義為是那些絕對值較小的權重[8]。盡管它很簡單，但幅度標準仍然廣泛用于最新的方法 [21, 26, 58]，使其成為該領域的主要內容。

然而，雖然這個標準在非結構化剪枝的情況下實現起來似乎微不足道，但人們可能想知道如何使其適應結構化剪枝。一種直接的方法是根據過濾器的范數（例如 L 1 或 L 2）對過濾器進行排序 [40, 70]。如果這種方法非常簡單，人們可能希望將多組參數封裝在一個度量中：例如，一個卷積過濾器、它的偏差和它的批量歸一化參數，或者甚至是并行層中的相應過濾器，其輸出隨后被融合。

一種方法是在不需要計算這些參數的組合范數的情況下，在要修剪的每組圖層之后為每個特征圖插入一個可學習的乘法參數。當這個參數減少到零時，有效地修剪了負責這個通道的整套參數，這個參數的大小說明了所有參數的重要性。因此，該方法包括修剪較小量級的參數 [36, 41]。

2.2 - 梯度幅度剪枝

權重的大小并不是唯一存在的流行標準（或標準系列）。實際上，一直持續到現在的另一個主要標準是梯度的大小。事實上，早在 80 年代，一些基礎工作 [37, 53] 通過移除參數對損失的影響的泰勒分解進行了理論化，一些從反向傳播梯度導出的度量可以提供一種很好的方法來確定可以在不損壞網絡的情況下修剪哪些參數。

該方法 [4, 50] 的最新的實現實際上是在小批量訓練數據上累積梯度，并根據該梯度與每個參數的相應權重之間的乘積進行修剪。該標準也可以應用于上述參數方法[49]。

2.3 — 全局或局部剪枝

要考慮的最后一個方面是所選標準是否是全局應用于網絡的所有參數或過濾器，或者是否為每一層獨立計算。雖然多次證明全局修剪可以產生更好的結果，但它可能導致層崩潰 [62]。避免這個問題的一個簡單方法是采用逐層局部剪枝，即在使用的方法不能防止層崩潰時，在每一層剪枝相同的速率。

局部剪枝（左）和全局剪枝（右）的區別：局部剪枝對每一層應用相同的速率，而全局剪枝一次在整個網絡上應用。

3 - 剪枝方法

現在我們已經獲得了修剪結構和標準，剩下的唯一需要確認的是我們應該使用哪種方法來修剪網絡。這實際上這是文獻中最令人困惑的話題，因為每篇論文都會帶來自己的怪癖和噱頭，以至于人們可能會在有條不紊的相關內容和給定論文的特殊性之間迷失。

這就是為什么我們將按主題概述一些最流行的修剪神經網絡的方法系列，以突出訓練期間使用稀疏性的演變。

3.1 - 經典框架：訓練、修剪和微調

要知道的第一個基本框架是訓練、修剪和微調方法，它顯然涉及 1) 訓練網絡 2) 通過將修剪結構和標準所針對的所有參數設置為 0 來修剪它（這些參數之后無法恢復）和 3）用最低的學習率訓練網絡幾個額外的時期，讓它有機會從修剪引起的性能損失中恢復過來。通常，最后兩個步驟可以迭代，每次都會增加修剪率。

Han等人提出的方法 [26] 應用的就是這種方法，在修剪和微調之間進行 5 次迭代，以進行權重修剪。迭代已被證明可以提高性能，但代價是額外的計算和訓練時間。這個簡單的框架是許多方法 [26, 40, 41, 50, 66] 的基礎，可以看作是其他所有作品的默認方法。

3.2 - 擴展經典框架

雖然沒有偏離太多，但某些方法對 Han 等人的上述經典框架進行了重大修改[26]，Gale 等人 [21] 通過在整個訓練過程中逐漸移除越來越多的權重，進一步推動了迭代的原則，這使得可以從迭代的優勢中受益并移除整個微調過程。He等人[29] 在每個 epoch 將可修剪的過濾器逐步減少到 0，同時不阻止它們學習和之后更新，以便讓它們的權重在修剪后重新增長，同時在訓練期間加強稀疏性。

最后，Renda 等人的方法 [58] 涉及在修剪網絡后完全重新訓練網絡。與以最低學習率執行的微調不同，再訓練遵循與訓練相同的學習率計劃，因此被稱為：“Learning-Rate Rewinding”。與單純的微調相比，這種再訓練已顯示出更好的性能，而且成本要高得多。

3.3 - 初始化時的修剪

為了加快訓練速度，避免微調并防止在訓練期間或之后對架構進行任何更改，多項工作都集中在訓練前的剪枝上。在 SNIP [39] 之后，許多方法都研究了 Le Cun 等人的方法 [37] 或 Mozer 和 Smolensky [53] 在初始化時修剪 [12, 64]，包括深入的理論研究 [27, 38, 62]。然而，Optimal Brain Damage [37] 依賴于多個近似值，包括“極值”近似值，即“假設訓練收斂后將執行參數刪除”[37]；這個事實很少被提及，即使在基于它的方法中也是如此。一些工作對此類方法生成掩碼的能力提出了保留意見，這些掩碼的相關性優于每層相似分布的隨機掩碼[20]。

另一個研究修剪和初始化之間關系的方法家族圍繞著“彩票假設”[18]。這個假設指出“隨機初始化的密集神經網絡包含一個子網工作，它被初始化，這樣當單獨訓練時它可以在訓練最多相同迭代次數后與原始網絡的測試精度相匹配”。在實踐中，該文獻研究了使用已經收斂的網絡定義的剪枝掩碼在剛初始化時可以應用于網絡的效果如何。多項工作擴展、穩定或研究了這一假設 [14, 19, 45, 51, 69]。然而，多項工作再次傾向于質疑假設的有效性以及用于研究它的方法 [21, 42]，有些甚至傾向于表明它的好處來自于使用確定性掩碼而不是完全訓練的原則，“Winning Ticket”[58]。

經典的“訓練、剪枝和微調”框架 [26]、彩票實驗 [18] 和Learning-Rate Rewinding [58] 之間的比較。

3.4 - 稀疏訓練

上面提到的方法都與一個看似共享的潛在主題相關聯：在稀疏約束下訓練。這個原則是一系列方法的核心，稱為稀疏訓練，它包括在訓練期間強制執行恒定的稀疏率，同時其分布變化并逐漸調整。由 Mocanu 等人提出 [47]，它包括：1) 用隨機掩碼初始化網絡，修剪一定比例的網絡 2) 在一個輪次內訓練這個修剪過的網絡 3) 修剪一定數量的最低數量的權重 4) 重新增長相同的隨機權重的數量。

這樣，修剪掩碼首先是隨機的，逐漸調整以針對最小的導入權重，同時在整個訓練過程中強制執行稀疏性。每一層 [47] 或全局 [52] 的稀疏級別可以相同。其他方法通過使用某個標準來重新增加權重而不是隨機選擇它們來擴展稀疏訓練 [15, 17]。

稀疏訓練在訓練期間周期性地削減和增長不同的權重，這會導致調整后的掩碼應僅針對相關參數。

3.5 - 掩碼學習

與依賴任意標準來修剪或重新增加權重不同，多種方法專注于在訓練期間學習修剪掩碼。兩種方法似乎在這個領域盛行：1）通過單獨的網絡或層進行掩碼學習；2）通過輔助參數進行掩碼學習。多種策略可以適用于第一類方法：訓練單獨的代理以盡可能多地修剪一層的過濾器，同時最大限度地提高準確性 [33]、插入基于注意力的層 [68] 或使用強化學習 [30] .第二種方法旨在將剪枝視為一個優化問題，它傾向于最小化網絡的 L 0 范數及其監督損失。

由于 L0 是不可微的，因此各種方法主要涉及通過使用懲罰輔助參數來規避這個問題，這些輔助參數在前向傳遞期間與其相應的參數相乘 [59, 23]。許多方法 [44, 60, 67] 依賴于一種類似于“二元連接”[11] 的方法，即：對參數應用隨機門，這些參數的值每個都從它們自己的參數 p 的伯努利分布中隨機抽取“Straight Through Estimator”[3] 或其他方式 [44]。

3.6 - 基于懲罰的方法

許多方法不是手動修剪連接或懲罰輔助參數，而是對權重本身施加各種懲罰，使它們逐漸縮小到 0。這個概念實際上很古老 [57]，因為權重衰減已經是一個必不可少的權重大小標準。除了使用單純的權重衰減之外，甚至在那時也有多項工作專注于制定專門用于強制執行稀疏性的懲罰 [55, 65]。今天，除了權重衰減之外，各種方法應用不同的正則化來進一步增加稀疏性（通常使用 L 1 范數 [41]）。

在最新的方法中，多種方法依賴于 LASSO[22, 31, 66] 來修剪權重或組。其他方法制定了針對弱連接的懲罰，以增加要保留的參數和要修剪的參數之間的差距，從而減少它們的刪除影響 [7, 16]。一些方法表明，針對在整個訓練過程中不斷增長的懲罰的權重子集可以逐步修剪它們并可以進行無縫刪除[6, 9, 63]。文獻還計算了圍繞“Variational Dropout”原理構建的一系列方法 [34]，這是一種基于變分推理 [5] 的方法，應用于深度學習 [35]。作為一種剪枝方法 [48]，它產生了多種將其原理應用于結構化剪枝 [43, 54] 的方法。

4 - 可用的框架

如果這些方法中的大多數必須從頭開始實現（或者可以從每篇論文的提供源代碼中重用），以下這些框架都可以應用基本方法或使上述實現更容易。

4.1 - Pytorch

Pytorch [56] 提供了一些基本的剪枝方法，例如全局剪枝或局部剪枝，無論是結構化的還是非結構化的。結構化修剪可以應用于權重張量的任何維度，它可以修剪過濾器、內核行甚至內核內部的一些行和列。那些內置的基本方法還允許隨機修剪或根據各種規范進行修剪。

4.2 - Tensorflow

Tensorflow [1] 的 Keras [10] 庫提供了一些基本工具來修剪最低量級的權重。例如在 Han 等人 [25] 的工作中，修剪的效率是根據所有插入的零引入的冗余程度來衡量的，可以更好地壓縮模型（與量化結合得很好）。

4.3 - ShrinkBench

Blalock 等人 [4] 在他們的工作中提供了一個自定義庫，以幫助社區規范剪枝算法的比較方式。 ShrinkBench 基于 Pytorch，旨在使剪枝方法的實施更容易，同時規范訓練和測試的條件。它提供了幾種不同的基線，例如隨機剪枝、全局或分層以及權重大小或梯度大小剪枝。

5 - 方法的簡要回顧

在這篇文章中，引用了許多不同的論文。這是一個簡單的表格，粗略總結了它們的作用以及它們的區別（提供的日期是首次發布的日期）：

ArticleDateStructureCriterionMethodRemarkSources

Classic methods
Han et al.	2015	weights	weights magnitude	train, prune and fine-tune	prototypical pruning method	none
Gale et al.	2019	weights	weights magnitude	gradual removal	-	none
Renda et al.	2020	weights	weights magnitude	train, prune and re-train (“LR-Rewinding”)	-	yes
Li et al.	2016	filters	L1 norm of weights	train, prune and fine-tune	-	none
Molchanov et al.	2016	filters	gradient magnitude	train, prune and fine-tune	-	none
Liu et al.	2017	filters	magnitude of batchnorm parameters	train, prune and fine-tune	gates-based structured pruning	none
He et al.	2018	filters	L2 norm of weights	soft pruning	zeroes out filters without removal until the end	yes
Molchanov et al.	2019	filters	gradient magnitude	train, prune and fine-tune	inserts gates to prune filters	none
Pruning at initialization
Lee et al.	2018	weights	gradient magnitude	prune and train	“SNIP”	yes
Lee et al.	2019	weights	“dynamical isometry”	prune and train	dataless method	yes
Wang et al.	2020	weights	second-order derivative	prune and train	“GraSP”: alike SNIP but with a criterion closer to that of Le Cun et al.	yes
Tanaka et al.	2020	weights	“synaptic flow”	prune and train	“SynFlow”: dataless method	yes
Frankle et al.	2018	weights	weights magnitude	train, rewind, prune and retrain	“lottery ticket”	none
Sparse training
Mocanu et al.	2018	weights	weights magnitude	sparse training	random regrowth of pruned weights	yes
Mostafa and Wang	2019	weights	weights magnitude	sparse training	alike Mocanu et al. but global instead of layer-wise	none
Dettmers and Zettlemoyer	2019	weights	weights magnitude	sparse training	regrowth and layer-wise pruning rate depending on momentum	yes
Evci et al.	2019	weights	weights magnitude	sparse training	regrowth on gradient magnitude	yes
Mask learning
Huang et al.	2018	filters	N/A	train, prune and fine-tune	trains pruning agents that target filters to prune	none
He et al.	2018	filters	N/A	train, prune and fine-tune	uses reinforcement learning to target filters to prune	yes
Yamamoto and Maeno	2018	filters	N/A	train, prune and fine-tune	“PCAS”: uses attention modules to target filters to prune	none
Guo et al.	2016	weights	weight magnitude	mask learning	updates a mask depending on two different thresholds on the magnitude of weights	yes
Srinivas et al.	2016	weights	N/A	mask learning	alike Binary Connect applied to auxiliary parameters	none
Louizos et al.	2017	weights	N/A	mask learning	variant of Binary Connect, applied to auxiliary parameters, that avoids resorting to the Straight Through Estimator	yes
Xiao et al.	2019	weights	N/A	mask learning	alike Binary Connect but alters the gradient propagated to the auxiliary parameters	none
Savarese et al.	2019	weights	N/A	mask learning	approximates L0 with a heavyside function, which is itself approximated by a sigmoid of increasing temperature over auxiliary parameters	yes
Penalty-based methods
Wen et al.	2016	filters	N/A	Group-LASSO regularization	-	yes
He et al.	2017	filters	N/A	Group-LASSO regularization	also reconstructs the outputs of pruned layers by least squares	yes
Gao et al.	2019	filters	N/A	Group-LASSO regularization	prunes matching filters accross layers and penalizes variance of weights	none
Chang and Sha	2018	weights	weight magnitude	global penalty	modifies the weight decay to make it induce more sparsity	none
Molchanov et al.	2017	weights	N/A	“Variational Dropout”	application of variational inference on pruning	none
Neklyudov et al.	2017	filters	N/A	“Variational Dropout”	structured version of variational dropout	yes
Louizos et al.	2017	filters	N/A	“Variational Dropout”	another structured version of variational dropout	none
Ding et al.	2018	filters	weight magnitude	targeted penalty	penalizes or stimulate filters depending on the distance of their L 2 norm to a given threshold	none
Choi et al.	2018	weights	weight magnitude	targeted penalty	at each step penalizes weights of least magnitude by its L 2 norm, with an importance that is learned throughout training	none
Carreira-Perpi?án and Idelbayev	2018	weights	weight magnitude	targeted penalty	defines a mask depending on weights of least magnitudes and penalizes them toward zero	none
Tessier et al.	2020	any	any (weight magnitude)	targeted penalty	at each step penalizes prunable weights or filters by its L2 norm, with an importance that grows exponentially throughout training	yes

5 - 總結

在我們對文獻的快速概覽中，我們看到 1) 剪枝結構定義了從剪枝中期望獲得的收益 2) 剪枝標準基于各種理論或實踐 3) 剪枝方法傾向于在訓練期間引入稀疏性兼顧性能和成本。我們還看到，盡管它的最開始的工作可以追溯到 80 年代后期，但神經網絡剪枝是一個非常動態的領域，今天仍然經歷著基本的發現和新的基本概念。

盡管該領域每天都有貢獻，但似乎仍有很大的探索和創新空間。如果方法的每個子族都可以看作是回答問題的一個嘗試（“如何重新生成剪枝后的權重？”、“如何通過優化學習剪枝掩碼？”、“如何通過更柔和的平均值來進行權重去除？”…… )，根據文獻的演變似乎指出了一個方向：整個訓練的稀疏性。這個方向提出了許多問題，例如：“剪枝標準在尚未收斂的網絡上是否有效？”或者“如何從一開始就從任何類型的稀疏性訓練中區分選擇要修剪的權重的好處？”

引用

[1] Mart??n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.

[2] Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. Structured pruning of deep convolutional neural networks. ACM Journal on Emerging Technologies in Computing Systems (JETC), 13(3):1–18, 2017.

[3] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.

[4] Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag. What is the state of neural network pruning? arXiv preprint arXiv:2003.03033, 2020.

[5] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877, 2017.

[6] Miguel A Carreira-Perpinán and Yerlan Idelbayev. “learning-compression” algorithms for neural net pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8532–8541, 2018.

[7] Jing Chang and Jin Sha. Prune deep neural networks with the modified L1/2 penalty. IEEE Access, 7:2273–2280, 2018.

[8] Yves Chauvin. A back-propagation algorithm with optimal use of hidden units. In NIPS, volume 1, pages 519–526, 1988.

[9] Yoojin Choi, Mostafa El-Khamy, and Jungwon Lee. Compression of deep convolutional neural networks under joint sparsity constraints. arXiv preprint arXiv:1805.08303, 2018.

[10] Francois Chollet et al. Keras, 2015.

[11] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In NIPS, 2015.

[12] Pau de Jorge, Amartya Sanyal, Harkirat S Behl, Philip HS Torr, Gregory Rogez, and Puneet K Dokania. Progressive skeletonization: Trimming more fat from a network at initialization. arXiv preprint arXiv:2006.09081, 2020.

[13] Emily Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In 28th Annual Conference on Neural Information Processing Systems 2014, NIPS 2014, pages 1269–1277. Neural information processing systems foundation, 2014.

[14] Shrey Desai, Hongyuan Zhan, and Ahmed Aly. Evaluating lottery tickets under distributional shifts. In Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019), pages 153–162, 2019.

[15] Tim Dettmers and Luke Zettlemoyer. Sparse networks from scratch: Faster training without losing performance. arXiv preprint arXiv:1907.04840, 2019.

[16] Xiaohan Ding, Guiguang Ding, Xiangxin Zhou, Yuchen Guo, Jungong Han, and Ji Liu. Global sparse momentum sgd for pruning very deep neural networks. arXiv preprint arXiv:1909.12778, 2019.

[17] Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. Rigging the lottery: Making all tickets winners. In International Conference on Machine Learning, pages 2943–2952. PMLR, 2020.

[18] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.

[19] Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael Carbin. Stabilizing the lottery ticket hypothesis. arXiv preprint arXiv:1903.01611, 2019.

[20] Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael Carbin. Pruning neural networks at initialization: Why are we missing the mark? arXiv preprint arXiv:2009.08576, 2020.

[21] Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574, 2019.

[22] Susan Gao, Xin Liu, Lung-Sheng Chien, William Zhang, and Jose M Alvarez. Vacl: Variance-aware cross-layer regularization for pruning deep residual networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.

[23] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. In NIPS, 2016.

[24] Ghouthi Boukli Hacene, Carlos Lassance, Vincent Gripon, Matthieu Courbariaux, and Yoshua Bengio. Attention based pruning for shift networks. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 4054–4061. IEEE, 2021.

[25] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.

[26] Song Han, Jeff Pool, John Tran, and William J Dally. Learning both weights and connections for efficient neural network. In NIPS, 2015.

[27] Soufiane Hayou, Jean-Francois Ton, Arnaud Doucet, and Yee Whye Teh. Robust pruning at initialization.

[28] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

[29] Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and Yi Yang. Soft filter pruning for accelerating deep convolutional neural networks. arXiv preprint arXiv:1808.06866, 2018.

[30] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), pages 784–800, 2018.

[31] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 1389–1397, 2017.

[32] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. stat, 1050:9, 2015.

[33] Qiangui Huang, Kevin Zhou, Suya You, and Ulrich Neumann. Learning to prune filters in convolutional neural networks. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 709–718. IEEE, 2018.

[34] Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. stat, 1050:8, 2015.

[35] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. stat, 1050:1, 2014.

[36] John K Kruschke and Javier R Movellan. Benefits of gain: Speeded learning and minimal hidden layers in back-propagation networks. IEEE Transactions on systems, Man, and Cybernetics, 21(1):273–280, 1991.

[37] Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems, pages 598–605, 1990.

[38] Namhoon Lee, Thalaiyasingam Ajanthan, Stephen Gould, and Philip HS Torr. A signal propagation perspective for pruning neural networks at initialization. In International Conference on Learning Representations, 2019.

[39] Namhoon Lee, Thalaiyasingam Ajanthan, and Philip HS Torr. Snip: Single-shot network pruning based on connection sensitivity. International Conference on Learning Representations, ICLR, 2019.

[40] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.

[41] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, pages 2736–2744, 2017.

[42] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. In International Conference on Learning Representations, 2018.

[43] C Louizos, K Ullrich, and M Welling. Bayesian compression for deep learning. In 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA., 2017.

[44] Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks through l 0 regularization. arXiv preprint arXiv:1712.01312, 2017.

[45] Eran Malach, Gilad Yehudai, Shai Shalev-Schwartz, and Ohad Shamir. Proving the lottery ticket hypothesis: Pruning is all you need. In International Conference on Machine Learning, pages 6682–6691. PMLR, 2020.

[46] Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, and William J Dally. Exploring the regularity of sparse structure in convolutional neural networks. arXiv preprint arXiv:1705.08922, 2017.

[47] Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine Gibescu, and Antonio Liotta. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature communications, 9(1):1–12, 2018.

[48] Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. In International Conference on Machine Learning, pages 2498–2507. PMLR, 2017.

[49] Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. Importance estimation for neural network pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11264–11272, 2019.

[50] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440, 2016.

[51] Ari S Morcos, Haonan Yu, Michela Paganini, and Yuandong Tian. One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. stat, 1050:6, 2019.

[52] Hesham Mostafa and Xin Wang. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In International Conference on Machine Learning, pages 4646–4655. PMLR, 2019.

[53] Michael C Mozer and Paul Smolensky. Skeletonization: A technique for trimming the fat from a network via relevance assessment. In Advances in neural information processing systems, pages 107–115, 1989.

[54] Kirill Neklyudov, Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Structured bayesian pruning via log-normal multiplicative noise. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 6778–6787, 2017.

[55] Steven J Nowlan and Geoffrey E Hinton. Simplifying neural networks by soft weight-sharing. Neural Computation, 4(4):473–493, 1992.

[56] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.

[57] Russell Reed. Pruning algorithms-a survey. IEEE transactions on Neural Networks, 4(5):740–747, 1993.

[58] Alex Renda, Jonathan Frankle, and Michael Carbin. Comparing rewinding and fine-tuning in neural network pruning. arXiv preprint arXiv:2003.02389, 2020.

[59] Pedro Savarese, Hugo Silva, and Michael Maire. Winning the lottery with continuous sparsification. Advances in Neural Information Processing Systems, 33, 2020.

[60] Suraj Srinivas, Akshayvarun Subramanya, and R Venkatesh Babu. Training sparse neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 138–145, 2017.

[61] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.

[62] Hidenori Tanaka, Daniel Kunin, Daniel L Yamins, and Surya Ganguli. Pruning neural networks without any data by iteratively conserving synaptic flow. Advances in Neural Information Processing Systems, 33, 2020.

[63] Hugo Tessier, Vincent Gripon, Mathieu Léonardon, Matthieu Arzel, Thomas Hannagan, and David Bertrand. Rethinking weight decay for efficient neural network pruning. 2021.

[64] Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking winning tickets before training by preserving gradient flow. In International Conference on Learning Representations, 2019.

[65] Andreas S Weigend, David E Rumelhart, and Bernardo A Huberman. Generalization by weight-elimination with application to forecasting. In Advances in neural information processing systems, pages 875–882, 1991.

[66] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In NIPS, 2016.

[67] Xia Xiao, Zigeng Wang, and Sanguthevar Rajasekaran. Autoprune: Automatic network pruning by regularizing auxiliary parameters. Advances in neural information processing systems, 32, 2019.

[68] Kohei Yamamoto and Kurato Maeno. Pcas: Pruning channels with attention statistics for deep network compression. arXiv preprint arXiv:1806.05382, 2018.

[69] Hattie Zhou, Janice Lan, Rosanne Liu, and Jason Yosinski. Deconstructing lottery tickets: Zeros, signs, and the supermask. arXiv preprint arXiv:1905.01067, 2019.

[70] Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Yong Guo, Qingyao Wu, Junzhou Huang, and Jin-Hui Zhu. Discrimination-aware channel pruning for deep neural networks. In NeurIPS, 2018.

本文作者：Hugo Tessier

Deconstructing lottery tickets: Zeros, signs, and the supermask. arXiv preprint arXiv:1905.01067, 2019.