RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one.
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one.
報(bào)錯(cuò)信息
報(bào)錯(cuò)信息:
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).
遇到這個(gè)報(bào)錯(cuò)的原因可能有很多,設(shè)置torch.nn.parallel.DistributedDataParallel的參數(shù)find_unused_parameters=True之類的方法就不提了,報(bào)錯(cuò)信息中給的很清楚,看不懂的話google翻譯一下即可。
運(yùn)行時(shí)錯(cuò)誤:預(yù)計(jì)在開始新迭代之前已完成前一次迭代的減少。此錯(cuò)誤表明您的模塊具有未用于產(chǎn)生損耗的參數(shù)。您可以通過 (1) 將關(guān)鍵字參數(shù) find_unused_parameters=True 傳遞給 torch.nn.parallel.DistributedDataParallel 來啟用未使用的參數(shù)檢測(cè); (2) 確保所有 forward 函數(shù)輸出都參與計(jì)算損失。如果您已經(jīng)完成了上述兩個(gè)步驟,那么分布式數(shù)據(jù)并行模塊無法在模塊的 forward 函數(shù)的返回值中定位輸出張量。報(bào)告此問題時(shí),請(qǐng)包括損失函數(shù)和模塊 forward 返回值的結(jié)構(gòu)(例如 list、dict、iterable)。
如果改個(gè)參數(shù)能夠就能夠解決你的問題的話,你也不會(huì)找到這篇博客了^^。
解決方法(之一)
這里其實(shí)報(bào)錯(cuò)的最后一句值得注意:
如果您已經(jīng)完成了上述兩個(gè)步驟,那么分布式數(shù)據(jù)并行模塊無法在模塊的 forward 函數(shù)的返回值中定位輸出張量。報(bào)告此問題時(shí),請(qǐng)包括損失函數(shù)和模塊 forward 返回值的結(jié)構(gòu)(例如 list、dict、iterable)。
但是第一次遇到這個(gè)問題只看官方的提示信息可能還是云里霧里,這里筆者將自己的理解和解決過程分享出來。
說的簡(jiǎn)單點(diǎn),其實(shí)就一句話:確保你的所有的forward的函數(shù)的所有輸出都被用于計(jì)算損失函數(shù)了。
注意,不僅僅是你的模型的forward函數(shù)的輸出,可能你的損失函數(shù)也是通過forward函數(shù)來計(jì)算的。也就是說,所有繼承自nn.Module的模塊(不只是模型本身)的forward函數(shù)的所有輸出都要參與損失函數(shù)的計(jì)算。
筆者本身遇到的問題就是,在多任務(wù)學(xué)習(xí)中,損失函數(shù)是通過一個(gè)整個(gè)繼承自nn.Module的模塊來計(jì)算的,但是在forward返回的loss中少加了一個(gè)任務(wù)的loss,導(dǎo)致這個(gè)報(bào)錯(cuò)。
class multi_task_loss(nn.Module):def __init__(self, device, batch_size):super().__init__()self.ce_loss_func = nn.CrossEntropyLoss()self.l1_loss_func = nn.L1Loss()self.contra_loss_func = ContrastiveLoss(batch_size, device)def forward(self, rot_p, rot_t, pert_p, pert_t, emb_o, emb_h, emb_p,original_imgs, rect_imgs):rot_loss = self.ce_loss_func(rot_p, rot_t)pert_loss = self.ce_loss_func(pert_p, pert_t)contra_loss = self.contra_loss_func(emb_o, emb_h) \+ self.contra_loss_func(emb_o, emb_p) \+ self.contra_loss_func(emb_p, emb_h)rect_loss = self.l1_loss_func(original_imgs, rect_imgs)# tol_loss = rot_loss + pert_loss + rect_loss # 少加了一個(gè)loss,但是所有l(wèi)oss都返回了tol_loss = rot_loss + pert_loss + contra_loss + rect_loss # 修改為此行后正常return tol_loss, (rot_loss, pert_loss, contra_loss, rect_loss)讀者可以檢查一下自己整個(gè)的計(jì)算過程中(不只是模型本身),是否所有的forward的函數(shù)的所有輸出都被用于計(jì)算損失函數(shù)了。
Ref:
https://discuss.pytorch.org/t/need-help-runtimeerror-expected-to-have-finished-reduction-in-the-prior-iteration-before-starting-a-new-one/119247
總結(jié)
以上是生活随笔為你收集整理的RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one.的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 花呗不支持当前交易怎么回事
- 下一篇: Docker理解