「分布式训练」使用 DDP 实现程序单机多卡并行指南
最近在大趨勢(shì)的影響下,開(kāi)始染指大模型。由于實(shí)驗(yàn)室計(jì)算資源的限制,需要使用單機(jī)多卡并行的方式運(yùn)行程序,這里以 BLOOM-560m 模型為例,演示如何通過(guò)單機(jī)多卡DDP并行的方式微調(diào)完成下游任務(wù)。
目錄
- 0. 基礎(chǔ)篇
- - 兩種分布式訓(xùn)練方式
- - 數(shù)據(jù)并行 & 模型并行
- 1. 程序修改
- 1.1 導(dǎo)入關(guān)鍵包
- 1.2 定義關(guān)鍵函數(shù)
- 1.3 程序入口
- 1.4 main() 函數(shù)
- 1.5 get_dataloader() 函數(shù)
- 1.6 train() 函數(shù)
- 1.7 validate() 函數(shù)
- 1.8 test() 函數(shù)
- 2. 程序運(yùn)行
- 2.1 mp.spawn() 啟動(dòng)
- 2.2 tochrnn 啟動(dòng)
- 2.3 torch.distributed.launch() 啟動(dòng)
- 3. Debug 歷程
- 問(wèn)題一:多進(jìn)程計(jì)算數(shù)據(jù)收集
- 問(wèn)題二:模型加載參數(shù)缺失
- 問(wèn)題三:參數(shù)類(lèi)型轉(zhuǎn)換異常
- 問(wèn)題四:參數(shù)泄露
0. 基礎(chǔ)篇
- 兩種分布式訓(xùn)練方式
??:Pytorch 分布式目前只支持 Linux。實(shí)現(xiàn)程序并行主要有 DataParallel 和 DistributedDataParallel 兩種方式:
-
DataParallel (DP):實(shí)現(xiàn)簡(jiǎn)單,代碼量較少,啟動(dòng)速度快一點(diǎn)。但速度較慢,且存在負(fù)載不均衡的問(wèn)題。單進(jìn)程,多線(xiàn)程。主卡顯存占用比其他卡會(huì)多很多。不支持 Apex 的混合精度訓(xùn)練。是Pytorch官方很久之前給的一種方案。受 Python GIL 的限制,DP的操作原理是將一個(gè)batchsize的輸入數(shù)據(jù)均分到多個(gè)GPU上分別計(jì)算(此處注意,batchsize要大于GPU個(gè)數(shù)才能劃分)。
-
DistributedDataParallel (DDP):All-Reduce模式,本意是用來(lái)分布式訓(xùn)練(多機(jī)多卡),但是也可用于單機(jī)多卡。配置稍復(fù)雜。多進(jìn)程。數(shù)據(jù)分配較均衡。是新一代的多卡訓(xùn)練方法。使用 torch.distributed 庫(kù)實(shí)現(xiàn)并行。torch.distributed 庫(kù)提供分布式支持,包括 GPU 和 CPU 的分布式訓(xùn)練支持,該庫(kù)提供了一種類(lèi)似 MPI 的接口,用于跨多機(jī)器網(wǎng)絡(luò)交換張量數(shù)據(jù)。它支持幾種不同的后端和初始化方法。DDP通過(guò)Ring-Reduce的數(shù)據(jù)交換方法提高了通訊效率,并通過(guò)啟動(dòng)多個(gè)進(jìn)程的方式減輕Python GIL的限制,從而提高訓(xùn)練速度。
-
DDP多卡訓(xùn)練的原理
- 將模型在各個(gè)GPU上復(fù)制一份;
- 將總的 batch 數(shù)據(jù)等分到不同的GPU上進(jìn)行計(jì)算(shuffle 順序打亂),每個(gè)進(jìn)程都從磁盤(pán)加載其自己的數(shù)據(jù);
- 在模型訓(xùn)練時(shí),損失函數(shù)的前向傳播和計(jì)算在每個(gè) GPU 上獨(dú)立執(zhí)行,因此,不需要收集網(wǎng)絡(luò)輸出。在反向傳播期間,各個(gè)進(jìn)程通過(guò)一種叫 Ring-Reduce 的方法與其他進(jìn)程通訊,交換各自的梯度,從而獲得所有進(jìn)程的平均梯度;然后用這個(gè)值在所有 GPU 上執(zhí)行梯度下降,從而每個(gè) GPU 在反向傳播結(jié)束時(shí)最終得到平均梯度的相同副本;
- 各個(gè)進(jìn)程用平均后的梯度更新自己的參數(shù),因?yàn)楦鱾€(gè)進(jìn)程的初始參數(shù)、更新梯度是一致的,所以更新后的參數(shù)也是完全相同的。
- 數(shù)據(jù)并行 & 模型并行
- 數(shù)據(jù)并行是指,多張 GPUs 使用相同的模型副本,但采用同一batch中的不同數(shù)據(jù)進(jìn)行訓(xùn)練。
- 模型并行是指,多張 GPUs 使用同一 batch 的數(shù)據(jù),分別訓(xùn)練模型的不同部分。
簡(jiǎn)單來(lái)記就是:并行就是對(duì)并行對(duì)象進(jìn)行拆分,以提高運(yùn)算效率。
1. 程序修改
本教程使用DDP 方式完成程序并行。參考此篇教程,實(shí)現(xiàn)模型多卡復(fù)制和數(shù)據(jù)并行。
1.1 導(dǎo)入關(guān)鍵包
以下是程序修改過(guò)程中會(huì)使用到的包;其中,dist 負(fù)責(zé)多卡通訊,DDP 負(fù)責(zé)模型傳遞等工作。
import torch.distributed as dist import torch.multiprocessing as mp from torch.cuda.amp import GradScaler from torch.utils.data.distributed import DistributedSampler from torch.nn.parallel import DistributedDataParallel as DDP1.2 定義關(guān)鍵函數(shù)
- init_ddp(local_rank)
對(duì)進(jìn)程進(jìn)行初始化,使用 nccl 后端,并用 env 作為初始化方法。
local_rank = dist.get_rank()
world_size = dist.get_world_size()
在完成了該初始化后,可以很輕松地在需要時(shí)獲得 local_rank、world_size,而不需要作為額外參數(shù)從 main() 函數(shù)中一層一層往下傳。比如需要 print, log, save_model時(shí),由于多個(gè)進(jìn)程擁有相同的副本,故只需要一個(gè)進(jìn)程執(zhí)行即可,示例:
if local_rank == 0:print(f'begin validating') ......if local_rank == 0:save_model(actual_epoch, model, scaler, args['model_save_dir'] + '/best_macro_model_DDP_direct.pt')- reduce_tensor(tensor)
對(duì)多個(gè)進(jìn)程的計(jì)算結(jié)果進(jìn)行匯總,如 loss、評(píng)價(jià)指標(biāo)。
def reduce_tensor(tensor: torch.Tensor):'''對(duì)多個(gè)進(jìn)程計(jì)算的多個(gè) tensor 類(lèi)型的 輸出值取平均操作'''rt = tensor.clone() # tensor(9.1429, device='cuda:1')dist.all_reduce(rt, op=dist.reduce_op.SUM)rt /= dist.get_world_size()return rt- get_ddp_generator(seed)
用于訓(xùn)練過(guò)程中,增強(qiáng)訓(xùn)練的隨機(jī)性。
def get_ddp_generator(seed=3407):'''對(duì)每個(gè)進(jìn)程使用不同的隨機(jī)種子,增強(qiáng)訓(xùn)練的隨機(jī)性'''local_rank = dist.get_rank()g = torch.Generator()g.manual_seed(seed + local_rank)return g1.3 程序入口
在if __name__ == '__main__':中,使用 spawn() 函數(shù)啟動(dòng) DDP,該函數(shù)的主要參數(shù)包括:
1.4 main() 函數(shù)
這里的 main() 函數(shù)即上面提到的 spawn() 函數(shù)中傳入的第一個(gè)參數(shù)。代碼關(guān)鍵部位修改如下:
-
參數(shù)列表更新:添加額外參數(shù) local_rank,該參數(shù)無(wú)需在 mp.spawn() 函數(shù)中傳遞,系統(tǒng)會(huì)自動(dòng)分配;
-
進(jìn)程初始化:調(diào)用 init_ddp() 函數(shù)實(shí)現(xiàn);
-
BN層同步:調(diào)用 convert_sync_batchnorm() 函數(shù)用同步的方法完成BN以盡可能模擬單卡場(chǎng)景,盡管會(huì)降低GPU利用率,但可以提高模型在多卡場(chǎng)景下的表現(xiàn)(詳解見(jiàn)此篇博客);BN層同步的必要性依賴(lài)于單卡batch_size的大小,如果單卡batch_size太小,使用SyncBN可以提高性能。但是如果batch_size較大的時(shí)候就不需要使用SyncBN, 因?yàn)檫@需要多卡之間通信,會(huì)導(dǎo)致訓(xùn)練速度變慢。
-
數(shù)據(jù)并行:調(diào)用 DistributedDataParallel() 函數(shù)實(shí)現(xiàn);
-
指定混合精度訓(xùn)練:調(diào)用 GradScaler() 函數(shù)實(shí)現(xiàn),作為參數(shù)傳至 train() 函數(shù)中;
-
訓(xùn)練采樣器設(shè)置:每個(gè) epoch 設(shè)置不同的 sampling 順序;
-
避免副本重復(fù)執(zhí)行:使用 if local_rank==0: 語(yǔ)句進(jìn)行約束;
-
消除進(jìn)程組:調(diào)用 destroy_process_group() 函數(shù)實(shí)現(xiàn)。
- 除上述修改,main() 函數(shù)中使用到的三個(gè)函數(shù),get_dataloader() 函數(shù)、train() 函數(shù)以及 validate() 函數(shù),也需要進(jìn)行相應(yīng)更新,下面分別對(duì)其進(jìn)行講解。
1.5 get_dataloader() 函數(shù)
該函數(shù)主要對(duì) DataLoader() 函數(shù)進(jìn)行了修改。對(duì)「訓(xùn)練」和「測(cè)試」兩個(gè)階段,分別定義 train_sampler 和 test_sampler,其中,設(shè)置 train_sampler 為隨機(jī)采樣,test_sampler 為順序采樣。此外,在「訓(xùn)練」階段,使用 get_ddp_generator() 函數(shù)向 DataLoader() 函數(shù)傳入?yún)?shù) generator(作用于不同worker),否則會(huì)減弱訓(xùn)練的隨機(jī)性。
def get_dataloader(path, args, tokenizer, train:bool): '''根據(jù)給定的路徑獲取數(shù)據(jù),并將數(shù)據(jù)和訓(xùn)練標(biāo)志傳遞給數(shù)據(jù)加載器,這樣可以方便地從給定路徑加載數(shù)據(jù)并生成數(shù)據(jù)加載器,以供后續(xù)的模型訓(xùn)練和評(píng)估使用。path:數(shù)據(jù)存放路徑tokenizer:分詞器train:是否是訓(xùn)練階段'''texts, labels = load_dataset(path, args['num_labels'])texts = tokenizer(texts, padding='max_length', truncation=True, return_tensors='pt', max_length=args['max_length']) data = TensorDataset(texts['input_ids'], texts['attention_mask'], torch.tensor(labels)) if train:train_sampler = DistributedSampler(data, shuffle=True) # #創(chuàng)建一個(gè)隨機(jī)采樣器。g = get_ddp_generator()dataloader = DataLoader(dataset=data,batch_size=args['batch_size'],num_workers=args['num_workers'],pin_memory=True,shuffle=False,sampler=train_sampler, #采用隨機(jī)采樣器。generator=g) else:test_sampler = DistributedSampler(data, shuffle=False) #創(chuàng)建一個(gè)順序采樣器。dataloader = DataLoader(dataset=data,batch_size=args['batch_size'],num_workers=args['num_workers'],pin_memory=True,shuffle=False,sampler=test_sampler #采用順序采樣器。)return dataloader1.6 train() 函數(shù)
該函數(shù)主要通過(guò) reduce_tensor() 函數(shù)對(duì)loss進(jìn)行了取均值操作,并對(duì)反向傳播的方式進(jìn)行了修改 —— 通過(guò)scaler 對(duì)梯度進(jìn)行縮放,防止由于使用混合精度導(dǎo)致?lián)p失下溢,并且對(duì)scaler自身的狀態(tài)進(jìn)行更新。多個(gè)并行進(jìn)程共用同一個(gè)scaler。在模型保存過(guò)程中,如果后續(xù)需要繼續(xù)訓(xùn)練(比如預(yù)訓(xùn)練-微調(diào)模式),最好將scaler 的狀態(tài)一起保留,并在后續(xù)的微調(diào)過(guò)程中和模型的參數(shù)一同加載。
def train(model, train_dataloader, optimizer, scheduler, criterion, actual_epoch, scaler, args):model.train()tr_loss = 0num_train_samples = 0for step, batch in enumerate(train_dataloader):batch = tuple(t.cuda(non_blocking=True) for t in batch)b_input_ids, b_input_mask, b_labels = batchwith torch.cuda.amp.autocast():output = model(b_input_ids, attention_mask=b_input_mask, labels=b_labels) # 運(yùn)行到這一行會(huì)增加一下顯存loss = criterion(output.logits.view(-1,args['num_labels']), b_labels.type_as(output.logits).view(-1,args['num_labels']))reduced_loss = reduce_tensor(loss.data) # 對(duì)并行進(jìn)程計(jì)算的多個(gè) loss 取平均if dist.get_rank() == 0: # 防止重復(fù)輸出print("\nOutput Loss: ", reduced_loss.item())tr_loss += reduced_loss.item()# 并行狀態(tài)下的更新,不同進(jìn)程分別根據(jù)自己計(jì)算的 loss 更新數(shù)據(jù)optimizer.zero_grad()scaler.scale(loss).backward()scaler.step(optimizer) # 運(yùn)行到這一行會(huì)增加一下顯存# 下面四行,多個(gè)進(jìn)程只執(zhí)行一次scheduler.step()scaler.update()num_train_samples += b_labels.size(0) #將批次中的樣本數(shù)量添加到 num_train_samples 中。torch.cuda.empty_cache() # 釋放GPU reserved memory顯存epoch_train_loss = tr_loss / num_train_samples # num_train_samples 代表每個(gè)進(jìn)程承接的樣本數(shù)量,由于上面已經(jīng)有對(duì)loss取平均的操作,這里分母無(wú)需再乘以進(jìn)程數(shù)if dist.get_rank() == 0:print("\nTrain loss after Epoch {} : {}".format(actual_epoch, epoch_train_loss))1.7 validate() 函數(shù)
@torch.no_grad() def validate(model, valid_dataloader, criterion, epoch, args, threshold=0.5):model.eval()eval_loss = 0.0num_eval_samples = 0pred_labels = []true_labels = []for step, batch in enumerate(valid_dataloader):batch = tuple(t.cuda(non_blocking=True) for t in batch)b_input_ids, b_input_mask, b_labels = batchwith torch.no_grad():with torch.cuda.amp.autocast():output = model(b_input_ids, attention_mask=b_input_mask)logits = output.logitsloss = criterion(logits.view(-1,args['num_labels']), b_labels.type_as(logits).view(-1,args['num_labels']))reduced_loss = reduce_tensor(loss.data)eval_loss += reduced_loss.item()pred_label = torch.sigmoid(logits)pred_label = pred_label.to('cpu').numpy()b_labels = b_labels.to('cpu').numpy()pred_labels.append(pred_label)true_labels.append(b_labels)num_eval_samples += b_labels.shape[0] # 這里是針對(duì)單個(gè) 進(jìn)程 的 計(jì)算樣本數(shù)epoch_eval_loss = eval_loss/num_eval_samples if dist.get_rank() == 0:print("Validation loss after Epoch {} : {}".format(epoch, epoch_eval_loss))# 每個(gè)并行進(jìn)程都會(huì)分別執(zhí)行下列計(jì)算操作,得到各進(jìn)程對(duì)應(yīng)的macro評(píng)價(jià)指標(biāo)pred_labels = [item for sublist in pred_labels for item in sublist]true_labels = [item for sublist in true_labels for item in sublist]pred_bools = [pl>threshold for pl in pred_labels]true_bools = [tl==1 for tl in true_labels]macro = f1_score(true_bools, pred_bools, average='macro')# 匯總不同進(jìn)程的實(shí)驗(yàn)結(jié)果macro = reduce_tensor(torch.tensor(macro).cuda())return macro1.8 test() 函數(shù)
由1.3節(jié)可知,我這里的程序是將「訓(xùn)練&驗(yàn)證」與「測(cè)試」過(guò)程分開(kāi),前一階段保存模型,后一階段對(duì)模型進(jìn)行驗(yàn)證。所以單獨(dú)來(lái)介紹一下 test() 函數(shù)需要修改的內(nèi)容,這一部分涉及到checkpoint模型加載。加速推理方法詳見(jiàn)此篇博客。
@torch.no_grad() def test(local_rank, args):init_ddp(local_rank) # 進(jìn)程初始化pred_labels = []true_labels = []if local_rank == 0:print(f'begin testing') save_path = args['model_save_dir'] + '/best_macro_model_DDP_direct.pt'model, tokenizer = load_model(save_path, args['modelname'], args['num_labels'])model.cuda()model = nn.SyncBatchNorm.convert_sync_batchnorm(model) ### 轉(zhuǎn)換模型的 BN 層num_gpus = torch.cuda.device_count()if num_gpus > 1 and local_rank == 0:print('use {} gpus!'.format(num_gpus))model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank) ### 套 DDPmodel.eval()test_dataloader = get_dataloader(args['testcsvpath'], args, tokenizer, train=False)for idx, batch in enumerate(test_dataloader): #遍歷測(cè)試集的數(shù)據(jù)加載器。...... dist.destroy_process_group() # 消除進(jìn)程組注意??在測(cè)試階段,也需要將程序并行運(yùn)行,否則會(huì)報(bào)錯(cuò)(以全量保存為例):
python /data/gluo/CMLTES/codes/BLOOM_DDP_direct.py -mode "test"torch.multiprocessing.spawn.ProcessRaisedException: -- Process 1 terminated with the following error: Traceback (most recent call last): File "/opt/conda/envs/CMLTES/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/opt/conda/envs/CMLTES/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/data/gluo/CMLTES/codes/BLOOM_DDP_direct.py", line 449, in test output = model(b_input_ids, attention_mask=b_input_mask, labels=b_labels) #獲取模型的輸出。 File "/opt/conda/envs/CMLTES/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/envs/CMLTES/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1008, in forward output = self._run_ddp_forward(*inputs, **kwargs) File "/opt/conda/envs/CMLTES/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 969, in _run_ddp_forward return module_to_run(*inputs[0], **kwargs[0]) File "/opt/conda/envs/CMLTES/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/envs/CMLTES/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1008, in forward output = self._run_ddp_forward(*inputs, **kwargs) File "/opt/conda/envs/CMLTES/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 969, in _run_ddp_forward return module_to_run(*inputs[0], **kwargs[0]) File "/opt/conda/envs/CMLTES/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/envs/CMLTES/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 1030, in forward transformer_outputs = self.transformer( File "/opt/conda/envs/CMLTES/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/envs/CMLTES/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 727, in forward inputs_embeds = self.word_embeddings(input_ids) File "/opt/conda/envs/CMLTES/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/envs/CMLTES/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 158, in forward return F.embedding( File "/opt/conda/envs/CMLTES/lib/python3.9/site-packages/torch/nn/functional.py", line 2199, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper__index_select)到這里,程序就全部修改完畢啦!
2. 程序運(yùn)行
下面分別介紹 DDP 的幾種多卡啟動(dòng)方式。
2.1 mp.spawn() 啟動(dòng)
本程序采用的啟動(dòng)方式是 mp.spawn() 函數(shù),其中mp模塊完成對(duì)multiprocessing庫(kù)進(jìn)行封裝,并沒(méi)有特定針對(duì)DDP。
一開(kāi)始,使用兩張 2080 Ti 顯卡并行運(yùn)行程序,然而發(fā)現(xiàn)在第 0 個(gè)Epoch剛剛啟動(dòng)不久,總是報(bào)錯(cuò) RuntimeError: CUDA out of memory.,如下:
Traceback (most recent call last):File "/data/CMLTES_codes/experiment/bloom/BLOOM_DDP.py", line 690, in <module>mp.spawn(main, args=(args, ), nprocs=world_size)File "/root/anaconda3/envs/pytorch77/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 240, in spawnreturn start_processes(fn, args, nprocs, join, daemon, start_method='spawn')File "/root/anaconda3/envs/pytorch77/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processeswhile not context.join():File "/root/anaconda3/envs/pytorch77/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in joinraise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException: -- Process 1 terminated with the following error: Traceback (most recent call last):File "/root/anaconda3/envs/pytorch77/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrapfn(i, *args)File "/data/CMLTES_codes/experiment/bloom/BLOOM_DDP.py", line 603, in mainepoch_train_loss = train(model, train_dataloader, optimizer, scheduler, loss_func, actual_epoch, scaler, args)File "/data/CMLTES_codes/experiment/bloom/BLOOM_DDP.py", line 336, in trainscaler.scale(loss).backward() ###File "/root/anaconda3/envs/pytorch77/lib/python3.9/site-packages/torch/_tensor.py", line 396, in backwardtorch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)File "/root/anaconda3/envs/pytorch77/lib/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backwardVariable._execution_engine.run_backward( # Calls into the C++ engine to run the backward passFile "/root/anaconda3/envs/pytorch77/lib/python3.9/site-packages/torch/autograd/function.py", line 253, in applyreturn user_fn(self, *args)File "/root/anaconda3/envs/pytorch77/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 188, in backwardtmp = bloom_gelu_back(grad_output, input)File "/root/anaconda3/envs/pytorch77/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 175, in bloom_gelu_backff = 0.5 * x * ((1 - tanh_out * tanh_out) * (0.79788456 + 0.1070322243 * x * x)) + 0.5 * (1 + tanh_out) RuntimeError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 1; 10.76 GiB total capacity; 8.83 GiB already allocated; 28.56 MiB free; 8.94 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF百思不得其解后,嘗試把程序原封不動(dòng)放到3090上面運(yùn)行,發(fā)現(xiàn)可以正常運(yùn)行了!這里的經(jīng)驗(yàn)教訓(xùn)就是GPU 單卡顯存對(duì)于并行也很重要!一個(gè)1.2G左右的模型微調(diào)時(shí)大約需要40G左右的中間變量來(lái)進(jìn)行反向傳播……這是我屬實(shí)沒(méi)有想到的情況……
2.2 tochrnn 啟動(dòng)
相較于使用 mp.spawn() 啟動(dòng),torchrun 會(huì)自動(dòng)控制一些環(huán)境變量的設(shè)置,因而更為方便。我們只需要設(shè)置os.environ[‘CUDA_VISIBLE_DEVICES’] 即可(不設(shè)置默認(rèn)為該機(jī)器上的所有GPU),而無(wú)需設(shè)置os.environ[‘MASTER_ADDR’] 等。此外,main() 函數(shù)不再需要 local_rank 參數(shù)。程序入口變?yōu)?#xff1a;
if __name__ == '__main__':......time_start = time.time()main(args)time_elapsed = time.time() - time_startlocal_rank = int(os.environ['LOCAL_RANK'])if local_rank == 0:print(f'\ntime elapsed: {time_elapsed:.2f} seconds')運(yùn)行腳本的命令由python變?yōu)榱藅orchrun,如下:
torchrun --standalone --nproc_per_node=2 ddp_main_torchrun.py --gpu 0,1程序能夠成功運(yùn)行之后,還有一些細(xì)節(jié)問(wèn)題,下面一一來(lái)進(jìn)行解決。
在用這種方式啟動(dòng)程序時(shí),報(bào)如下錯(cuò)誤:
ImportError: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /opt/conda/envs/CMLTES/lib/python3.9/site-packages/google/protobuf/pyext/_message.cpython-39-x86_64-linux-gnu.so)解決辦法:替換使用的 /usr/lib/x86_64-linux-gnu/libstdc++.so.6,詳情參照 此篇博客。
2.3 torch.distributed.launch() 啟動(dòng)
這種方式代碼量更少,啟動(dòng)速度更快。
python -m torch.distributed.launch --nproc_per_node 8 xxx.py # -m 意思是 run library module as a script # -nproc_per_node 表示每臺(tái)機(jī)器的進(jìn)程數(shù)PS:這種方式要被淘汰了:
/opt/conda/envs/CMLTES/lib/python3.9/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun.3. Debug 歷程
下面對(duì)使用DDP期間遇到的問(wèn)題進(jìn)行分析并給出解決辦法。
問(wèn)題一:多進(jìn)程計(jì)算數(shù)據(jù)收集
由于我這里是將模型復(fù)制到雙卡上實(shí)現(xiàn)數(shù)據(jù)并行,所以在匯總結(jié)果時(shí),需要將不同進(jìn)程上的數(shù)據(jù)進(jìn)行匯總?cè)【涤?jì)算。這時(shí)就需要用到 1.2 節(jié)提到的 all_reduce() 收集函數(shù)。
這里要注意??:對(duì)于 float 等非張量型數(shù)據(jù),如果我們想對(duì)其計(jì)算多進(jìn)程的平均值,可以先使用 torch.tensor() 將需要匯總的變量轉(zhuǎn)為 tensor 并使用 .cuda() 命令將其放至 gpu 上,然后調(diào)用 all_reduce() 收集函數(shù)。詳見(jiàn) 1.7 節(jié) validate() 函數(shù)中 macro 變量的收集計(jì)算。若沒(méi)有完成數(shù)據(jù)轉(zhuǎn)換,則會(huì)報(bào)錯(cuò)如下:
衍生問(wèn)題:在進(jìn)行反向傳播時(shí),每個(gè)進(jìn)程使用的訓(xùn)練數(shù)據(jù)是不同的,所以還是需要根據(jù)自己當(dāng)前計(jì)算的 loss 分別更新,而不是根據(jù)收集函數(shù)得到的 loss 值進(jìn)行更新,否則會(huì)報(bào)錯(cuò),也不合邏輯。
問(wèn)題二:模型加載參數(shù)缺失
在 1.4節(jié) main() 函數(shù)中,使用「只保存模型參數(shù)」的方式存儲(chǔ)模型。在測(cè)試階段,用對(duì)應(yīng)方式加載模型時(shí),報(bào)錯(cuò)如下:
(CMLTES) ? CMLTES git:(master) ? python /data/gluo/CMLTES/codes/BLOOM_DDP.py -mode "test" Model directory for bloom and batch size 4 already exists! TEST FOR bloom and Batch Size4 [W socket.cpp:558] [c10d] The client socket has failed to connect to [localhost]:19198 (errno: 99 - Cannot assign requested address). begin testing Some weights of BloomForSequenceClassification were not initialized from the model checkpoint at /data/gluo/CMLTES/bloom_PRE and are newly initialized: ['score.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Some weights of BloomForSequenceClassification were not initialized from the model checkpoint at /data/gluo/CMLTES/bloom_PRE and are newly initialized: ['score.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Traceback (most recent call last):File "/data/gluo/CMLTES/codes/BLOOM_DDP.py", line 586, in <module>mp.spawn(test, args=(args, ), nprocs=world_size)File "/opt/conda/envs/CMLTES/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 240, in spawnreturn start_processes(fn, args, nprocs, join, daemon, start_method='spawn')File "/opt/conda/envs/CMLTES/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processeswhile not context.join():File "/opt/conda/envs/CMLTES/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in joinraise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException: -- Process 0 terminated with the following error: Traceback (most recent call last):File "/opt/conda/envs/CMLTES/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrapfn(i, *args)File "/opt/conda/envs/CMLTES/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_contextreturn func(*args, **kwargs)File "/data/gluo/CMLTES/codes/BLOOM_DDP.py", line 450, in testmodel, tokenizer = load_model(save_path, args['modelname'], args['num_labels']) #加載模型。File "/data/gluo/CMLTES/codes/BLOOM_DDP.py", line 95, in load_modelmodel.load_state_dict(model_state_dict) #, strict=False) #加載模型的參數(shù)。File "/opt/conda/envs/CMLTES/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1604, in load_state_dictraise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for BloomForSequenceClassification:Missing key(s) in state_dict: "transformer.word_embeddings.weight", "transformer.word_embeddings_layernorm.weight", "transformer.word_embeddings_layernorm.bias", "transformer.h.0.input_layernorm.weight", "transformer.h.0.input_layernorm.bias", "transformer.h.0.self_attention.query_key_value.weight", "transformer.h.0.self_attention.query_key_value.bias", "transformer.h.0.self_attention.dense.weight", "transformer.h.0.self_attention.dense.bias",根據(jù)此篇博客,這里暫時(shí)的處理方法是:將 load_state_dict() 函數(shù)修改為:model.load_state_dict(model_state_dict, strict=False),即設(shè)置 strict 參數(shù)值為 False. strict=False 的含義是:不嚴(yán)格要求 state_dict 中的鍵與該模塊的鍵返回的鍵匹配。
上述處理方式可以暫時(shí)忽略上述參數(shù)缺失問(wèn)題,但是可能會(huì)對(duì)模型的性能造成一定程度的影響,這一問(wèn)題有待后續(xù)解決。
PS:關(guān)于模型保存與加載的兩種方法
根據(jù)此篇博客,保存模型有兩種方式,一是全量保存模型的全部信息,二是只保存模型的參數(shù),兩種保存方式對(duì)應(yīng)的模型加載方式自然也有所差別。
- 保存模型的全部信息
- 只保存模型參數(shù)
與第一種方式不同的是,這種方式在加載模型時(shí),需要首先定義與保存的模型相同的模型結(jié)構(gòu),然后加載模型參數(shù)。
問(wèn)題三:參數(shù)類(lèi)型轉(zhuǎn)換異常
在 1.4節(jié) main() 函數(shù)中,使用「只保存模型參數(shù)」的方式存儲(chǔ)模型。在測(cè)試階段,用對(duì)應(yīng)方式加載模型時(shí),報(bào)錯(cuò)如下:
使用上述方法解決加載模型參數(shù)缺失的問(wèn)題后,隨之而來(lái)的問(wèn)題如下所示。
Traceback (most recent call last):File "/data/gluo/CMLTES/codes/BLOOM_DDP.py", line 587, in <module>mp.spawn(test, args=(args, ), nprocs=world_size)File "/opt/conda/envs/CMLTES/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 240, in spawnreturn start_processes(fn, args, nprocs, join, daemon, start_method='spawn')File "/opt/conda/envs/CMLTES/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processeswhile not context.join():File "/opt/conda/envs/CMLTES/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in joinraise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException: -- Process 0 terminated with the following error: Traceback (most recent call last):File "/opt/conda/envs/CMLTES/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrapfn(i, *args)File "/opt/conda/envs/CMLTES/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_contextreturn func(*args, **kwargs)File "/data/gluo/CMLTES/codes/BLOOM_DDP.py", line 459, in testmodel = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank) ### 套 DDPFile "/opt/conda/envs/CMLTES/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 646, in __init___verify_param_shape_across_processes(self.process_group, parameters)File "/opt/conda/envs/CMLTES/lib/python3.9/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processesreturn dist._verify_params_across_processes(process_group, tensors, logger) RuntimeError: value cannot be converted to type int without overflow深度原因有待后續(xù)探索。
問(wèn)題四:參數(shù)泄露
報(bào)錯(cuò)日志:UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '
由上圖可知,上述警告產(chǎn)生的原因是使用了 Ctrl+C 中斷程序。深度原因有待后續(xù)探索。
注意??:在使用PyTorch設(shè)置多線(xiàn)程進(jìn)行數(shù)據(jù)讀取時(shí),后臺(tái)實(shí)際操作情況是開(kāi)了N個(gè)PID連號(hào)的子進(jìn)程模擬多線(xiàn)程工作,所以在程序跑完或者中途kill掉主進(jìn)程的話(huà),子進(jìn)程的GPU顯存并不會(huì)被釋放,需要手動(dòng)一個(gè)一個(gè)kill才行。
> 本篇博客沒(méi)有涉及到的知識(shí)點(diǎn):dist.barrier()、Gradient Accumulation、Apex 實(shí)現(xiàn)混合精度訓(xùn)練&分布式訓(xùn)練、后記:本篇博客是我經(jīng)過(guò)不斷探索總結(jié)而得,其中若有表述不當(dāng)或表意不明之處,還望各位不吝賜教,我們共同進(jìn)步!
參考文獻(xiàn)
以下系列資源均來(lái)自此博主,可以說(shuō)是關(guān)于數(shù)據(jù)并行十分詳細(xì)的教程了!
總結(jié)
以上是生活随笔為你收集整理的「分布式训练」使用 DDP 实现程序单机多卡并行指南的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 基于Lattice的密码学
- 下一篇: 日本“冷知识”你都知道吗?