Pytorch分布式训练错误-CFANZ编程社区

Pytorch分布式训练错误

参考 PyTorch分布式训练简介 - 云+社区 - 腾讯云

参考 Pytorch分布式训练错误 - 云+社区 - 腾讯云

subprocess.CalledProcessError: Command ‘[’/home/labpos/anaconda3/envs/idr/bin/python’, ‘-u’, ‘main_distribute.py’, ‘–local_rank=1’]’ returned non-zero exit status 1.

pytorch DistributedDataParallel训练时遇到的问题

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /opt/conda/conda-

model = torch.nn.parallel.DistributedDataParallel(model,device_ids=[args.local_rank],output_device=args.local_rank, find_unused_parameters=True)

0 条评论