PyTorch 中autograd.Variable模块的基本操作

仲秋花似锦 2023-06-30 阅读 40

问题描述：

训练模型时，加载保存好的模型权重报错如下：

RuntimeError: [enforce fail at inline_container.cc:222] . file not found: archive/data/94167620909552

参考博客文章，造成该报错的原因是在用DistributedDataParallel进行并行训练时，多卡均有保存模型写入同一权重文件，导致出错。

在训练阶段，指定某卡保存保存模型，在torch.save() 指令前加个判别语句，如下：

if dist.get_rank() == 0:
   torch.save(net.module.state_dict(), 'model.pth')

0 条评论

关注