集群环境
文章目录
重点关注cuda版本。
尝试运行 【Step1】
docker run --name yen_v3 --runtime=nvidia --gpus=all -v ~/workspace/NeRF:/zj -tid 172.20.208.7/zhaojing_repo/nerf:yen37_v2 bash
报错:
docker: Error response from daemon: Unknown runtime specified nvidia.
尝试解决方案1:
在 /etc/docker/daemon.json里的内容如下:
{
"registry-mirrors": ["https://f1z25q5p.mirror.aliyuncs.com"],
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
然后命令
sudo systemctl daemon-reload
sudo systemctl restart docker
执行 sudo systemctl restart docker
的时候,报错:Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
错误原因: 上一步在编辑 daemon.json文档的时候,存在编辑错误。
重新尝试运行【Step1】,运行成功
【Step2】配置环境参数如下:
Package Version
----------------------- -----------
absl-py 1.0.0
cachetools 5.0.0
certifi 2021.10.8
charset-normalizer 2.0.12
ConfigArgParse 1.5.3
cycler 0.11.0
dataclasses 0.6
fonttools 4.31.2
future 0.18.2
google-auth 2.6.2
google-auth-oauthlib 0.4.6
grpcio 1.44.0
idna 3.3
imageio 2.16.1
imageio-ffmpeg 0.4.5
importlib-metadata 4.11.3
kiwisolver 1.4.0
Markdown 3.3.6
matplotlib 3.5.1
numpy 1.21.5
oauthlib 3.2.0
opencv-python 4.5.5.64
packaging 21.3
Pillow 9.0.1
pip 22.0.4
protobuf 3.19.4
pyasn1 0.4.8
pyasn1-modules 0.2.8
pyparsing 3.0.7
python-dateutil 2.8.2
requests 2.27.1
requests-oauthlib 1.3.1
rsa 4.8
setuptools 57.5.0
six 1.16.0
tensorboard 2.8.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
torch 1.7.0+cu110
torchaudio 0.7.0
torchvision 0.8.1+cu110
tqdm 4.63.1
typing_extensions 4.1.1
urllib3 1.26.9
Werkzeug 2.0.3
wheel 0.37.1
zipp 3.7.0
查看nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.80 Driver Version: 460.80 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:01:00.0 On | N/A |
| 30% 42C P8 12W / 250W | 240MiB / 11014MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
【Step3】尝试本地运行:python run_nerf.py --config configs/fern.txt
.运行成功。
【Step4】将环境提交并push
docker commit -m "hh" yen_v3 172.20.208.7/zhaojing_repo/nerf:yen_cu110
docker push 172.20.208.7/zhaojing_repo/nerf:yen_cu110
【Step5】服务器运行测试
失败。
暂时没有找到解决的办法,后面有时间了再探索。