模型训练GPU跑飞

问题

当前使用的是魔改版的NVIDIA 2080 Ti 22G显卡,发现在模型训练过程中,跑着跑着就报错了,具体如下:

raceback (most recent call last):
  File "/home/laumy/lerobot/./src/lerobot/scripts/train.py", line 291, in <module>
    train()
  File "/home/laumy/lerobot/src/lerobot/configs/parser.py", line 226, in wrapper_inner
    response = fn(cfg, *args, **kwargs)
  File "/home/laumy/lerobot/./src/lerobot/scripts/train.py", line 212, in train
    train_tracker, output_dict = update_policy(
  File "/home/laumy/lerobot/./src/lerobot/scripts/train.py", line 101, in update_policy
    train_metrics.loss = loss.item()
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

然后使用nvidia-smi发现显卡都找不到了。

nvidia-smi

No devices were found

排查

重启电脑,重新训练模型,同时执行以下命令查看显卡情况。

watch -n nvidia-smi

发现训练过程中,温度飙升非常快,初步怀疑是性能跑太满,导致温度过高保护了。

解决

限制GPU的功率和核心频率

sudo nvidia-smi -pl 150  # 将功率限制设置为150W

sudo nvidia-smi -lgc 1000,1000  # 限制核心频率为1000MHz

限制后继续跑,发现没有问题了。也可以使用nvidia-smi -a来详细查看参数。

另外如果想实时查看GPU监控,可以使用

nvtop