模型训练GPU跑飞
问题
当前使用的是魔改版的NVIDIA 2080 Ti 22G显卡,发现在模型训练过程中,跑着跑着就报错了,具体如下:
raceback (most recent call last):
File "/home/laumy/lerobot/./src/lerobot/scripts/train.py", line 291, in <module>
train()
File "/home/laumy/lerobot/src/lerobot/configs/parser.py", line 226, in wrapper_inner
response = fn(cfg, *args, **kwargs)
File "/home/laumy/lerobot/./src/lerobot/scripts/train.py", line 212, in train
train_tracker, output_dict = update_policy(
File "/home/laumy/lerobot/./src/lerobot/scripts/train.py", line 101, in update_policy
train_metrics.loss = loss.item()
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
然后使用nvidia-smi发现显卡都找不到了。
nvidia-smi
No devices were found
排查
重启电脑,重新训练模型,同时执行以下命令查看显卡情况。
watch -n nvidia-smi
发现训练过程中,温度飙升非常快,初步怀疑是性能跑太满,导致温度过高保护了。
解决
限制GPU的功率和核心频率
sudo nvidia-smi -pl 150 # 将功率限制设置为150W
sudo nvidia-smi -lgc 1000,1000 # 限制核心频率为1000MHz
限制后继续跑,发现没有问题了。也可以使用nvidia-smi -a来详细查看参数。
另外如果想实时查看GPU监控,可以使用
nvtop
修改显卡为高性能模式
把GPU的runtime PM和PCIe ASPM关掉,下面是开机自启动配置脚本。
sudo vim /usr/local/sbin/fix-nvidia-runtime-pm.sh
#!/bin/bash
# 强制关闭 ASPM,避免因节能进入低功耗模式
echo performance | tee /sys/module/pcie_aspm/parameters/policy
# 获取 NVIDIA GPU 的 PCIe 设备路径
GPU_DEV="0000:(lspci | awk '/NVIDIA/{print1; exit}')"
if [ -z "GPU_DEV" ]; then
echo "NVIDIA GPU not found, exiting script."
exit 1
fi
# 设置 GPU 的 power/control 为 'on',确保设备不会休眠
CUR="/sys/bus/pci/devices/{GPU_DEV}"
while [ -n "CUR" ] && [ -e "CUR" ]; do
if [ -w "CUR/power/control" ]; then
echo on | sudo tee "CUR/power/control"
fi
# 上一级桥
PARENT=(readlink -f "CUR/..")
[[ "PARENT" == "/sys/devices" ]] && break
CUR="PARENT"
done
# 启用 NVIDIA persistence 模式,避免 GPU 重置
nvidia-smi -pm 1
# 输出信息确认操作成功
echo "GPU power control set to 'on' and ASPM disabled."
配置可执行sudo chmod +x /usr/local/sbin/fix-nvidia-runtime-pm.sh
然后配置开机自启动
sudo vim /etc/systemd/system/fix-nvidia-runtime-pm.service
[Unit]
Description=Force NVIDIA GPU power control 'on' and disable ASPM
After=multi-user.target
Wants=multi-user.target
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/fix-nvidia-runtime-pm.sh
RemainAfterExit=true
[Install]
WantedBy=multi-user.target
然后设置开机自启动
sudo systemctl daemon-reload
sudo systemctl enable fix-nvidia-runtime-pm.service
sudo systemctl start fix-nvidia-runtime-pm.service
重启后可以验证一下
cat /sys/bus/pci/devices/0000:(lspci | awk '/NVIDIA/{print1; exit}')/power/control
如果打开的是on成功
cat /sys/module/pcie_aspm/parameters/policy
输出应该是 [performance]