模型训练GPU跑飞

🕒 2025-07-25 📁 lerobot 👤 laumy 🔥 128 热度

问题

当前使用的是魔改版的NVIDIA 2080 Ti 22G显卡,发现在模型训练过程中,跑着跑着就报错了,具体如下:

raceback (most recent call last):
  File "/home/laumy/lerobot/./src/lerobot/scripts/train.py", line 291, in <module>
    train()
  File "/home/laumy/lerobot/src/lerobot/configs/parser.py", line 226, in wrapper_inner
    response = fn(cfg, *args, **kwargs)
  File "/home/laumy/lerobot/./src/lerobot/scripts/train.py", line 212, in train
    train_tracker, output_dict = update_policy(
  File "/home/laumy/lerobot/./src/lerobot/scripts/train.py", line 101, in update_policy
    train_metrics.loss = loss.item()
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

然后使用nvidia-smi发现显卡都找不到了。

nvidia-smi

No devices were found

排查

重启电脑,重新训练模型,同时执行以下命令查看显卡情况。

watch -n nvidia-smi

发现训练过程中,温度飙升非常快,初步怀疑是性能跑太满,导致温度过高保护了。

解决

限制GPU的功率和核心频率

sudo nvidia-smi -pl 150  # 将功率限制设置为150W

sudo nvidia-smi -lgc 1000,1000  # 限制核心频率为1000MHz

限制后继续跑,发现没有问题了。也可以使用nvidia-smi -a来详细查看参数。

另外如果想实时查看GPU监控,可以使用

nvtop

修改显卡为高性能模式

把GPU的runtime PM和PCIe ASPM关掉,下面是开机自启动配置脚本。

sudo vim /usr/local/sbin/fix-nvidia-runtime-pm.sh

#!/bin/bash

# 强制关闭 ASPM,避免因节能进入低功耗模式
echo performance | tee /sys/module/pcie_aspm/parameters/policy

# 获取 NVIDIA GPU 的 PCIe 设备路径
GPU_DEV="0000:(lspci | awk '/NVIDIA/{print1; exit}')"

if [ -z "GPU_DEV" ]; then
  echo "NVIDIA GPU not found, exiting script."
  exit 1
fi

# 设置 GPU 的 power/control 为 'on',确保设备不会休眠
CUR="/sys/bus/pci/devices/{GPU_DEV}"
while [ -n "CUR" ] && [ -e "CUR" ]; do
  if [ -w "CUR/power/control" ]; then
    echo on | sudo tee "CUR/power/control"
  fi
  # 上一级桥
  PARENT=(readlink -f "CUR/..")
  [[ "PARENT" == "/sys/devices" ]] && break
  CUR="PARENT"
done

# 启用 NVIDIA persistence 模式,避免 GPU 重置
nvidia-smi -pm 1

# 输出信息确认操作成功
echo "GPU power control set to 'on' and ASPM disabled."

配置可执行sudo chmod +x /usr/local/sbin/fix-nvidia-runtime-pm.sh

然后配置开机自启动

sudo vim /etc/systemd/system/fix-nvidia-runtime-pm.service

[Unit]
Description=Force NVIDIA GPU power control 'on' and disable ASPM
After=multi-user.target
Wants=multi-user.target

[Service]
Type=oneshot
ExecStart=/usr/local/sbin/fix-nvidia-runtime-pm.sh
RemainAfterExit=true

[Install]
WantedBy=multi-user.target

然后设置开机自启动

sudo systemctl daemon-reload
sudo systemctl enable fix-nvidia-runtime-pm.service
sudo systemctl start fix-nvidia-runtime-pm.service

重启后可以验证一下

cat /sys/bus/pci/devices/0000:(lspci | awk '/NVIDIA/{print1; exit}')/power/control

如果打开的是on成功

cat /sys/module/pcie_aspm/parameters/policy


输出应该是 [performance]

发表你的看法

\t