jetson orin nano使用llamap.cpp跑大模型

🕒 2026-01-22 📁 Ai应用 👤 laumy 🔥 83 热度

安装

确认一下是否有/usr/local/cuda/bin/nvcc,有就配置一下环境。

# 临时在当前终端生效
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

# 验证 nvcc 是否能找到
nvcc --version

如果没有就安装

sudo -E apt install cuda-toolkit

下载代码

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

安装魔塔下载工具

pip install modelscope -i https://mirrors.aliyun.com/pypi/simple/

编译

mkdir build

cmake .. -DGGML_CUDA=ON
cmake --build . --config Release -j $(nproc)

跑模型

低功耗模式

下载模型

mkdir model_zoo
modelscope download --model Qwen/Qwen2.5-1.5B-Instruct-GGUF qwen2.5-1.5b-instruct-q4_0.gguf --local_dir ./model_zoo

测试模型

./build/bin/llama-cli -m ../model_zoo/qwen2.5-1.5b-instruct-q4_0.gguf --perf --show-timings -lv 3


 换个笑话

|slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.952 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist 
slot launch_slot_: id  0 | task 184 | processing task, is_child = 0
slot update_slots: id  0 | task 184 | new prompt, n_ctx_slot = 4096, n_keep = 0, task.n_tokens = 248
slot update_slots: id  0 | task 184 | n_tokens = 236, memory_seq_rm [236, end)
slot update_slots: id  0 | task 184 | prompt processing progress, n_tokens = 248, batch.n_tokens = 12, progress = 1.000000
slot update_slots: id  0 | task 184 | prompt done, n_tokens = 248, batch.n_tokens = 12
slot init_sampler: id  0 | task 184 | init sampler, took 0.11 ms, tokens: text = 248, total = 248
好的,以下是一个不同的笑话:
为什么电脑没有呼吸?
因为它只有一个输入口,一个输出口。slot print_timing: id  0 | task 184 | 
prompt eval time =     214.26 ms /    12 tokens (   17.85 ms per token,    56.01 tokens per second)
       eval time =    1691.48 ms /    22 tokens (   76.89 ms per token,    13.01 tokens per second)
      total time =    1905.74 ms /    34 tokens
slot      release: id  0 | task 184 | stop processing: n_tokens = 269, truncated = 0
srv  update_slots: all slots are idle


[ Prompt: 56.0 t/s | Generation: 13.0 t/s ]

看起来速度不是很快。

再跑一个3B的试试看。

./build/bin/llama-cli -m ../model_zoo/qwen2.5-3b-instruct-q4_0.gguf --perf --show-timings -lv 3

prompt eval time =    1412.32 ms /    15 tokens (   94.15 ms per token,    10.62 tokens per second)
       eval time =   48064.10 ms /   244 tokens (  196.98 ms per token,     5.08 tokens per second)
      total time =   49476.41 ms /   259 tokens
slot      release: id  0 | task 148 | stop processing: n_tokens = 469, truncated = 0
srv  update_slots: all slots are idle


[ Prompt: 10.6 t/s | Generation: 5.1 t/s ]

看起来不太行。

加上个参数重新编译试试,针对CUDA的能力计算数值,加这个参数,llama.cpp 的 cmake 默认可能不会启用针对特定 GPU 架构的深度优化

rm -rf build && mkdir build && cd build
cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=87
#或者cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=native
cmake --build . --config Release -j$(nproc)

再测试执行:

./build/bin/llama-cli -m ../model_zoo/qwen2.5-3b-instruct-q4_0.gguf --perf --show-timings -lv 3 -ngl 99

prompt eval time =     201.92 ms /    13 tokens (   15.53 ms per token,    64.38 tokens per second)
       eval time =   39903.36 ms /   386 tokens (  103.38 ms per token,     9.67 tokens per second)
      total time =   40105.28 ms /   399 tokens
slot      release: id  0 | task 159 | stop processing: n_tokens = 598, truncated = 0
srv  update_slots: all slots are idle


[ Prompt: 64.4 t/s | Generation: 9.7 t/s ]

推理速度上来了,稳定在10 token/s。

再跑跑bench

./build/bin/llama-bench -m ../model_zoo/qwen2.5-3b-instruct-q4_0.gguf -ngl 99
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Orin, compute capability 8.7, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen2 3B Q4_0                  |   1.86 GiB |     3.40 B | CUDA       |  99 |           pp512 |        259.35 ± 0.46 |
| qwen2 3B Q4_0                  |   1.86 GiB |     3.40 B | CUDA       |  99 |           tg128 |          8.98 ± 0.03 |

build: 37c35f0e1 (7787)
(base) bianbu@ubuntu:~/llama.cpp$ 
sudo nvpmodel -m 0        # 开启最大功率模式
sudo jetson_clocks        # 锁定 CPU/GPU/内存频率到最高
(base) bianbu@ubuntu:~/llama.cpp$ ./build/bin/llama-bench -m ../model_zoo/qwen2.5-3b-instruct-q4_0.gguf -ngl 99
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Orin, compute capability 8.7, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen2 3B Q4_0                  |   1.86 GiB |     3.40 B | CUDA       |  99 |           pp512 |        264.42 ± 0.27 |
| qwen2 3B Q4_0                  |   1.86 GiB |     3.40 B | CUDA       |  99 |           tg128 |          9.17 ± 0.01 |

build: 37c35f0e1 (7787)

超级性能模式

上面跑了一下看起来还是很慢,看了一下jetson官网,要跑性能,需要设置电源为超级性能模式,点击 Ubuntu 桌面顶部栏右侧的 NVIDIA 图标

Power mode
    0: 15W
    1: 25W
    2: MAXN SUPER

如果没有MAXN SUPER,执行下面的命令安装。

sudo apt remove nvidia-l4t-bootloader
sudo apt install nvidia-l4t-bootloader
sudo reboot

如果还是无法配置的话,参考下:super mode

接下来,接着跑。

./build/bin/llama-cli -m ../model_zoo_gguf/model_zoo/qwen2.5-1.5b-instruct-q4_0.gguf --perf --show-timings -lv 3 -ngl 99

./build/bin/llama-bench -m ../model_zoo_gguf/model_zoo/qwen2.5-1.5b-instruct-q4_0.gguf -ngl 99

在跑个3B的

./build/bin/llama-bench -m ../model_zoo_gguf/model_zoo/qwen2.5-3b-instruct-q4_0.gguf -ngl 99

./build/bin/llama-cli -m ../model_zoo_gguf/model_zoo/qwen2.5-3b-instruct-q4_0.gguf -ngl 99 --perf -lv 3

发表你的看法

\t