Ollama Qwen3 性能测试

最近年假期间，尝试使用 Claude Code 结合 OpenAI 兼容模型接口，摸索 AI Spec 编程模式。之前看到有网友调侃，claude code 是成本杀手还没有直观感觉。这几天，换成了自己的百炼账号和 Openrouter，看到余额在肉眼可见的减少，才意识到这个成本有多高。为此，我决定尝试下，看看能不能找到一个性价比高的替代方案。首先想到的是，利用手里已有的计算资源，用 Ollama 来跑 Qwen3 Coder 模型，看看效果如何。

先看下手里现有的计算资源：首先可用的，是一台分配了 1/3 L20 GPU 的虚拟机；其次，是手里日常使用的 mpb m4；最后，是闲置下来作为备用机的 mbp m1。具体的机器配置信息如下：

机器	CPU	GPU	内存	硬盘
虚拟机	AMD EPYC 9K84 96-Core Processor7	1/3 L20 GPU (14GB 显存)	64 GB	-
MBP M4	Apple M4 Pro	-	48 GB	1TB SSD
MBP M1	Apple M1 Pro	-	32 GB	512GB SSD

环境准备

首先，安装 Ollama，安装过程略过。为了方便测试，修改 ollama 的启动变量 OLLAMA_HOST=0.0.0.0:18443。安装完成后，拉取并启动模型：

1
ollama run qwen3-coder:30b

压测工具，我选择魔搭开源的 EvalScope。运行压测任务。

环境准备上，我在 m4 机器上，使用 Rust 编写 python 包管理工具 UV，初始化一个 aiperf 项目的 venv 环境。这里面有个小坑，uv 不会使用 home 目录下配置的 pip 镜像源，需要在 uv 自己的配置文件中，添加镜像源。在 ~/.config/uv/uv.toml 中，添加如下配置，才能使用清华的 tuna 源加速下载安装包：

1
index-url = "https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple"

最后，运行如下命令，安装 EvalScope：

1
uv pip install "evalscope[perf]" -U

压测脚本

压测脚本如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
from evalscope.perf.main import run_perf_benchmark
from evalscope.perf.arguments import Arguments

task_cfg = Arguments(
    parallel=[1, 5, 10],
    number=[10, 15, 20],
    model='qwen3-coder:30b',
    url='http://modelhost.internal.com:18443/v1/chat/completions',
    api='openai',
    dataset='random',
    min_tokens=1024,
    max_tokens=1024,
    prefix_length=0,
    min_prompt_length=1024,
    max_prompt_length=1024,
    tokenizer_path='Qwen/Qwen3-Coder-30B-A3B-Instruct',
    extra_args={'ignore_eos': True}
)
results = run_perf_benchmark(task_cfg)

压测结果

从压测结果来看，虚拟机上的 L20 GPU，性能表现最差，M4 和 M1 的表现差不多，M4 大致有 M1 1.6倍的性能。L20 的表现，比 M4 和 M1 差了 10 倍以上，是意料之外的。

下面是压测报告详情：

L20-0.3-14GB

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
2025-08-24 21:43:36,766 - evalscope - INFO - Save the summary to: ./outputs/20250824_205110/qwen3-coder_30b/parallel_10_number_20
╭──────────────────────────────────────────────────────────╮
│ Performance Test Summary Report                          │
╰──────────────────────────────────────────────────────────╯

Basic Information:
┌───────────────────────┬──────────────────────────────────┐
│ Model                 │ qwen3-coder:30b                  │
│ Total Generated       │ 21,464.0004 tokens               │
│ Total Test Time       │ 3100.21 seconds                  │
│ Avg Output Rate       │ 6.92 tokens/sec                  │
└───────────────────────┴──────────────────────────────────┘


                                    Detailed Performance Metrics
┏━━━━━━┳━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┓
┃      ┃      ┃      Avg ┃      P99 ┃    Gen. ┃      Avg ┃     P99 ┃      Avg ┃     P99 ┃   Success┃
┃Conc. ┃  RPS ┃  Lat.(s) ┃  Lat.(s) ┃  toks/s ┃  TTFT(s) ┃ TTFT(s) ┃  TPOT(s) ┃ TPOT(s) ┃      Rate┃
┡━━━━━━╇━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━┩
│    1 │ 0.01 │   78.645 │  145.346 │    6.62 │    8.018 │  12.765 │    0.136 │   0.149 │    100.0%│
│    5 │ 0.01 │  310.401 │  423.056 │    6.94 │  246.326 │ 358.330 │    0.136 │   0.148 │    100.0%│
│   10 │ 0.01 │  483.161 │  653.454 │    7.09 │  418.198 │ 594.951 │    0.135 │   0.143 │     95.0%│
└──────┴──────┴──────────┴──────────┴─────────┴──────────┴─────────┴──────────┴─────────┴──────────┘


               Best Performance Configuration
 Highest RPS         Concurrency 1 (0.01 req/sec)
 Lowest Latency      Concurrency 1 (78.645 seconds)

Performance Recommendations:
• Consider lowering concurrency, current load may be too high

M4-48GB

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
2025-08-24 20:47:36,775 - evalscope - INFO - Save the summary to: ./outputs/20250824_203929/qwen3-coder_30b/parallel_10_number_20
╭──────────────────────────────────────────────────────────╮
│ Performance Test Summary Report                          │
╰──────────────────────────────────────────────────────────╯

Basic Information:
┌───────────────────────┬──────────────────────────────────┐
│ Model                 │ qwen3-coder:30b                  │
│ Total Generated       │ 21,144.0 tokens                  │
│ Total Test Time       │ 464.34 seconds                   │
│ Avg Output Rate       │ 45.54 tokens/sec                 │
└───────────────────────┴──────────────────────────────────┘


                                    Detailed Performance Metrics
┏━━━━━━┳━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┓
┃      ┃      ┃      Avg ┃      P99 ┃    Gen. ┃      Avg ┃     P99 ┃      Avg ┃     P99 ┃   Success┃
┃Conc. ┃  RPS ┃  Lat.(s) ┃  Lat.(s) ┃  toks/s ┃  TTFT(s) ┃ TTFT(s) ┃  TPOT(s) ┃ TPOT(s) ┃      Rate┃
┡━━━━━━╇━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━┩
│    1 │ 0.09 │   11.578 │   20.408 │   46.19 │    2.027 │   2.403 │    0.018 │   0.018 │    100.0%│
│    5 │ 0.11 │   39.557 │   54.246 │   43.97 │   32.636 │  43.541 │    0.018 │   0.018 │    100.0%│
│   10 │ 0.09 │   80.286 │  122.025 │   46.14 │   71.383 │ 115.365 │    0.018 │   0.018 │    100.0%│
└──────┴──────┴──────────┴──────────┴─────────┴──────────┴─────────┴──────────┴─────────┴──────────┘


               Best Performance Configuration
 Highest RPS         Concurrency 5 (0.11 req/sec)
 Lowest Latency      Concurrency 1 (11.578 seconds)

Performance Recommendations:
• Optimal concurrency range is around 5

M1-32GB

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
2025-08-24 21:04:05,915 - evalscope - INFO - Save the summary to: ./outputs/20250824_204952/qwen3-coder_30b/parallel_10_number_20
╭──────────────────────────────────────────────────────────╮
│ Performance Test Summary Report                          │
╰──────────────────────────────────────────────────────────╯

Basic Information:
┌───────────────────────┬──────────────────────────────────┐
│ Model                 │ qwen3-coder:30b                  │
│ Total Generated       │ 20,063.0 tokens                  │
│ Total Test Time       │ 713.89 seconds                   │
│ Avg Output Rate       │ 28.10 tokens/sec                 │
└───────────────────────┴──────────────────────────────────┘


                                    Detailed Performance Metrics
┏━━━━━━┳━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┓
┃      ┃      ┃      Avg ┃      P99 ┃    Gen. ┃      Avg ┃     P99 ┃      Avg ┃     P99 ┃   Success┃
┃Conc. ┃  RPS ┃  Lat.(s) ┃  Lat.(s) ┃  toks/s ┃  TTFT(s) ┃ TTFT(s) ┃  TPOT(s) ┃ TPOT(s) ┃      Rate┃
┡━━━━━━╇━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━┩
│    1 │ 0.05 │   18.984 │   32.272 │   28.93 │    3.719 │   3.987 │    0.028 │   0.029 │    100.0%│
│    5 │ 0.07 │   63.660 │   88.960 │   27.17 │   53.220 │  76.165 │    0.027 │   0.028 │    100.0%│
│   10 │ 0.06 │  119.327 │  179.827 │   28.23 │  107.219 │ 171.498 │    0.027 │   0.028 │    100.0%│
└──────┴──────┴──────────┴──────────┴─────────┴──────────┴─────────┴──────────┴─────────┴──────────┘


               Best Performance Configuration
 Highest RPS         Concurrency 5 (0.07 req/sec)
 Lowest Latency      Concurrency 1 (18.984 seconds)

Performance Recommendations:
• Optimal concurrency range is around 5

L20-0.3-14GB 二测

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
2025-08-25 01:16:41,232 - evalscope - INFO - Save the summary to: ./outputs/20250825_002641/qwen3-coder_30b/parallel_10_number_20
╭──────────────────────────────────────────────────────────╮
│ Performance Test Summary Report                          │
╰──────────────────────────────────────────────────────────╯

Basic Information:
┌───────────────────────┬──────────────────────────────────┐
│ Model                 │ qwen3-coder:30b                  │
│ Total Generated       │ 21,113.999499999998 tokens       │
│ Total Test Time       │ 2971.01 seconds                  │
│ Avg Output Rate       │ 7.11 tokens/sec                  │
└───────────────────────┴──────────────────────────────────┘


                                    Detailed Performance Metrics
┏━━━━━━┳━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┓
┃      ┃      ┃      Avg ┃      P99 ┃    Gen. ┃      Avg ┃     P99 ┃      Avg ┃     P99 ┃   Success┃
┃Conc. ┃  RPS ┃  Lat.(s) ┃  Lat.(s) ┃  toks/s ┃  TTFT(s) ┃ TTFT(s) ┃  TPOT(s) ┃ TPOT(s) ┃      Rate┃
┡━━━━━━╇━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━┩
│    1 │ 0.01 │   76.279 │   96.921 │    7.11 │    3.263 │   3.387 │    0.135 │   0.141 │    100.0%│
│    5 │ 0.02 │  298.099 │  410.941 │    7.08 │  236.417 │ 326.334 │    0.135 │   0.141 │    100.0%│
│   10 │ 0.01 │  471.041 │  638.173 │    7.13 │  401.407 │ 569.030 │    0.134 │   0.146 │     85.0%│
└──────┴──────┴──────────┴──────────┴─────────┴──────────┴─────────┴──────────┴─────────┴──────────┘


               Best Performance Configuration
 Highest RPS         Concurrency 5 (0.02 req/sec)
 Lowest Latency      Concurrency 1 (76.279 seconds)

Performance Recommendations:
• Optimal concurrency range is around 5
• Success rate is low at high concurrency, check system resources or reduce concurrency