Page cover

Autom Mate Local LLM GPT-OSS 120B & 20B Performans Test Report and API Feature Report

GPT-OSS 120B & 20B Performans Test Report

Executive Summary

When evaluating the GPT-OSS 20B and GPT-OSS 120B models, the core question is whether your organization should prioritize scalability and cost-efficiency or deep reasoning accuracy. This report provides the context, a side-by-side comparison, and a decision framework tailored for business leaders.

Criteria

20B Model

120B Model

Recommendation

Concurrent Users (Scalability)

βœ… SLA-compliant up to 15 users ❌ Fails at 25 users

βœ… SLA-compliant up to 10 users ❌ Fails at 15+ users

For higher user volumes β†’ 20B

Speed (Response Time)

On average 25–40% faster

Slower; degrades as load increases

If speed is critical β†’ 20B

Reasoning Accuracy

Adequate for daily operations

Much stronger reasoning, better for complex logic

For complex problem solving β†’ 120B

Error Rate

Slightly higher at low load (11%) 0% errors at 15 users

Low at light load (6.7%) Error rate steadily increases under load

For stability under load β†’ 20B

Hardware Requirements

Runs on a single 16 GB GPU Edge-device friendly

Requires 2 Γ— 48 GB GPUs minimum

If infra cost is a concern β†’ 20B

Cost

Low hardware & energy cost

Very high infra & energy cost

Budget-sensitive cases β†’ 20B

Use Cases

Customer support, daily ops, reporting

Strategic decision support, deep analysis

Broad adoption β†’ 20B Niche high-accuracy needs β†’ 120B

User Experience (UX)

Faster, more stable β†’ high satisfaction

Slower, potential delays β†’ lower satisfaction

If UX is a priority β†’ 20B

Risk Management

Predictable, low risk under scaling

Higher risk: SLA violations beyond 10 users

Risk-averse strategy β†’ 20B

Return on Investment (ROI)

Lower cost β†’ fast ROI

Higher cost, but long-term value from reasoning accuracy

Short-term ROI β†’ 20B Long-term insight β†’ 120B

Application Fit

Operational tasks, customer services, mid-scale analytics

Management insights, critical analysis, low-concurrency workloads

Operations β†’ 20B Strategy β†’ 120B

Future Growth / Scaling

Easily deployable on existing or edge infrastructure

Requires significant GPU investment to scale

Flexible growth path β†’ 20B

Adaptability (Flexibility)

Hardware-agnostic, works in varied environments

Hardware-dependent, limited flexibility

For flexible deployment β†’ 20B

When deciding between the GPT-OSS 20B and GPT-OSS 120B models, the choice depends on whether your priority is scalability and cost-efficiency or deep reasoning accuracy.

  • 20B Model

    • Best suited for daily operations, customer support, and mid-scale workloads.

    • Handles up to 15 users at once reliably, with faster response times and lower costs.

    • Can run on a single GPU, making it practical for both cloud and edge deployments.

    • Offers quick ROI and better user experience thanks to speed and stability.

  • 120B Model

    • Designed for complex, strategic, and high-accuracy tasks where reasoning quality is critical.

    • Works best with 10 users or fewer; beyond that, performance and stability drop.

    • Requires powerful (and costly) hardware, making it more suitable for niche, mission-critical scenarios.

    • Provides long-term strategic value, but with significantly higher upfront and operational costs.

πŸ“Œ Bottom Line:

  • Choose 20B if you want speed, cost savings, scalability, and broader adoption across your teams.

  • Choose 120B only if you need superior reasoning accuracy for critical decisions with limited users.


Test Environment

  • CPU: Intel i9-10900X

  • GPU: 2 Γ— NVIDIA RTX A6000 (48 GB, SLA bridge)

  • RAM: 128 GB DDR4

  • Storage: 2 TB NVMe M.2 SSD

  • OS & Framework: Ubuntu, NVIDIA GPU Drivers, NVIDIA Docker, PyTorch/TensorFlow

  • Model Format: MXFP4 quantized (Hugging Face)


Comparative Results Table

Model
Users
Avg (ms)
Median (ms)
%90 (ms)
%95 (ms)
%99 (ms)
Error Rate
Throughput (/min)
SLA (≀90s)

120B

5

41,802

43,437

54,304

60,111

114,040

6.7%

7.0

βœ…

10

78,347

88,515

108,509

111,669

113,800

12.2%

7.0

βœ…

15

97,493

117,635

150,114

160,119

163,594

14.3%

8.3

❌

25

124,093

142,777

206,793

217,145

221,432

17.7%

10.0

❌

20B

5

34,355

38,462

47,163

49,897

55,427

11.0%

8.4

βœ…

10

56,421

64,825

75,231

77,109

79,759

9.0%

10.3

βœ…

15

73,574

85,901

108,422

110,204

112,986

0.0%

11.4

βœ…

25

102,460

125,987

188,772

195,114

203,937

18.8%

10.0

❌

πŸ“Š Analysis

The table clearly demonstrates how the two models diverge under different concurrency levels:

  • 5 Users: Both models are SLA compliant. However, 20B (34s) is 18% faster than 120B (42s), offering smoother experience in small workloads.

  • 10 Users: Both models remain SLA compliant. 20B (56s) vs 120B (78s) β†’ 28% faster with a lower error rate (9% vs 12%). This proves 20B maintains better stability under higher load.

  • 15 Users: Critical divergence. 20B (73s, 0% errors) is SLA compliant, whereas 120B (97s, 14% errors) breaches SLA. This confirms 20B can sustain stability even under high concurrency.

  • 25 Users: Both exceed SLA. 20B (102s) is still faster than 120B (124s) by 18%, but error rates reach ~18% for both. This indicates scaling is not possible without additional GPUs or optimization.

πŸ“Œ Summary:

  • 20B β†’ Faster, more stable, and cost-efficient for small to medium workloads.

  • 120B β†’ Higher reasoning accuracy but limited scalability, suitable only for low-concurrency scenarios.


Model-Based Analyses

GPT-OSS-120B – Strengthened Analysis

Strengths:

  • High reasoning capacity: With 120B parameters, it offers higher accuracy in complex logical queries.

  • SLA compliant up to 10 users: Avg response times 41s (5 users) and 78s (10 users) remain under SLA.

  • Low error rate at light load (6.7%): Reliable under small concurrency.

Weaknesses:

  • SLA violation from 15 users onward: Avg 97s, exceeding SLA, with error rate rising to 14%.

  • Critical degradation at 25 users: Avg 124s, error rate 18% β†’ proves scalability is limited.

  • High hardware cost: Cannot run without dual A6000 GPUs (48GB each), raising operational costs significantly.

πŸ“Œ Conclusion (Evidence-Based): The 120B model maintains SLA compliance up to 10 users but fails beyond that. At 25 users, it reaches 124s avg with 18% errors, making it unstable. Thus, 120B should only be used for scenarios requiring high reasoning accuracy with low concurrency. Scaling requires extra GPUs or optimization.


GPT-OSS-20B – Strengthened Analysis

Strengths:

  • SLA compliant up to 15 users: Avg response times 34s, 56s, and 73s for 5/10/15 users respectively, all under SLA. At 15 users, 0% error rate demonstrates exceptional stability.

  • Faster than 120B:

    • 10 users β†’ 20B: 56s, 120B: 78s β†’ 28% faster

    • 15 users β†’ 20B: 73s, 120B: 97s β†’ 25% faster

  • Cost efficiency: Runs on a single 16GB GPU, making it ideal for edge deployments and cost-sensitive environments.

Weaknesses:

  • Fails SLA at 25 users: Avg 102s, error rate 18%.

  • Relatively high error rate (11%) at 5 users: Indicates need for inference pipeline tuning.

πŸ“Œ Conclusion (Evidence-Based): The 20B model is SLA compliant up to 15 users, faster than 120B by 25–40%, and significantly cheaper to operate. It is the best choice for small-to-medium workloads, edge deployments, and cost-driven environments, but requires scaling strategies beyond 25 users.


SLA Compliance Table (90s Avg Criterion)

Model
5 Users
10 Users
15 Users
25 Users

120B

βœ…

βœ…

❌

❌

20B

βœ…

βœ…

βœ…

❌


Visual Comparisons

Average Response Time by Users

Analysis:

  • GPT-OSS-20B consistently maintains lower average response times compared to GPT-OSS-120B.

  • At 5 and 10 users, 20B stays well below the SLA threshold (90 seconds), whereas 120B approaches the limit by 10 users.

  • At 15 users, 120B exceeds the SLA (97s), while 20B still complies (73s).

  • At 25 users, both models fail the SLA, with 120B deteriorating more severely.

πŸ“Œ Proof: The SLA line at 90s shows the clear break point: 20B is compliant until 15 users, while 120B fails earlier.


Median Response Time by Users

Analysis:

  • Median values confirm the trend in averages, but reveal greater variance for 120B under load.

  • For 15 users, the median for 120B is 117s, far beyond SLA, while 20B is 85s, just under SLA.

  • The gap widens at 25 users, where 120B exceeds 140s median, compared to 125s for 20B.

πŸ“Œ Proof: The median results show that even the β€œtypical” (not worst-case) user experience with 120B deteriorates faster than with 20B under concurrent load.


Error Rate by Users

Analysis:

  • GPT-OSS-20B has higher error rate at 5 users (11%), suggesting inference optimizations are needed for low-load conditions.

  • At 15 users, 20B achieves 0% error rate, proving stability under moderate concurrency.

  • GPT-OSS-120B, however, sees error rates climb steadily: 6.7% β†’ 12.2% β†’ 14.3% β†’ 17.7%.

  • At 25 users, both models reach unacceptable levels (β‰ˆ18%).

πŸ“Œ Proof: This demonstrates that 20B is more resilient at scale, while 120B deteriorates consistently as load increases.


Throughput by Users

Analysis:

  • GPT-OSS-20B delivers higher throughput across all user levels, peaking at 11.4 req/min at 15 users.

  • GPT-OSS-120B caps out around 8.3–10 req/min, with diminishing returns after 15 users.

  • The drop in throughput at 25 users shows both models saturating, but 20B remains more efficient.

πŸ“Œ Proof: For organizations requiring higher requests per minute under moderate concurrency, 20B provides better scalability and efficiency.


Strategic Insights

  1. 20B β†’ safer and more cost-effective:

    • SLA compliant up to 15 users (73s avg, 0% errors).

    • 25–40% faster than 120B (e.g., 10 users: 56s vs 78s).

    • Single GPU operation β†’ ideal for edge devices and cost-sensitive environments.

  2. 120B β†’ higher accuracy but limited scalability:

    • SLA compliant up to 10 users (78s avg, 12% errors).

    • SLA violation starts at 15 users (97s avg, 14% errors).

    • At 25 users: 124s avg, 18% errors β†’ critical instability.

    • Dual GPU requirement β†’ significantly higher operational cost.

  3. Scaling beyond 25 users requires investment:

    • Both models exceed SLA at 25 users (20B: 102s, 120B: 124s).

    • Error rates (~18%) prove that without GPU expansion, batch optimization, or pipeline parallelization, scaling is infeasible.

  4. Recommended approach:

    • 20B β†’ Daily operations, mid-scale concurrency, edge deployments.

    • 120B β†’ High reasoning accuracy, low concurrency scenarios.

    • 25+ users β†’ Mandatory scaling via hardware and optimization.

πŸ“Œ Final Decision Statement The 20B model is the most efficient choice for small-to-medium workloads, delivering faster, more stable, and cost-effective results. The 120B model should only be deployed where reasoning accuracy is critical and concurrency is low.


API Feature Report – GPT-OSS 20B & 120B

This report provides a clear feature overview of the APIs, aligned with the performance SLA analysis of GPT-OSS-20B and GPT-OSS-120B. It highlights which endpoints are ready, which are under development, and how the models fit into operational scaling.

Runtime (Assistant Runtime) – βœ… Ready

  • Synchronous Q&A (Ask): POST /assistants/{assistant_id}/ask

  • Streaming Response: POST /assistants/{assistant_id}/stream (reduces perceived latency)

  • Session Management:

    • Create/List: GET/POST /assistants/{assistant_id}/sessions

  • User Management:

    • List users: GET /assistants/{assistant_id}/users

  • File Management (RAG support):

    • Upload: POST /assistants/{assistant_id}/files

    • List: GET /assistants/{assistant_id}/files

    • Delete: DELETE /assistants/{assistant_id}/files/{file_id}

  • Vector Store Management (RAG):

    • Create: POST /assistants/{assistant_id}/vectorstores

    • List: GET /assistants/{assistant_id}/vectorstores

  • System Health & Version:

    • Health: GET /health

    • Version: GET /version


Administration (Admin Orchestrator) – βœ… Ready

  • Assistant Creation:

    • Auto-ID: POST /admin/assistants/auto

    • Explicit: POST /admin/assistants

  • Assistant Listing / Retrieval:

    • List: GET /admin/assistants

    • Detail: GET /admin/assistants/{assistant_id}

  • Traffic Management:

    • Enable: POST /admin/assistants/{assistant_id}/enable

    • Disable: POST /admin/assistants/{assistant_id}/disable

  • Deletion: DELETE /admin/assistants/{assistant_id}


Observability & Operations – πŸ›  In Development

  • Metrics (performance & load): GET /metrics

  • Telemetry (deep monitoring): GET /telemetry

  • Admin Health Proxy (management-side health checks)

  • Assistant Update (model/vector settings): POST /admin/assistants/{assistant_id}


Architectural Flow (High-Level)

  • Autom Mate Chat β†’ RAG check (files & vector store) β†’

    • If relevant data exists: context is injected into LLM

    • If no relevant data: query is forwarded directly to LLM

  • LLM model prepares the answer β†’ Response returned to user

(This flow fully matches the runtime endpoints and RAG capabilities in the collections.)


SLA & Scaling (Aligned with Performance Report)

  • SLA target: Average response ≀ 90s.

  • 20B model: SLA-compliant up to 15 concurrent users; can run on a single GPU.

  • 120B model: SLA-compliant up to 10 concurrent users; requires 2Γ— A6000 GPUs.

  • 25 users: Both models breach SLA β†’ requires GPU scaling, batching, or parallelization.


Model Feature Summary

GPT-OSS-20B

  • Strengths:

    • SLA-compliant up to 15 users.

    • 25–40% faster than 120B in side-by-side tests.

    • Runs on a single GPU (16 GB VRAM) β†’ cost-effective, edge-friendly.

  • Weaknesses:

    • SLA breach at 25 users (~102s avg, 18% error).

    • At 5 users, 11% error rate (requires inference tuning).

GPT-OSS-120B

  • Strengths:

    • Higher reasoning accuracy due to larger parameter size.

    • SLA-compliant up to 10 users (avg 41s and 78s).

    • Low error rate (6.7%) under small concurrency.

  • Weaknesses:

    • SLA breach at 15 users (~97s avg, 14% error).

    • Severe degradation at 25 users (~124s avg, 18% error).

    • Requires dual A6000 GPUs (48 GB each) β†’ high infra cost.


Deployment Readiness – Quick View

  • Core runtime & admin functions: βœ… Ready

  • Observability (metrics, telemetry): πŸ›  In development

  • Recommended path:

    • Short term: Production possible with current endpoints.

    • Medium term: Complete metrics/telemetry for monitoring.

    • Long term: Implement health-gating & automated traffic routing for scalability.

Last updated

Was this helpful?