Autom Mate Local LLM GPT-OSS 120B & 20B Performans Test Report and API Feature Report
GPT-OSS 120B & 20B Performans Test Report
Executive Summary
When evaluating the GPT-OSS 20B and GPT-OSS 120B models, the core question is whether your organization should prioritize scalability and cost-efficiency or deep reasoning accuracy. This report provides the context, a side-by-side comparison, and a decision framework tailored for business leaders.
Criteria
20B Model
120B Model
Recommendation
Concurrent Users (Scalability)
β SLA-compliant up to 15 users β Fails at 25 users
β SLA-compliant up to 10 users β Fails at 15+ users
For higher user volumes β 20B
Speed (Response Time)
On average 25β40% faster
Slower; degrades as load increases
If speed is critical β 20B
Reasoning Accuracy
Adequate for daily operations
Much stronger reasoning, better for complex logic
For complex problem solving β 120B
Error Rate
Slightly higher at low load (11%) 0% errors at 15 users
Low at light load (6.7%) Error rate steadily increases under load
For stability under load β 20B
Hardware Requirements
Runs on a single 16 GB GPU Edge-device friendly
Requires 2 Γ 48 GB GPUs minimum
If infra cost is a concern β 20B
Cost
Low hardware & energy cost
Very high infra & energy cost
Budget-sensitive cases β 20B
Use Cases
Customer support, daily ops, reporting
Strategic decision support, deep analysis
Broad adoption β 20B Niche high-accuracy needs β 120B
User Experience (UX)
Faster, more stable β high satisfaction
Slower, potential delays β lower satisfaction
If UX is a priority β 20B
Risk Management
Predictable, low risk under scaling
Higher risk: SLA violations beyond 10 users
Risk-averse strategy β 20B
Return on Investment (ROI)
Lower cost β fast ROI
Higher cost, but long-term value from reasoning accuracy
Short-term ROI β 20B Long-term insight β 120B
Application Fit
Operational tasks, customer services, mid-scale analytics
Management insights, critical analysis, low-concurrency workloads
Operations β 20B Strategy β 120B
Future Growth / Scaling
Easily deployable on existing or edge infrastructure
Requires significant GPU investment to scale
Flexible growth path β 20B
Adaptability (Flexibility)
Hardware-agnostic, works in varied environments
Hardware-dependent, limited flexibility
For flexible deployment β 20B
When deciding between the GPT-OSS 20B and GPT-OSS 120B models, the choice depends on whether your priority is scalability and cost-efficiency or deep reasoning accuracy.
20B Model
Best suited for daily operations, customer support, and mid-scale workloads.
Handles up to 15 users at once reliably, with faster response times and lower costs.
Can run on a single GPU, making it practical for both cloud and edge deployments.
Offers quick ROI and better user experience thanks to speed and stability.
120B Model
Designed for complex, strategic, and high-accuracy tasks where reasoning quality is critical.
Works best with 10 users or fewer; beyond that, performance and stability drop.
Requires powerful (and costly) hardware, making it more suitable for niche, mission-critical scenarios.
Provides long-term strategic value, but with significantly higher upfront and operational costs.
π Bottom Line:
Choose 20B if you want speed, cost savings, scalability, and broader adoption across your teams.
Choose 120B only if you need superior reasoning accuracy for critical decisions with limited users.
Test Environment
CPU: Intel i9-10900X
GPU: 2 Γ NVIDIA RTX A6000 (48 GB, SLA bridge)
RAM: 128 GB DDR4
Storage: 2 TB NVMe M.2 SSD
OS & Framework: Ubuntu, NVIDIA GPU Drivers, NVIDIA Docker, PyTorch/TensorFlow
Model Format: MXFP4 quantized (Hugging Face)
The 20B model can also run on a single GPU; in these tests, 2 GPUs (SLA bridge) were used.
Comparative Results Table
120B
5
41,802
43,437
54,304
60,111
114,040
6.7%
7.0
β
10
78,347
88,515
108,509
111,669
113,800
12.2%
7.0
β
15
97,493
117,635
150,114
160,119
163,594
14.3%
8.3
β
25
124,093
142,777
206,793
217,145
221,432
17.7%
10.0
β
20B
5
34,355
38,462
47,163
49,897
55,427
11.0%
8.4
β
10
56,421
64,825
75,231
77,109
79,759
9.0%
10.3
β
15
73,574
85,901
108,422
110,204
112,986
0.0%
11.4
β
25
102,460
125,987
188,772
195,114
203,937
18.8%
10.0
β
π Analysis
The table clearly demonstrates how the two models diverge under different concurrency levels:
5 Users: Both models are SLA compliant. However, 20B (34s) is 18% faster than 120B (42s), offering smoother experience in small workloads.
10 Users: Both models remain SLA compliant. 20B (56s) vs 120B (78s) β 28% faster with a lower error rate (9% vs 12%). This proves 20B maintains better stability under higher load.
15 Users: Critical divergence. 20B (73s, 0% errors) is SLA compliant, whereas 120B (97s, 14% errors) breaches SLA. This confirms 20B can sustain stability even under high concurrency.
25 Users: Both exceed SLA. 20B (102s) is still faster than 120B (124s) by 18%, but error rates reach ~18% for both. This indicates scaling is not possible without additional GPUs or optimization.
π Summary:
20B β Faster, more stable, and cost-efficient for small to medium workloads.
120B β Higher reasoning accuracy but limited scalability, suitable only for low-concurrency scenarios.
Model-Based Analyses
GPT-OSS-120B β Strengthened Analysis
Strengths:
High reasoning capacity: With 120B parameters, it offers higher accuracy in complex logical queries.
SLA compliant up to 10 users: Avg response times 41s (5 users) and 78s (10 users) remain under SLA.
Low error rate at light load (6.7%): Reliable under small concurrency.
Weaknesses:
SLA violation from 15 users onward: Avg 97s, exceeding SLA, with error rate rising to 14%.
Critical degradation at 25 users: Avg 124s, error rate 18% β proves scalability is limited.
High hardware cost: Cannot run without dual A6000 GPUs (48GB each), raising operational costs significantly.
π Conclusion (Evidence-Based): The 120B model maintains SLA compliance up to 10 users but fails beyond that. At 25 users, it reaches 124s avg with 18% errors, making it unstable. Thus, 120B should only be used for scenarios requiring high reasoning accuracy with low concurrency. Scaling requires extra GPUs or optimization.
GPT-OSS-20B β Strengthened Analysis
Strengths:
SLA compliant up to 15 users: Avg response times 34s, 56s, and 73s for 5/10/15 users respectively, all under SLA. At 15 users, 0% error rate demonstrates exceptional stability.
Faster than 120B:
10 users β 20B: 56s, 120B: 78s β 28% faster
15 users β 20B: 73s, 120B: 97s β 25% faster
Cost efficiency: Runs on a single 16GB GPU, making it ideal for edge deployments and cost-sensitive environments.
Weaknesses:
Fails SLA at 25 users: Avg 102s, error rate 18%.
Relatively high error rate (11%) at 5 users: Indicates need for inference pipeline tuning.
π Conclusion (Evidence-Based): The 20B model is SLA compliant up to 15 users, faster than 120B by 25β40%, and significantly cheaper to operate. It is the best choice for small-to-medium workloads, edge deployments, and cost-driven environments, but requires scaling strategies beyond 25 users.
SLA Compliance Table (90s Avg Criterion)
120B
β
β
β
β
20B
β
β
β
β
Visual Comparisons
Average Response Time by Users

Analysis:
GPT-OSS-20B consistently maintains lower average response times compared to GPT-OSS-120B.
At 5 and 10 users, 20B stays well below the SLA threshold (90 seconds), whereas 120B approaches the limit by 10 users.
At 15 users, 120B exceeds the SLA (97s), while 20B still complies (73s).
At 25 users, both models fail the SLA, with 120B deteriorating more severely.
π Proof: The SLA line at 90s shows the clear break point: 20B is compliant until 15 users, while 120B fails earlier.
Median Response Time by Users

Analysis:
Median values confirm the trend in averages, but reveal greater variance for 120B under load.
For 15 users, the median for 120B is 117s, far beyond SLA, while 20B is 85s, just under SLA.
The gap widens at 25 users, where 120B exceeds 140s median, compared to 125s for 20B.
π Proof: The median results show that even the βtypicalβ (not worst-case) user experience with 120B deteriorates faster than with 20B under concurrent load.
Error Rate by Users

Analysis:
GPT-OSS-20B has higher error rate at 5 users (11%), suggesting inference optimizations are needed for low-load conditions.
At 15 users, 20B achieves 0% error rate, proving stability under moderate concurrency.
GPT-OSS-120B, however, sees error rates climb steadily: 6.7% β 12.2% β 14.3% β 17.7%.
At 25 users, both models reach unacceptable levels (β18%).
π Proof: This demonstrates that 20B is more resilient at scale, while 120B deteriorates consistently as load increases.
Throughput by Users

Analysis:
GPT-OSS-20B delivers higher throughput across all user levels, peaking at 11.4 req/min at 15 users.
GPT-OSS-120B caps out around 8.3β10 req/min, with diminishing returns after 15 users.
The drop in throughput at 25 users shows both models saturating, but 20B remains more efficient.
π Proof: For organizations requiring higher requests per minute under moderate concurrency, 20B provides better scalability and efficiency.
Strategic Insights
20B β safer and more cost-effective:
SLA compliant up to 15 users (73s avg, 0% errors).
25β40% faster than 120B (e.g., 10 users: 56s vs 78s).
Single GPU operation β ideal for edge devices and cost-sensitive environments.
120B β higher accuracy but limited scalability:
SLA compliant up to 10 users (78s avg, 12% errors).
SLA violation starts at 15 users (97s avg, 14% errors).
At 25 users: 124s avg, 18% errors β critical instability.
Dual GPU requirement β significantly higher operational cost.
Scaling beyond 25 users requires investment:
Both models exceed SLA at 25 users (20B: 102s, 120B: 124s).
Error rates (~18%) prove that without GPU expansion, batch optimization, or pipeline parallelization, scaling is infeasible.
Recommended approach:
20B β Daily operations, mid-scale concurrency, edge deployments.
120B β High reasoning accuracy, low concurrency scenarios.
25+ users β Mandatory scaling via hardware and optimization.
π Final Decision Statement The 20B model is the most efficient choice for small-to-medium workloads, delivering faster, more stable, and cost-effective results. The 120B model should only be deployed where reasoning accuracy is critical and concurrency is low.
API Feature Report β GPT-OSS 20B & 120B
This report provides a clear feature overview of the APIs, aligned with the performance SLA analysis of GPT-OSS-20B and GPT-OSS-120B. It highlights which endpoints are ready, which are under development, and how the models fit into operational scaling.
Runtime (Assistant Runtime) β β
Ready
Synchronous Q&A (Ask):
POST /assistants/{assistant_id}/askStreaming Response:
POST /assistants/{assistant_id}/stream(reduces perceived latency)Session Management:
Create/List:
GET/POST /assistants/{assistant_id}/sessions
User Management:
List users:
GET /assistants/{assistant_id}/users
File Management (RAG support):
Upload:
POST /assistants/{assistant_id}/filesList:
GET /assistants/{assistant_id}/filesDelete:
DELETE /assistants/{assistant_id}/files/{file_id}
Vector Store Management (RAG):
Create:
POST /assistants/{assistant_id}/vectorstoresList:
GET /assistants/{assistant_id}/vectorstores
System Health & Version:
Health:
GET /healthVersion:
GET /version
Administration (Admin Orchestrator) β β
Ready
Assistant Creation:
Auto-ID:
POST /admin/assistants/autoExplicit:
POST /admin/assistants
Assistant Listing / Retrieval:
List:
GET /admin/assistantsDetail:
GET /admin/assistants/{assistant_id}
Traffic Management:
Enable:
POST /admin/assistants/{assistant_id}/enableDisable:
POST /admin/assistants/{assistant_id}/disable
Deletion:
DELETE /admin/assistants/{assistant_id}
Observability & Operations β π In Development
Metrics (performance & load):
GET /metricsTelemetry (deep monitoring):
GET /telemetryAdmin Health Proxy (management-side health checks)
Assistant Update (model/vector settings):
POST /admin/assistants/{assistant_id}
Architectural Flow (High-Level)
Autom Mate Chat β RAG check (files & vector store) β
If relevant data exists: context is injected into LLM
If no relevant data: query is forwarded directly to LLM
LLM model prepares the answer β Response returned to user
(This flow fully matches the runtime endpoints and RAG capabilities in the collections.)
SLA & Scaling (Aligned with Performance Report)
SLA target: Average response β€ 90s.
20B model: SLA-compliant up to 15 concurrent users; can run on a single GPU.
120B model: SLA-compliant up to 10 concurrent users; requires 2Γ A6000 GPUs.
25 users: Both models breach SLA β requires GPU scaling, batching, or parallelization.
Model Feature Summary
GPT-OSS-20B
Strengths:
SLA-compliant up to 15 users.
25β40% faster than 120B in side-by-side tests.
Runs on a single GPU (16 GB VRAM) β cost-effective, edge-friendly.
Weaknesses:
SLA breach at 25 users (~102s avg, 18% error).
At 5 users, 11% error rate (requires inference tuning).
GPT-OSS-120B
Strengths:
Higher reasoning accuracy due to larger parameter size.
SLA-compliant up to 10 users (avg 41s and 78s).
Low error rate (6.7%) under small concurrency.
Weaknesses:
SLA breach at 15 users (~97s avg, 14% error).
Severe degradation at 25 users (~124s avg, 18% error).
Requires dual A6000 GPUs (48 GB each) β high infra cost.
Deployment Readiness β Quick View
Core runtime & admin functions: β Ready
Observability (metrics, telemetry): π In development
Recommended path:
Short term: Production possible with current endpoints.
Medium term: Complete metrics/telemetry for monitoring.
Long term: Implement health-gating & automated traffic routing for scalability.
Last updated
Was this helpful?


