# Autom Mate Local LLM GPT-OSS 120B & 20B Performans Test Report and API Feature Report

## GPT-OSS 120B & 20B Performans Test Report

### **Executive Summary**

When evaluating the GPT-OSS 20B and GPT-OSS 120B models, the core question is whether your organization should prioritize **scalability and cost-efficiency** or **deep reasoning accuracy**. This report provides the context, a side-by-side comparison, and a decision framework tailored for business leaders.

| **Criteria**                       | **20B Model**                                               | **120B Model**                                                    | **Recommendation**                                            |
| ---------------------------------- | ----------------------------------------------------------- | ----------------------------------------------------------------- | ------------------------------------------------------------- |
| **Concurrent Users (Scalability)** | ✅ SLA-compliant up to **15 users** ❌ Fails at 25 users      | ✅ SLA-compliant up to **10 users** ❌ Fails at 15+ users           | For higher user volumes → **20B**                             |
| **Speed (Response Time)**          | On average **25–40% faster**                                | Slower; degrades as load increases                                | If speed is critical → **20B**                                |
| **Reasoning Accuracy**             | Adequate for daily operations                               | Much stronger reasoning, better for complex logic                 | For complex problem solving → **120B**                        |
| **Error Rate**                     | Slightly higher at low load (11%) **0% errors at 15 users** | Low at light load (6.7%) Error rate steadily increases under load | For stability under load → **20B**                            |
| **Hardware Requirements**          | Runs on a **single 16 GB GPU** Edge-device friendly         | Requires **2 × 48 GB GPUs** minimum                               | If infra cost is a concern → **20B**                          |
| **Cost**                           | Low hardware & energy cost                                  | Very high infra & energy cost                                     | Budget-sensitive cases → **20B**                              |
| **Use Cases**                      | Customer support, daily ops, reporting                      | Strategic decision support, deep analysis                         | Broad adoption → **20B** Niche high-accuracy needs → **120B** |
| **User Experience (UX)**           | Faster, more stable → **high satisfaction**                 | Slower, potential delays → **lower satisfaction**                 | If UX is a priority → **20B**                                 |
| **Risk Management**                | Predictable, low risk under scaling                         | Higher risk: SLA violations beyond 10 users                       | Risk-averse strategy → **20B**                                |
| **Return on Investment (ROI)**     | Lower cost → **fast ROI**                                   | Higher cost, but long-term value from reasoning accuracy          | Short-term ROI → **20B** Long-term insight → **120B**         |
| **Application Fit**                | Operational tasks, customer services, mid-scale analytics   | Management insights, critical analysis, low-concurrency workloads | Operations → **20B** Strategy → **120B**                      |
| **Future Growth / Scaling**        | Easily deployable on existing or edge infrastructure        | Requires significant GPU investment to scale                      | Flexible growth path → **20B**                                |
| **Adaptability (Flexibility)**     | Hardware-agnostic, works in varied environments             | Hardware-dependent, limited flexibility                           | For flexible deployment → **20B**                             |

When deciding between the **GPT-OSS 20B** and **GPT-OSS 120B** models, the choice depends on whether your priority is **scalability and cost-efficiency** or **deep reasoning accuracy**.

* **20B Model**
  * Best suited for **daily operations, customer support, and mid-scale workloads**.
  * Handles up to **15 users at once** reliably, with faster response times and lower costs.
  * Can run on a single GPU, making it practical for both cloud and edge deployments.
  * Offers **quick ROI** and **better user experience** thanks to speed and stability.
* **120B Model**
  * Designed for **complex, strategic, and high-accuracy tasks** where reasoning quality is critical.
  * Works best with **10 users or fewer**; beyond that, performance and stability drop.
  * Requires powerful (and costly) hardware, making it more suitable for niche, mission-critical scenarios.
  * Provides **long-term strategic value**, but with significantly higher upfront and operational costs.

📌 **Bottom Line:**

* Choose **20B** if you want **speed, cost savings, scalability, and broader adoption** across your teams.
* Choose **120B** only if you need **superior reasoning accuracy** for **critical decisions with limited users**.

***

### **Test Environment**

* **CPU:** Intel i9-10900X
* **GPU:** 2 × NVIDIA RTX A6000 (48 GB, SLA bridge)
* **RAM:** 128 GB DDR4
* **Storage:** 2 TB NVMe M.2 SSD
* **OS & Framework:** Ubuntu, NVIDIA GPU Drivers, NVIDIA Docker, PyTorch/TensorFlow
* **Model Format:** MXFP4 quantized (Hugging Face)

{% hint style="warning" %}
The 20B model can also run on a single GPU; in these tests, 2 GPUs (SLA bridge) were used.
{% endhint %}

***

### **Comparative Results Table**

<table><thead><tr><th width="102">Model</th><th width="105">Users</th><th width="118">Avg (ms)</th><th width="116">Median (ms)</th><th width="120">%90 (ms)</th><th width="90">%95 (ms)</th><th width="101">%99 (ms)</th><th width="96">Error Rate</th><th width="93">Throughput (/min)</th><th>SLA (≤90s)</th></tr></thead><tbody><tr><td><strong>120B</strong></td><td>5</td><td>41,802</td><td>43,437</td><td>54,304</td><td>60,111</td><td>114,040</td><td>6.7%</td><td>7.0</td><td>✅</td></tr><tr><td></td><td>10</td><td>78,347</td><td>88,515</td><td>108,509</td><td>111,669</td><td>113,800</td><td>12.2%</td><td>7.0</td><td>✅</td></tr><tr><td></td><td>15</td><td>97,493</td><td>117,635</td><td>150,114</td><td>160,119</td><td>163,594</td><td>14.3%</td><td>8.3</td><td>❌</td></tr><tr><td></td><td>25</td><td>124,093</td><td>142,777</td><td>206,793</td><td>217,145</td><td>221,432</td><td>17.7%</td><td>10.0</td><td>❌</td></tr><tr><td><strong>20B</strong></td><td>5</td><td>34,355</td><td>38,462</td><td>47,163</td><td>49,897</td><td>55,427</td><td>11.0%</td><td>8.4</td><td>✅</td></tr><tr><td></td><td>10</td><td>56,421</td><td>64,825</td><td>75,231</td><td>77,109</td><td>79,759</td><td>9.0%</td><td>10.3</td><td>✅</td></tr><tr><td></td><td>15</td><td>73,574</td><td>85,901</td><td>108,422</td><td>110,204</td><td>112,986</td><td>0.0%</td><td>11.4</td><td>✅</td></tr><tr><td></td><td>25</td><td>102,460</td><td>125,987</td><td>188,772</td><td>195,114</td><td>203,937</td><td>18.8%</td><td>10.0</td><td>❌</td></tr></tbody></table>

#### 📊 Analysis&#x20;

The table clearly demonstrates how the two models diverge under different concurrency levels:

* **5 Users:** Both models are SLA compliant. However, **20B (34s)** is **18% faster than 120B (42s)**, offering smoother experience in small workloads.
* **10 Users:** Both models remain SLA compliant. **20B (56s) vs 120B (78s)** → **28% faster** with a lower error rate (9% vs 12%). This proves 20B maintains better stability under higher load.
* **15 Users:** Critical divergence. **20B (73s, 0% errors)** is SLA compliant, whereas **120B (97s, 14% errors)** breaches SLA. This confirms **20B can sustain stability even under high concurrency**.
* **25 Users:** Both exceed SLA. **20B (102s)** is still faster than **120B (124s)** by 18%, but error rates reach \~18% for both. This indicates **scaling is not possible without additional GPUs or optimization**.

📌 **Summary:**

* **20B** → Faster, more stable, and cost-efficient for small to medium workloads.
* **120B** → Higher reasoning accuracy but limited scalability, suitable only for low-concurrency scenarios.

***

### **Model-Based Analyses**

#### GPT-OSS-120B – Strengthened Analysis

**Strengths:**

* **High reasoning capacity:** With 120B parameters, it offers higher accuracy in complex logical queries.
* **SLA compliant up to 10 users:** Avg response times **41s (5 users) and 78s (10 users)** remain under SLA.
* **Low error rate at light load (6.7%):** Reliable under small concurrency.

**Weaknesses:**

* **SLA violation from 15 users onward:** Avg **97s**, exceeding SLA, with error rate rising to 14%.
* **Critical degradation at 25 users:** Avg **124s**, error rate 18% → proves scalability is limited.
* **High hardware cost:** Cannot run without dual A6000 GPUs (48GB each), raising operational costs significantly.

**📌 Conclusion (Evidence-Based):**\
The 120B model maintains SLA compliance up to 10 users but fails beyond that. At 25 users, it reaches **124s avg with 18% errors**, making it unstable. Thus, **120B should only be used for scenarios requiring high reasoning accuracy with low concurrency. Scaling requires extra GPUs or optimization.**

***

#### GPT-OSS-20B – Strengthened Analysis

**Strengths:**

* **SLA compliant up to 15 users:** Avg response times **34s, 56s, and 73s** for 5/10/15 users respectively, all under SLA. At 15 users, **0% error rate** demonstrates exceptional stability.
* **Faster than 120B:**
  * 10 users → **20B: 56s, 120B: 78s** → **28% faster**
  * 15 users → **20B: 73s, 120B: 97s** → **25% faster**
* **Cost efficiency:** Runs on **a single 16GB GPU**, making it ideal for edge deployments and cost-sensitive environments.

**Weaknesses:**

* **Fails SLA at 25 users:** Avg **102s**, error rate 18%.
* **Relatively high error rate (11%) at 5 users:** Indicates need for inference pipeline tuning.

**📌 Conclusion (Evidence-Based):**\
The 20B model is **SLA compliant up to 15 users, faster than 120B by 25–40%, and significantly cheaper to operate**. It is the best choice for **small-to-medium workloads, edge deployments, and cost-driven environments**, but requires scaling strategies beyond 25 users.

***

### **SLA Compliance Table (90s Avg Criterion)**

| Model    | 5 Users | 10 Users | 15 Users | 25 Users |
| -------- | ------- | -------- | -------- | -------- |
| **120B** | ✅       | ✅        | ❌        | ❌        |
| **20B**  | ✅       | ✅        | ✅        | ❌        |

***

### **Visual Comparisons**

#### Average Response Time by Users

<figure><img src="/files/1VlGAdgnmGwVgLnWb16U" alt=""><figcaption></figcaption></figure>

**Analysis:**

* GPT-OSS-20B consistently maintains lower average response times compared to GPT-OSS-120B.
* At **5 and 10 users**, 20B stays well below the SLA threshold (90 seconds), whereas 120B approaches the limit by 10 users.
* At **15 users**, 120B exceeds the SLA (97s), while 20B still complies (73s).
* At **25 users**, both models fail the SLA, with 120B deteriorating more severely.

📌 **Proof:** The SLA line at 90s shows the clear break point: 20B is compliant until 15 users, while 120B fails earlier.

***

#### Median Response Time by Users

<figure><img src="/files/znEgGdHjkZh9CUB5M54w" alt=""><figcaption></figcaption></figure>

**Analysis:**

* Median values confirm the trend in averages, but reveal **greater variance for 120B** under load.
* For **15 users**, the median for 120B is **117s**, far beyond SLA, while 20B is **85s**, just under SLA.
* The gap widens at **25 users**, where 120B exceeds 140s median, compared to 125s for 20B.

📌 **Proof:** The median results show that even the “typical” (not worst-case) user experience with 120B deteriorates faster than with 20B under concurrent load.

***

#### Error Rate by Users

<figure><img src="/files/FB05ZUbob8W5aF4AKMV2" alt=""><figcaption></figcaption></figure>

**Analysis:**

* GPT-OSS-20B has higher error rate at **5 users (11%)**, suggesting inference optimizations are needed for low-load conditions.
* At **15 users**, 20B achieves **0% error rate**, proving stability under moderate concurrency.
* GPT-OSS-120B, however, sees error rates climb steadily: **6.7% → 12.2% → 14.3% → 17.7%**.
* At **25 users**, both models reach unacceptable levels (≈18%).

📌 **Proof:** This demonstrates that 20B is more resilient at scale, while 120B deteriorates consistently as load increases.

***

#### Throughput by Users

<figure><img src="/files/CuzXEvmT1KUA9aVpX6GC" alt=""><figcaption></figcaption></figure>

**Analysis:**

* GPT-OSS-20B delivers **higher throughput across all user levels**, peaking at **11.4 req/min at 15 users**.
* GPT-OSS-120B caps out around **8.3–10 req/min**, with diminishing returns after 15 users.
* The drop in throughput at **25 users** shows both models saturating, but 20B remains more efficient.

📌 **Proof:** For organizations requiring **higher requests per minute under moderate concurrency**, 20B provides better scalability and efficiency.

***

### **Strategic Insights**

1. **20B → safer and more cost-effective:**
   * SLA compliant up to 15 users (**73s avg, 0% errors**).
   * 25–40% faster than 120B (e.g., 10 users: 56s vs 78s).
   * Single GPU operation → **ideal for edge devices and cost-sensitive environments**.
2. **120B → higher accuracy but limited scalability:**
   * SLA compliant up to 10 users (**78s avg, 12% errors**).
   * SLA violation starts at 15 users (**97s avg, 14% errors**).
   * At 25 users: **124s avg, 18% errors** → critical instability.
   * Dual GPU requirement → significantly higher operational cost.
3. **Scaling beyond 25 users requires investment:**
   * Both models exceed SLA at 25 users (20B: 102s, 120B: 124s).
   * Error rates (\~18%) prove that without **GPU expansion, batch optimization, or pipeline parallelization**, scaling is infeasible.
4. **Recommended approach:**
   * **20B** → Daily operations, mid-scale concurrency, edge deployments.
   * **120B** → High reasoning accuracy, low concurrency scenarios.
   * **25+ users** → Mandatory scaling via hardware and optimization.

📌 **Final Decision Statement**\
The **20B model is the most efficient choice for small-to-medium workloads**, delivering faster, more stable, and cost-effective results.\
The **120B model should only be deployed where reasoning accuracy is critical and concurrency is low.**

***

## API Feature Report – GPT-OSS 20B & 120B

This report provides a **clear feature overview of the APIs**, aligned with the **performance SLA analysis** of GPT-OSS-20B and GPT-OSS-120B. It highlights which endpoints are ready, which are under development, and how the models fit into operational scaling.

### **Runtime (Assistant Runtime) – ✅ Ready**

* **Synchronous Q\&A (Ask):** `POST /assistants/{assistant_id}/ask`
* **Streaming Response:** `POST /assistants/{assistant_id}/stream` (reduces perceived latency)
* **Session Management:**
  * Create/List: `GET/POST /assistants/{assistant_id}/sessions`
* **User Management:**
  * List users: `GET /assistants/{assistant_id}/users`
* **File Management (RAG support):**
  * Upload: `POST /assistants/{assistant_id}/files`
  * List: `GET /assistants/{assistant_id}/files`
  * Delete: `DELETE /assistants/{assistant_id}/files/{file_id}`
* **Vector Store Management (RAG):**
  * Create: `POST /assistants/{assistant_id}/vectorstores`
  * List: `GET /assistants/{assistant_id}/vectorstores`
* **System Health & Version:**
  * Health: `GET /health`
  * Version: `GET /version`

***

### **Administration (Admin Orchestrator) – ✅ Ready**

* **Assistant Creation:**
  * Auto-ID: `POST /admin/assistants/auto`
  * Explicit: `POST /admin/assistants`
* **Assistant Listing / Retrieval:**
  * List: `GET /admin/assistants`
  * Detail: `GET /admin/assistants/{assistant_id}`
* **Traffic Management:**
  * Enable: `POST /admin/assistants/{assistant_id}/enable`
  * Disable: `POST /admin/assistants/{assistant_id}/disable`
* **Deletion:** `DELETE /admin/assistants/{assistant_id}`

***

### **Observability & Operations – 🛠 In Development**

* **Metrics (performance & load):** `GET /metrics`
* **Telemetry (deep monitoring):** `GET /telemetry`
* **Admin Health Proxy** (management-side health checks)
* **Assistant Update (model/vector settings):** `POST /admin/assistants/{assistant_id}`

***

### **Architectural Flow (High-Level)**

* **Autom Mate Chat** → **RAG check** (files & vector store) →
  * If **relevant data exists**: context is injected into LLM
  * If **no relevant data**: query is forwarded directly to LLM
* **LLM model** prepares the answer → **Response returned to user**

*(This flow fully matches the runtime endpoints and RAG capabilities in the collections.)*

***

### **SLA & Scaling (Aligned with Performance Report)**

* **SLA target:** Average response ≤ **90s**.
* **20B model:** SLA-compliant up to **15 concurrent users**; can run on a **single GPU**.
* **120B model:** SLA-compliant up to **10 concurrent users**; requires **2× A6000 GPUs**.
* **25 users:** Both models breach SLA → requires GPU scaling, batching, or parallelization.

***

### **Model Feature Summary**

#### GPT-OSS-20B

* **Strengths:**
  * SLA-compliant up to 15 users.
  * **25–40% faster** than 120B in side-by-side tests.
  * Runs on **a single GPU (16 GB VRAM)** → cost-effective, edge-friendly.
* **Weaknesses:**
  * SLA breach at 25 users (\~102s avg, 18% error).
  * At 5 users, 11% error rate (requires inference tuning).

#### GPT-OSS-120B

* **Strengths:**
  * Higher **reasoning accuracy** due to larger parameter size.
  * SLA-compliant up to 10 users (avg 41s and 78s).
  * Low error rate (6.7%) under small concurrency.
* **Weaknesses:**
  * SLA breach at 15 users (\~97s avg, 14% error).
  * Severe degradation at 25 users (\~124s avg, 18% error).
  * Requires **dual A6000 GPUs (48 GB each)** → high infra cost.

***

### **Deployment Readiness – Quick View**

* **Core runtime & admin functions:** ✅ Ready
* **Observability (metrics, telemetry):** 🛠 In development
* **Recommended path:**
  * **Short term:** Production possible with current endpoints.
  * **Medium term:** Complete metrics/telemetry for monitoring.
  * **Long term:** Implement **health-gating & automated traffic routing** for scalability.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.autommate.com/userguide/product-features/ai-agent-composer/autom-mate-local-llm-gpt-oss-120b-and-20b-performans-test-report-and-api-feature-report.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
