Benchmarking LLM context scaling on a local RTX 5090 setup

Why I ran this benchmark

Large context windows are one of the most advertised features of modern LLMs.

Most model cards focus on maximum supported context lengths, often quoting numbers like 128k or even 1M tokens.

What is rarely discussed is the real-world latency cost.

I wanted to measure how context scaling affects interactive usage on consumer hardware and where the practical limits start to appear for local coding assistants, RAG systems and agent workflows.

The benchmark was performed locally on a single RTX 5090 using Ollama.

Hardware setup

RTX 5090 with 32GB VRAM
AMD Ryzen 9 9950X
64GB DDR5 RAM
Samsung 9100 Pro Gen5 SSD
Ollama CUDA runtime
Ubuntu Linux
Remote execution over SSH from macOS

Models tested

qwen3.6:27b

Dense model with all parameters active for every token.

qwen3.6:35b-a3b

Mixture-of-Experts model where only a subset of experts are activated per token.

The benchmark compares how both architectures behave as context grows.

Benchmark methodology

Each benchmark was executed three times.

The same prompt was used for every run.

The prompt consisted of:

A synthetic TypeScript monorepo
Approximately 20,000 generated source file entries
A coding task requiring implementation of a typed EventEmitter

Example task:

Implement a typed EventEmitter with:
- on
- off
- once
- emit
Include explanation and edge cases.

Generation settings:

{
  "temperature": 0,
  "num_predict": 1024
}

Context sizes tested:

32k
64k
96k
128k

Metrics collected:

TTFT, time to first token
Total generation time
End-to-end throughput
Relative context scaling behavior

All reported values are averages across three runs.

One important note: token throughput is approximate. The script estimates output tokens using len(text) // 4, so the results should be read as practical benchmark numbers rather than exact tokenizer-level measurements.

Results

qwen3.6:27b

Context	TTFT	tok/s	Total Time
32k	14.9s	23.9	40.8s
64k	31.4s	15.2	63.7s
96k	52.2s	10.8	92.8s
128k	80.1s	6.8	147.6s

qwen3.6:35b-a3b

Context	TTFT	tok/s	Total Time
32k	8.1s	49.8	19.6s
64k	15.7s	32.5	30.9s
96k	24.6s	21.8	44.1s
128k	50.0s	10.7	93.5s

Throughput degradation

qwen3.6:27b

Context	Relative Speed
32k	100%
64k	64%
96k	45%
128k	28%

qwen3.6:35b-a3b

Context	Relative Speed
32k	100%
64k	65%
96k	44%
128k	22%

Interestingly, both models follow almost the same degradation curve.

That suggests the bottleneck is less about model architecture and more about the cost of maintaining and attending over a larger context window, especially KV-cache growth and memory bandwidth pressure.

Key findings

1. The MoE model is dramatically faster

At 32k context:

qwen3.6:35b-a3b reaches about 50 tok/s
qwen3.6:27b reaches about 24 tok/s

The MoE model delivers roughly double the throughput despite having a larger total parameter count.

This was the largest performance difference observed during the benchmark.

2. Context scaling is expensive

Increasing context size has a clear cost.

TTFT grows aggressively as context increases:

Context	27B TTFT	35B-A3B TTFT
32k	14.9s	8.1s
64k	31.4s	15.7s
96k	52.2s	24.6s
128k	80.1s	50.0s

Throughput also drops significantly, but TTFT is what affects the interactive feeling the most.

For local coding and agent workflows, the model can still be generating at an acceptable speed while the initial wait starts to feel expensive.

3. 64k is the sweet spot, but 96k is still usable

For this setup:

32k is very fast
64k feels like the best balance
96k is still usable and honestly impressive on consumer hardware
128k works, but the UX tradeoff becomes much more noticeable

This is the main practical takeaway.

The question is not whether a model can technically run a large context window. It can.

The better question is whether that context window still feels good to use interactively.

On this RTX 5090 setup, 64k feels like the practical sweet spot for responsiveness. 96k is still useful for larger analysis tasks. 128k is where the latency starts becoming harder to justify unless the workflow is more batch-oriented.

4. Bigger context does not make the model smarter

One common misconception around LLMs is that larger context windows improve reasoning.

They do not.

A larger context window gives the model more information to attend over, but the model itself does not become more capable.

The tradeoff is simple:

More context gives more workspace, but it also increases latency.

So the useful question is not:

“How much context does the model support?”

The more useful question is:

“How much context can I afford before the user experience becomes too slow?”

Conclusion

For local coding assistants running on RTX 5090-class hardware:

32k is ideal when speed matters most
64k is the best overall balance
96k is still very usable for larger codebase or analysis-heavy tasks
128k works, but is better suited for slower, batch-style workflows

The most interesting result was not simply that 128k became slower.

That was expected.

The more useful finding was where the practical boundary appears on consumer hardware.

A single RTX 5090 can handle surprisingly large context windows, especially with qwen3.6:35b-a3b. But the cost grows quickly, and responsiveness becomes the limiting factor before the model becomes unusable.

Bigger context is useful.

But only if the interaction still feels responsive.

Mohammed Hammoud

Why I ran this benchmark

Hardware setup

Models tested

qwen3.6:27b

qwen3.6:35b-a3b

Benchmark methodology

Results

qwen3.6:27b

qwen3.6:35b-a3b

Throughput degradation

qwen3.6:27b

qwen3.6:35b-a3b

Key findings

1. The MoE model is dramatically faster

2. Context scaling is expensive

3. 64k is the sweet spot, but 96k is still usable

4. Bigger context does not make the model smarter

Conclusion