Local LLM gets dumber over time on RTX 5090 due to context window and memory issues
At a glance:
- Running Qwen 3.6 27B at 256K context on an RTX 5090 causes performance degradation, not model deterioration
- The hybrid architecture reduces KV cache requirements, but still exceeds 32GB VRAM when including system overhead
- Memory spilling to system RAM over PCIe creates slowdowns that make the model appear less capable
The experiment that revealed the problem
While local large language models have reached a point where they're useful for most coding tasks, the author discovered an unexpected issue when running Qwen 3.6 27B at Q4_K_M quantization inside LM Studio on an Nvidia RTX 5090. Over extended conversations, the model appeared to become less coherent, with answers drifting, token generation slowing, and overall performance declining even during idle periods.
This wasn't an isolated incident tied to a specific model or server setup. Testing with smaller models and alternative setups like vLLM produced the same degradation patterns. Initial suspicion fell on context length calculations, but the root cause proved to be more nuanced than simple napkin math would suggest.
The KV cache miscalculation
The conventional wisdom about VRAM usage doesn't apply to Qwen 3.6 27B, which uses a hybrid transformer architecture. While standard transformers would require approximately 64GB for a 256K context window at fp16 precision, Qwen 3.6's architecture only applies full attention to 16 of its 64 layers. This dramatically reduces the KV cache requirement to roughly 16GB instead of the expected 64GB.
However, this doesn't solve the memory problem entirely. The model weights consume 16.8GB, and Windows 11 system overhead, browser processes, vision encoder, and CUDA buffering all compete for the remaining VRAM. The total memory footprint exceeds the RTX 5090's 32GB capacity, forcing the GPU driver to silently offload data to system RAM over the PCIe bus.
| Context length | KV cache (fp16) | + 16.8GB weights | Fits in 32GB? |
|---|---|---|---|
| 262K (the setting) | ~16 GB | ~33 GB + overhead | No — just over → spills |
| 128K | ~8 GB | ~25 GB | Yes |
| 64K | ~4 GB | ~21 GB | Yes, easily |
| 32K | ~2 GB | ~19 GB | Yes, lots of room |
Why the model isn't actually getting worse
The fundamental misunderstanding here is attributing performance degradation to the model itself. The model weights remain frozen during inference—there's no continual learning or bad habit acquisition happening. What's deteriorating is the computational environment surrounding the model.
Context window management becomes critical as conversations extend. Each interaction feeds history back into the model, increasing token count and creating longer sequences for the transformer to process. This hits a fundamental limitation of transformer architectures: they're measurably worse at recalling information from the middle of long context windows, tending to focus on the beginning and end of conversations.
Qwen 3.6 compounds this issue as a reasoning model with hidden think traces that consume context rapidly. The 256K context window, which initially seems advantageous, becomes a liability as it fills with conversation history that the model struggles to effectively utilize.
The silent performance killer: Windows GPU drivers
Using an Nvidia GPU on Windows introduces a specific problem: the driver doesn't fail when VRAM becomes full. Instead of throwing an error, it silently offloads the excess memory to system RAM via PCIe. This creates a bottleneck that slows the entire system, making even an RTX 5090 perform like an underpowered machine.
This behavior differs significantly from what developers might expect based on server-grade GPU behavior, where memory pressure typically results in explicit failures rather than silent degradation.
The simple fix that works
The solution to this apparent intelligence degradation is surprisingly straightforward: restart the conversation or reload the model entirely. Opening a fresh chat, restarting LM Studio, or manually reloading the model clears the KV cache and restores performance.
This reveals that the issue isn't cumulative model degradation but rather cache saturation. As VRAM headroom disappears over time or across multiple sessions, the system increasingly relies on slower system memory, accelerating the performance decline.
While rebooting might seem like a primitive solution for AI systems, treating local LLMs as computational tools rather than magical intelligences leads to more practical usage patterns. Regular cache management becomes as important as any prompt engineering technique.
What this means for local LLM users
The degradation in local LLM performance over time stems from a combination of context window management, memory pressure, and system-level bottlenecks rather than any inherent model limitations. Users should expect that longer conversations will eventually hit performance walls, and that restarting sessions can restore optimal performance.
For those with unified memory architectures like Apple's M-series chips, these issues may be less pronounced since the unified memory design handles memory pressure more gracefully than discrete GPU setups with PCIe offloading.
The key takeaway is that local LLMs are tools with physical constraints, not infinitely scalable intelligences. Understanding these limitations—particularly around memory management and context handling—enables more effective usage patterns. Just as traditional software benefits from periodic restarts, local LLMs perform better when treated as finite computational resources rather than magical black boxes.
FAQ
Why does my local LLM seem to get dumber over time?
Does this happen with all local LLMs or just specific models?
What's the best way to prevent this performance degradation?
More in the feed
Prepared by the editorial stack from public data and external sources.
Original article