Full transparency
No marketing copy. No jargon. Every single optimization — what it is, why it matters, and exactly how it works. Ordered by how much it actually helps.
All 14 optimizations
Click any item below to jump to the full explanation.
Right-sizes the KV cache to each request — frees 300+ MB per call
Adjusts context + KV precision in real time as RAM changes
Pins system prompt in KV — never re-evaluated after turn 1
Model stays in RAM — eliminates 1–4s cold-reload between sessions
Snaps context to stable sizes so Ollama reuses Metal buffers — no thrashing
F16 → Q8 under pressure — halves KV footprint with negligible quality impact
Reduces peak activation memory during prefill — zero quality impact
Pre-flight RAM check; graduates context down before any swap risk
num_batch=1024 → fewer GPU passes for long prompts
Checks model fits before loading; suggests lighter quants if not
QOS class, process priority, GC disable, CPU governor around inference
Watches live metrics; takes graduated action before performance degrades
Compresses old messages in tiers when approaching context ceiling
Saves sessions locally; injects semantically relevant past context
Foundation
Almost every optimization in autotune touches the KV cache in some way. Understanding it takes two minutes and makes everything else make sense.
When a language model generates text, it doesn't work one word at a time in isolation. Every new word it produces requires it to “attend to” — look back at — every previous word in the conversation. That backward look is the attention mechanism, and it's what makes LLMs coherent and context-aware.
The problem: that backward look is expensive. Producing word #500 would require re-processing words #1 through #499 from scratch — hundreds of matrix multiplications, repeated, for every single token. A 1,000-word reply would require a billion redundant calculations.
The solution is caching. When the model processes token #1, it computes two tables of numbers that represent “what this token contributes to future attention.” These tables are called K (keys) and V (values). By storing — caching — them in RAM, the model can skip recomputing them for every future token. Token #500 just reads from the cache instead of redoing all that work.
That cache is the KV cache. It lives in your computer's RAM (or GPU VRAM — on Apple Silicon, they're the same pool). Its size is mathematically predictable:
KV cache size formula
num_ctx. If you allocate a 4,096-token context when your prompt is only 200 tokens, you're wasting 95% of that memory. Ollama does this by default. autotune fixes it.Real example: qwen3:8b (48 layers, 8 KV heads, 128 head_dim)
Every freed megabyte goes back to your system's available pool — your browser, other apps, and macOS all benefit. The model weights themselves don't change in size.
Memory — highest impact
RAM is the single most important resource for local LLM inference. Running out means your OS starts writing to your SSD (swap), which drops generation speed from 30+ tok/s to under 5 tok/s and makes your whole computer sluggish. autotune has four independent systems for keeping memory under control — ordered here from most to least impactful.
Ollama allocates the full KV cache before generating the first token. With the default num_ctx=4096, it zeros and initializes a 4,096-token buffer even if your prompt is 50 words. That initialization is part of what you wait for before the first token appears.
autotune computes the minimum context that actually fits this specific request:
For a typical balanced-profile chat message: ~22-token prompt + 1024 max reply + 256 buffer = 1,302 tokens. That maps to the 1,536 bucket (see optimization #05 below). On qwen3:8b, this frees 381 MB before a single token is generated — on every single request.
RAM freed per request — qwen3:8b
As conversations grow longer, the needed context grows too — the math always reflects the actual history. No tokens are ever dropped. The full context window expands organically as you chat.
Right-sizing the KV cache at request time is the foundation. But RAM usage on your machine is dynamic: a browser tab loads, Xcode compiles in the background, a background process wakes up. autotune reads the system's actual RAM usage before every single request and applies two independent levers — context window size and KV precision — automatically, without any user action.
Automatic adjustments by RAM tier — checked live, every request
KV precision switching (F16 → Q8) cuts the KV cache's RAM footprint in half at the cost of negligible quality degradation — each attention value goes from 2 bytes to 1 byte. The difference in model output is undetectable in practice. You get a notice in the chat interface when an adjustment fires: “RAM 88% — context 8,192→6,144 tokens, KV F16→Q8”.
Beyond the live pressure response above, each profile has a deliberate default KV precision setting. F16 (16-bit float) uses 2 bytes per KV element. Q8 (8-bit quantized) uses 1 byte — half the KV memory at the same context size, with negligible quality impact.
This is separate from model quantization (Q4_K_M, Q5_K_M, etc.), which applies to the model's weights. KV precision only affects the temporary computation cache, not the model itself.
Before sending any request to Ollama, autotune runs a pre-flight check: will this KV allocation fit in available RAM without causing swap?
On Apple Silicon, when RAM fills up macOS starts compressing memory pages, then pages them to your NVMe drive. Either path is catastrophic for inference — generation speed drops from 30+ tok/s to under 5 tok/s, and the whole machine becomes sluggish. Ollama doesn't prevent this — it allocates what it's told and lets the OS handle the consequences. autotune runs in front and checks first.
Reduction levels — applied in order until it fits
autotune keeps a 1.5 GB safety margin — macOS starts compressing memory at around 85% utilization, so staying below that threshold prevents any degradation at all. The model's architecture (layers, KV heads, head dimension) is queried once from Ollama and cached, so every calculation is exact, not estimated.
Before a model is loaded into memory, autotune runs a complete RAM analysis: will this model fit without causing swap?
The analysis calculates the total memory requirement:
It classifies the result as one of four states: SAFE (under 85% RAM), MARGINAL (85–92%), SWAP RISK (92–100%), or OOM (over 100%).
Speed — high impact
TTFT — time to first token — is what you feel as the “thinking pause” before the model starts responding. autotune reduces it through five distinct techniques, all working simultaneously, ordered here from most to least impactful.
In any multi-turn conversation, the system prompt — “You are a helpful assistant. You prefer concise answers” — is identical on every single turn. By default, Ollama re-processes (re-evaluates through every layer of the model) this entire system prompt from scratch on every message.
autotune counts the system prompt's tokens and tells Ollama: keep these first N tokens in the KV cache permanently. The Ollama parameter for this is num_keep. Once set, those tokens are evaluated exactly once — at the start of the conversation — and never again.
In agentic workloads where a session has 10+ turns, this compounding effect means TTFT actually decreases as the session grows — the opposite of what raw Ollama shows.
By default, Ollama unloads a model from RAM after 5 minutes of idle. The next time you send a message — even seconds later — it reads the entire model file from disk, loads it into GPU/Metal memory, and warms up the runtime. On a 5 GB model, this costs 1–4 seconds before your first token appears.
autotune sets keep_alive="-1" (keep forever) on every request. The model stays in RAM between conversations.
autotune config set keep_alive_enabled false.After computing the minimum context size needed, autotune rounds it up to the nearest “bucket” from a fixed list:
Here's why this matters enormously: Ollama caches the KV buffer for the most recently used context length. If num_ctx changes between requests — say 1,286 then 1,157 then 1,308 — Ollama must reallocate the Metal buffer on every single call, even if the model is already loaded. This “KV thrashing” adds 100–300 ms of overhead per request and completely negates the benefit of smaller context windows.
By snapping to buckets, prompts of 50–200 tokens all map to bucket 1,536. Ollama allocates it once and reuses the buffer on every subsequent request — zero reallocation cost. All bucket sizes are multiples of 256, which aligns with Metal's memory alignment boundaries for F16 tensors.
Standard attention computes the full attention matrix in memory. For a context window of N tokens, this requires O(N²) memory — it grows fast and causes large memory spikes during the initial prompt processing phase.
Flash attention is a mathematically identical algorithm that computes attention in tiles (blocks) rather than materializing the full matrix at once. It needs only O(N) memory for the same computation — the peak activation memory spike during prefill (the initial prompt processing) is dramatically smaller.
autotune passes flash_attn: trueon every request. Models and Ollama builds that support it use it; those that don't silently ignore the flag. Zero quality impact— it's purely an implementation optimization, not an approximation.
During “prefill” — when the model processes your entire prompt before generating anything — tokens are fed through the model in chunks called batches. Ollama's default is 512 tokens per chunk.
autotune sets num_batch=1024. For a 700-token prompt: the default takes 2 GPU passes (0→512, 512→700). With 1024, it takes 1 pass. Fewer passes means fewer Metal kernel dispatches, which directly cuts prefill time for any prompt longer than 512 tokens.
For short prompts (under 512 tokens), llama.cpp automatically caps the actual batch at the prompt length — so there's no extra memory allocation for short messages. At critical RAM pressure, autotune drops this back to 256 to reduce the peak activation tensor footprint.
Adaptive intelligence
Static settings only get you so far. These two systems watch what's actually happening on your machine and respond in real time.
Before each inference call, autotune makes real changes to how your operating system schedules the inference process. After the call completes, everything is restored to normal.
USER_INTERACTIVE — the highest scheduling priority macOS offers (the same class used for scrolling animations and direct UI responses). The inference process literally gets more CPU time than background tasks during generation.performance mode, disabling frequency scaling so the CPU runs at full clock speed during inference (requires root; silently skipped otherwise).During a session, the adaptive advisor continuously watches RAM usage, swap activity, tokens per second, and time to first token. It compares live metrics to a baseline it builds from your first few requests, and acts if things degrade.
Health score (0–100) — updated every 30 seconds
When the score drops below a threshold, the advisor takes actions in order from least to most disruptive:
There's a 20-second cooldown between actions to avoid thrashing, and the advisor waits 90 seconds of sustained stability before considering a scale-up. It also attributes performance changes — it knows whether a RAM spike was caused by loading a new model, KV growth, or a background application.
Context & conversation
Every conversation turn adds tokens to the history. Without management, long sessions hit the context ceiling and either lose old messages or require a larger (more expensive) context window. autotune handles both automatically.
As conversation history grows and approaches the context limit, autotune selectively compresses older messages to make room — without deleting them entirely or losing their meaning.
Context budget tiers
Compression is applied in order from lightest to most aggressive:
Code blocks are always preserved first — they carry the most information per token and losing them would make the context misleading. All truncation happens at sentence or paragraph boundaries, never mid-sentence.
Every conversation you have is automatically saved to a local SQLite database on your machine — not sent anywhere. At the start of each new conversation, autotune searches your history for context that's relevant to what you're asking about now, and quietly injects it as a note in the system prompt.
If you asked about FastAPI authentication three sessions ago, and now you're asking a related question, the model will have that prior context available without you having to re-explain it.
~/.autotune/recall.db on your machine. Nothing is sent to any server. The embedding model runs entirely in Ollama on your hardware.Honesty
We'd rather be transparent about limitations than have you discover them yourself.
Every number in our benchmarks is reproducible. One command runs a 30-second head-to-head using Ollama's own internal timers on your hardware.