Full transparency

Everything autotune does,
explained plainly.

No marketing copy. No jargon. Every single optimization — what it is, why it matters, and exactly how it works. Ordered by how much it actually helps.

Foundation

What is the KV cache?

Almost every optimization in autotune touches the KV cache in some way. Understanding it takes two minutes and makes everything else make sense.

When a language model generates text, it doesn't work one word at a time in isolation. Every new word it produces requires it to “attend to” — look back at — every previous word in the conversation. That backward look is the attention mechanism, and it's what makes LLMs coherent and context-aware.

The problem: that backward look is expensive. Producing word #500 would require re-processing words #1 through #499 from scratch — hundreds of matrix multiplications, repeated, for every single token. A 1,000-word reply would require a billion redundant calculations.

The solution is caching. When the model processes token #1, it computes two tables of numbers that represent “what this token contributes to future attention.” These tables are called K (keys) and V (values). By storing — caching — them in RAM, the model can skip recomputing them for every future token. Token #500 just reads from the cache instead of redoing all that work.

That cache is the KV cache. It lives in your computer's RAM (or GPU VRAM — on Apple Silicon, they're the same pool). Its size is mathematically predictable:

KV cache size formula

2 × n_layers × kv_heads × head_dim × num_ctx × bytes
n_layersHow many transformer layers the model has (e.g. 32 for a 7B model)
kv_headsNumber of KV attention heads (often fewer than total heads, via GQA)
head_dimDimension of each attention head (embedding size ÷ total heads)
num_ctxHow many tokens the context window is sized for — this is the big lever
bytesF16 = 2 bytes per element, Q8 = 1 byte — halving this halves the cache
The key insight: The KV cache scales linearly with num_ctx. If you allocate a 4,096-token context when your prompt is only 200 tokens, you're wasting 95% of that memory. Ollama does this by default. autotune fixes it.

Real example: qwen3:8b (48 layers, 8 KV heads, 128 head_dim)

4,096 ctx (Ollama default)
576 MB KV
576 MB allocated
2,048 ctx (autotune, typical)
576 MB KV
288 MB — 288 MB freed
1,536 ctx (short message)
576 MB KV
216 MB — 360 MB freed

Every freed megabyte goes back to your system's available pool — your browser, other apps, and macOS all benefit. The model weights themselves don't change in size.

Memory — highest impact

How autotune manages RAM.

RAM is the single most important resource for local LLM inference. Running out means your OS starts writing to your SSD (swap), which drops generation speed from 30+ tok/s to under 5 tok/s and makes your whole computer sluggish. autotune has four independent systems for keeping memory under control — ordered here from most to least impactful.

📐
#01Highest impact · Every request

Precise KV allocation — dynamic context sizing

Ollama allocates the full KV cache before generating the first token. With the default num_ctx=4096, it zeros and initializes a 4,096-token buffer even if your prompt is 50 words. That initialization is part of what you wait for before the first token appears.

autotune computes the minimum context that actually fits this specific request:

num_ctx = clamp(input_tokens + max_new_tokens + 256, 512, profile_max)

For a typical balanced-profile chat message: ~22-token prompt + 1024 max reply + 256 buffer = 1,302 tokens. That maps to the 1,536 bucket (see optimization #05 below). On qwen3:8b, this frees 381 MB before a single token is generated — on every single request.

RAM freed per request — qwen3:8b

Short message
576 MB (4k default)
197 MB → 379 MB freed
Medium message
576 MB (4k default)
288 MB → 288 MB freed
Long document
576 MB (4k default)
576 MB → 0 MB freed

As conversations grow longer, the needed context grows too — the math always reflects the actual history. No tokens are ever dropped. The full context window expands organically as you chat.

📊
#02Highest impact · Every request

Live memory pressure response — real-time adaptation

Right-sizing the KV cache at request time is the foundation. But RAM usage on your machine is dynamic: a browser tab loads, Xcode compiles in the background, a background process wakes up. autotune reads the system's actual RAM usage before every single request and applies two independent levers — context window size and KV precision — automatically, without any user action.

Automatic adjustments by RAM tier — checked live, every request

RAM usage
Context window
KV precision
under 80%
full size
profile default
80–88%
−10%
unchanged
88–93%
−25%
F16 → Q8 (−50% KV RAM)
over 93%
halved
forced Q8

KV precision switching (F16 → Q8) cuts the KV cache's RAM footprint in half at the cost of negligible quality degradation — each attention value goes from 2 bytes to 1 byte. The difference in model output is undetectable in practice. You get a notice in the chat interface when an adjustment fires: “RAM 88% — context 8,192→6,144 tokens, KV F16→Q8”.

🎚️
#06High impact · Profile-aware

KV cache precision control

Beyond the live pressure response above, each profile has a deliberate default KV precision setting. F16 (16-bit float) uses 2 bytes per KV element. Q8 (8-bit quantized) uses 1 byte — half the KV memory at the same context size, with negligible quality impact.

This is separate from model quantization (Q4_K_M, Q5_K_M, etc.), which applies to the model's weights. KV precision only affects the temporary computation cache, not the model itself.

fast profile
Always Q8
Priority: lowest latency
balanced profile
F16 → Q8 under pressure
Priority: quality + stability
quality profile
F16 → Q8 under pressure
Priority: best output
🛡️
#08Safety · Every request

NoSwapGuard — pre-flight RAM check

Before sending any request to Ollama, autotune runs a pre-flight check: will this KV allocation fit in available RAM without causing swap?

On Apple Silicon, when RAM fills up macOS starts compressing memory pages, then pages them to your NVMe drive. Either path is catastrophic for inference — generation speed drops from 30+ tok/s to under 5 tok/s, and the whole machine becomes sluggish. Ollama doesn't prevent this — it allocates what it's told and lets the OS handle the consequences. autotune runs in front and checks first.

Reduction levels — applied in order until it fits

Level 0Fits comfortably — no change
Level 1Trim context 25%
Level 2Halve context
Level 3Halve context + switch to Q8 KV (saves ~50% more)
Level 4Quarter context + Q8
Level 5Minimum (512 tokens) + Q8 — emergency floor

autotune keeps a 1.5 GB safety margin — macOS starts compressing memory at around 85% utilization, so staying below that threshold prevents any degradation at all. The model's architecture (layers, KV heads, head dimension) is queried once from Ollama and cached, so every calculation is exact, not estimated.

🔍
#10Safety · Before loading

Pre-flight model fit analysis

Before a model is loaded into memory, autotune runs a complete RAM analysis: will this model fit without causing swap?

The analysis calculates the total memory requirement:

total = model_weights + kv_cache(context, precision) + runtime_overhead (400 MB)

It classifies the result as one of four states: SAFE (under 85% RAM), MARGINAL (85–92%), SWAP RISK (92–100%), or OOM (over 100%).

If the model is tight but workable, autotune automatically caps the context window to a safe maximum and recommends Q8 KV precision. You get the best performance the hardware can deliver without ever touching swap.
If the model is too heavy, autotune suggests a lighter quantization: “Model requires ~14 GB but only 11 GB available. Pull Q4_K_M instead (~9 GB).” No guessing — the recommendation is calculated from the exact model architecture.

Speed — high impact

Five ways autotune reduces latency.

TTFT — time to first token — is what you feel as the “thinking pause” before the model starts responding. autotune reduces it through five distinct techniques, all working simultaneously, ordered here from most to least impactful.

📌
#03High impact · Multi-turn

System prompt prefix caching

In any multi-turn conversation, the system prompt — “You are a helpful assistant. You prefer concise answers” — is identical on every single turn. By default, Ollama re-processes (re-evaluates through every layer of the model) this entire system prompt from scratch on every message.

autotune counts the system prompt's tokens and tells Ollama: keep these first N tokens in the KV cache permanently. The Ollama parameter for this is num_keep. Once set, those tokens are evaluated exactly once — at the start of the conversation — and never again.

Without prefix caching
Turn 1: process system prompt (100 tokens) + message
Turn 2: process system prompt (100 tokens) + turn 1 + message
Turn 3: process system prompt (100 tokens) + turn 1-2 + message
System prompt re-processed every turn
With prefix caching
Turn 1: process system prompt (100 tokens) + message
Turn 2: skip system prompt ← new tokens only
Turn 3: skip system prompt ← new tokens only
Savings compound with every turn

In agentic workloads where a session has 10+ turns, this compounding effect means TTFT actually decreases as the session grows — the opposite of what raw Ollama shows.

♾️
#04High impact · Session start

Model keep-alive

By default, Ollama unloads a model from RAM after 5 minutes of idle. The next time you send a message — even seconds later — it reads the entire model file from disk, loads it into GPU/Metal memory, and warms up the runtime. On a 5 GB model, this costs 1–4 seconds before your first token appears.

autotune sets keep_alive="-1" (keep forever) on every request. The model stays in RAM between conversations.

On the RAM question:The model's weights were already taking up RAM from the moment it was loaded. Setting keep-alive to forever means Ollama doesn't release and re-acquire that same RAM between sessions. It doesn't cost more memory — it just keeps the memory committed, which eliminates the reload time. You can disable this via autotune config set keep_alive_enabled false.
🎯
#05High impact · Every request

Context bucket snapping

After computing the minimum context size needed, autotune rounds it up to the nearest “bucket” from a fixed list:

Buckets: 512 · 768 · 1024 · 1536 · 2048 · 3072 · 4096 · 6144 · 8192 · 12288 · 16384 · 32768

Here's why this matters enormously: Ollama caches the KV buffer for the most recently used context length. If num_ctx changes between requests — say 1,286 then 1,157 then 1,308 — Ollama must reallocate the Metal buffer on every single call, even if the model is already loaded. This “KV thrashing” adds 100–300 ms of overhead per request and completely negates the benefit of smaller context windows.

By snapping to buckets, prompts of 50–200 tokens all map to bucket 1,536. Ollama allocates it once and reuses the buffer on every subsequent request — zero reallocation cost. All bucket sizes are multiples of 256, which aligns with Metal's memory alignment boundaries for F16 tensors.

#07Medium impact · Every request

Flash attention

Standard attention computes the full attention matrix in memory. For a context window of N tokens, this requires O(N²) memory — it grows fast and causes large memory spikes during the initial prompt processing phase.

Flash attention is a mathematically identical algorithm that computes attention in tiles (blocks) rather than materializing the full matrix at once. It needs only O(N) memory for the same computation — the peak activation memory spike during prefill (the initial prompt processing) is dramatically smaller.

autotune passes flash_attn: trueon every request. Models and Ollama builds that support it use it; those that don't silently ignore the flag. Zero quality impact— it's purely an implementation optimization, not an approximation.

🚀
#09Medium impact · Long prompts

Larger prefill batch size

During “prefill” — when the model processes your entire prompt before generating anything — tokens are fed through the model in chunks called batches. Ollama's default is 512 tokens per chunk.

autotune sets num_batch=1024. For a 700-token prompt: the default takes 2 GPU passes (0→512, 512→700). With 1024, it takes 1 pass. Fewer passes means fewer Metal kernel dispatches, which directly cuts prefill time for any prompt longer than 512 tokens.

For short prompts (under 512 tokens), llama.cpp automatically caps the actual batch at the prompt length — so there's no extra memory allocation for short messages. At critical RAM pressure, autotune drops this back to 256 to reduce the peak activation tensor footprint.

Adaptive intelligence

Systems that watch and respond.

Static settings only get you so far. These two systems watch what's actually happening on your machine and respond in real time.

🔧
#11Moderate impact · During inference

Hardware tuner — OS-level scheduling

Before each inference call, autotune makes real changes to how your operating system schedules the inference process. After the call completes, everything is restored to normal.

macOS QOS class
Sets the thread to USER_INTERACTIVE — the highest scheduling priority macOS offers (the same class used for scrolling animations and direct UI responses). The inference process literally gets more CPU time than background tasks during generation.
Python GC disabled
Python's garbage collector runs “stop the world” pauses where all Python code stops — potentially for tens of milliseconds. During streaming generation, these create visible hitches in output. autotune disables GC during inference (collecting first to clean up) and re-enables it after.
Process priority (nice)
Raises both the autotune process priority and — where permitted — the Ollama server process priority on macOS and Linux. The OS scheduler gives higher-priority processes more CPU time slices, directly improving inference throughput when the system is under load from other apps.
Linux CPU governor
On Linux, attempts to set the CPU frequency governor to performance mode, disabling frequency scaling so the CPU runs at full clock speed during inference (requires root; silently skipped otherwise).
🧠
#12Moderate impact · Live monitoring

Adaptive session advisor

During a session, the adaptive advisor continuously watches RAM usage, swap activity, tokens per second, and time to first token. It compares live metrics to a baseline it builds from your first few requests, and acts if things degrade.

Health score (0–100) — updated every 30 seconds

90–100Running smoothly
75–89Moderate load — watching closely
55–74Memory pressure building
35–54Stressed — action recommended
0–34Critical — immediate action needed

When the score drops below a threshold, the advisor takes actions in order from least to most disruptive:

1. Reduce concurrencyFewer parallel requests → less simultaneous KV pressure
2. Reduce context windowSmaller window → smaller KV cache → RAM freed immediately
3. Lower KV precisionF16 → Q8 → frees ~50% of KV memory at once
4. Enable prompt cachingForces prefix caching if not already active
5. Disable speculative decodingFrees draft-model memory if applicable
6. Lower quantizationSuggests pulling a lighter model variant
7. Switch to smaller modelLast resort — model is simply too large for current RAM

There's a 20-second cooldown between actions to avoid thrashing, and the advisor waits 90 seconds of sustained stability before considering a scale-up. It also attributes performance changes — it knows whether a RAM spike was caused by loading a new model, KV growth, or a background application.

Context & conversation

Two systems for managing long conversations.

Every conversation turn adds tokens to the history. Without management, long sessions hit the context ceiling and either lose old messages or require a larger (more expensive) context window. autotune handles both automatically.

🗜️
#13Moderate impact · Long sessions

Context compressor

As conversation history grows and approaches the context limit, autotune selectively compresses older messages to make room — without deleting them entirely or losing their meaning.

Context budget tiers

< 55%FULL — all turns verbatim, no compression
55–75%RECENT+FACTS — last 8 turns + fact summary for older
75–90%COMPRESSED — last 6 turns (light compression) + compact summary
> 90%EMERGENCY — last 4 turns (compressed) + one-line summary

Compression is applied in order from lightest to most aggressive:

Strip noiseRemove extra blank lines, trailing whitespace — lossless
Compress JSON blobs{"key1": ..., "key2": ...} → {/* 12 keys: key1, key2… */}
Shorten tool outputKeep first 12 lines + last 6 lines, mark middle as omitted
Trim assistant messagesKeep: first paragraph + up to 2 code blocks + last paragraph
Trim user messagesPreserve first ~600 characters (the intent), trim repetition

Code blocks are always preserved first — they carry the most information per token and losing them would make the context misleading. All truncation happens at sentence or paragraph boundaries, never mid-sentence.

💾
#14Selective impact · Across sessions

Conversation memory & recall

Every conversation you have is automatically saved to a local SQLite database on your machine — not sent anywhere. At the start of each new conversation, autotune searches your history for context that's relevant to what you're asking about now, and quietly injects it as a note in the system prompt.

If you asked about FastAPI authentication three sessions ago, and now you're asking a related question, the model will have that prior context available without you having to re-explain it.

How search works
Vector search (primary): Uses a local embedding model (nomic-embed-text, ~274 MB) to find semantically similar past exchanges — even if they use different words.
FTS5 keyword search (fallback): If the embedding model isn't available, falls back to full-text search across all stored conversations.
Injection threshold
Only injects if the best match has a cosine similarity above 0.38 — a deliberately conservative threshold. The rule is: it's better to show no context than irrelevant noise. Up to 3 relevant memories are injected, capped at 1,200 characters total to avoid bloating the system prompt.
Privacy: All data stays local. The SQLite database lives at ~/.autotune/recall.db on your machine. Nothing is sent to any server. The embedding model runs entirely in Ollama on your hardware.

Honesty

What autotune doesn't change.

We'd rather be transparent about limitations than have you discover them yourself.

Generation speed
Token throughput (tokens per second) is Metal GPU-bound on Apple Silicon and CUDA-bound on NVIDIA. autotune doesn't touch the generation loop. Benchmarks show ±2% variance — that's measurement noise, not a real difference.
Unchanged
Model weights
autotune changes context window size, KV cache precision, and scheduling. It never modifies the model's actual weights. Output quality is identical — autotune changes how memory is managed, not what the model knows.
Unchanged
First turn in agentic sessions
autotune pre-allocates a larger KV window for a full agentic session upfront. Turn 1 is ~80% slower as a result. From turn 2 onward, prefix-cache savings compound and total wall time comes out ~46% lower. Worth it for sessions with 3+ turns.
Turn 1 is slower
Swap on severely low RAM
NoSwapGuard prevents swap when RAM is adequate. If your machine is running critically low (e.g. multiple large models loaded simultaneously), the guard may not be able to reduce context enough to fit — it will tell you explicitly.
Can't prevent everything
No cloud or external dependency
autotune runs entirely locally. There's no API key, no account, no cloud service required. Anonymous telemetry is opt-in and off by default. Everything — inference, memory recall, embedding — runs on your hardware.
Fully local
Output is identical
The same prompt, the same model, the same temperature — you will get equivalent responses with or without autotune. prompt_eval_count (how many tokens Ollama actually processed) is identical in both conditions. We just do it with less RAM.
Zero quality tradeoff

Run the proof on your machine.

Every number in our benchmarks is reproducible. One command runs a 30-second head-to-head using Ollama's own internal timers on your hardware.

autotune proof -m qwen3:8b