Command reference

All autotune commands

Every command listed with examples and flags. New? Start with autotune recommend to get a hardware-matched model, then autotune proof to verify the improvement is real.

Get started

autotune chatQuick start
Jump straight into a conversation

Opens chat immediately — no pre-flight check. autotune still applies all real-time optimizations (adaptive RAM monitoring, KV cache manager, dynamic context compression). Use this when you already know the model fits, or for HuggingFace and MLX models.

Usage
Terminal
autotune chat --model qwen3:8b
Examples
Basic chat
Terminal
autotune chat --model qwen3:8b
With a system prompt
Terminal
autotune chat --model qwen3:8b --system "You are a concise coding assistant"
Quality mode (slower but smarter responses)
Terminal
autotune chat --model qwen3:8b --profile quality
Resume a previous conversation
Terminal
autotune chat --model qwen3:8b --conv-id abc123
Flags
--model, -mModel to use (required). Run `autotune ls` for available models.
--profile, -pfast / balanced / quality. Default: balanced.
--system, -sCustom system prompt.
--conv-idResume a saved conversation by ID.
autotune runFirst time with a model
Memory check, then chat

Analyzes the model's memory requirements before loading it — checks if it fits, how tight it is, and auto-selects the safest profile and context window. Then opens the same optimized chat. Use this the first time you try a model, or any time you're unsure if it fits your RAM.

Usage
Terminal
autotune run qwen3:8b
Examples
Terminal
autotune run qwen3:8b
Terminal
autotune run qwen2.5-coder:14b --profile balanced
With a custom system prompt
Terminal
autotune run llama3.2 --system "You are a helpful assistant"
Override if autotune says the model is too large
Terminal
autotune run qwen3:8b --force
Flags
MODELModel name (positional argument, required).
--profile, -pfast / balanced / quality / auto. Default: auto (autotune picks for you).
--system, -sCustom system prompt.
--forceOverride the memory check and start anyway.
--recallInject relevant context from past conversations.
autotune hardware
See what your machine can do

Scans your CPU, RAM, and GPU. Shows which AI models fit in your available memory and which apps are consuming the most RAM right now. If closing one app would let you run a larger model, autotune will tell you.

Usage
Terminal
autotune hardware
Examples
Full report with RAM tips
Terminal
autotune hardware
Hardware only (skip RAM tips)
Terminal
autotune hardware --no-ram-tips
Flags
--ram-tips / --no-ram-tipsShow top RAM consumers and model-unlock suggestions. Default: on.

Manage models

autotune ls
List downloaded models with fit scores

Shows every model you've downloaded and whether it fits in your available RAM. Includes the safe context window size, recommended profile, and a warning if quantization is too aggressive for your hardware.

Usage
Terminal
autotune ls
Examples
Terminal
autotune ls
Machine-readable JSON output
Terminal
autotune ls --json
Flags
--jsonOutput as JSON instead of a table.
autotune ps
See which models are loaded in memory right now

Shows every model currently loaded in RAM across Ollama, MLX, and LM Studio. Use this to check if a model is still occupying memory after you stopped chatting with it.

Usage
Terminal
autotune ps
autotune pull
Download a model

Download any Ollama model directly from within autotune. Run without a model name to browse popular recommendations for your hardware.

Usage
Terminal
autotune pull qwen3:8b
Examples
Download a specific model
Terminal
autotune pull qwen3:8b
Browse popular models for your hardware
Terminal
autotune pull
Download and immediately chat
Terminal
autotune pull qwen3:8b
autotune chat --model qwen3:8b
autotune models
List all available models

Shows all models installed on this machine across Ollama, MLX, and LM Studio — with size on disk, architecture, quantization level, and quality tier based on public benchmarks (MMLU, HumanEval).

Usage
Terminal
autotune models
Examples
Local models
Terminal
autotune models
View autotune's pre-configured model registry
Terminal
autotune models --registry
Flags
--registryShow autotune's internal model list instead of locally downloaded models.
autotune unload
Free RAM immediately

Releases a model from memory without opening a chat session. Useful after a heavy session — frees RAM for other apps right away. If you omit the model name, you'll see an interactive picker.

Usage
Terminal
autotune unload qwen3:8b
Examples
Unload a specific model
Terminal
autotune unload qwen3:8b
Interactive picker (shows all loaded models)
Terminal
autotune unload

Deploy & integrate

autotune serve
Start an OpenAI-compatible API server

Starts a local API server that any OpenAI-compatible tool can connect to — Continue.dev, Open WebUI, LangChain, a Python script, anything. All autotune optimizations apply to every request automatically.

Usage
Terminal
autotune serve
Examples
Default (localhost:8765)
Terminal
autotune serve
Connect from Python
Terminal
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8765/v1", api_key="autotune")
Connect via curl
Terminal
curl http://localhost:8765/v1/models
Apple Silicon — also enable MLX backend
Terminal
autotune serve --mlx
Flags
--hostBind address. Default: 127.0.0.1 (local only).
--portPort. Default: 8765.
--mlxEnable MLX backend on Apple Silicon (~10–40% higher throughput, disables tool calling).
--reloadAuto-reload on code changes (dev mode).
autotune recommend
Get the best model for your hardware

Profiles your hardware and recommends the best model and settings for your machine. Shows alternatives for fastest, balanced, and best-quality modes — including exact `ollama pull` commands to get started instantly.

Usage
Terminal
autotune recommend
Examples
Terminal
autotune recommend
Show only balanced recommendations
Terminal
autotune recommend --mode balanced
Check a specific model
Terminal
autotune recommend --model qwen3:8b
Show top 5 alternatives per mode
Terminal
autotune recommend --top 5
Flags
--mode, -mfastest / balanced / best_quality / all. Default: all.
--modelRestrict recommendations to a single model.
--top NNumber of alternatives to show per mode. Default: 3.
--no-show-hardwareSkip the hardware profile section.

Benchmarking & proof

Verify that autotune is actually helping on your specific machine. All timings come from Ollama's own Go nanosecond timers — nothing estimated.

autotune proofRun this first
Quick head-to-head benchmark (~30 seconds)

Runs a fast, honest benchmark comparing raw Ollama defaults against autotune. Measures TTFT (time to first token), KV cache RAM usage, RAM headroom, swap events, and generation speed. Includes a long-context test that shows TTFT improvements by comparing KV buffer allocation at 4096 tokens vs autotune's dynamically-sized buffer. Results saved to JSON.

Usage
Terminal
autotune proof --model qwen3:8b
Examples
Auto-select the best installed model
Terminal
autotune proof
Benchmark a specific model
Terminal
autotune proof --model qwen3:8b
More runs for stabler numbers
Terminal
autotune proof --model qwen3:8b --runs 3
Save results to a file
Terminal
autotune proof --model qwen3:8b --output results.json
See which models are installed
Terminal
autotune proof --list-models
Flags
--model, -mOllama model to benchmark. Auto-selects if omitted.
--runs, -rRuns per condition. 2 is fast (~30s); 3+ gives more stable numbers. Default: 2.
--profile, -pfast / balanced / quality. autotune profile to test against raw. Default: balanced.
--output, -oSave JSON results to this path. Defaults to proof_<model>.json.
--list-modelsList locally installed Ollama models and exit.
autotune proof-suiteDeep analysis
Multi-model statistical benchmark

Runs a curated 5-prompt suite (factual, code, analysis, conversation, long output) through both raw Ollama and autotune across multiple models. Reports Ollama-internal timing, process-isolated RAM, and full statistical significance: Wilcoxon signed-rank test, Cohen's d effect size, and 95% confidence intervals.

Usage
Terminal
autotune proof-suite
Examples
Default 3-model run
Terminal
autotune proof-suite
Specific models
Terminal
autotune proof-suite -m llama3.2:3b -m qwen3:8b
More runs per prompt for tighter stats
Terminal
autotune proof-suite -m qwen3:8b --runs 5
Save full results
Terminal
autotune proof-suite --output results.json
Flags
--models, -mOllama model IDs to benchmark. Repeat for multiple. Default: llama3.2:3b, qwen3:8b.
--runs, -nInference runs per condition per prompt. Minimum 3 for statistics. Default: 3.
--profile, -pautotune profile to compare against raw Ollama. Default: balanced.
--output, -oSave full results to a JSON file.
--list-modelsList locally installed Ollama models and exit.
autotune benchAdvanced
Intensive multi-prompt benchmark

Runs a full benchmark suite with multiple prompts across different task types (short, code, reasoning, analysis). Use this when you want detailed per-prompt breakdowns or want to compare two different autotune profiles against each other.

Usage
Terminal
autotune bench --model qwen3:8b
Examples
Standard benchmark
Terminal
autotune bench --model qwen3:8b
Duel mode: compare two profiles head-to-head
Terminal
autotune bench --model qwen3:8b --duel
Raw Ollama only (no autotune)
Terminal
autotune bench --model qwen3:8b --raw
Compare autotune vs raw
Terminal
autotune bench --model qwen3:8b --compare
More runs for stable results
Terminal
autotune bench --model qwen3:8b --runs 5
Flags
--model, -mOllama model to benchmark (required).
--runs, -rRuns per prompt per mode. Default: 3.
--profile, -pautotune profile to use. Default: balanced.
--duelCompare two profiles against each other.
--rawRun raw Ollama only (no autotune).
--compareRun both raw and autotune and show a side-by-side diff.
--output, -oSave results JSON to this path.
autotune user-bench
Real-world user experience benchmark

Measures what users actually feel — not raw throughput. Runs autotune head-to-head against raw Ollama across realistic laptop workflows: background queries, sustained chat, agent loops, and code debugging. Reports in user-friendly language: swap events, RAM headroom, TTFT consistency, CPU spikes, and a 0–100 background impact score. Can run in the background (survives terminal close) with a desktop notification when done.

Usage
Terminal
autotune user-bench --model qwen3:8b
Examples
Standard benchmark (~30 min)
Terminal
autotune user-bench --model qwen3:8b
Quick mode: 2 scenarios (~10-15 min)
Terminal
autotune user-bench --model qwen3:8b --quick
Run in background (keeps running after terminal close)
Terminal
autotune user-bench --model qwen3:8b --background
Run on every locally installed model
Terminal
autotune user-bench --all-models --runs 2
Flags
--model, -mOllama model to benchmark. Auto-selects first installed model if omitted.
--profile, -pautotune profile to use. Default: balanced.
--runs, -rRuns per scenario per condition. Default: 3.
--quick, -qQuick mode: 2 scenarios instead of 4 (~10-15 min).
--all-modelsRun on every locally installed Ollama model.
--backgroundFork to background — survives terminal close, sends a desktop notification when done.
--output-dirDirectory for result JSON files. Default: current directory.
autotune agent-bench
Agentic multi-turn benchmark

Tests autotune on 5 realistic agentic tasks: code debugging, research synthesis, step planning, adversarial context, and extended sessions. The key story is TTFT growth curves — in raw Ollama, TTFT grows linearly with each conversation turn as the full 4096-token KV buffer fills. autotune's dynamic context sizing keeps TTFT flat by sizing the window to actual usage.

Usage
Terminal
autotune agent-bench
Examples
Default run (all 5 tasks, 5 trials each)
Terminal
autotune agent-bench
Specific models
Terminal
autotune agent-bench -m llama3.2:3b -m qwen3:8b
Quick mode: 3 tasks, 2 trials (~20-30 min)
Terminal
autotune agent-bench --quick
Specific tasks only
Terminal
autotune agent-bench --tasks code_debugger,extended_session
Save results
Terminal
autotune agent-bench -m qwen3:8b --output agent_results.json
Flags
--models, -mOllama model IDs to benchmark. Repeat for multiple. Default: llama3.2:3b, qwen3:8b.
--trials, -nTrials per condition per task. Min 3 recommended. Default: 5.
--tasks, -tComma-separated task IDs. Options: code_debugger, research_synth, step_planner, adversarial_context, extended_session.
--profile, -pautotune profile to test. Default: balanced.
--quick, -qQuick mode: 3 tasks, 2 trials (~20-30 min).
--output, -oSave full results JSON to this path.

Conversation memory

autotune stores past conversations locally (SQLite + optional vector embeddings) so future chat sessions can surface relevant context automatically.

autotune memory listBrowse stored memories

List recently stored conversation memories with timestamps, model names, and a preview of each chunk. Use this to find a memory ID before deleting it.

Usage
Terminal
autotune memory list
Examples
Terminal
autotune memory list
autotune memory list --days 7
autotune memory list --model qwen3:8b --limit 50
Flags
--limit, -nNumber of memories to show. Default: 20.
--daysOnly show memories from the last N days.
--modelFilter by model (e.g. qwen3:8b).
autotune memory statsMemory store statistics

Show statistics about the local memory store: total chunks, how many have vector embeddings, database size, date range, breakdown by model, and whether semantic search is active.

Usage
Terminal
autotune memory stats
autotune memory forgetDelete memories

Delete one memory chunk, all memories for a specific conversation, or wipe the entire store. Use autotune memory list first to find the memory ID.

Usage
Terminal
autotune memory forget <memory-id>
Examples
Terminal
autotune memory forget 42
autotune memory forget --conv-id abc123
autotune memory forget --all
autotune memory forget --all --yes   # skip confirmation
Flags
MEMORY_IDID of the specific memory chunk to delete (see autotune memory list).
--allDelete ALL memories (asks for confirmation unless --yes is passed).
--conv-idDelete all memories for a specific conversation ID.
--yes, -ySkip confirmation prompt.
autotune memory setupEnable semantic search

Pull nomic-embed-text from Ollama (~274 MB) to enable semantic vector search across your conversation history. Without this, autotune uses FTS5 keyword search instead.

Usage
Terminal
autotune memory setup

Apple Silicon (MLX)

MLX runs LLMs entirely on-chip using Apple's unified memory and Metal GPU kernels — typically 10–40% faster than Ollama on the same model. Requires an Apple Silicon Mac (M1/M2/M3/M4).

autotune mlx listShow cached MLX models

List all MLX-format models already downloaded locally, with size on disk.

Usage
Terminal
autotune mlx list
autotune mlx pullDownload an MLX model

Download an MLX-quantized model from the mlx-community on HuggingFace. You can use an Ollama model name (e.g. qwen3:8b) and autotune will resolve the correct MLX variant automatically.

Usage
Terminal
autotune mlx pull <model>
Examples
Terminal
autotune mlx pull qwen3:8b
autotune mlx pull llama3.2:3b
autotune mlx pull qwen2.5-coder:14b --quant 8bit
Flags
MODELModel name (required). Ollama name (e.g. qwen3:8b) or full HuggingFace ID.
--quant, -qQuantization level: 4bit / 8bit / bf16. Default: 4bit.
autotune mlx resolveLook up the MLX model ID

Show which MLX HuggingFace model ID would be used for a given Ollama model name. Useful to check before pulling.

Usage
Terminal
autotune mlx resolve <model>
Examples
Terminal
autotune mlx resolve qwen3:8b
autotune mlx resolve llama3.2

Settings

autotune telemetry
View performance history and manage data collection

Shows a table of all recent inference runs with TTFT, throughput, RAM/swap pressure, CPU load, and completion status. Also manages opt-in/out for anonymous telemetry collection (hardware fingerprint + performance data sent to the autotune team to improve defaults).

Usage
Terminal
autotune telemetry
Examples
View recent runs
Terminal
autotune telemetry
Filter by model
Terminal
autotune telemetry --model qwen3:8b
View individual telemetry events
Terminal
autotune telemetry --events --model qwen3:8b
Check consent status
Terminal
autotune telemetry --status
Opt in to anonymous telemetry
Terminal
autotune telemetry --enable
Opt out
Terminal
autotune telemetry --disable
Flags
--modelFilter to a specific model ID.
--limitNumber of recent runs to show. Default: 20.
--eventsShow individual telemetry events (RAM spikes, slow tokens, errors) instead of run history.
--statusShow current telemetry consent status.
--enableOpt in to anonymous telemetry collection.
--disableOpt out — no further data will be sent.
autotune storage
Manage local SQLite performance data

Enable or disable local SQLite storage of performance observations, telemetry events, and agent benchmark results. Model metadata is always stored regardless of this setting. Run without an argument to see the current status.

Usage
Terminal
autotune storage [on|off|status]
Examples
Check current setting
Terminal
autotune storage status
Enable local storage (default)
Terminal
autotune storage on
Disable storage (e.g. shared / ephemeral machines)
Terminal
autotune storage off

Diagnose

autotune doctor
Check your installation

Runs a full health check: Python version, required packages, whether Ollama and other backends are reachable, RAM and swap headroom, and database health. Every check shows a clear pass/fail so you know exactly what needs fixing.

Usage
Terminal
autotune doctor
autotune upgrade
Update to the latest version

Checks PyPI for a newer version of autotune, shows what changed, and upgrades with one keypress. Because autotune is updated frequently, run this any time something seems off or you want the latest improvements.

Usage
Terminal
autotune upgrade
autotune upgrade --yes   # skip confirmation

Ollama commands you'll use

autotune runs on top of Ollama. Here are the Ollama commands that pair with autotune.

ollama serve
Start the Ollama background service (must be running before autotune chat)
ollama pull qwen3:8b
Download a model (use autotune pull instead to get hardware-aware recommendations)
ollama list
List downloaded models (autotune ls shows more detail)
ollama ps
See models in memory (autotune ps shows more detail)
ollama rm qwen3:8b
Delete a model from disk
ollama --version
Confirm Ollama is installed

New to autotune?

Follow the install guide to get running in 5 minutes.

Install guide →