Command reference

All autotune commands

Every command listed with examples and flags. New? Start with autotune recommend to get a hardware-matched model, then autotune proof to verify the improvement is real.

Get started

autotune chatQuick start

Jump straight into a conversation

Opens chat immediately — no pre-flight check. autotune still applies all real-time optimizations (adaptive RAM monitoring, KV cache manager, dynamic context compression). Use this when you already know the model fits, or for HuggingFace and MLX models.

Usage

Terminal

autotune chat --model qwen3:8b

Examples

Basic chat

Terminal

autotune chat --model qwen3:8b

With a system prompt

Terminal

autotune chat --model qwen3:8b --system "You are a concise coding assistant"

Quality mode (slower but smarter responses)

Terminal

autotune chat --model qwen3:8b --profile quality

Resume a previous conversation

Terminal

autotune chat --model qwen3:8b --conv-id abc123

Flags

--model, -mModel to use (required). Run `autotune ls` for available models.

--profile, -pfast / balanced / quality. Default: balanced.

--system, -sCustom system prompt.

--conv-idResume a saved conversation by ID.

autotune runFirst time with a model

Memory check, then chat

Analyzes the model's memory requirements before loading it — checks if it fits, how tight it is, and auto-selects the safest profile and context window. Then opens the same optimized chat. Use this the first time you try a model, or any time you're unsure if it fits your RAM.

Usage

Terminal

autotune run qwen3:8b

Examples

Terminal

autotune run qwen3:8b

Terminal

autotune run qwen2.5-coder:14b --profile balanced

With a custom system prompt

Terminal

autotune run llama3.2 --system "You are a helpful assistant"

Override if autotune says the model is too large

Terminal

autotune run qwen3:8b --force

Flags

MODELModel name (positional argument, required).

--profile, -pfast / balanced / quality / auto. Default: auto (autotune picks for you).

--system, -sCustom system prompt.

--forceOverride the memory check and start anyway.

--recallInject relevant context from past conversations.

autotune hardware

See what your machine can do

Scans your CPU, RAM, and GPU. Shows which AI models fit in your available memory and which apps are consuming the most RAM right now. If closing one app would let you run a larger model, autotune will tell you.

Usage

Terminal

autotune hardware

Examples

Full report with RAM tips

Terminal

autotune hardware

Hardware only (skip RAM tips)

Terminal

autotune hardware --no-ram-tips

Flags

--ram-tips / --no-ram-tipsShow top RAM consumers and model-unlock suggestions. Default: on.

Manage models

autotune ls

List downloaded models with fit scores

Shows every model you've downloaded and whether it fits in your available RAM. Includes the safe context window size, recommended profile, and a warning if quantization is too aggressive for your hardware.

Usage

Terminal

autotune ls

Examples

Terminal

autotune ls

Machine-readable JSON output

Terminal

autotune ls --json

Flags

--jsonOutput as JSON instead of a table.

autotune ps

See which models are loaded in memory right now

Shows every model currently loaded in RAM across Ollama, MLX, and LM Studio. Use this to check if a model is still occupying memory after you stopped chatting with it.

Usage

Terminal

autotune ps

autotune pull

Download a model

Download any Ollama model directly from within autotune. Run without a model name to browse popular recommendations for your hardware.

Usage

Terminal

autotune pull qwen3:8b

Examples

Download a specific model

Terminal

autotune pull qwen3:8b

Browse popular models for your hardware

Terminal

autotune pull

Download and immediately chat

Terminal

autotune pull qwen3:8b
autotune chat --model qwen3:8b

autotune models

List all available models

Shows all models installed on this machine across Ollama, MLX, and LM Studio — with size on disk, architecture, quantization level, and quality tier based on public benchmarks (MMLU, HumanEval).

Usage

Terminal

autotune models

Examples

Local models

Terminal

autotune models

View autotune's pre-configured model registry

Terminal

autotune models --registry

Flags

--registryShow autotune's internal model list instead of locally downloaded models.

autotune unload

Free RAM immediately

Releases a model from memory without opening a chat session. Useful after a heavy session — frees RAM for other apps right away. If you omit the model name, you'll see an interactive picker.

Usage

Terminal

autotune unload qwen3:8b

Examples

Unload a specific model

Terminal

autotune unload qwen3:8b

Interactive picker (shows all loaded models)

Terminal

autotune unload

Deploy & integrate

autotune serve

Start an OpenAI-compatible API server

Starts a local API server that any OpenAI-compatible tool can connect to — Continue.dev, Open WebUI, LangChain, a Python script, anything. All autotune optimizations apply to every request automatically.

Usage

Terminal

autotune serve

Examples

Default (localhost:8765)

Terminal

autotune serve

Connect from Python

Terminal

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8765/v1", api_key="autotune")

Connect via curl

Terminal

curl http://localhost:8765/v1/models

Apple Silicon — also enable MLX backend

Terminal

autotune serve --mlx

Flags

--hostBind address. Default: 127.0.0.1 (local only).

--portPort. Default: 8765.

--mlxEnable MLX backend on Apple Silicon (~10–40% higher throughput, disables tool calling).

--reloadAuto-reload on code changes (dev mode).

autotune recommend

Get the best model for your hardware

Profiles your hardware and recommends the best model and settings for your machine. Shows alternatives for fastest, balanced, and best-quality modes — including exact `ollama pull` commands to get started instantly.

Usage

Terminal

autotune recommend

Examples

Terminal

autotune recommend

Show only balanced recommendations

Terminal

autotune recommend --mode balanced

Check a specific model

Terminal

autotune recommend --model qwen3:8b

Show top 5 alternatives per mode

Terminal

autotune recommend --top 5

Flags

--mode, -mfastest / balanced / best_quality / all. Default: all.

--modelRestrict recommendations to a single model.

--top NNumber of alternatives to show per mode. Default: 3.

--no-show-hardwareSkip the hardware profile section.

Benchmarking & proof

Verify that autotune is actually helping on your specific machine. All timings come from Ollama's own Go nanosecond timers — nothing estimated.

autotune proofRun this first

Quick head-to-head benchmark (~30 seconds)

Runs a fast, honest benchmark comparing raw Ollama defaults against autotune. Measures TTFT (time to first token), KV cache RAM usage, RAM headroom, swap events, and generation speed. Includes a long-context test that shows TTFT improvements by comparing KV buffer allocation at 4096 tokens vs autotune's dynamically-sized buffer. Results saved to JSON.

Usage

Terminal

autotune proof --model qwen3:8b

Examples

Auto-select the best installed model

Terminal

autotune proof

Benchmark a specific model

Terminal

autotune proof --model qwen3:8b

More runs for stabler numbers

Terminal

autotune proof --model qwen3:8b --runs 3

Save results to a file

Terminal

autotune proof --model qwen3:8b --output results.json

See which models are installed

Terminal

autotune proof --list-models

Flags

--model, -mOllama model to benchmark. Auto-selects if omitted.

--runs, -rRuns per condition. 2 is fast (~30s); 3+ gives more stable numbers. Default: 2.

--profile, -pfast / balanced / quality. autotune profile to test against raw. Default: balanced.

--output, -oSave JSON results to this path. Defaults to proof_<model>.json.

--list-modelsList locally installed Ollama models and exit.

autotune proof-suiteDeep analysis

Multi-model statistical benchmark

Runs a curated 5-prompt suite (factual, code, analysis, conversation, long output) through both raw Ollama and autotune across multiple models. Reports Ollama-internal timing, process-isolated RAM, and full statistical significance: Wilcoxon signed-rank test, Cohen's d effect size, and 95% confidence intervals.

Usage

Terminal

autotune proof-suite

Examples

Default 3-model run

Terminal

autotune proof-suite

Specific models

Terminal

autotune proof-suite -m llama3.2:3b -m qwen3:8b

More runs per prompt for tighter stats

Terminal

autotune proof-suite -m qwen3:8b --runs 5

Save full results

Terminal

autotune proof-suite --output results.json

Flags

--models, -mOllama model IDs to benchmark. Repeat for multiple. Default: llama3.2:3b, qwen3:8b.

--runs, -nInference runs per condition per prompt. Minimum 3 for statistics. Default: 3.

--profile, -pautotune profile to compare against raw Ollama. Default: balanced.

--output, -oSave full results to a JSON file.

--list-modelsList locally installed Ollama models and exit.

autotune benchAdvanced

Intensive multi-prompt benchmark

Runs a full benchmark suite with multiple prompts across different task types (short, code, reasoning, analysis). Use this when you want detailed per-prompt breakdowns or want to compare two different autotune profiles against each other.

Usage

Terminal

autotune bench --model qwen3:8b

Examples

Standard benchmark

Terminal

autotune bench --model qwen3:8b

Duel mode: compare two profiles head-to-head

Terminal

autotune bench --model qwen3:8b --duel

Raw Ollama only (no autotune)

Terminal

autotune bench --model qwen3:8b --raw

Compare autotune vs raw

Terminal

autotune bench --model qwen3:8b --compare

More runs for stable results

Terminal

autotune bench --model qwen3:8b --runs 5

Flags

--model, -mOllama model to benchmark (required).

--runs, -rRuns per prompt per mode. Default: 3.

--profile, -pautotune profile to use. Default: balanced.

--duelCompare two profiles against each other.

--rawRun raw Ollama only (no autotune).

--compareRun both raw and autotune and show a side-by-side diff.

--output, -oSave results JSON to this path.

autotune user-bench

Real-world user experience benchmark

Measures what users actually feel — not raw throughput. Runs autotune head-to-head against raw Ollama across realistic laptop workflows: background queries, sustained chat, agent loops, and code debugging. Reports in user-friendly language: swap events, RAM headroom, TTFT consistency, CPU spikes, and a 0–100 background impact score. Can run in the background (survives terminal close) with a desktop notification when done.

Usage

Terminal

autotune user-bench --model qwen3:8b

Examples

Standard benchmark (~30 min)

Terminal

autotune user-bench --model qwen3:8b

Quick mode: 2 scenarios (~10-15 min)

Terminal

autotune user-bench --model qwen3:8b --quick

Run in background (keeps running after terminal close)

Terminal

autotune user-bench --model qwen3:8b --background

Run on every locally installed model

Terminal

autotune user-bench --all-models --runs 2

Flags

--model, -mOllama model to benchmark. Auto-selects first installed model if omitted.

--profile, -pautotune profile to use. Default: balanced.

--runs, -rRuns per scenario per condition. Default: 3.

--quick, -qQuick mode: 2 scenarios instead of 4 (~10-15 min).

--all-modelsRun on every locally installed Ollama model.

--backgroundFork to background — survives terminal close, sends a desktop notification when done.

--output-dirDirectory for result JSON files. Default: current directory.

autotune agent-bench

Agentic multi-turn benchmark

Tests autotune on 5 realistic agentic tasks: code debugging, research synthesis, step planning, adversarial context, and extended sessions. The key story is TTFT growth curves — in raw Ollama, TTFT grows linearly with each conversation turn as the full 4096-token KV buffer fills. autotune's dynamic context sizing keeps TTFT flat by sizing the window to actual usage.

Usage

Terminal

autotune agent-bench

Examples

Default run (all 5 tasks, 5 trials each)

Terminal

autotune agent-bench

Specific models

Terminal

autotune agent-bench -m llama3.2:3b -m qwen3:8b

Quick mode: 3 tasks, 2 trials (~20-30 min)

Terminal

autotune agent-bench --quick

Specific tasks only

Terminal

autotune agent-bench --tasks code_debugger,extended_session

Save results

Terminal

autotune agent-bench -m qwen3:8b --output agent_results.json

Flags

--models, -mOllama model IDs to benchmark. Repeat for multiple. Default: llama3.2:3b, qwen3:8b.

--trials, -nTrials per condition per task. Min 3 recommended. Default: 5.

--tasks, -tComma-separated task IDs. Options: code_debugger, research_synth, step_planner, adversarial_context, extended_session.

--profile, -pautotune profile to test. Default: balanced.

--quick, -qQuick mode: 3 tasks, 2 trials (~20-30 min).

--output, -oSave full results JSON to this path.

Conversation memory

autotune stores past conversations locally (SQLite + optional vector embeddings) so future chat sessions can surface relevant context automatically.

autotune memory searchSearch past conversations

Search your conversation history by semantic meaning or keywords. Uses vector search (cosine similarity) when an embedding model is available, otherwise falls back to FTS5 full-text keyword search.

Usage

Terminal

autotune memory search "your query here"

Examples

Terminal

autotune memory search "postgres migration"
autotune memory search "FastAPI authentication" --top 10
autotune memory search "React hooks" --min-score 0.4

Flags

QUERYSearch query (required).

--top, -nNumber of results to return. Default: 5.

--min-scoreMinimum similarity score (0–1). Ignored for FTS5 fallback. Default: 0.20.

autotune memory listBrowse stored memories

List recently stored conversation memories with timestamps, model names, and a preview of each chunk. Use this to find a memory ID before deleting it.

Usage

Terminal

autotune memory list

Examples

Terminal

autotune memory list
autotune memory list --days 7
autotune memory list --model qwen3:8b --limit 50

Flags

--limit, -nNumber of memories to show. Default: 20.

--daysOnly show memories from the last N days.

--modelFilter by model (e.g. qwen3:8b).

autotune memory statsMemory store statistics

Show statistics about the local memory store: total chunks, how many have vector embeddings, database size, date range, breakdown by model, and whether semantic search is active.

Usage

Terminal

autotune memory stats

autotune memory forgetDelete memories

Delete one memory chunk, all memories for a specific conversation, or wipe the entire store. Use autotune memory list first to find the memory ID.

Usage

Terminal

autotune memory forget <memory-id>

Examples

Terminal

autotune memory forget 42
autotune memory forget --conv-id abc123
autotune memory forget --all
autotune memory forget --all --yes   # skip confirmation

Flags

MEMORY_IDID of the specific memory chunk to delete (see autotune memory list).

--allDelete ALL memories (asks for confirmation unless --yes is passed).

--conv-idDelete all memories for a specific conversation ID.

--yes, -ySkip confirmation prompt.

autotune memory setupEnable semantic search

Pull nomic-embed-text from Ollama (~274 MB) to enable semantic vector search across your conversation history. Without this, autotune uses FTS5 keyword search instead.

Usage

Terminal

autotune memory setup

Apple Silicon (MLX)

MLX runs LLMs entirely on-chip using Apple's unified memory and Metal GPU kernels — typically 10–40% faster than Ollama on the same model. Requires an Apple Silicon Mac (M1/M2/M3/M4).

autotune mlx listShow cached MLX models

List all MLX-format models already downloaded locally, with size on disk.

Usage

Terminal

autotune mlx list

autotune mlx pullDownload an MLX model

Download an MLX-quantized model from the mlx-community on HuggingFace. You can use an Ollama model name (e.g. qwen3:8b) and autotune will resolve the correct MLX variant automatically.

Usage

Terminal

autotune mlx pull <model>

Examples

Terminal

autotune mlx pull qwen3:8b
autotune mlx pull llama3.2:3b
autotune mlx pull qwen2.5-coder:14b --quant 8bit

Flags

MODELModel name (required). Ollama name (e.g. qwen3:8b) or full HuggingFace ID.

--quant, -qQuantization level: 4bit / 8bit / bf16. Default: 4bit.

autotune mlx resolveLook up the MLX model ID

Show which MLX HuggingFace model ID would be used for a given Ollama model name. Useful to check before pulling.

Usage

Terminal

autotune mlx resolve <model>

Examples

Terminal

autotune mlx resolve qwen3:8b
autotune mlx resolve llama3.2

Settings

autotune telemetry

View performance history and manage data collection

Shows a table of all recent inference runs with TTFT, throughput, RAM/swap pressure, CPU load, and completion status. Also manages opt-in/out for anonymous telemetry collection (hardware fingerprint + performance data sent to the autotune team to improve defaults).

Usage

Terminal

autotune telemetry

Examples

View recent runs

Terminal

autotune telemetry

Filter by model

Terminal

autotune telemetry --model qwen3:8b

View individual telemetry events

Terminal

autotune telemetry --events --model qwen3:8b

Check consent status

Terminal

autotune telemetry --status

Opt in to anonymous telemetry

Terminal

autotune telemetry --enable

Opt out

Terminal

autotune telemetry --disable

Flags

--modelFilter to a specific model ID.

--limitNumber of recent runs to show. Default: 20.

--eventsShow individual telemetry events (RAM spikes, slow tokens, errors) instead of run history.

--statusShow current telemetry consent status.

--enableOpt in to anonymous telemetry collection.

--disableOpt out — no further data will be sent.

autotune storage

Manage local SQLite performance data

Enable or disable local SQLite storage of performance observations, telemetry events, and agent benchmark results. Model metadata is always stored regardless of this setting. Run without an argument to see the current status.

Usage

Terminal

autotune storage [on|off|status]

Examples

Check current setting

Terminal

autotune storage status

Enable local storage (default)

Terminal

autotune storage on

Disable storage (e.g. shared / ephemeral machines)

Terminal

autotune storage off

Diagnose

autotune doctor

Check your installation

Runs a full health check: Python version, required packages, whether Ollama and other backends are reachable, RAM and swap headroom, and database health. Every check shows a clear pass/fail so you know exactly what needs fixing.

Usage

Terminal

autotune doctor

autotune upgrade

Update to the latest version

Checks PyPI for a newer version of autotune, shows what changed, and upgrades with one keypress. Because autotune is updated frequently, run this any time something seems off or you want the latest improvements.

Usage

Terminal

autotune upgrade
autotune upgrade --yes   # skip confirmation

Ollama commands you'll use

autotune runs on top of Ollama. Here are the Ollama commands that pair with autotune.

ollama serve

Start the Ollama background service (must be running before autotune chat)

ollama pull qwen3:8b

Download a model (use autotune pull instead to get hardware-aware recommendations)

ollama list

List downloaded models (autotune ls shows more detail)

ollama ps

See models in memory (autotune ps shows more detail)

ollama rm qwen3:8b

Delete a model from disk

ollama --version

Confirm Ollama is installed

New to autotune?

Follow the install guide to get running in 5 minutes.

Install guide →