Command reference
Every command listed with examples and flags. New? Start with autotune recommend to get a hardware-matched model, then autotune proof to verify the improvement is real.
Jump to
autotune chatQuick startOpens chat immediately — no pre-flight check. autotune still applies all real-time optimizations (adaptive RAM monitoring, KV cache manager, dynamic context compression). Use this when you already know the model fits, or for HuggingFace and MLX models.
autotune chat --model qwen3:8bautotune chat --model qwen3:8bautotune chat --model qwen3:8b --system "You are a concise coding assistant"autotune chat --model qwen3:8b --profile qualityautotune chat --model qwen3:8b --conv-id abc123--model, -mModel to use (required). Run `autotune ls` for available models.--profile, -pfast / balanced / quality. Default: balanced.--system, -sCustom system prompt.--conv-idResume a saved conversation by ID.autotune runFirst time with a modelAnalyzes the model's memory requirements before loading it — checks if it fits, how tight it is, and auto-selects the safest profile and context window. Then opens the same optimized chat. Use this the first time you try a model, or any time you're unsure if it fits your RAM.
autotune run qwen3:8bautotune run qwen3:8bautotune run qwen2.5-coder:14b --profile balancedautotune run llama3.2 --system "You are a helpful assistant"autotune run qwen3:8b --forceMODELModel name (positional argument, required).--profile, -pfast / balanced / quality / auto. Default: auto (autotune picks for you).--system, -sCustom system prompt.--forceOverride the memory check and start anyway.--recallInject relevant context from past conversations.autotune hardwareScans your CPU, RAM, and GPU. Shows which AI models fit in your available memory and which apps are consuming the most RAM right now. If closing one app would let you run a larger model, autotune will tell you.
autotune hardwareautotune hardwareautotune hardware --no-ram-tips--ram-tips / --no-ram-tipsShow top RAM consumers and model-unlock suggestions. Default: on.autotune lsShows every model you've downloaded and whether it fits in your available RAM. Includes the safe context window size, recommended profile, and a warning if quantization is too aggressive for your hardware.
autotune lsautotune lsautotune ls --json--jsonOutput as JSON instead of a table.autotune psShows every model currently loaded in RAM across Ollama, MLX, and LM Studio. Use this to check if a model is still occupying memory after you stopped chatting with it.
autotune psautotune pullDownload any Ollama model directly from within autotune. Run without a model name to browse popular recommendations for your hardware.
autotune pull qwen3:8bautotune pull qwen3:8bautotune pullautotune pull qwen3:8b
autotune chat --model qwen3:8bautotune modelsShows all models installed on this machine across Ollama, MLX, and LM Studio — with size on disk, architecture, quantization level, and quality tier based on public benchmarks (MMLU, HumanEval).
autotune modelsautotune modelsautotune models --registry--registryShow autotune's internal model list instead of locally downloaded models.autotune unloadReleases a model from memory without opening a chat session. Useful after a heavy session — frees RAM for other apps right away. If you omit the model name, you'll see an interactive picker.
autotune unload qwen3:8bautotune unload qwen3:8bautotune unloadautotune serveStarts a local API server that any OpenAI-compatible tool can connect to — Continue.dev, Open WebUI, LangChain, a Python script, anything. All autotune optimizations apply to every request automatically.
autotune serveautotune servefrom openai import OpenAI
client = OpenAI(base_url="http://localhost:8765/v1", api_key="autotune")curl http://localhost:8765/v1/modelsautotune serve --mlx--hostBind address. Default: 127.0.0.1 (local only).--portPort. Default: 8765.--mlxEnable MLX backend on Apple Silicon (~10–40% higher throughput, disables tool calling).--reloadAuto-reload on code changes (dev mode).autotune recommendProfiles your hardware and recommends the best model and settings for your machine. Shows alternatives for fastest, balanced, and best-quality modes — including exact `ollama pull` commands to get started instantly.
autotune recommendautotune recommendautotune recommend --mode balancedautotune recommend --model qwen3:8bautotune recommend --top 5--mode, -mfastest / balanced / best_quality / all. Default: all.--modelRestrict recommendations to a single model.--top NNumber of alternatives to show per mode. Default: 3.--no-show-hardwareSkip the hardware profile section.Verify that autotune is actually helping on your specific machine. All timings come from Ollama's own Go nanosecond timers — nothing estimated.
autotune proofRun this firstRuns a fast, honest benchmark comparing raw Ollama defaults against autotune. Measures TTFT (time to first token), KV cache RAM usage, RAM headroom, swap events, and generation speed. Includes a long-context test that shows TTFT improvements by comparing KV buffer allocation at 4096 tokens vs autotune's dynamically-sized buffer. Results saved to JSON.
autotune proof --model qwen3:8bautotune proofautotune proof --model qwen3:8bautotune proof --model qwen3:8b --runs 3autotune proof --model qwen3:8b --output results.jsonautotune proof --list-models--model, -mOllama model to benchmark. Auto-selects if omitted.--runs, -rRuns per condition. 2 is fast (~30s); 3+ gives more stable numbers. Default: 2.--profile, -pfast / balanced / quality. autotune profile to test against raw. Default: balanced.--output, -oSave JSON results to this path. Defaults to proof_<model>.json.--list-modelsList locally installed Ollama models and exit.autotune proof-suiteDeep analysisRuns a curated 5-prompt suite (factual, code, analysis, conversation, long output) through both raw Ollama and autotune across multiple models. Reports Ollama-internal timing, process-isolated RAM, and full statistical significance: Wilcoxon signed-rank test, Cohen's d effect size, and 95% confidence intervals.
autotune proof-suiteautotune proof-suiteautotune proof-suite -m llama3.2:3b -m qwen3:8bautotune proof-suite -m qwen3:8b --runs 5autotune proof-suite --output results.json--models, -mOllama model IDs to benchmark. Repeat for multiple. Default: llama3.2:3b, qwen3:8b.--runs, -nInference runs per condition per prompt. Minimum 3 for statistics. Default: 3.--profile, -pautotune profile to compare against raw Ollama. Default: balanced.--output, -oSave full results to a JSON file.--list-modelsList locally installed Ollama models and exit.autotune benchAdvancedRuns a full benchmark suite with multiple prompts across different task types (short, code, reasoning, analysis). Use this when you want detailed per-prompt breakdowns or want to compare two different autotune profiles against each other.
autotune bench --model qwen3:8bautotune bench --model qwen3:8bautotune bench --model qwen3:8b --duelautotune bench --model qwen3:8b --rawautotune bench --model qwen3:8b --compareautotune bench --model qwen3:8b --runs 5--model, -mOllama model to benchmark (required).--runs, -rRuns per prompt per mode. Default: 3.--profile, -pautotune profile to use. Default: balanced.--duelCompare two profiles against each other.--rawRun raw Ollama only (no autotune).--compareRun both raw and autotune and show a side-by-side diff.--output, -oSave results JSON to this path.autotune user-benchMeasures what users actually feel — not raw throughput. Runs autotune head-to-head against raw Ollama across realistic laptop workflows: background queries, sustained chat, agent loops, and code debugging. Reports in user-friendly language: swap events, RAM headroom, TTFT consistency, CPU spikes, and a 0–100 background impact score. Can run in the background (survives terminal close) with a desktop notification when done.
autotune user-bench --model qwen3:8bautotune user-bench --model qwen3:8bautotune user-bench --model qwen3:8b --quickautotune user-bench --model qwen3:8b --backgroundautotune user-bench --all-models --runs 2--model, -mOllama model to benchmark. Auto-selects first installed model if omitted.--profile, -pautotune profile to use. Default: balanced.--runs, -rRuns per scenario per condition. Default: 3.--quick, -qQuick mode: 2 scenarios instead of 4 (~10-15 min).--all-modelsRun on every locally installed Ollama model.--backgroundFork to background — survives terminal close, sends a desktop notification when done.--output-dirDirectory for result JSON files. Default: current directory.autotune agent-benchTests autotune on 5 realistic agentic tasks: code debugging, research synthesis, step planning, adversarial context, and extended sessions. The key story is TTFT growth curves — in raw Ollama, TTFT grows linearly with each conversation turn as the full 4096-token KV buffer fills. autotune's dynamic context sizing keeps TTFT flat by sizing the window to actual usage.
autotune agent-benchautotune agent-benchautotune agent-bench -m llama3.2:3b -m qwen3:8bautotune agent-bench --quickautotune agent-bench --tasks code_debugger,extended_sessionautotune agent-bench -m qwen3:8b --output agent_results.json--models, -mOllama model IDs to benchmark. Repeat for multiple. Default: llama3.2:3b, qwen3:8b.--trials, -nTrials per condition per task. Min 3 recommended. Default: 5.--tasks, -tComma-separated task IDs. Options: code_debugger, research_synth, step_planner, adversarial_context, extended_session.--profile, -pautotune profile to test. Default: balanced.--quick, -qQuick mode: 3 tasks, 2 trials (~20-30 min).--output, -oSave full results JSON to this path.autotune stores past conversations locally (SQLite + optional vector embeddings) so future chat sessions can surface relevant context automatically.
autotune memory searchSearch past conversationsSearch your conversation history by semantic meaning or keywords. Uses vector search (cosine similarity) when an embedding model is available, otherwise falls back to FTS5 full-text keyword search.
Usageautotune memory search "your query here"autotune memory search "postgres migration"
autotune memory search "FastAPI authentication" --top 10
autotune memory search "React hooks" --min-score 0.4QUERYSearch query (required).--top, -nNumber of results to return. Default: 5.--min-scoreMinimum similarity score (0–1). Ignored for FTS5 fallback. Default: 0.20.autotune memory listBrowse stored memoriesList recently stored conversation memories with timestamps, model names, and a preview of each chunk. Use this to find a memory ID before deleting it.
Usageautotune memory listautotune memory list
autotune memory list --days 7
autotune memory list --model qwen3:8b --limit 50--limit, -nNumber of memories to show. Default: 20.--daysOnly show memories from the last N days.--modelFilter by model (e.g. qwen3:8b).autotune memory statsMemory store statisticsShow statistics about the local memory store: total chunks, how many have vector embeddings, database size, date range, breakdown by model, and whether semantic search is active.
Usageautotune memory statsautotune memory forgetDelete memoriesDelete one memory chunk, all memories for a specific conversation, or wipe the entire store. Use autotune memory list first to find the memory ID.
autotune memory forget <memory-id>autotune memory forget 42
autotune memory forget --conv-id abc123
autotune memory forget --all
autotune memory forget --all --yes # skip confirmationMEMORY_IDID of the specific memory chunk to delete (see autotune memory list).--allDelete ALL memories (asks for confirmation unless --yes is passed).--conv-idDelete all memories for a specific conversation ID.--yes, -ySkip confirmation prompt.autotune memory setupEnable semantic searchPull nomic-embed-text from Ollama (~274 MB) to enable semantic vector search across your conversation history. Without this, autotune uses FTS5 keyword search instead.
autotune memory setupMLX runs LLMs entirely on-chip using Apple's unified memory and Metal GPU kernels — typically 10–40% faster than Ollama on the same model. Requires an Apple Silicon Mac (M1/M2/M3/M4).
autotune mlx listShow cached MLX modelsList all MLX-format models already downloaded locally, with size on disk.
Usageautotune mlx listautotune mlx pullDownload an MLX modelDownload an MLX-quantized model from the mlx-community on HuggingFace. You can use an Ollama model name (e.g. qwen3:8b) and autotune will resolve the correct MLX variant automatically.
autotune mlx pull <model>autotune mlx pull qwen3:8b
autotune mlx pull llama3.2:3b
autotune mlx pull qwen2.5-coder:14b --quant 8bitMODELModel name (required). Ollama name (e.g. qwen3:8b) or full HuggingFace ID.--quant, -qQuantization level: 4bit / 8bit / bf16. Default: 4bit.autotune mlx resolveLook up the MLX model IDShow which MLX HuggingFace model ID would be used for a given Ollama model name. Useful to check before pulling.
Usageautotune mlx resolve <model>autotune mlx resolve qwen3:8b
autotune mlx resolve llama3.2autotune telemetryShows a table of all recent inference runs with TTFT, throughput, RAM/swap pressure, CPU load, and completion status. Also manages opt-in/out for anonymous telemetry collection (hardware fingerprint + performance data sent to the autotune team to improve defaults).
autotune telemetryautotune telemetryautotune telemetry --model qwen3:8bautotune telemetry --events --model qwen3:8bautotune telemetry --statusautotune telemetry --enableautotune telemetry --disable--modelFilter to a specific model ID.--limitNumber of recent runs to show. Default: 20.--eventsShow individual telemetry events (RAM spikes, slow tokens, errors) instead of run history.--statusShow current telemetry consent status.--enableOpt in to anonymous telemetry collection.--disableOpt out — no further data will be sent.autotune storageEnable or disable local SQLite storage of performance observations, telemetry events, and agent benchmark results. Model metadata is always stored regardless of this setting. Run without an argument to see the current status.
autotune storage [on|off|status]autotune storage statusautotune storage onautotune storage offautotune doctorRuns a full health check: Python version, required packages, whether Ollama and other backends are reachable, RAM and swap headroom, and database health. Every check shows a clear pass/fail so you know exactly what needs fixing.
autotune doctorautotune upgradeChecks PyPI for a newer version of autotune, shows what changed, and upgrades with one keypress. Because autotune is updated frequently, run this any time something seems off or you want the latest improvements.
autotune upgrade
autotune upgrade --yes # skip confirmationautotune runs on top of Ollama. Here are the Ollama commands that pair with autotune.
ollama serveollama pull qwen3:8bollama listollama psollama rm qwen3:8bollama --versionFollow the install guide to get running in 5 minutes.