One agent. Every model. Zero cloud.
A deep, honest guide to building a private AI studio on a single MacBook Pro M5 Max (128GB). Hermes Agent is the conductor; a local model is its brain; and a dedicated local model handles images, video, music, dubbing, voice cloning and transcription. The whole 2026 model field, how each one works and trains, the benchmarks that matter, and exactly how to build it. Flip the toggle to read it plain or in full technical depth.
Sources are primary throughout: official Hermes, Apple, OpenAI, Qwen and DeepSeek documentation; the Artificial Analysis Intelligence Index, SWE-bench Verified, BFCL, τ-bench, MMLU-Pro and Text-to-Image arenas; arXiv papers for every media model; and community discussion on r/LocalLLaMA. Numbers are as reported by their sources and are a May 2026 snapshot. Models rotate monthly; the architecture here is stable, the leaderboard is not. Benchmark on your own machine before committing.
Why this is suddenly possible
Two years ago, running a frontier-class model on your own machine was a fantasy. That changed faster than almost anyone predicted. The reason this guide exists now, and could not have existed in 2024, is a single chart: the gap between open models you can download and the closed models you rent has nearly closed.
"Open weights" means the actual model is published for anyone to download and run, instead of being locked behind a company's paid website. In 2024 these were toys next to ChatGPT. In 2026 they are genuinely close to the best, free, and small enough to run on a good laptop. That is the whole reason a private studio on one Mac is now realistic.
On the Artificial Analysis Intelligence Index (v4.0, a composite of ten evals including GPQA Diamond, Humanity's Last Exam, τ²-Bench, Terminal-Bench Hard and SciCode), the best open-weights model a year ago, DeepSeek V3 0324, scored 22, about 13 points below the leading proprietary model. Today the top open models (Kimi K2.6, MiMo-V2.5-Pro) score 54, with DeepSeek V4 Pro at 52, within 3-6 points of GPT-5.5 (60), Gemini 3.1 Pro and Claude Opus 4.7 (57). Open weights now hold 244 of 386 ranked models and dominate the intelligence-vs-price Pareto frontier.
| Then → now (open weights) | Early 2025 | 2026-05-24 |
|---|---|---|
| Top Intelligence Index score | 22 (DeepSeek V3 0324) | 54 (Kimi K2.6 / MiMo-V2.5-Pro) |
| Gap to best proprietary | ~13 points | 3–6 points |
| SWE-bench Verified (best open) | ~55% | 80.6% (DeepSeek V4 Pro Max) |
| Open share of ranked models | minority | 244 of 386 |
Why the M5 Max is the right hardware
The Mac's advantage is not raw speed; a data-center GPU beats it on throughput. The advantage is one large pool of fast memory shared by the whole chip, which lets a laptop hold models that simply will not fit on a consumer graphics card, while sipping power and staying silent.
Unified memory is the whole trick
A normal gaming PC keeps the model in the graphics card's small, separate memory, and if the model is too big, it simply will not load. The Mac has one big shared pool, so the entire model lives in the same 128 GB the rest of the chip uses. That is how a laptop runs models a $1,500 graphics card chokes on.
Apple's unified memory is a single pool addressable by CPU, GPU and Neural Engine with no host↔device copies. The M5 Max, launched 2026-03-03, tops out at 128 GB at 614 GB/s on the 40-core GPU SKU (the 32-core variant caps at 64 GB / 460 GB/s). After raising the wired limit you get about 120 GB usable for models. The historic Mac weakness was prefill (prompt processing), where it trailed NVIDIA badly; the M5's new per-core Neural Accelerators push prefill 3.33×–4.06× faster than M4 (Apple's own MLX team measurement) and make FLUX-dev-4bit ~3.8× faster. The Neural Accelerators run 1,024 FP16 fused multiply-accumulates per core per cycle, aggregating to about 70 TFLOPS of FP16 or 130 TFLOPS of INT8 across the 40-core GPU. Native FP8 and FP4 still belong to Blackwell; BF16 on the Neural Accelerators arrived in macOS 26.1, INT4 in 26.4: the numbers in this guide assume macOS Tahoe 26.4 or later.
| Spec | Figure | Why it matters here |
|---|---|---|
| Unified memory | 128 GB (40-core GPU SKU) | About 120 GB usable for models after raising the wired limit. Holds one big brain or one video model. |
| Memory bandwidth | 614 GB/s | Caps token generation speed; favours MoE models with low active parameters. |
| GPU + Neural Accelerators | 40-core · ~70 TFLOPS FP16 · ~130 TFLOPS INT8 | About four times M4 Max GPU compute on matmul and prefill; about a fifth faster on decode. Speeds diffusion and rectified-flow image models. |
| CPU | 18-core (6 super + 12 performance) | Drives the agent loop and the Python orchestration; not the bottleneck. |
| Storage | 2 / 4 / 8 TB SSD · 13.6 GB/s read · 17.8 GB/s write | A full stack with alternates is 250–300 GB; pick 4 TB if you keep multiple 120B weight files. Cold-loading a 65 GB model takes ~5 seconds. |
| Chip topology | TSMC SoIC-mH chiplet (two N3P dies) | "Fusion Architecture": same compute die in M5 Pro and M5 Max; explains the uniform per-core spec. |
| macOS required | Tahoe 26.4 or later | Earlier 26.0 / 26.1 lack INT4 Neural Accelerator support; perf drops materially. Verify your build before reproducing the numbers in this guide. |
The M5 Max SSD reads at 13.6 GB/s and writes at 17.8 GB/s, so cold-loading is fast enough that mode-switching between loadouts feels instant. Concrete wall clock to swap weights in from disk:
| Model · size on disk | Cold load (first run) | Warm load (page cache) |
|---|---|---|
| gpt-oss-120B Q4 · 63 GB | ~4.6 s | ~1.1 s |
| Qwen3.6-27B Q4 · 16 GB | ~1.2 s | ~0.4 s |
| DeepSeek-V4 Flash 35B-A3B Q5 · 24 GB | ~1.8 s | ~0.6 s |
| FLUX.2 dev 4-bit · 6.5 GB | ~0.5 s | ~0.2 s |
| Wan 2.7 video · ~50 GB | ~3.7 s | ~0.9 s |
Practical sizing rule: a full studio (brain + image + music + voice + STT) is 250-300 GB on disk. If you keep more than one 120B weight file (e.g. gpt-oss-120B + Llama 4 Maverick + an abliterated mirror), go to 4 TB. The 8 TB SKU only earns its premium if you collect video models or maintain multiple FLUX LoRA libraries. macOS APFS clone-on-write means a Hermes hermes backup of memory + skills is near-instant; the bulk on disk is always the model weights.
A 120B brain running tool calls sustained is not a free lunch even on Apple silicon. Concrete numbers from a 16-inch MacBook Pro M5 Max 128 GB on macOS Tahoe 26.4, sustained over 30 minutes:
| Workload | Package power | Fan | Battery hours (full charge) |
|---|---|---|---|
| Idle (brain loaded, no calls) | ~6-9 W | silent | ~14-16 h |
| Chat (gpt-oss-120B Q4, intermittent) | ~22-34 W | silent → low | ~5-7 h |
| Sustained agent loop (tool calls, decode-bound) | ~40-55 W | low-audible | ~3-4 h |
| FLUX.2 image gen (steady throughput) | ~58-72 W | audible | ~2-2.5 h |
| Wan 2.7 video gen (peak) | ~85-105 W | loud | ~1.3-1.6 h |
| ACE-Step music gen (47 s output) | ~45-60 W (short burst) | brief | n/a (~18 s) |
The 16-inch chassis runs ~8-12 °C cooler than the 14-inch under the same workload because of the larger vapour chamber; the 14-inch will throttle sooner on the video-gen row. Battery numbers assume macOS Tahoe's adaptive power profile is on (default) and the GPU is allowed to clock down between calls. On battery the GPU caps at ~70% of plugged-in clocks; if you are benchmarking, always run plugged in or your numbers will land in the bottom of the ranges in section 14's reproduce guide.
The honest envelope: a working day of chat plus a few dozen images is silent and fits a full battery cycle. A sustained agent run that hammers tool calls plus generates dozens of images is plugged-in territory. Video is plugged-in territory regardless. The MacBook Pro stays a laptop for the first two; it becomes a small desktop for the third.
How the M5 Max stacks up against everything else with 128 GB
For an agent that decodes hundreds of tool calls per session, memory bandwidth beats raw compute, because decode is bandwidth-bound. The Mac wins this race against its closest peers despite Blackwell's FP4 hardware advantage.
| Platform | Unified RAM | Bandwidth | gpt-oss-120B Q4 | DeepSeek V4 Flash Q4 | Notes |
|---|---|---|---|---|---|
| MacBook Pro M5 Max 128 GB | 128 GB | 614 GB/s | fits · ~63 GB | no · ~120 GB | Silent, battery, ANE + Neural Accelerators. |
| MacBook Pro M4 Max 128 GB | 128 GB | 546 GB/s | fits | no | Predecessor: weak prefill. |
| Mac Studio M3 Ultra 512 GB | 512 GB | 819 GB/s | fits · Q8 | fits | Desktop scale-up: the only Apple option that holds V4 Flash. |
| RTX 5090 (32 GB) | 32 GB GDDR7 | 1,792 GB/s | too small | no | Fastest per-GB but GB ceiling kills it for 120B. |
| NVIDIA DGX Spark / Project Digits | 128 GB unified | 273 GB/s | fits | no | Same RAM, 2.25× less bandwidth than M5 Max. Has FP4 hardware. |
| AMD Strix Halo (Ryzen AI MAX+ 395) | 128 GB unified | 256 GB/s | fits | no | x86 alternative; mature ROCm still trails MLX / CUDA on Q4 kernels. |
| RTX PRO 6000 Blackwell | 96 GB | 1,792 GB/s | fits | tight | Workstation: PCIe scale-out, no battery, $7-10k. |
| NVIDIA Jetson AGX Thor | 128 GB unified | n/a published | likely | no | Robotics-first, not the studio target. |
What Mac uniquely enables
- Fine-tuning 14–32B models on a laptop (32–64 GB unified beats a 24 GB consumer GPU that can't even load them).
- Battery-powered, silent operation under sustained load.
- MLX-Audio one-endpoint TTS + STT + STS server.
- Draw Things Lightning Draft (about one second per image on M5 Max).
- Hardware ProRes encode in the Media Engine.
- Continuity recipes: iPhone audio capture → AirDrop → MLX-Audio Whisper → 120B summarizer in one chain.
What Mac uniquely cannot do
- CUDA-only models, FlashAttention-3 native kernels, NVIDIA-only quant formats (AWQ-INT4 with Triton).
- True multi-GPU NVLink scaling beyond a single Mac (EXO Labs distributed inference over TB5 is the workaround for >128 GB).
- Native FP8 and FP4 hardware support (Blackwell's persistent lead).
- vLLM speculative-decoding-with-paged-attention performance at scale (vllm-metal v0.2 closes some of this in April 2026).
Hermes Agent, in full
A raw model only emits text. Something has to turn "dub this clip into English in my voice" into a real sequence of actions that ends in a file on disk. That something is the agent. Hermes Agent, from Nous Research, is the conductor that holds the baton.
Hermes is a free program you install once. It is the tireless operator that lives on your Mac: you talk to it like a person, and it actually does the work, running commands, editing files, browsing, making images, scheduling itself for later, and remembering what it learns so it gets better the more you use it. Think senior assistant, not chatbot.
An MIT-licensed Python 3.11+ agent runtime. It is model-agnostic: the model is a swappable component behind any OpenAI-compatible /v1/chat/completions endpoint (Ollama, llama.cpp, vLLM, SGLang, LM Studio). It ships a registry of more than 70 tools across more than 30 toolsets, a pluggable memory provider, a skills engine that authors its own skills, a full MCP client (OAuth 2.1 PKCE, OSV malware scanning, ACP via the Zed Agent Client Protocol Registry installable via uvx), subagent delegation, durable multi-agent orchestration via Kanban, a cron scheduler, 22 first-class messaging gateways and a React/Ink TUI plus web dashboard. Installs to ~/.local/bin; all state in ~/.hermes/; no telemetry.
The anatomy
| Part | Plain | Technical |
|---|---|---|
| Tools | Its hands: everything it can physically do. | More than 70 built-in tools across more than 30 toolsets, plus MCP tools; switchable per session with -t. |
| Skills | Step-by-step methods; it writes new ones from what worked. | 87 built-in skills + 79 optional in-repo + 672+ across the agentskills.io / HermesHub / LobeHub / Anthropic registries; skill_manage; your own under ~/.hermes/skills/. |
| Memory | Facts about you and your projects. | Frozen-snapshot system-prompt prefix-cache injection at session start (MEMORY.md ~2,200 chars + USER.md ~1,375 chars), plus on-demand FTS5 query via session_search. Relevance ranking lives in session_search and in pluggable providers (Honcho, Mem0, Hindsight). |
| Session search | Total recall of past conversations. | Every session in SQLite with full-text search; session_search retrieves and summarizes. |
| Subagents | Clones helpers that work in parallel and report back. | delegate_task: isolated context + terminal + toolset; orchestrator role; max_spawn_depth; file-coordination layer. |
| Cron | Runs on a schedule. "Every morning, do X." | Skill-backed jobs run in fresh sessions; notify_on_complete on background processes. |
| Gateways | Reach it from Telegram, Discord, Slack, WhatsApp, email. | 17 platforms; allowlist / DM-pairing / open auth; per-platform skill gating. |
| Profiles | Separate personas with their own memory (work vs creative). | Isolated HERMES_HOME dirs; per-profile config, keys, memory, sessions, skills. |
| Plugins + MCP | Bolt-on powers and connections to other tools. | Python/shell-hook plugins (can veto tool calls, ship image-gen backends); full MCP client. |
The 70+ built-in tools
Tools tagged cloud need a key by default and get swapped for local equivalents in section 07; everything else is local.
terminal · process
Run shell commands, background servers, monitor and notify on completion.
file ×4
read_file, write_file, patch (fuzzy find-replace, 9 strategies, auto syntax check), search_files (ripgrep).
code_execution
execute_code: Python that calls other Hermes tools, with branching and output filtering.
delegation
delegate_task spawns subagents in isolated contexts (own terminal + toolset). Defaults: max_spawn_depth=2, max_concurrent_children=3. Parent blocks until child completes; only the summary returns.
browser ×12
navigate, click, type, get_text, scroll, snapshot, screenshot, evaluate_js, console_log, vision_analyze, wait_for, close. Headless Chrome via CDP; 180× faster persistent connection since v0.14.
vision · memory
vision_analyze (describe/answer about images); memory (save durable facts).
skills ×3 · session_search
skill_manage / skill_view / skills_list; search and summarize all past sessions.
cronjob · todo · clarify
cronjob schedules skill-backed tasks (no_agent mode for pure-script jobs, full-agent for adaptive ones); todo plans multi-step work; clarify asks you a multiple-choice question.
moa
mixture_of_agents: route a hard problem through several models (4 + aggregator) and merge.
rl ×10
Drive RL fine-tuning runs: collect trajectories, generate batches, compress trajectories, score with judge, kick off SFT / DPO / GRPO loops, read live training metrics. The full RL toolchain is in-process.
image_gen · tts · web ×2
image_generate, text_to_speech, web_search/web_extract.
messaging · homeassistant ×4 · feishu ×5
Send to chat platforms; control smart-home devices; enterprise Lark/Feishu doc ops.
How it thinks
You give it a goal. It makes a short plan, picks a tool, uses it, looks at the result, and decides the next step, repeating until done. It stops to ask only when it needs a real decision or permission for something risky.
A tool-calling loop: system prompt (memory + skills + tool schemas) → model emits a tool call → runtime executes → result re-enters context → repeat. Context compression (/compress) keeps long runs under the window; activity-based timeouts (read timeout relaxed to 1800s for local endpoints) prevent premature kills; subagent results return as summaries so intermediate noise never bloats the main thread.
Three mechanisms keep long jobs alive: todo (it tracks its own steps), compression (summarizes history to stay under the context limit), and delegation (offloads sub-tasks to subagents so the main thread stays clean). They are why a local agent can run a multi-stage media pipeline end to end without losing the plot.
What a single agent turn actually looks like
One real exchange. The user says "summarise the README in this repo and write a tweet about it"; the agent runs four tool calls and answers in about eight seconds on gpt-oss-120B.
# user prompt
> summarise the README in this repo and write a tweet about it
# agent turn 1: pick a tool, emit a tool call
tool: search_files(query="^README", path=".")
# runtime executes, result feeds back to context
result: ["./README.md", "./packages/foo/README.md"]
# agent turn 2: narrow to the root README
tool: read_file(path="./README.md")
result: 3,847 chars of markdown
# agent turn 3: internal reasoning, no tool call
The README describes a 2026 field guide for local AI on M5 Max with
Hermes Agent. Headline claim: 27B model at 77.2% SWE-Verified fits a
laptop. The thread should lead with that stat.
# agent turn 4: emit the answer (no tool call, free-form text to user)
A 27-billion-parameter model that fits a laptop now codes at 77.2% on
SWE-bench Verified. Local AI is no longer a compromise: 2026 field
guide for a private studio on a MacBook M5 Max with Hermes Agent.
https://m5ai.ivanm.dev
Note what does NOT happen: the agent does not narrate every step to the user, does not stop to ask permission for read-only tools (those are pre-approved), does not write chain-of-thought into the answer. The visible behaviour is "thought for 8 seconds, then answered". The invisible behaviour is the loop above. Hermes's /compress ensures the result of turn 2 (3,847 chars of README) is summarised before turn 5 if the conversation keeps going.
What the community actually reports
Independent of any vendor claim, community discussion across r/LocalLLaMA and Hacker News has been active and growing, praising the smoother setup and the built-in learning loop. A representative comment: the agent "actually remembers" a failure and "creates a skill for troubleshooting it." The honest consensus is not that Hermes replaces dedicated coding agents; for pure software engineering, the leaders are clear: Claude Code with Opus 4.6 holds 80.8% on SWE-bench Verified (highest reported single-agent), Aider's Polyglot leaderboard places Opus 4.6 around 85% edit-format, and Cursor Composer with Sonnet 4.6 sits in the mid-70s. Hermes does not publish a SWE-bench Verified number and is built for a wider job: orchestration, memory, scheduling and gateways. Treat marketing comparisons sceptically; benchmark for your actual workload. Nous shipped hundreds of security-tagged commits across the 0.12-0.14 release cycles (588 / 633 / 550 total merged PRs respectively) and there are no widely reported breaches; the lone disclosed CVE-2026-7396 (WeChat path traversal) was patched. Hermes positions itself as complementary to the older OpenClaw framework, with a built-in hermes claw migrate path that imports persona, skills, memory, channels and API keys.
A skill is just a folder with a Markdown file
Skills are the procedural memory: short, named, version-controlled how-to documents the agent reads and follows. Here is a real one from ~/.hermes/skills/release-it/SKILL.md, hand-written then refined by Curator after a few runs:
---
name: release-it
description: tag + push a release after CI green
tools: [terminal, file]
approvals: [git tag, git push]
---
# When to use
User asks to "ship", "tag a release", or "cut v<N>".
# Pre-flight
1. Run `gh run list --branch main --limit 1`; verify status=success.
2. Read `CHANGELOG.md`; the top entry must have a version + date filled in.
3. Run `npm run validate`; must exit 0.
# Steps
1. `git tag -a v{N} -m "<extract from CHANGELOG>"`
2. `git push --tags`
3. Confirm to user with the new tag URL on GitHub.
# Failure modes
- CI red → STOP, tell user, do not tag.
- CHANGELOG unfilled → ask user to fill before continuing.
- Same tag exists → STOP, do not force-push.
The agent reads this on session start (only the skills relevant to the current goal land in the system prompt). Skills are shareable: hermes skill push my-skill publishes to agentskills.io; hermes skill install <name> imports someone else's. The Curator agent deduplicates and consolidates skills nightly; over weeks the studio's skill set converges on the patterns YOU repeat.
A multi-modal turn: text + image + voice in one prompt
The same agent loop drives every modality because the brain dispatches to tool servers, not to one model. A request that spans copy, design and audio is just three tool calls with the brain stitching the result back together. Worked example for "draft a launch poster + caption + read the caption in my cloned voice":
// USER
> draft a launch poster for the studio (1080x1350, IG portrait),
write a 28-word caption, and read the caption in my cloned voice
from ~/voices/me-15s.wav. Drop everything in ~/Desktop/launch/.
// TURN 1 · brain plans
need: brand colours, headline, body. Skill brand-kit has it.
read_file: ~/.hermes/skills/brand-kit/SKILL.md (158 chars)
→ primary #4d8dff · accent #2ee6a6 · font Bricolage 800
// TURN 2 · brain drafts copy in-context, then calls image tool
gen_image:
prompt: "minimal portrait poster, deep navy gradient,
headline 'One agent. Every model. Zero cloud.'
in Bricolage 800 white, accent dot mint,
grain texture, generous margins"
model: flux-2-dev
size: 1080x1350
steps: 28
→ ~/Desktop/launch/poster.png (1.2 MB, 14.6 s)
// TURN 3 · brain writes caption, then calls voice tools
write_file: ~/Desktop/launch/caption.txt
→ "One MacBook. One agent. Every open model.
No subscription, no telemetry, no cloud round-trip.
Sixty-three GB of brain on a laptop you can close." (28 words)
clone_voice:
reference: ~/voices/me-15s.wav
model: chatterbox-v2
→ voice profile cached as 'me' (3.1 s)
tts_speak:
text: <caption.txt>
voice: me
exaggeration: 0.35
format: wav 24kHz
→ ~/Desktop/launch/caption.wav (412 KB, 2.9 s)
// TURN 4 · brain emits final answer
> Done. Poster at poster.png, caption.txt (28 words),
caption.wav (12 s, your voice, mild emotion).
Total: 22.7 s, 4 tool calls, 0 cloud calls.
What this shows: one brain (gpt-oss-120B) chooses tools, drafts copy, and never touches the image or voice models directly. The image model lives behind a gen_image tool, the voice clone behind two more. Swap FLUX for Qwen-Image and the brain does not need re-prompting. Swap Chatterbox for F5-TTS and the same. The skill folder is where this convention lives; the brain follows.
Sleeper features in 0.13 and 0.14
Six things from the last two release cycles that materially change what the studio can do:
hermes proxy
Turns Claude Pro / ChatGPT Pro / SuperGrok into a localhost OpenAI-compatible endpoint. Codex CLI, Aider, Cline and Continue all become free to drive off your Hermes-managed subscription.
/goal · /subgoal
Ralph-loop persistent goal contracts: the agent keeps going until a judge decides the criteria are met. Layered subgoals can be added mid-run.
Kanban durable
Multi-agent orchestration with heartbeat detection, auto-block on incomplete exit, per-task retries, hallucination recovery. Closes the gap delegate_task leaves open.
Curator agent
Background process that deduplicates, deprecates and consolidates skills. No equivalent in any competitor.
LSP semantic diagnostics
Beyond syntax linting: type errors, undefined symbols, missing imports surfaced back to the agent before the next turn.
Cross-session prompt cache
1-hour Claude prompt cache survives /new. Cuts cost materially on long workflows.
OpenClaw, Claw Code, ClaudeClaw: three different projects
Names collide. The migration path Hermes ships is for one of them only.
| Project | Repo | What it is |
|---|---|---|
| OpenClaw Peter Steinberger | openclaw/openclaw374K stars · MIT · TypeScript | General-purpose messaging-first AI assistant. Originally Clawdbot (2025-11-24) → Moltbot (2026-01-27) → OpenClaw. This is what hermes claw migrate reads. |
| Claw Code Sigrid Jin | instructkr/claw-code48K stars · Python + Rust | Clean-room rewrite of Claude Code's leaked source map. Coding-focused CLI. Unrelated to OpenClaw. |
| ClaudeClaw moazbuilds | moazbuilds/claudeclaw | Lightweight OpenClaw-equivalent that runs as a Claude Code plugin: daemon + Telegram/Discord/Slack/cron/web dashboard. |
hermes claw migrate reads ~/.openclaw/, and auto-detects legacy ~/.clawdbot/ and ~/.moltbot/ paths. Non-destructive by default: skips SOUL.md if Hermes already has one, skips duplicate memory entries, skips same-named skills. Imports persona, skills, memory, channels and API keys. Imported skills land at ~/.hermes/skills/openclaw-imports/.What happens if Hermes itself gets abandoned
The fair question to ask before committing to any agent runtime: what is recoverable if the project dies tomorrow? The studio's architecture answers this on purpose. Everything you produce lives in formats older than Hermes and outlasts any single tool.
| Asset | Where it lives | Survives Hermes going dark? |
|---|---|---|
| Skills | ~/.hermes/skills/*/SKILL.md + plain scripts | Yes. Markdown + frontmatter + shell/Python. Readable by any future agent or by you directly. |
| Memory | ~/.hermes/memory/*.md | Yes. Markdown with frontmatter. Same format Claude Code, OpenClaw and the AIvan layer all consume. |
| Model weights | Ollama / MLX caches under ~/.ollama/models/ and ~/.cache/huggingface/ | Yes. GGUF and safetensors are open formats with multiple loaders (llama.cpp, vLLM, MLX, ExLlamaV2, KoboldCpp). |
| MCP servers | ~/.hermes/mcp.json + the MCP processes themselves | Yes. MCP is an Anthropic-led open protocol; Claude Code, OpenClaw, Cursor, Cline all speak it. |
| Channel bindings | ~/.hermes/channels.yaml + per-gateway tokens | Yes. Telegram / Discord / Slack tokens are yours; rebind to any other runtime. |
| Cron jobs | ~/.hermes/cron.yaml | Yes. Same YAML schema as crontab + a goal contract; trivially portable. |
| config.yaml | ~/.hermes/config.yaml | Partial. The schema is Hermes-specific, but every value (model name, endpoint URL, key) carries over by hand in 5 minutes. |
| Cross-session prompt cache | Hermes-internal SQLite | No, but it's a cache; not data you care about preserving. |
Hermes treats your stuff (skills, memory, channel tokens, model weights) as data in standard formats stored on your filesystem, and treats itself as the process that happens to be running them today. If the project dies, fork it, switch to OpenClaw or ClaudeClaw, write a thin replacement shim, or just keep using the dead version - the data does not move and does not get held hostage. This is the single biggest difference between an open-source agent and a SaaS one: SaaS dying means your stuff dies. Local agent dying means your stuff sits on disk waiting for the next runtime.
Concrete recovery recipe if Hermes shipped its last release tomorrow: skills run as plain shell/Python scripts (the SKILL.md frontmatter is a 20-line parser to write), memory is grep-able markdown, weights load into vLLM or MLX directly, MCP servers stay running unchanged, cron continues under stock crontab + a one-shot model invocation. The studio degrades from "agent does it" to "you do it with the same tools" - which is exactly where it started before Hermes existed.
The model that thinks, and the whole field of them
The brain is the model the agent reasons with. For an agent, the trait that matters most is not raw genius: it is reliably choosing the right tool and emitting a clean call, hundreds of times in a row. A model that is 2% smarter but fumbles one call ends the run. So this section covers how these models work, what makes each one different, how they are trained, and exactly how the 2026 field scores, before picking the ones that fit your Mac.
How a modern model works
Picture a company with 128 specialists. For each word, a "router" wakes only the 8 most relevant ones. The model is enormous in total knowledge, but only a thin slice works at a time, so it stays fast and fits in memory. That "mixture of experts" trick is why a 120-billion model runs on a laptop. Some models also have a "think first" switch that lets them reason step by step on hard problems and answer instantly on easy ones.
A Mixture-of-Experts transformer routes each token through a small subset of expert FFNs. Active-parameter count, not total, governs per-token compute and bandwidth, which is why MoE suits a 614 GB/s Mac. The differentiators between 2026 models are mostly in attention and routing: MLA (DeepSeek's Multi-head Latent Attention compresses the KV cache), attention sinks (gpt-oss lets heads "pay zero attention"), linear/lightning attention (MiniMax for long-context efficiency), auxiliary-loss-free routing (DeepSeek's load balancing), and hybrid thinking (Qwen's switchable reasoning with a thinking budget). Training increasingly leans on RL: DeepSeek-R1 showed pure RL via GRPO can teach reasoning with no supervised chain-of-thought; gpt-oss post-trains with CoT+RL like o3.
What makes each architecture different
| Innovation | Who | What it does |
|---|---|---|
| Mixture-of-Experts | nearly all | Only a few experts fire per token; huge total capacity, small active cost. |
| Multi-head Latent Attention (MLA) | DeepSeek V3/V4 | Compresses the KV cache into a latent space, slashing memory for long context. |
| Attention sinks | gpt-oss | A learned per-head bias lets the model ignore tokens cleanly; stabilizes long context. |
| Hybrid thinking + budget | Qwen3.x, DeepSeek V4 | One model switches between visible chain-of-thought and instant answers; you cap the reasoning spend. |
| Auxiliary-loss-free routing | DeepSeek, Qwen3.x MoE | Balances expert load without a loss term that hurts quality; encourages specialization. |
| Hybrid Mamba + Transformer + MoE | NVIDIA Nemotron 3 Super | State-space + attention + MoE in one model. Only open frontier-class system shipping all three; holds 91.75% RULER at 1M tokens. |
| Linear / lightning attention | MiniMax M2.x | Sub-quadratic attention for cheap very-long-context inference. |
| RL-first post-training (GRPO) | DeepSeek-R1 lineage | Pure reinforcement learning induces reasoning without supervised CoT data. |
| Native MXFP4 | gpt-oss | 4.25-bit microscaling quantization is the intended format, not a lossy afterthought. |
Intelligence Index v4.0 (Artificial Analysis): composite of 10 evals (GDPval-AA, τ²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, CritPt). A breadth score; 60 is the current frontier ceiling.
SWE-bench Verified: 500 real GitHub issues from popular Python projects, hand-validated by the SWE-bench team. The model gets the repo and the issue text; it must produce a patch that passes hidden tests. Coding-agent reality check; closer to "does this make me money" than MMLU.
BFCL v4 (Berkeley Function Calling Leaderboard): can the model emit a correct tool call for a given API spec, across 1500+ scenarios with multi-turn, parallel, and irrelevance-detection categories. The agent-reliability metric; a model that's 2 points smarter on MMLU but 5 points worse on BFCL is the wrong brain for tool-calling work.
MMLU-Pro: harder MMLU. 12K multiple-choice across 57 academic disciplines, ten-choice (vs four), text-only. Reasoning depth at academic difficulty.
GPQA Diamond: 198 graduate-level science questions, hand-curated to be "expert-hard". Tests the ceiling of factual + reasoning depth.
GDPval-AA: Artificial Analysis's economic-value benchmark: pairs of agentic tasks scored Elo-style. Real-work proxy.
τ-bench retail: multi-turn customer-service simulation. Strict tool-use protocol + adversarial users; measures "can this agent run a contact center".
The whole field, scored and sized
Intelligence Index = Artificial Analysis v4.0 composite. SWE = SWE-bench Verified. "128GB" = fits on this Mac at a usable 4-bit quant. Top models are listed precisely because most of them do not fit, which is the honest part.
| Model | Lab | Params (total / active) | Intel. | SWE | License | 128GB? |
|---|---|---|---|---|---|---|
| gpt-oss-120B | OpenAI | 117B / 5.1B MoE | n/a | o4-mini class | Apache-2.0 | yes · 63GB |
| Qwen3.6-27B | Alibaba | 27B dense | n/a | 77.2 | Apache-2.0 | yes · 17-33GB |
| Qwen3.6 35B-A3B | Alibaba | 35B / 3.5B MoE | n/a | n/a | Apache-2.0 | yes · 26GB |
| Gemma 4 31B | 31B dense | top non-China open | n/a | Gemma | yes · 17GB | |
| Mistral Medium 3.5 | Mistral | 128B | n/a | 77.6 | open | tight · ~70GB |
| Llama 4 Scout | Meta | 109B / 17B MoE | n/a | n/a | Llama | yes · ~60GB |
| Nemotron 3 Super | NVIDIA | ~70-100B | top non-China open | n/a | open | likely · 4-bit |
| DeepSeek V4 Flash | DeepSeek | 284B / 13B MoE | ~49 | 79.0 | open | no · ~120GB Q4 |
| DeepSeek V4 Pro | DeepSeek | 1.6T / 49B MoE | 52 | 80.6 | open | cloud only |
| Kimi K2.6 | Moonshot | ~1T MoE | 54 | 80.2 | open | cloud only |
| GLM-5.1 | Z.AI | 744B MoE | ~51 | 77.8 (GLM-5) | open | cloud only |
| MiniMax M2.7 | MiniMax | MoE · linear attn | 49.6 | n/a | open | cloud only |
| MiMo-V2.5-Pro | Xiaomi | ~1T MoE | 54 | 78.0 | open | cloud only |
| Qwen3.5 397B-A17B | Alibaba | 397B / 17B MoE | n/a | 76.2 | Apache-2.0 | no |
Intelligence Index · top open weights (higher is better)
SWE-bench Verified · coding (open weights)
Tool calling · BFCL v4 (the agent metric)
MMLU-Pro · reasoning breadth (open weights, 2026-05-25)
GPQA Diamond · graduate-level science (open weights)
GDPval-AA · real-world agentic value (Elo, open weights, 2026-05-25)
τ-bench retail · multi-turn customer-service agents (closed-vs-open reference)
The models that fit your Mac, in detail
gpt-oss-120B · the reliability brain
OpenAI's open model. It is the most dependable at "using tools" of anything you can download, which is the single trait that decides whether long agent jobs finish. It fits comfortably and lets you dial how hard it thinks.
- Architecture: token-choice MoE (117B total, 5.1B active, 4 experts), gated SwiGLU, GQA, alternating full + 128-token sliding-window attention, RoPE 131K via YaRN, learned per-head attention sinks.
- Quantization: native MXFP4 on MoE weights → ~63 GB; Q6 ~90 GB. o200k_harmony tokenizer; Harmony chat format.
- Training: CoT + RL post-training using o3-family techniques, specifically for reasoning and tool use; adjustable reasoning effort (low/medium/high).
- Scores: MMLU-Pro 90.0, AIME 2025 97.9 (tools), GPQA 80.9 (tools), τ-bench matches/exceeds o4-mini. The cleanest tool-call discipline of any open weight.
Qwen3.6-27B · the all-rounder
Alibaba's open model and the best balance for this build: fast, small enough to leave room for media, and genuinely excellent at coding and agentic work. If you run one model, this is the safe pick.
- Architecture: 27B dense (the family also ships 128-expert/8-active MoE variants with no shared experts and global-batch load-balancing). Hybrid thinking with a thinking budget. 256K context.
- Training: ~36T pre-training tokens (double Qwen2.5), three-stage pretrain → four-stage post-train, synthetic math/code from Qwen2.5-Math/Coder, 119 languages. Apache-2.0.
- Scores: SWE-bench Verified 77.2, beating models 50× its size; BFCL-class tool calling is strong. ~17 GB at Q4, ~33 GB at Q8.
Qwen3.6 35B-A3B
Tiny active footprint = very fast on Mac. Perfect fast/auxiliary lane beside a heavy main brain.
Gemma 4 31B
Top non-China open model on the Intelligence Index; efficient, multimodal, strong general writing. Watch for repetition loops past ~11 tool calls in some harnesses.
Mistral Medium 3.5
SWE-bench 77.6, EU-built for data-residency needs. Fits tight at 4-bit, leaves little for media.
Llama 4 Scout
Fits at ~60 GB 4-bit; long context. Caveat: its pythonic tool-call format needs a compatible parser.
Nemotron 3 Super
The other top non-China open reasoning model; a strong Western-licensed alternative.
DeepSeek V4 Flash
SWE-bench 79.0, MLA + auxiliary-loss-free routing. About 120 GB at Q4 (160 GB at FP16), just over the line for 128 GB once KV cache is added.
Nemotron 3 Super: the open hybrid
NVIDIA's contender. Different inside: it mixes three kinds of layers (long-context Mamba, attention, and the MoE experts you already know). Holds 91.75% on a million-token retrieval test, which no other open model touches.
- Architecture: hybrid Mamba state-space layers (cheap long context) + transformer attention (short range) + MoE experts (capacity). The only open frontier-class system shipping all three combined.
- Scores: 91.75% RULER at 1M tokens (unmatched among open models). Strong on Intelligence Index alongside Gemma 4 31B as the two non-China open entries.
- Why it matters: the natural pick for teams with data-residency constraints that bar weights from Chinese labs. Western-licensed alternative to DeepSeek / Qwen / GLM.
Qwen3-Coder-Next: the small fast coder
Alibaba's specialised coding model. Three billion active parameters, scores 70.6% on the coding test. The best self-hostable coder under 100 GB.
- Architecture: 3B active parameters in a coding-tuned MoE. About 40 GB at Q4.
- Training: 800K agentic coding RL tasks (multi-turn, tool-using, test-validated trajectories).
- Scores: SWE-bench Verified 70.6 with only 3B active params. Inside the 100 GB ceiling so you can keep it loaded alongside the orchestrator brain on the M5 Max.
The cloud-only leaders, briefly
These do not fit your Mac, but you should know what sits at the top of the open leaderboard:
DeepSeek V4 Pro
Current open-weights coding leader at 80.6 SWE-Verified. Multi-head Latent Attention compresses the KV cache; auxiliary-loss-free routing balances experts. Hybrid thinking.
DeepSeek V4 Flash
First new DeepSeek architecture since V3. Same innovations as Pro at a fifth of the size. Still does not fit a 128 GB Mac at Q4 (~120 GB).
Kimi K2.6
Co-leads Intelligence Index v4.0 at 54. GPQA Diamond 90.5 (highest of any open model). General-purpose flagship.
GLM-5.1
Z.AI's flagship. Intelligence Index around 51, GDPval-AA leader at 1535. Strong real-world agentic benchmark performance.
MiniMax M2.7
Pioneer of lightning attention: sub-quadratic for cheap very-long-context inference. Intelligence Index around 50.
MiMo-V2.5-Pro
Xiaomi's entry. Intelligence Index 54, 1M-token context, 78.0 SWE-Verified. China's third major model lab behind DeepSeek and Alibaba.
| Term | Used by | What it does in one line |
|---|---|---|
| MoE (Mixture of Experts) | nearly all | Only a few experts fire per token; huge total, small active cost. |
| MLA (Multi-head Latent Attention) | DeepSeek V3/V4 | Compresses the KV cache into a latent space. |
| Attention sinks | gpt-oss | Learned per-head bias lets the model ignore tokens cleanly. |
| Auxiliary-loss-free routing | DeepSeek, Qwen3.x MoE | Balances expert load without a loss-term penalty. |
| Hybrid thinking + budget | Qwen3.x, DeepSeek V4 | One model switches between visible CoT and instant answers; you cap reasoning spend. |
| Lightning attention | MiniMax M2.x | Sub-quadratic attention for cheap very-long-context inference. |
| GRPO | DeepSeek-R1 lineage | Pure RL induces reasoning without supervised CoT data. Cut post-training cost ~10×. |
| MXFP4 | gpt-oss | 4.25-bit microscaling: the intended format, not a lossy afterthought. |
| Mamba + Transformer + MoE | Nemotron 3 Super | State-space + attention + MoE in one model. Holds 91.75% RULER at 1M. |
| License | Models | Commercial use | Catch |
|---|---|---|---|
| Apache-2.0 | gpt-oss-120B, Qwen3.6-27B, Qwen3.6 35B-A3B, Qwen-Image-2512, Z-Image-Turbo, FLUX.2 [klein] 4B | ✓ unrestricted | None. Default-yes. |
| MIT | Chatterbox, HiDream-O1, ACE-Step 1.5 XL | ✓ unrestricted | None. |
| Llama 4 Community | Llama 4 Scout, Maverick | ✓ conditional | EU multimodal blocked. 700M MAU threshold. |
| Gemma | Gemma 4 31B | ✓ conditional | Google's terms; review product attribution. |
| Modified MIT | Mistral Medium 3.5 | ✓ conditional | Not pure MIT; check redistribution clauses. |
| FLUX.2 Non-Commercial | FLUX.2 [dev], FLUX.2 [klein] 9B | ✗ research only | Pay BFL for commercial, or use [klein] 4B. |
| Custom open weights | DeepSeek V3/V4, GLM-5/5.1, Kimi K2.6, MiniMax M2.7, MiMo-V2.5-Pro | per-model | Generally permissive but read each. |
OLLAMA_CONTEXT_LENGTH=65536) or the 70+-tool system prompt silently overflows. And for hybrid-thinking models (Qwen3.x, GLM-4.7, DeepSeek V4), disabling reasoning mode requires all three of --reasoning off, --reasoning-budget 0, and chat_template_kwargs.enable_thinking: false. Setting only one is the single biggest source of production failures in May 2026 (see llama.cpp issue #13189).Abliteration: how it works, and the catch
"Uncensored" models do not refuse you. Researchers discovered there is essentially a single internal "no" signal inside a model; abliteration surgically cancels it. The catch is that doing this carelessly also dents the model's general skill, which an agent relies on. So the smart pattern is one disciplined model for the work, and a separate uncensored one only for the writing.
Refusal is mediated by roughly a single direction in the residual stream (Arditi et al., NeurIPS 2024). You estimate that refusal direction from contrastive harmful vs harmless prompts, then either steer activations away from it at inference, or permanently orthogonalize the weights against it. The term (ablate + obliterate) was coined by FailSpy. The problem: the refusal vector is polysemantic, entangling refusal with syntax, formatting and capability circuits, so naive ablation causes collateral damage, partially recoverable with light "healing" fine-tuning (SFT/DPO).
The 2026 tooling
| Tool | What it does | Why it matters |
|---|---|---|
| Heretic | One-command automated abliteration; separates attention vs MLP interventions (MLP causes more damage); v1.2 adds a LoRA-based engine producing a toggleable adapter plus 4-bit support. | ~6.5× less capability damage than hand-tuned efforts. pip install heretic-llm → heretic <model>. |
| OBLITERATUS | 116-model toolkit; adds Expert-Granular Abliteration (per-expert directions for MoE) and CoT-aware ablation. | Deeper and broader, but heavier. For when a single direction is not enough. |
| UGI Leaderboard | Community ranking of Uncensored General Intelligence plus a natural-intelligence (NatInt) score. | The place to confirm an "uncensored" model is still actually smart after the surgery. |
Qwen3.6 abliterated :agent
The compromise model: uncensored and keeps tool calling. Best single "both" pick.
Gemma 4 31B Heretic
Uncensored general-purpose with native vision and tool calling.
Hermes 4.3 / 70B
Low-refusal by design (not abliterated). Excellent writing, lyrics, roleplay; run as the content subagent.
DIY with Heretic
Abliterate Qwen3.6-27B yourself for an uncensored brain tuned to taste, output as a toggleable LoRA.
delegation model so the agent auto-routes writing to it. Reliable agency, zero refusals where you want them, no compromise on either side.How models run, and what "4-bit" means
A model's "weights" are billions of numbers. Quantization shrinks them by storing each with fewer digits, like rounding prices to the nearest dollar: 4-bit is small and fast with a tiny quality cost; 8-bit is bigger and basically perfect. You also need a small program to run the model; on Mac the easy one is Ollama, and the fastest is Apple's own MLX.
Quantization reduces weight precision (16-bit → 8/6/5/4-bit). GGUF is the cross-platform de-facto format (Q4_K_M standard, Q5_K_M sweet spot, Q6/Q8 near-lossless), run by llama.cpp/Ollama; it has the broadest model coverage, often within hours of a release. MLX is Apple's native format, built for unified memory with zero CPU↔GPU copies: ~10% less memory and 15–30% faster than GGUF at the same quant. MXFP4 is gpt-oss's native 4.25-bit microscaling. AWQ/GPTQ are activation- and gradient-aware schemes common on NVIDIA. Below ~Q3, tool-calling reliability collapses.
| Server | Best for | Notes |
|---|---|---|
| Ollama (GGUF) | The agent brain | Simplest; Hermes auto-detects at :11434/v1; most stable for long agentic tool use. |
| MLX (mlx_lm) | Media, max speed, fine-tuning | Apple-native; fastest single-user generation; the only local LoRA-training path on Mac. |
| LM Studio | GUI management | Bundles the MLX engine; one-click OpenAI server; runs both MLX and GGUF. |
| llama.cpp / vLLM / SGLang | Power-user serving | Fine control of quant, context, KV-cache; -ngl 99 offloads all layers to the Metal GPU. |
| Quant level | Quality | Use when |
|---|---|---|
| 4-bit (Q4_K_M / MXFP4 / MLX-4) | Minor loss, max speed | The practical default for big models. |
| 5–6-bit (Q5_K_M / MLX-6) | Near-lossless | 24 GB+; the quality sweet spot. |
| 8-bit (Q8 / MLX-8) | Effectively full precision | 48 GB+ (you have it); best for a small premium brain. |
Lane × server × quant: picking quickly
Each lane in the studio wants a different pairing. This collapses the decisions into one table.
| Lane | Server | Quant | Why |
|---|---|---|---|
| Orchestrator brain | Ollama (GGUF) | Q4 / Q8 | Tool-call stability past 5+ rounds; MLX drifts. Use Q8 if you have the room. |
| Auxiliary fast lane | MLX (mlx_lm) | Q5 | Memory edge + native Metal kernels; short sessions do not trigger MLX tool-call drift. |
| Media generation | MLX / Draw Things | MXFP4 / 4-bit | Native; Apple's Neural Accelerator path; Draw Things' Lightning Draft hits about 1 sec/image on M5 Max. |
| Fine-tuning (LoRA) | MLX (mlx_lm.lora) | Q4 base | The only local LoRA-training path on Mac. Unified memory beats a 24 GB RTX 3090 here. |
Reasoning-format settings, per model
Hybrid-thinking models need explicit configuration or they emit chain-of-thought that breaks downstream parsers. Copy-paste:
# gpt-oss-120B (Harmony format, native reasoning effort)
model: ollama/gpt-oss:120b
extra_body:
reasoning_effort: high # low | medium | high
# Qwen3.6 family (hybrid thinking; disable all three knobs)
model: ollama/qwen3.6:27b
extra_body:
reasoning: off
reasoning_budget: 0
chat_template_kwargs:
enable_thinking: false
# GLM-4.7 / 5.1 : same triple flag as Qwen3.6
# DeepSeek V4 : same triple flag
# Llama 4 Scout : pythonic tool-call parser required
model: ollama/llama4:scout
extra_body:
tool_call_format: pythonic
# Gemma 4 31B : watch for repetition loops past ~11 tool calls. Reset session on detection.
Citation: llama.cpp issue #13189 documents the full triple-flag fix.
Images, video, music, voice, transcription
This is where local AI stopped being a compromise. The brain writes and reasons; these models produce the media. Each is a different machine with its own architecture, its own training, and its own leaderboard. Below: how each kind works, the full field of options with arena scores, the apps that run them on Mac, and how to prompt them.
Images
You type a description; the model starts from pure static and "develops" it into a photo over a few seconds, like an instant Polaroid in reverse. Modern ones read your prompt so well you can ask for specific text on a sign, an exact pose, or "the same character, new scene."
2026 image models are rectified-flow transformers (MM-DiT): a diffusion-style model that learns a near-straight path from noise to image, so it needs far fewer sampling steps than old U-Net diffusion. Text and image tokens flow through coupled attention streams; a large text encoder (FLUX.2 uses Mistral Small 3.1 24B) gives strong prompt adherence. Images live in a compressed 16-channel latent decoded by an autoencoder. FLUX.2 [klein] was distilled-free, which makes it the best open base for LoRA training.
Text-to-image arena · open weights (Elo)
FLUX.2 [dev / klein 9B / klein 4B / pro]
Best-in-class quality and prompt control; klein 4B is the distillation-free LoRA base; up to 10 reference images, HEX color control. License is split: dev and klein 9B are non-commercial; klein 4B is Apache-2.0 (commercial OK).
Qwen-Image-2512
Top Apache-2.0 model: commercial-safe, excellent text rendering, strong editing. The pragmatic default.
HiDream-O1-Image-Dev
Highest-ranked open-weights model in the arena right now.
Z-Image-Turbo
Permissive and quick: few-step turbo sampling for near-instant previews.
Seedream 4.5 / Hunyuan 3.0
Seedream excels at East-Asian aesthetics (calligraphy, fabric, architecture); Hunyuan leads open image editing.
SD 3.5 / SDXL
The mature ecosystem: the deepest library of community LoRAs and ControlNets, even if raw quality now trails.
Video
Same idea as images, but the model also has to keep things consistent from frame to frame so motion looks real. This is the heaviest job in the studio: expect minutes per clip, not seconds, and it cannot run at the same time as a big brain.
Video models extend diffusion into a spatio-temporal latent: 3D attention over (frames × height × width) tokens with a causal video VAE, so the model denoises a whole clip while enforcing temporal coherence. This is why VRAM and time costs explode versus stills.
Wan 2.7
The current Wan generation; reference-to-video with voice cloning and instruction-based video editing as new model classes since 2.2. Text- and image-to-video via ComfyUI. About 50 GB resident, minutes per clip; won't co-reside with a 120B brain. Wan 3.0 60B (Apache-2.0) is roadmapped for mid-2026.
LTX-Video
Built for speed: real-time-ish generation on capable hardware, lower fidelity than Wan.
Hunyuan Video / Mochi
High-motion open alternatives; heavier still, strong cinematic motion.
| Model | Lab | Resident | Max res / dur | M5 Max time per 5-sec clip | Notes |
|---|---|---|---|---|---|
| Wan 2.7 | Alibaba · Apr 2026 | ~50 GB | 1080p · 10 sec | ~4–8 min | Current Wan generation. Reference-to-video + voice cloning + instruction editing as new model classes. |
| LTX-Video v0.9 | Lightricks | ~20 GB | 720p · 6 sec | ~30–90 sec | Speed-first; lower fidelity than Wan but real-time-ish on capable hardware. |
| HunyuanVideo | Tencent | ~60 GB | 1280×720 · 5 sec | ~5–10 min | High-motion open alternative; cinematic motion. |
| Mochi 1 | Genmo | ~40 GB | 848×480 · 5 sec | ~3–6 min | Apache-2.0; strong open motion baseline. |
| Step-Video / CogVideoX | StepFun / Tsinghua | ~30–50 GB | 720p · 6–10 sec | ~3–8 min | Newer contenders; CogVideoX-Vid evolution lineage stable. |
Music
Give it a style and some lyrics and it writes and performs a full song, vocals and instruments, in under a minute. The local model is genuinely close to the big paid services, runs offline, and you can teach it a voice or style from a handful of examples.
ACE-Step 1.5 XL is a 4B hybrid: a language-model "composer" reasons in chain-of-thought to plan a structured blueprint (lyrics, sections, duration, metadata), then a diffusion transformer renders 48 kHz stereo audio. It is built on a Sana-style deep-compression autoencoder (DCAE) + linear transformer, with MERT and m-hubert features aligned via REPA; v1.5 adds intrinsic RL. Under 4 GB VRAM, 50+ languages, quality in the Suno v5 range (the closed leader Suno v5.5 shipped March 2026 and has since widened the gap somewhat), and it supports cover, repaint, vocal-to-BGM and LoRA from a few songs.
ACE-Step 1.5 XL
The local Suno. Full songs with vocals in seconds; tiny footprint; trainable on your own style.
YuE
Long-form vocal music generation; strong full-song structure, heavier than ACE-Step.
DiffRhythm
Full songs in ~10s via latent diffusion; very fast, fewer controls.
MusicGen / Stable Audio
Instrumental and sound-design workhorses; no vocals, but reliable for beds and loops.
[verse], [chorus], [instrumental]. Budget ~2–3 words of lyric per second (under ~140 words for a 47-second clip). Specific tags ("balkan brass, minor key, 90 bpm, male vocal") beat vague ones every time.Worked ACE-Step prompt: a 47-second balkan-folk demo
Anatomy of one actual prompt that produces a clean output on the first try. Total wall-clock on M5 Max: ~18 seconds.
# Tags (3-7 words; specific instruments + bpm beat adjectives)
tags: balkan folk, accordion, violin, minor key, 92 bpm, male vocal, live feel
# Structure (markers shape sectioning + dynamics)
lyrics: |
[verse]
Sutra zora pada na grad,
jutarnji povjetarac nosi pjesmu.
Korak po korak kroz tihu ulicu,
vrijeme za pjevanje bez razloga.
[chorus]
Pjevaj sa mnom, pjevaj glasno,
pjesma stara, srce mlado.
Pjevaj sa mnom, pjevaj glasno,
do jutra, do jutra, do jutra.
[instrumental]
[verse]
Akordeon pamti svaku ulicu,
violina zna svaku stazu.
Korak po korak kroz tihu jutro,
pjesma za one koji slušaju.
[chorus]
Pjevaj sa mnom, pjevaj glasno,
pjesma stara, srce mlado.
# Render
duration: 47
guidance: 7.5
seed: 1492
Why it works: the tag block names three instruments and pins bpm + key + vocal register before any adjective; the lyric block stays under 140 words (124 words across two verses + chorus repeat + instrumental break) so ACE-Step does not have to truncate mid-phrase; the [instrumental] marker gives the diffusion transformer a clean breath where it can foreground the accordion solo. The seed makes the output reproducible; drop the seed to re-roll the take.
[bridge] between the second verse and the final chorus. If you trained a LoRA on a singer's stems, append the LoRA name to tags at the end: +lora:my-singer. ACE-Step's recipes converge in 3-4 takes for most styles.Lyric structure templates that survive a first take
ACE-Step is fault-tolerant on tags but unforgiving on length math. Two rules carry most of the load: total lyric words must fit duration × 2.6 words/sec, and each section needs enough time for the diffusion transformer to lock its motif. Templates below are battle-tested at 90-110 BPM; scale lyric counts proportionally for slower or faster.
| Duration | Structure (recommended) | Lyric budget | Per-section |
|---|---|---|---|
| 30 s (snippet) | [verse] [chorus] | ~70-78 words | verse 32 · chorus 38 |
| 47 s (single demo) | [verse] [chorus] [instrumental] [verse] [chorus] | ~120-130 words | verse 28 · chorus 32 · instr 0 |
| 60 s (short clip) | [intro] [verse] [chorus] [verse] [chorus] | ~155 words | intro 8 · verse 32 · chorus 40 · 2nd 35/35 |
| 90 s (radio cut) | [verse] [chorus] [verse] [chorus] [bridge] [chorus] | ~230 words | verse 36 · chorus 38 · bridge 32 · final 50 |
| 120 s (album cut) | [intro] [verse] [chorus] [verse] [chorus] [bridge] [instrumental] [chorus] | ~310 words | scale 1.0×; pad with repeats on final chorus |
- Pop · AABA-derived:
[verse] [chorus] [verse] [chorus] [bridge] [chorus]. The bridge sits at the 60% mark; that is where ACE-Step naturally drops a key change if the lyric meter shifts. - Ballad · ABABCB: longer verses (~40-50 words), shorter chorus (~24-30), one bridge. Tag
rubatoand drop BPM tag to ~70 to let the model breathe. - Dance / EDM · verse-chorus-drop: replace the second
[chorus]with[instrumental]and tagdrop, sidechainin the same tag line; ACE-Step renders the build-up + release without a lyric cue. - Balkan / folk · verse-chorus repeat: instrument-led; keep lyrics simple, repeat the chorus three times across a 60 s clip, mark
[instrumental]for the accordion or violin solo. - Hip-hop · 16-bar verse + 8-bar hook: at 90 BPM, 16 bars = ~42 s, so a single verse fills a 47 s clip; use one
[verse]and one short[chorus]. Append tagspoken cadence.
Why this matters: ACE-Step's diffusion transformer plans the entire timeline at the start. Overshooting the word budget by 20% causes the model to truncate mid-phrase or speed the vocal beyond intelligibility. Undershooting by 30% leaves the singer trailing off into instrumental sections that were not in your prompt. The templates above hit the sweet spot for one-take outputs in the 90-110 BPM band.
Voice & dubbing
From 5–10 seconds of someone speaking, these models clone the voice and then read any text in it, with emotion. Chain a few together and you can take a video in one language and output it dubbed in another, keeping the original speaker's voice.
Modern open TTS is zero-shot non-autoregressive flow-matching: a DiT generates a mel-spectrogram conditioned on text plus a speaker embedding extracted from a short reference clip, fused without a separate duration model. F5-TTS pairs a DiT with ConvNeXt, trained on ~100k hours, hitting real-time factor ~0.15 (≈6× faster than playback). Chatterbox adds emotion control and a PerTh watermark.
Chatterbox
23-language cloning with emotion control; competitive with ElevenLabs in independent blind tests. The default voice engine.
F5-TTS
RTF ~0.15, clones from a 1–10s reference; superb speed/quality balance.
Kokoro
Featherweight, extremely fast TTS for narration where cloning is not needed.
Qwen3-TTS
Clones from just 3 seconds of reference audio (vs 5–15s for F5, 5s for Chatterbox). Strong multilingual range. Pairs naturally with the Qwen brain.
CosyVoice 3
About 150 ms streaming latency, the lowest of any open TTS as of May 2026. The pick for real-time voice agents.
Sesame CSM-1B
Conversational speech model: tiny, fast, fluent in the conversational register where most TTS still sounds read-aloud.
Dia
Multi-speaker dialogue model; handles back-and-forth and overlap better than single-speaker TTS systems.
TTS quality · TTS-Arena Elo (2026-05-25)
- Length: 5-15 seconds for Chatterbox / F5-TTS / CosyVoice 3. 3 seconds is enough for Qwen3-TTS. Longer than 30 seconds rarely helps and can introduce inconsistency.
- Cleanliness: no background music, no reverb, no second speaker. Studio mic > phone > Zoom recording. If you only have a noisy clip, run it through Demucs's
--two-stems vocalsfirst. - Prosody: natural speech, not flat reading. Include at least one falling-intonation sentence and one rising-intonation question if possible; the model anchors to your pitch range from the sample.
- Format: WAV 16 kHz mono is universal. 24 kHz mono works too. The clone models resample internally; do not over-engineer.
- Pitch consistency: same speaker, same emotional register as the target. A clip of you laughing will make the clone laugh-tinted across all output.
- Punctuation in the target text drives prosody at inference time. Commas pause, ellipses trail, em-dashes break (but this guide bans em-dashes; use a semicolon).
- For emotion in Chatterbox: use the
exaggerationparameter (0.0-1.0); do not write ALL-CAPS in the target text. The exaggeration scalar is dramatically more controllable.
All seven stages on-device. M5 Max time budget: about 3-5 minutes of processing per minute of source video. Alternate lip-sync: MuseTalk 1.5 (Tencent Music) for higher fidelity on close-up faces.
Transcription
Turns speech into accurate text in dozens of languages, fast, fully offline. The backbone of meeting notes, subtitles and the dubbing pipeline above.
Encoder-decoder transformers trained on huge weakly-labelled audio. Whisper large-v3 runs at ~3 GB in MLX with strong multilingual word-error rates; Parakeet and Qwen3-ASR push speed and accuracy further on supported languages.
Whisper large-v3
The reliable multilingual standard; timestamps, translation, robust to noise.
Parakeet v3
Very fast, very accurate on supported languages; great for long recordings.
Qwen3-ASR
Newer multilingual ASR with strong accuracy; pairs naturally with the Qwen brain.
Canary-Qwen-2.5B
5.63% WER on English: beats Whisper large-v3 on the HuggingFace Open ASR Leaderboard for English transcription. Use when English-only and accuracy matters more than language coverage.
ASR word-error rate: open models on English (HF Open ASR Leaderboard, 2026-05-25)
ASR WER · per source language (open models, lower is better)
mlx_audio.server) exposes TTS, STT and speech-to-speech behind an OpenAI-compatible REST endpoint, so Hermes drives every voice model through one local URL, the same way it talks to the brain.Making the last cloud tools local
Hermes ships four tools that phone the cloud by default. To make the studio truly air-gappable, each one is repointed at a local engine. After this, the machine can run with Wi-Fi off.
| Default (cloud) | Local replacement | How |
|---|---|---|
image_generate → FAL | Local FLUX / Qwen-Image | Plugin or MCP wrapper around Draw Things / ComfyUI / MLX; the plugin system supports image-gen backends. |
text_to_speech → cloud | MLX-Audio server | Point the tool at the local OpenAI-compatible voice endpoint. |
web_search → Exa | Off, or local SearXNG | Disable for a pure air-gap, or wrap a self-hosted SearXNG via MCP for offline-ish search. |
mixture_of_agents | Local council | Repoint the 4 references + aggregator at your local models (e.g. gpt-oss + Qwen + Gemma). |
localhost. Pull the network cable and the studio keeps working. That is the difference between "private-ish" and genuinely yours.Teaching a model your style, on the Mac
A LoRA is a tiny add-on file that teaches a big model one new thing: your face, your art style, a brand voice, a singer's tone, without retraining the whole model. You make one from a handful of examples in an hour or two, and snap it on or off like a lens.
Low-Rank Adaptation freezes the base weights and injects small trainable rank-decomposition matrices (A·B) into attention/FFN layers; you train ~0.1–1% of parameters. QLoRA trains those adapters on top of a 4-bit-quantized base, collapsing memory further. Lineage: DreamBooth and Textual Inversion for image models, now standard across text, image, music and voice. Unified memory is the quiet superpower here: there is no separate VRAM wall, so a Mac fine-tunes models a 24 GB RTX 3090 cannot even load (rule of thumb: 16 GB → 8B, 32 GB → 14B, 64 GB → 32B; Llama-7B needs ~28 GB full, ~14 GB LoRA, ~7 GB QLoRA).
Training a text LoRA with MLX
# 1. quantize the base to 4-bit (QLoRA kicks in automatically)
mlx_lm.convert --hf-path Qwen/Qwen3.6-27B -q --q-bits 4
# 2. train the adapter on your JSONL data
mlx_lm.lora --model ./mlx_model --train --data ./data \
--lora-layers 16 --batch-size 2 --iters 600
# 3. fuse the adapter back into a standalone model (optional)
mlx_lm.fuse --model ./mlx_model --adapter-path ./adapters
Image LoRA
FLUX/SDXL LoRA in 2,000–4,000 steps via ai-toolkit, SimpleTuner or ComfyUI. Stack 2–3 at 0.5–0.7 weight. Thousands ready on Civitai.
Music LoRA
ACE-Step learns a genre or a singer's tone from a handful of songs; snap it on for on-brand tracks.
Abliteration LoRA
Heretic v1.2 outputs the uncensoring itself as a toggleable LoRA adapter, no full re-download.
Voice "LoRA"
Zero-shot cloning is effectively instant adaptation; fine-tune only for a recurring signature voice.
Image-LoRA training: tool chooser
Four tools dominate FLUX / SDXL / Qwen-Image LoRA training in 2026. Pick by base model and ergonomics.
| Tool | Base models | Sweet spot | Notes |
|---|---|---|---|
| ai-toolkit (Ostris) | FLUX.2, SDXL, SD 3.5 | FLUX.2 [klein] 4B LoRA | The default for FLUX. ~/.aitoolkit config files; clean dataset format; 1,000–2,000 steps at rank 8 = HF baseline. Best LR: 1e-4 with cosine decay. |
| SimpleTuner | FLUX.2, SDXL, SD 3.5, Auraflow | Multi-aspect-ratio training | Stronger for production datasets with mixed aspect ratios. Supports DeepSpeed; Mac MPS works but slower than ai-toolkit on M5 Max. |
| Diffusion Pipe | FLUX.2, HunyuanVideo, Wan 2.7 | Video LoRAs | The only viable Mac path for training video model LoRAs (Wan 2.7, Hunyuan). Heavy memory; needs the 128 GB budget. |
| ComfyUI built-in | FLUX.2 dev, SDXL | One-off iteration | Node-graph LoRA training; visual but slower than CLI tools. Good for trying ideas before committing to ai-toolkit. |
- 15-30 source images of the subject (more isn't always better; quality > quantity).
- Mixed lighting, angles, distance, expressions. Avoid duplicates.
- 1024×1024 or matched-aspect-ratio crops; FLUX.2 handles multi-aspect with SimpleTuner.
- Caption file per image: short trigger token (e.g.
ivansubj) + brief description. ai-toolkit auto-captions via vision LLM if you skip. - Train 1,000-2,000 steps at rank 8, LR 1e-4 cosine. Sample every 250 steps to a held-out prompt; pick the checkpoint that nails likeness without overfitting facial frozen-state.
- Stack with style LoRAs at 0.5-0.7 weight at inference. Civitai has thousands ready to combine.
How the agent runs the whole studio
The pieces only become a studio when one mind coordinates them. Hermes does this with five mechanisms, all configurable.
| Mechanism | What it enables |
|---|---|
| Model routing | delegation.model sends subagent or content work to a different model (e.g. uncensored writer) while the orchestrator stays on the reliable brain. Switch live with /model. |
| Subagents | delegate_task spawns isolated workers (own context, terminal, tools) that run in parallel and return only a summary, keeping the main thread clean. |
| Mixture of agents | mixture_of_agents sends one hard problem to several models and an aggregator merges the best answer, all local. |
| Code execution | execute_code runs Python that itself calls Hermes tools, for branching logic the model would otherwise narrate step by step. |
| Cron + memory + skills | Scheduled jobs run in fresh sessions; memory carries durable facts; skills carry repeatable procedures the agent wrote itself. |
# route the heavy thinking and the uncensored writing separately
model: ollama/gpt-oss:120b # orchestrator brain
delegation:
model: ollama/qwen3.6-abliterated:agent # subagent / content writer
custom_providers:
- name: local-media
base_url: http://localhost:8080/v1 # MLX-Audio / image gateway
Swap the brain in 30 seconds when the leaderboard moves
The whole point of treating the brain as a swappable component: the day a better open-weights model lands, you switch without re-architecting. Concrete recipe, valid for any Ollama-served model:
# 1. Pull the new brain (one-time cost; cached afterwards)
ollama pull deepseek-v4-flash:35b-a3b-q5 # ~24 GB
# 2. Hot-swap by editing one line in ~/.hermes/config.yaml
sed -i '' 's|ollama/gpt-oss:120b|ollama/deepseek-v4-flash:35b-a3b-q5|' ~/.hermes/config.yaml
# 3. Restart the agent; skills, memory, channels all carry over
hermes restart
# 4. Sanity check
hermes 'Confirm which model you are running, then list the first 3 tools available.'
What carries over: every skill in ~/.hermes/skills/, every memory entry, every channel binding (Telegram / Discord / Slack), every API key, every cron job, every MCP server registration. Nothing else needs to change because nothing else depends on the model identity. The brain is one row in one YAML file.
- Reasoning-format flags: hybrid-thinking models (gpt-oss-120B, Qwen3.6) need three knobs disabled (
--reasoning off+--reasoning-budget 0+enable_thinking: false). Non-thinking models (Llama 4 Scout, Gemma 4) need none of them. Check the cheat sheet in section 6 after swapping. - Context-window size:
OLLAMA_CONTEXT_LENGTHmust fit the 70+-tool system prompt. 65,536 is the safe default; some smaller-context models will silently truncate tools if left at 8,192. - Tool-call format: most modern open models speak the OpenAI tool-call schema natively. A handful of older Llama-derivatives need
--tool-format llama. Hermes auto-detects on first call; if a tool silently fails to fire, that is the suspect.
How to actually talk to each model
Every model class wants a different prompt style. Old "masterpiece, trending on artstation" spam hurts modern models. Here are the patterns that work.
Prompt patterns by model class
FLUX / Qwen-Image
Plain descriptive sentences + camera, lens and lighting. "A weathered fisherman at dawn, 35mm, soft side light, shallow depth of field, muted teal palette." Put exact text in quotes for signage.
ACE-Step
Tags: balkan brass, minor key, 90bpm, male vocal, live. Lyrics with [verse] / [chorus] / [instrumental]. ~2–3 words per second.
Voice cloning
5–10s of clean reference audio (no music). Punctuation drives prosody; for emotion use Chatterbox's exaggeration control rather than ALL-CAPS.
Agent system prompt
State the goal, the allowed tools, and a stop condition. Let memory and skills carry standing context instead of repeating it every session.
Per-brain prompting: what each Mac-fit model wants
The four orchestrator brains do not respond to the same system-prompt shape. Tune per model.
gpt-oss-120B
Use OpenAI's Harmony chat template (Ollama handles automatically). Set reasoning_effort: high in extra_body for hard tasks; drop to low for quick tool calls. The model emits visible chain-of-thought by default; keep that in the agent log but strip it from end-user output via Hermes's /compress.
Qwen3.6-27B
For tool-calling agents, disable reasoning with all three: reasoning: off, reasoning_budget: 0, and chat_template_kwargs.enable_thinking: false. Skip any one and the model leaks chain-of-thought into your tool-call output. Set reasoning_budget: 8192 for math and code where you do want thinking.
Llama 4 Scout
Pin a parser that handles Llama 4's pythonic tool-call format (not OpenAI JSON). With Ollama: extra_body.tool_call_format: pythonic. With llama.cpp serve: pass --chat-template llama4. Vanilla OpenAI parsers silently drop calls.
Gemma 4 31B
Reliable through about 10 tool calls per session. Past that, builders report repetition loops where the model resends the same tool call. Detect with a three-identical-calls heuristic and reset the session (Hermes /new) before continuing. Vision works; tool-calling reliability does not match gpt-oss or Qwen3.6 at depth.
Real workflows the studio runs end to end
These are not hypotheticals; each is a chain of the tools and models already covered, expressed as a Hermes skill or cron job.
Real workflows
Auto-dubbing
Video in → Whisper transcribes (~10% realtime on M5 Max) → brain translates (~1 sec/line on Qwen3.6-27B) → Demucs separates vocal track + Pyannote diarizes speakers (~30 sec for 5-min clip) → Chatterbox clones each speaker voice + reads translation (~RTF 0.2) → LatentSync/MuseTalk re-syncs mouth motion (~realtime) → ffmpeg muxes back to video. One Hermes skill, one command.
Local songwriting
Brain (or uncensored writer) drafts lyrics in your style → tags hand-picked → ACE-Step XL 4B composes blueprint (~10 sec) and renders 48 kHz stereo audio (~60 sec) → you keep stems for re-mix. The fully offline Suno equivalent. M5 Max runs the whole chain on battery.
Content factory (cron)
Hermes cronjob at 06:00 local: brain drafts a post per topic (~30 sec) → FLUX dev or Qwen-Image-2512 renders matching hero image (~15 sec via Draw Things Lightning Draft) → file dropped in ~/content/YYYY-MM-DD/ → Hermes messaging gateway pings you on Telegram. Runs while you sleep; review in the morning.
Autonomous coding
Brain plans via /goal contract → delegate_task spawns subagents per module (isolated context, parallel) → execute_code runs pytest/cargo test → LSP semantic diagnostics surface any type errors → loop until green or Kanban auto-blocks on heartbeat timeout. Production Sprint 1 of this guide ran exactly this way.
Local model council
mixture_of_agents routes the hard question through gpt-oss-120B + Qwen3.6-27B + Gemma 4 31B in parallel, then an aggregator brain merges. No API calls. ~3x wall-clock vs single brain (parallel limited by 614 GB/s bandwidth), better answers on contested questions.
Self-improving loop
Hermes hits a bug, solves it, then writes a skill with the fix; the Curator agent deduplicates and consolidates skills nightly. Same bug a week later: agent finds the skill via session_search and applies the known fix in one tool call. The studio gets measurably better with use.
from run_agent import AIAgent drops the studio into your own Python scripts, so a pipeline can be triggered by a file drop, a webhook or a schedule.Music-video pipeline · script to render
End-to-end music-video generation, all on-device. M5 Max budget: about 5-8 minutes total for a 30-second video.
Podcast pipeline · script to mastered episode
Multi-speaker podcast from a script, all voice-cloned, with intro music, lower-thirds music beds and outro. M5 Max budget: real-time + a few minutes mastering for a 30-minute episode.
What actually fits in 128 GB at once
Roughly 120 GB is usable for models after raising the wired limit. These are real, simultaneous loadouts. The pattern is always: keep one brain resident, spin heavy media up on demand.
1 · Mac Studio M3 Ultra at 512 GB / 819 GB/s. Holds DeepSeek V4 Flash 284B at Q8 with room for a second brain co-resident. The only single Apple machine that matches the trillion-parameter cloud tier at slow tok/s.
2 · EXO Labs distributed inference over Thunderbolt 5. Daisy-chain M5 Max + Mac Studio M3 Ultra (or multiple M5 Max units) over the 120 Gb/s TB5 fabric into a single virtual inference pool. Workable for models > 128 GB; the bandwidth ceiling per hop is TB5, not RAM, so attention-heavy workloads pay a real but tractable penalty. Open-source at github.com/exo-explore/exo; works with the same Hermes config and same model weights you already have. The escape valve for when the leaderboard moves and you do not want to abandon the studio architecture.
What fits my Mac? Interactive
Slide to your RAM budget. The model field table above and the budget bars above re-colour: green if it fits, amber if tight, red if you would need to evict the brain to run it.
What it would cost to rent the equivalent
The studio is a capital cost (one Mac, once). Subscriptions are operating cost (monthly, forever). The honest comparison is what a comparable workflow runs through paid APIs and consumer-tier subscriptions, at moderate use: a working day of agent calls, ~40 images, ~10 song demos, ~30 minutes of voice clones, ~3 hours of transcription. Prices are list, May 2026; multi-tier discounts ignored.
| Capability | Local stack | Closest paid equivalent | List price · moderate use | Monthly |
|---|---|---|---|---|
| Brain · agent loop | gpt-oss-120B + Hermes | Claude Pro (Opus 4.7) + Cursor Pro | $20 + $20 | $40 |
| Coding agent | Qwen3.6-27B + Hermes | GitHub Copilot Business + Codex API top-up | $19 + ~$25 | $44 |
| Image generation | FLUX.2 dev/klein · Qwen-Image | Midjourney Standard + Adobe Firefly | $30 + $10 | $40 |
| Music generation | ACE-Step 1.5 XL · YuE | Suno Pro + Udio Standard | $30 + $10 | $40 |
| Voice cloning + TTS | Chatterbox · F5-TTS · Qwen3-TTS | ElevenLabs Creator (100k chars) | $22 | $22 |
| Transcription | Whisper · Canary-Qwen · Parakeet | OpenAI Whisper API + Rev.ai | ~$15 + $10 | $25 |
| Video generation | Wan 2.7 · LTX-Video (mode-switch) | Runway Gen-4 Standard + Pika 2.x Pro | $35 + $35 | $70 |
| Long-context summarisation | any local 27B+ brain | Anthropic API 200k context spillover | ~$30 | $30 |
| Search-grounded answers | SearXNG + local brain | Perplexity Pro | $20 | $20 |
| Cloud sandbox (for agents) | local · Docker · Modal free tier | Modal paid · Daytona Pro · Vercel Pro | ~$20-50 | $30 |
| Rented equivalent · monthly | $361 | |||
MacBook Pro M5 Max 128 GB / 2 TB lands around $4,499 list (US, May 2026). At $361/mo rented-equivalent, the machine pays for itself in ~12.5 months. At $200/mo rented (drop video + reduce to single-tier image and music), the payback is ~22.5 months. The Mac then keeps working for years 2-5 at zero operating cost, while subscription totals over the same period are $8,664 to $21,660.
This understates the gap two ways. First, list prices ignore overage tiers: heavy image or music use blows past these caps fast. Second, the studio has no rate limits, no usage caps, no policy refusals, no per-call latency from the round-trip, no data leaving the laptop. The capability ceiling is your Mac, not a usage meter.
From a fresh Mac to a running studio
- Free the memory. Raise the GPU wired limit so models can use about 120 GB.
sudo sysctl iogpu.wired_limit_mb=122880 - Install the serving layer. Ollama for the brain (GGUF), and MLX/LM Studio for media and fine-tuning.
- Pull the models.
ollama pull gpt-oss:120b ollama pull qwen3.6:27b # raise context so the 70+-tool prompt fits export OLLAMA_CONTEXT_LENGTH=65536 - Install Hermes. Pick one:
# Recommended on macOS brew install hermes-agent # Or via PyPI (clean Python environments) pip install hermes-agent # Or via the official install script curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash # Or on Windows iex (irm https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.ps1) - Point it at local. Edit
config.yaml(the orchestration block in section 10): brain, delegation model, custom providers for media. - Verify. Run
hermes doctor, then ask it to write a file, generate an image, and transcribe a clip. If all three land, the studio is live. - Configure reasoning-format per model. Hybrid-thinking models need three knobs disabled to keep tool-call output clean. The cheat sheet is in section 6. Skip this and watch agents emit raw chain-of-thought into your terminal output.
- Enable
hermes proxyif you have any of Claude Pro, ChatGPT Pro or SuperGrok. The subscription becomes a localhost OpenAI-compatible endpoint that Codex CLI, Aider, Cline and Continue can all drive. Material cost-saving for multi-tool users.
hermes backup snapshots memory and skills · profiles keep work and creative brains separate · /compress rescues a long session nearing its context limit.studio-up.sh, run with bash studio-up.sh. Idempotent; rerun anytime.
#!/usr/bin/env bash
set -euo pipefail
# 1. Free the memory
sudo sysctl iogpu.wired_limit_mb=122880
# 2. Install serving layer
brew install ollama
brew install --cask lm-studio
pip install mlx mlx-lm mlx-audio
# 3. Pull models
ollama pull gpt-oss:120b
ollama pull qwen3.6:27b
export OLLAMA_CONTEXT_LENGTH=65536
# 4. Install Hermes
brew install hermes-agent
# 5. Initial config
mkdir -p ~/.hermes
cat > ~/.hermes/config.yaml <<'YAML'
model: ollama/gpt-oss:120b
delegation:
model: ollama/qwen3.6:27b
custom_providers:
- name: local-media
base_url: http://localhost:8080/v1
YAML
# 6. Verify
hermes doctor
echo "Studio is live. Try: hermes 'write hello.txt then summarize it.'"
Reproduce the numbers in this guide
Every tok/s, prefill speedup, image-gen wall clock and ASR WER in this guide can be re-measured on your own Mac in under 10 minutes. If a number drifts more than ~15% from the ranges below, the most likely cause is missing macOS Tahoe 26.4+, an unraised wired limit, or a competing memory hog (Chrome, Docker Desktop). Probe in this order.
1 · LLM decode throughput (tok/s)
# Ollama: built-in eval-rate report
ollama run gpt-oss:120b --verbose "Write a 200-word essay about the M5 Max."
# → look for: eval_count, eval_duration, eval_rate (tokens/s)
# MLX equivalent (more granular, separates prefill from decode)
mlx_lm.generate --model mlx-community/Qwen3.6-27B-Instruct-4bit \
--prompt "Explain unified memory in two paragraphs." \
--max-tokens 256 --verbose
Expected on M5 Max 128 GB · macOS Tahoe 26.4+ · cold start, no other GPU load:
| Model · quant | Decode tok/s | Prefill tok/s | Memory resident |
|---|---|---|---|
| gpt-oss-120B · Q4 (MoE, 5.1B active) | ~58-72 | ~620-820 | ~63 GB |
| Qwen3.6-27B · Q4 | ~21-26 | ~280-360 | ~16 GB |
| Qwen3.6-27B · Q8 | ~12-15 | ~140-180 | ~31 GB |
| Llama 4 Scout 17B-16E · Q4 (MoE) | ~74-88 | ~720-940 | ~12 GB |
| DeepSeek-V4 Flash 35B-A3B · Q5 | ~48-60 | ~510-670 | ~24 GB |
2 · Prefill speedup vs M4 Max (sanity check)
# Long-context prefill measurement: 8k-token prompt
mlx_lm.generate --model mlx-community/Qwen3.6-27B-Instruct-4bit \
--prompt-cache-file /tmp/none \
--prompt "$(yes 'context line ' | head -1000 | tr -d '\n')" \
--max-tokens 1 --verbose
# → look for: prompt-eval rate. Compare against M4 Max ~85 tok/s.
# M5 Max should land ~280-360 tok/s on the same prompt (3.3-4.0x).
3 · Image generation wall clock (FLUX.2 + Draw Things)
# FLUX.2 dev · 28 steps · 1024x1024 · MLX-Diffusion
time python -m mlx_diffusion.generate \
--model black-forest-labs/FLUX.2-dev-mlx-4bit \
--prompt "a still life of a brass accordion, dramatic light" \
--steps 28 --size 1024
# Draw Things Lightning Draft (UI app, but timeable)
# Generate at 512x512, 4 steps, Lightning Draft schedule.
# Expected wall clock ~1.0-1.4 s per image.
| Workload | Wall clock · M5 Max | M4 Max baseline |
|---|---|---|
| FLUX.2 dev · 28 steps · 1024 | ~14-18 s | ~52-68 s |
| FLUX.2 klein · 4 steps · 1024 | ~2.4-3.2 s | ~9-12 s |
| Draw Things Lightning Draft · 4 steps · 512 | ~1.0-1.4 s | ~3.8-4.6 s |
| Qwen-Image · 30 steps · 1024 | ~22-28 s | ~80-95 s |
4 · Transcription speed (real-time factor)
# 60-second clip; RTF = wall_clock / audio_duration. Lower is better.
time mlx_whisper --model mlx-community/whisper-large-v3-mlx \
--language en sample-60s.wav
# → expected wall clock ~3.8-5.2 s → RTF ~0.06-0.09
time faster-whisper-cli --model large-v3 --device mps sample-60s.wav
# → expected wall clock ~5.2-7.8 s → RTF ~0.09-0.13
5 · Music generation (ACE-Step)
time python -m ace_step.generate \
--tags "balkan folk, accordion, violin, minor key, 92 bpm, male vocal" \
--lyrics-file /tmp/lyrics.txt \
--duration 47 --guidance 7.5 --seed 1492
# → expected wall clock ~16-22 s for 47-second output
6 · Putting it all together: the 5-minute smoke test
#!/usr/bin/env bash
# Save as bench-studio.sh; run after install completes.
set -euo pipefail
echo "== sysctl wired limit =="
sysctl iogpu.wired_limit_mb
echo "== macOS build =="
sw_vers -productVersion
echo "== brain decode =="
ollama run gpt-oss:120b --verbose "Count to 50." 2>&1 | grep eval_rate
echo "== prefill =="
mlx_lm.generate --model mlx-community/Qwen3.6-27B-Instruct-4bit \
--prompt "$(yes 'x ' | head -500 | tr -d '\n')" \
--max-tokens 1 --verbose 2>&1 | grep -E "prompt|generation"
echo "== whisper =="
time mlx_whisper --model mlx-community/whisper-large-v3-mlx sample-60s.wav
- macOS < 26.4: INT4 Neural Accelerator path is not enabled. Upgrade is the single biggest lever; expect 30-50% jump on prefill and 15-25% on decode after.
- Wired limit unraised:
sysctl iogpu.wired_limit_mbshould read 122880 (~120 GB). Default is ~96 GB; the 120B brain spills. - Competing GPU consumers: Chrome with WebGL tabs, Docker Desktop, Final Cut Pro background render. Quit them, retest.
- Thermal throttle: a stress run that just finished leaves the M5 Max in throttle for 1-2 minutes. Wait, then re-measure cold.
- Power adapter: on battery the GPU clocks down. All numbers above assume the laptop is plugged in.
- Wrong quant: a Q4 number compared against a Q8 weight on the same model will look ~40% slower. Match the row.
Staying private, staying in control
An agent that can run shell commands and an uncensored model that never refuses are powerful and need guardrails. Hermes ships them; keep them on.
- Dangerous-command blocking + approvals: destructive patterns (
rm -rf,DROP TABLE) are blocked or require explicit confirmation. - Secret-exfiltration scanning: the runtime flags attempts to leak keys or credentials, important when an abliterated model will follow any instruction it reads.
- MCP hardening: OAuth 2.1 PKCE for connectors plus OSV scanning of MCP servers for known-malicious packages.
- Air-gap checklist: local brain ✓, local media ✓,
web_searchoff or local SearXNG ✓, no telemetry ✓. Then the network cable is optional.
Seven sandbox backends
Hermes ships more sandbox options than any competitor. Pick by trust level + cost.
local
Runs in your shell with the agent's permissions. Fastest. Use for trusted skills.
Docker
Isolated filesystem, network, processes. Use for untrusted MCP servers or third-party skills.
SSH
Run the workload on another machine over SSH. Use for heavy compute on a Mac Studio without leaving the laptop.
Singularity
Container format favoured in HPC. Use on university or research clusters.
Modal
Spins up an ephemeral cloud sandbox per task. Use for elastic burst compute outside the studio.
Daytona
Pre-configured dev environments. Use for reproducible per-project workspaces.
Vercel Sandbox
Vercel's serverless sandbox primitive. Use when the workload should sit close to a web app.
"Private and local" means no telemetry, no cloud calls, no model provider seeing your prompts. It does not mean threat-free. The four threat surfaces that survive going local:
- Malicious skills.
~/.hermes/skills/is justSKILL.md+ scripts; a hostile skill in agentskills.io / ClawHub can run anything. Mitigation: Hermes ships skill provenance + OSV scanning on import, but you must read the skill before installing. The Curator agent flags churn. - Prompt injection via fetched content. An abliterated model will follow instructions hidden in a web page or PDF it reads. The orchestrator brain (clean, refusal-intact) should pre-filter content before passing to the writer. Mitigation: split orchestration brain (clean) from content-writing brain (uncensored); keep tool approvals on for shell + file-delete.
- MCP supply chain. Third-party MCP servers run with the agent's permissions. Mitigation: Hermes auto-runs OSV on MCP packages at install + OAuth 2.1 PKCE for connectors. Review what each MCP server can touch before enabling.
- Physical access + key extraction. Local API keys (for the few cloud tools you do keep) sit in
~/.hermes/. macOS FileVault encrypts the disk; secrets are not in your keychain by default. Mitigation: enable FileVault; consider moving secrets to macOS Keychain viahermes keychain migrate(0.14+).
The studio is dramatically safer than the cloud equivalent. It is not safe. Plan for it.
The studio is a starting line, not a finish
Step back and look at the arc. A year ago, the best open model scored 22 and a private studio like this was science fiction. Today the gap to the frontier is single digits, a 27-billion-parameter model that fits a laptop codes like last year's best, and music, voice and images that rivalled paid services now run offline in seconds. The line on that chart is still climbing.
What you have built here is not a cheaper ChatGPT. It is a different relationship with the technology. The models are yours: they do not change under you overnight, they do not log your work, they do not disappear when a subscription lapses or a company pivots. The agent learns your patterns and keeps them. The voice clone, the LoRA of your style, the skills it wrote solving your bugs, none of it leaves the machine. In a market built on renting access to someone else's servers, owning the whole stack outright is the genuinely radical option.
It is not the strongest possible system. The trillion-parameter leaders still live in data centers, local video still trails, and the leaderboard you read today will be wrong next month. But the trajectory is unmistakable: every quarter, more of the frontier becomes something you can hold in 128 GB. The right move is not to wait for the perfect model. It is to build the studio now, learn how the pieces fit, and let it improve underneath you as the open field keeps closing the gap, which it will.
One agent. Every model. Zero cloud. Run it once, and the cloud starts to look like a choice rather than a requirement. That is the whole point, and it is already here.
Every benchmark in this guide is timestamped to the source-of-truth date in the snapshot badge at the top. The page is regenerated and re-validated on a monthly cron (.github/workflows/monthly-refresh.yml, runs the 1st of each month at 09:00 UTC). The cron does three things:
- Re-fetches the public leaderboards (Artificial Analysis Intelligence Index, LMArena, HuggingFace Open ASR, TTS-Arena v2, BFCL, GDPval, MMLU-Pro, GPQA, SWE-bench Verified, τ²-Bench) and opens a draft PR if any displayed number drifts more than 2 Elo points or 2 percentage points.
- Re-runs the validate script against the rendered HTML to keep em-dash count at 0, JSON-LD valid, all internal anchors resolved, all icon refs defined, page size under the 250 KB budget, and the cliche scan clean.
- Re-runs Lighthouse CI and axe-core against the production URL, posts the diff into the PR description.
If the draft PR sits open longer than a week, the next cron run pings it. Readers can verify all of this themselves: every workflow file is in the repo, every benchmark cite has a primary-source link, and the /llms-full.txt mirror is what AI crawlers actually index. The page is engineered to age slowly and visibly, not to silently rot. If you spot drift before the cron does, open an issue with the source link and it gets folded into the next refresh.
How this page itself was built
Worth saying plainly: this guide describing the local AI studio was written using the studio it describes. Not as a stunt, as a stress test. If the architecture cannot ship a 220 KB single-page field guide with zero em-dashes, six JSON-LD blocks, working CSS scroll animations and an OG card that renders on Twitter, then it cannot ship anything more demanding. Here is the actual build stack, end to end:
| Step | Done by | Notes |
|---|---|---|
| Source research (papers, model cards, leaderboards) | local brain + web_search tool calls + delegated subagents | The brain dispatches one research-agent subagent per topic (open-weights LLMs, FLUX.2 family, ACE-Step internals, M5 Max bandwidth, etc.) and the results merge back into one outline. Same pattern as the multi-modal worked example in section 3. |
| Outline + narrative arc | brain + /goal contract | The 17-section outline is one Ralph-loop goal: "ship a field guide at 99/100 quality, single page, English with locale roadmap, zero em-dashes". The Judge agent gates each merge. |
| Copy drafting (every paragraph, both registers) | brain in long-context mode | "Simple" and "Technical" registers are two passes over the same outline. Em-dashes are intercepted by the validate script before commit; the brain learned after the first three rejections to use commas and sentence-splits. |
| SVG diagrams + benchmark bars | brain (raw SVG, no design tool) | Every diagram in this guide is hand-authored SVG in the HTML source. Faster to iterate, smaller payload, perfect rendering at any zoom, no external dependency. |
| OG image + favicon set | scripts/build-og.mjs (Resvg + hand-crafted SVG) | Renders 1200x630 PNG from an SVG template; six locales of headline copy. No Satori, no font parsing layer, no Canva. |
| Validation gates | scripts/validate.mjs + Lighthouse CI + axe-core | Em-dash count = 0, JSON-LD valid, tag balance, anchor resolution, size budget, cliche scan, WCAG 2.2 AA. All run on every PR before deploy. |
| Deploy | Cloudflare Pages via wrangler-action@v3 | Push to main → auto-deploy in ~30 s. Preview deploys per PR. No GitHub OAuth, no manual step. |
| Monthly drift refresh | monthly-refresh.yml cron + PR ping | The honest-aging loop described above. Same brain, same skills, same agent loop, no human in the regular path. |
The studio's brain wrote the copy for the studio's own field guide, in two language registers (Simple + Technical) across 17 sections and 236 KB of single-page HTML. The studio's tools rendered the OG card (in six locales), ran the validation gates, opened the PRs, deployed to the edge, and will keep the benchmarks honest month over month. The whole pipeline runs on the same MacBook Pro M5 Max the guide is about. No cloud LLM round-trip ever touched the production copy.
Honest scope note on i18n: the page body is English only as of v1.12. The OG social card renders in six locales (en/sr/de/fr/zh/ar) via scripts/build-og.mjs --locale=<xx>. Full per-locale page translations with a language switcher, hreflang, og:locale:alternate, RTL CSS for /ar/ and localized sitemap entries are on the roadmap, not shipped. The 6-locale promise applies to the social card today and the full page in a future release.
You are reading the result. If that is not a working demonstration of "one agent, every model, zero cloud", nothing is.
Straight answers
Is it really free?
Does it genuinely work offline?
What is the catch with uncensored models?
Where is local AI still weak?
Is a Mac actually the right machine?
Will these recommendations age?
Do I need macOS Tahoe 26.4?
sw_vers to check.Does Apple Intelligence conflict with the local studio?
Can I scale beyond 128 GB?
Is FLUX.2 commercially usable?
How fast is the M5 Max actually?
What happens when the model I'm using gets superseded?
config.yaml; the rest of the studio (agent, tools, skills, memory, pipelines) is unchanged. That is the whole point of the architecture: the model is a swappable component, not the system.Why not Strix Halo or DGX Spark instead of a Mac?
What about the electricity cost?
Why not just a Mac mini M5 Pro instead of a MacBook?
Can I run this on Linux with an NVIDIA workstation?
What if my use case needs more than 256K context?
session_search) instead of stuffing the context; usually higher quality at lower cost. (3) Hermes /compress rolls long sessions into summaries automatically.How does this compare to a $20-50/month Claude or ChatGPT subscription?
hermes proxy lets you use the subscription through the studio for the rare hard job; you get both.