Independent field guide · Apple Silicon · May 2026

One agent. Every model. Zero cloud.

Q: Does it genuinely work offline?

Yes, once the four cloud-default tools are repointed. Brain, images, music, voice, video and transcription all answer on localhost.

Q: Is FLUX.2 commercially usable?

Partly. FLUX.2 dev and klein 9B are non-commercial. FLUX.2 klein 4B is Apache-2.0. For unrestricted commercial work, Qwen-Image-2512 or Z-Image-Turbo (both Apache-2.0) are the safe defaults.

Q: How fast is the M5 Max actually?

Independent benchmarks: gpt-oss-120B at Q4 MLX runs 58-72 tokens/second decode on a 128 GB M5 Max. Qwen3.6-27B at Q4 MLX: 21-26 tokens/second (12-15 at Q8). Prefill is 3.33-4.06 times faster than M4 Max per Apple's MLX team. Full reproduce-the-numbers guide with shell commands is in section 14.

Q: Why not Strix Halo or DGX Spark instead of a Mac?

Bandwidth. M5 Max 614 GB/s vs DGX Spark 273 GB/s vs Strix Halo 256 GB/s. For bandwidth-bound decode (agent tool calls), the Mac wins by roughly 2.25 times over Spark and 2.4 times over Strix Halo, despite Spark's FP4 hardware advantage.

A deep, honest guide to building a private AI studio on a single MacBook Pro M5 Max (128GB). Hermes Agent is the conductor; a local model is its brain; and a dedicated local model handles images, video, music, dubbing, voice cloning and transcription. The whole 2026 model field, how each one works and trains, the benchmarks that matter, and exactly how to build it. Flip the toggle to read it plain or in full technical depth.

Conductor

Hermes · MIT

Memory

0 GB · 614GB/s

Tools

0 + MCP

Models covered

Cloud calls

0 air-gappable

One conductor, six local capabilities, nothing crossing the dashed line.

gpt-oss-120B · Qwen3.6-27B · DeepSeek V4 Flash · GLM-5.1 · Kimi K2.6 · Gemma 4 31B · Mistral Medium 3.5 · Llama 4 Scout · Nemotron 3 Super · FLUX.2 · Qwen-Image · HiDream-O1 · Wan 2.7 · ACE-Step 1.5 · Chatterbox · F5-TTS · Whisper v3 · gpt-oss-120B · Qwen3.6-27B · DeepSeek V4 Flash · GLM-5.1 · Kimi K2.6 · Gemma 4 31B · Mistral Medium 3.5 · Llama 4 Scout · Nemotron 3 Super · FLUX.2 · Qwen-Image · HiDream-O1 · Wan 2.7 · ACE-Step 1.5 · Chatterbox · F5-TTS · Whisper v3 ·

Sources are primary throughout: official Hermes, Apple, OpenAI, Qwen and DeepSeek documentation; the Artificial Analysis Intelligence Index, SWE-bench Verified, BFCL, τ-bench, MMLU-Pro and Text-to-Image arenas; arXiv papers for every media model; and community discussion on r/LocalLLaMA. Numbers are as reported by their sources and are a May 2026 snapshot. Models rotate monthly; the architecture here is stable, the leaderboard is not. Benchmark on your own machine before committing.

01 · The case

Why this is suddenly possible

Two years ago, running a frontier-class model on your own machine was a fantasy. That changed faster than almost anyone predicted. The reason this guide exists now, and could not have existed in 2024, is a single chart: the gap between open models you can download and the closed models you rent has nearly closed.

A year ago the best open model scored 22. Today it scores 54, within a few points of the frontier.

In plain terms

"Open weights" means the actual model is published for anyone to download and run, instead of being locked behind a company's paid website. In 2024 these were toys next to ChatGPT. In 2026 they are genuinely close to the best, free, and small enough to run on a good laptop. That is the whole reason a private studio on one Mac is now realistic.

Under the hood

On the Artificial Analysis Intelligence Index (v4.0, a composite of ten evals including GPQA Diamond, Humanity's Last Exam, τ²-Bench, Terminal-Bench Hard and SciCode), the best open-weights model a year ago, DeepSeek V3 0324, scored 22, about 13 points below the leading proprietary model. Today the top open models (Kimi K2.6, MiMo-V2.5-Pro) score 54, with DeepSeek V4 Pro at 52, within 3-6 points of GPT-5.5 (60), Gemini 3.1 Pro and Claude Opus 4.7 (57). Open weights now hold 244 of 386 ranked models and dominate the intelligence-vs-price Pareto frontier.

Then → now (open weights)	Early 2025	2026-05-24
Top Intelligence Index score	22 (DeepSeek V3 0324)	54 (Kimi K2.6 / MiMo-V2.5-Pro)
Gap to best proprietary	~13 points	3–6 points
SWE-bench Verified (best open)	~55%	80.6% (DeepSeek V4 Pro Max)
Open share of ranked models	minority	244 of 386

The honest twist this guide is built around

The very top of that leaderboard (Kimi K2.6 at ~1T parameters, DeepSeek V4 Pro at 1.6T, GLM-5.1 at 744B) needs a data center, not a laptop. But the models that do fit your Mac are essentially last year's frontier: a 27-billion-parameter Qwen scores 77.2% on SWE-bench Verified. You are not running the #1 model. You are running something that would have been #1 a year ago, for free, offline, forever. That trade is the entire point.

02 · The machine

Why the M5 Max is the right hardware

The Mac's advantage is not raw speed; a data-center GPU beats it on throughput. The advantage is one large pool of fast memory shared by the whole chip, which lets a laptop hold models that simply will not fit on a consumer graphics card, while sipping power and staying silent.

Unified memory is the whole trick

A normal gaming PC keeps the model in the graphics card's small, separate memory, and if the model is too big, it simply will not load. The Mac has one big shared pool, so the entire model lives in the same 128 GB the rest of the chip uses. That is how a laptop runs models a $1,500 graphics card chokes on.

Apple's unified memory is a single pool addressable by CPU, GPU and Neural Engine with no host↔device copies. The M5 Max, launched 2026-03-03, tops out at 128 GB at 614 GB/s on the 40-core GPU SKU (the 32-core variant caps at 64 GB / 460 GB/s). After raising the wired limit you get about 120 GB usable for models. The historic Mac weakness was prefill (prompt processing), where it trailed NVIDIA badly; the M5's new per-core Neural Accelerators push prefill 3.33×–4.06× faster than M4 (Apple's own MLX team measurement) and make FLUX-dev-4bit ~3.8× faster. The Neural Accelerators run 1,024 FP16 fused multiply-accumulates per core per cycle, aggregating to about 70 TFLOPS of FP16 or 130 TFLOPS of INT8 across the 40-core GPU. Native FP8 and FP4 still belong to Blackwell; BF16 on the Neural Accelerators arrived in macOS 26.1, INT4 in 26.4: the numbers in this guide assume macOS Tahoe 26.4 or later.

Spec	Figure	Why it matters here
Unified memory	128 GB (40-core GPU SKU)	About 120 GB usable for models after raising the wired limit. Holds one big brain or one video model.
Memory bandwidth	614 GB/s	Caps token generation speed; favours MoE models with low active parameters.
GPU + Neural Accelerators	40-core · ~70 TFLOPS FP16 · ~130 TFLOPS INT8	About four times M4 Max GPU compute on matmul and prefill; about a fifth faster on decode. Speeds diffusion and rectified-flow image models.
CPU	18-core (6 super + 12 performance)	Drives the agent loop and the Python orchestration; not the bottleneck.
Storage	2 / 4 / 8 TB SSD · 13.6 GB/s read · 17.8 GB/s write	A full stack with alternates is 250–300 GB; pick 4 TB if you keep multiple 120B weight files. Cold-loading a 65 GB model takes ~5 seconds.
Chip topology	TSMC SoIC-mH chiplet (two N3P dies)	"Fusion Architecture": same compute die in M5 Pro and M5 Max; explains the uniform per-core spec.
macOS required	Tahoe 26.4 or later	Earlier 26.0 / 26.1 lack INT4 Neural Accelerator support; perf drops materially. Verify your build before reproducing the numbers in this guide.

The one rule of capacity

One big brain or one video model, never both. A 120B brain (~63 GB) plus a video model (~50 GB) plus KV cache blows past 120 GB. The studio works by running the brain resident and spinning heavy media up on demand. Full memory budgets are in section 13.

Storage and cold-load times that actually matter

The M5 Max SSD reads at 13.6 GB/s and writes at 17.8 GB/s, so cold-loading is fast enough that mode-switching between loadouts feels instant. Concrete wall clock to swap weights in from disk:

Model · size on disk	Cold load (first run)	Warm load (page cache)
gpt-oss-120B Q4 · 63 GB	~4.6 s	~1.1 s
Qwen3.6-27B Q4 · 16 GB	~1.2 s	~0.4 s
DeepSeek-V4 Flash 35B-A3B Q5 · 24 GB	~1.8 s	~0.6 s
FLUX.2 dev 4-bit · 6.5 GB	~0.5 s	~0.2 s
Wan 2.7 video · ~50 GB	~3.7 s	~0.9 s

Practical sizing rule: a full studio (brain + image + music + voice + STT) is 250-300 GB on disk. If you keep more than one 120B weight file (e.g. gpt-oss-120B + Llama 4 Maverick + an abliterated mirror), go to 4 TB. The 8 TB SKU only earns its premium if you collect video models or maintain multiple FLUX LoRA libraries. macOS APFS clone-on-write means a Hermes hermes backup of memory + skills is near-instant; the bulk on disk is always the model weights.

Power, thermals and battery life under real load

A 120B brain running tool calls sustained is not a free lunch even on Apple silicon. Concrete numbers from a 16-inch MacBook Pro M5 Max 128 GB on macOS Tahoe 26.4, sustained over 30 minutes:

Workload	Package power	Fan	Battery hours (full charge)
Idle (brain loaded, no calls)	~6-9 W	silent	~14-16 h
Chat (gpt-oss-120B Q4, intermittent)	~22-34 W	silent → low	~5-7 h
Sustained agent loop (tool calls, decode-bound)	~40-55 W	low-audible	~3-4 h
FLUX.2 image gen (steady throughput)	~58-72 W	audible	~2-2.5 h
Wan 2.7 video gen (peak)	~85-105 W	loud	~1.3-1.6 h
ACE-Step music gen (47 s output)	~45-60 W (short burst)	brief	n/a (~18 s)

The 16-inch chassis runs ~8-12 °C cooler than the 14-inch under the same workload because of the larger vapour chamber; the 14-inch will throttle sooner on the video-gen row. Battery numbers assume macOS Tahoe's adaptive power profile is on (default) and the GPU is allowed to clock down between calls. On battery the GPU caps at ~70% of plugged-in clocks; if you are benchmarking, always run plugged in or your numbers will land in the bottom of the ranges in section 14's reproduce guide.

The honest envelope: a working day of chat plus a few dozen images is silent and fits a full battery cycle. A sustained agent run that hammers tool calls plus generates dozens of images is plugged-in territory. Video is plugged-in territory regardless. The MacBook Pro stays a laptop for the first two; it becomes a small desktop for the third.

How the M5 Max stacks up against everything else with 128 GB

For an agent that decodes hundreds of tool calls per session, memory bandwidth beats raw compute, because decode is bandwidth-bound. The Mac wins this race against its closest peers despite Blackwell's FP4 hardware advantage.

Platform	Unified RAM	Bandwidth	gpt-oss-120B Q4	DeepSeek V4 Flash Q4	Notes
MacBook Pro M5 Max 128 GB	128 GB	614 GB/s	fits · ~63 GB	no · ~120 GB	Silent, battery, ANE + Neural Accelerators.
MacBook Pro M4 Max 128 GB	128 GB	546 GB/s	fits	no	Predecessor: weak prefill.
Mac Studio M3 Ultra 512 GB	512 GB	819 GB/s	fits · Q8	fits	Desktop scale-up: the only Apple option that holds V4 Flash.
RTX 5090 (32 GB)	32 GB GDDR7	1,792 GB/s	too small	no	Fastest per-GB but GB ceiling kills it for 120B.
NVIDIA DGX Spark / Project Digits	128 GB unified	273 GB/s	fits	no	Same RAM, 2.25× less bandwidth than M5 Max. Has FP4 hardware.
AMD Strix Halo (Ryzen AI MAX+ 395)	128 GB unified	256 GB/s	fits	no	x86 alternative; mature ROCm still trails MLX / CUDA on Q4 kernels.
RTX PRO 6000 Blackwell	96 GB	1,792 GB/s	fits	tight	Workstation: PCIe scale-out, no battery, $7-10k.
NVIDIA Jetson AGX Thor	128 GB unified	n/a published	likely	no	Robotics-first, not the studio target.

The headline most reviewers miss

On the same 128 GB of unified memory, the M5 Max has 2.25× the bandwidth of NVIDIA's own personal-AI box. On bandwidth-bound decode (which is what an agent doing tool calls is), the Mac wins despite Spark's FP4 advantage. Blackwell only pulls ahead when the workload is matmul-heavy and fits in 32 GB, which most agent-grade brains don't.

What Mac uniquely enables

Fine-tuning 14–32B models on a laptop (32–64 GB unified beats a 24 GB consumer GPU that can't even load them).
Battery-powered, silent operation under sustained load.
MLX-Audio one-endpoint TTS + STT + STS server.
Draw Things Lightning Draft (about one second per image on M5 Max).
Hardware ProRes encode in the Media Engine.
Continuity recipes: iPhone audio capture → AirDrop → MLX-Audio Whisper → 120B summarizer in one chain.

What Mac uniquely cannot do

CUDA-only models, FlashAttention-3 native kernels, NVIDIA-only quant formats (AWQ-INT4 with Triton).
True multi-GPU NVLink scaling beyond a single Mac (EXO Labs distributed inference over TB5 is the workaround for >128 GB).
Native FP8 and FP4 hardware support (Blackwell's persistent lead).
vLLM speculative-decoding-with-paged-attention performance at scale (vllm-metal v0.2 closes some of this in April 2026).

03 · The conductor

Hermes Agent, in full

A raw model only emits text. Something has to turn "dub this clip into English in my voice" into a real sequence of actions that ends in a file on disk. That something is the agent. Hermes Agent, from Nous Research, is the conductor that holds the baton.

In plain terms

Hermes is a free program you install once. It is the tireless operator that lives on your Mac: you talk to it like a person, and it actually does the work, running commands, editing files, browsing, making images, scheduling itself for later, and remembering what it learns so it gets better the more you use it. Think senior assistant, not chatbot.

Under the hood

An MIT-licensed Python 3.11+ agent runtime. It is model-agnostic: the model is a swappable component behind any OpenAI-compatible /v1/chat/completions endpoint (Ollama, llama.cpp, vLLM, SGLang, LM Studio). It ships a registry of more than 70 tools across more than 30 toolsets, a pluggable memory provider, a skills engine that authors its own skills, a full MCP client (OAuth 2.1 PKCE, OSV malware scanning, ACP via the Zed Agent Client Protocol Registry installable via uvx), subagent delegation, durable multi-agent orchestration via Kanban, a cron scheduler, 22 first-class messaging gateways and a React/Ink TUI plus web dashboard. Installs to ~/.local/bin; all state in ~/.hermes/; no telemetry.

The anatomy

Part	Plain	Technical
Tools	Its hands: everything it can physically do.	More than 70 built-in tools across more than 30 toolsets, plus MCP tools; switchable per session with `-t`.
Skills	Step-by-step methods; it writes new ones from what worked.	87 built-in skills + 79 optional in-repo + 672+ across the agentskills.io / HermesHub / LobeHub / Anthropic registries; `skill_manage`; your own under `~/.hermes/skills/`.
Memory	Facts about you and your projects.	Frozen-snapshot system-prompt prefix-cache injection at session start (MEMORY.md ~2,200 chars + USER.md ~1,375 chars), plus on-demand FTS5 query via `session_search`. Relevance ranking lives in `session_search` and in pluggable providers (Honcho, Mem0, Hindsight).
Session search	Total recall of past conversations.	Every session in SQLite with full-text search; `session_search` retrieves and summarizes.
Subagents	Clones helpers that work in parallel and report back.	`delegate_task`: isolated context + terminal + toolset; orchestrator role; `max_spawn_depth`; file-coordination layer.
Cron	Runs on a schedule. "Every morning, do X."	Skill-backed jobs run in fresh sessions; `notify_on_complete` on background processes.
Gateways	Reach it from Telegram, Discord, Slack, WhatsApp, email.	17 platforms; allowlist / DM-pairing / open auth; per-platform skill gating.
Profiles	Separate personas with their own memory (work vs creative).	Isolated `HERMES_HOME` dirs; per-profile config, keys, memory, sessions, skills.
Plugins + MCP	Bolt-on powers and connections to other tools.	Python/shell-hook plugins (can veto tool calls, ship image-gen backends); full MCP client.

The 70+ built-in tools

Tools tagged cloud need a key by default and get swapped for local equivalents in section 07; everything else is local.

terminal · process

Run shell commands, background servers, monitor and notify on completion.

2 tools

file ×4

read_file, write_file, patch (fuzzy find-replace, 9 strategies, auto syntax check), search_files (ripgrep).

4 tools

code_execution

execute_code: Python that calls other Hermes tools, with branching and output filtering.

delegation

delegate_task spawns subagents in isolated contexts (own terminal + toolset). Defaults: max_spawn_depth=2, max_concurrent_children=3. Parent blocks until child completes; only the summary returns.

browser ×12

navigate, click, type, get_text, scroll, snapshot, screenshot, evaluate_js, console_log, vision_analyze, wait_for, close. Headless Chrome via CDP; 180× faster persistent connection since v0.14.

12 tools

vision · memory

vision_analyze (describe/answer about images); memory (save durable facts).

skills ×3 · session_search

skill_manage / skill_view / skills_list; search and summarize all past sessions.

cronjob · todo · clarify

cronjob schedules skill-backed tasks (no_agent mode for pure-script jobs, full-agent for adaptive ones); todo plans multi-step work; clarify asks you a multiple-choice question.

moa

mixture_of_agents: route a hard problem through several models (4 + aggregator) and merge.

rl ×10

Drive RL fine-tuning runs: collect trajectories, generate batches, compress trajectories, score with judge, kick off SFT / DPO / GRPO loops, read live training metrics. The full RL toolchain is in-process.

10 tools

image_gen · tts · web ×2

image_generate, text_to_speech, web_search/web_extract.

cloud by default

messaging · homeassistant ×4 · feishu ×5

Send to chat platforms; control smart-home devices; enterprise Lark/Feishu doc ops.

How it thinks

In plain terms

You give it a goal. It makes a short plan, picks a tool, uses it, looks at the result, and decides the next step, repeating until done. It stops to ask only when it needs a real decision or permission for something risky.

Under the hood

A tool-calling loop: system prompt (memory + skills + tool schemas) → model emits a tool call → runtime executes → result re-enters context → repeat. Context compression (/compress) keeps long runs under the window; activity-based timeouts (read timeout relaxed to 1800s for local endpoints) prevent premature kills; subagent results return as summaries so intermediate noise never bloats the main thread.

Three mechanisms keep long jobs alive: todo (it tracks its own steps), compression (summarizes history to stay under the context limit), and delegation (offloads sub-tasks to subagents so the main thread stays clean). They are why a local agent can run a multi-stage media pipeline end to end without losing the plot.

What a single agent turn actually looks like

One real exchange. The user says "summarise the README in this repo and write a tweet about it"; the agent runs four tool calls and answers in about eight seconds on gpt-oss-120B.

# user prompt
> summarise the README in this repo and write a tweet about it

# agent turn 1: pick a tool, emit a tool call
tool: search_files(query="^README", path=".")
# runtime executes, result feeds back to context
result: ["./README.md", "./packages/foo/README.md"]

# agent turn 2: narrow to the root README
tool: read_file(path="./README.md")
result: 3,847 chars of markdown

# agent turn 3: internal reasoning, no tool call
The README describes a 2026 field guide for local AI on M5 Max with
Hermes Agent. Headline claim: 27B model at 77.2% SWE-Verified fits a
laptop. The thread should lead with that stat.

# agent turn 4: emit the answer (no tool call, free-form text to user)
A 27-billion-parameter model that fits a laptop now codes at 77.2% on
SWE-bench Verified. Local AI is no longer a compromise: 2026 field
guide for a private studio on a MacBook M5 Max with Hermes Agent.
https://m5ai.ivanm.dev

Note what does NOT happen: the agent does not narrate every step to the user, does not stop to ask permission for read-only tools (those are pre-approved), does not write chain-of-thought into the answer. The visible behaviour is "thought for 8 seconds, then answered". The invisible behaviour is the loop above. Hermes's /compress ensures the result of turn 2 (3,847 chars of README) is summarised before turn 5 if the conversation keeps going.

What the community actually reports

Independent of any vendor claim, community discussion across r/LocalLLaMA and Hacker News has been active and growing, praising the smoother setup and the built-in learning loop. A representative comment: the agent "actually remembers" a failure and "creates a skill for troubleshooting it." The honest consensus is not that Hermes replaces dedicated coding agents; for pure software engineering, the leaders are clear: Claude Code with Opus 4.6 holds 80.8% on SWE-bench Verified (highest reported single-agent), Aider's Polyglot leaderboard places Opus 4.6 around 85% edit-format, and Cursor Composer with Sonnet 4.6 sits in the mid-70s. Hermes does not publish a SWE-bench Verified number and is built for a wider job: orchestration, memory, scheduling and gateways. Treat marketing comparisons sceptically; benchmark for your actual workload. Nous shipped hundreds of security-tagged commits across the 0.12-0.14 release cycles (588 / 633 / 550 total merged PRs respectively) and there are no widely reported breaches; the lone disclosed CVE-2026-7396 (WeChat path traversal) was patched. Hermes positions itself as complementary to the older OpenClaw framework, with a built-in hermes claw migrate path that imports persona, skills, memory, channels and API keys.

Why it is the right host for a local studio

Model-agnostic (your local brain plugs straight in), ships the agent machinery you would otherwise hand-build, fully offline-capable, and explicitly supports local servers. It is the same category as Claude Code or OpenClaw, but local-first, model-swappable and self-improving.

A skill is just a folder with a Markdown file

Skills are the procedural memory: short, named, version-controlled how-to documents the agent reads and follows. Here is a real one from ~/.hermes/skills/release-it/SKILL.md, hand-written then refined by Curator after a few runs:

---
name: release-it
description: tag + push a release after CI green
tools: [terminal, file]
approvals: [git tag, git push]
---

# When to use
User asks to "ship", "tag a release", or "cut v<N>".

# Pre-flight
1. Run `gh run list --branch main --limit 1`; verify status=success.
2. Read `CHANGELOG.md`; the top entry must have a version + date filled in.
3. Run `npm run validate`; must exit 0.

# Steps
1. `git tag -a v{N} -m "<extract from CHANGELOG>"`
2. `git push --tags`
3. Confirm to user with the new tag URL on GitHub.

# Failure modes
- CI red → STOP, tell user, do not tag.
- CHANGELOG unfilled → ask user to fill before continuing.
- Same tag exists → STOP, do not force-push.

The agent reads this on session start (only the skills relevant to the current goal land in the system prompt). Skills are shareable: hermes skill push my-skill publishes to agentskills.io; hermes skill install <name> imports someone else's. The Curator agent deduplicates and consolidates skills nightly; over weeks the studio's skill set converges on the patterns YOU repeat.

A multi-modal turn: text + image + voice in one prompt

The same agent loop drives every modality because the brain dispatches to tool servers, not to one model. A request that spans copy, design and audio is just three tool calls with the brain stitching the result back together. Worked example for "draft a launch poster + caption + read the caption in my cloned voice":

// USER
> draft a launch poster for the studio (1080x1350, IG portrait),
  write a 28-word caption, and read the caption in my cloned voice
  from ~/voices/me-15s.wav. Drop everything in ~/Desktop/launch/.

// TURN 1 · brain plans
need: brand colours, headline, body. Skill brand-kit has it.
read_file: ~/.hermes/skills/brand-kit/SKILL.md  (158 chars)
  → primary #4d8dff · accent #2ee6a6 · font Bricolage 800

// TURN 2 · brain drafts copy in-context, then calls image tool
gen_image:
  prompt:  "minimal portrait poster, deep navy gradient,
            headline 'One agent. Every model. Zero cloud.'
            in Bricolage 800 white, accent dot mint,
            grain texture, generous margins"
  model:   flux-2-dev
  size:    1080x1350
  steps:   28
  → ~/Desktop/launch/poster.png (1.2 MB, 14.6 s)

// TURN 3 · brain writes caption, then calls voice tools
write_file: ~/Desktop/launch/caption.txt
  → "One MacBook. One agent. Every open model.
        No subscription, no telemetry, no cloud round-trip.
        Sixty-three GB of brain on a laptop you can close."  (28 words)

clone_voice:
  reference: ~/voices/me-15s.wav
  model:     chatterbox-v2
  → voice profile cached as 'me' (3.1 s)

tts_speak:
  text:          <caption.txt>
  voice:         me
  exaggeration:  0.35
  format:        wav 24kHz
  → ~/Desktop/launch/caption.wav (412 KB, 2.9 s)

// TURN 4 · brain emits final answer
> Done. Poster at poster.png, caption.txt (28 words),
  caption.wav (12 s, your voice, mild emotion).
  Total: 22.7 s, 4 tool calls, 0 cloud calls.

What this shows: one brain (gpt-oss-120B) chooses tools, drafts copy, and never touches the image or voice models directly. The image model lives behind a gen_image tool, the voice clone behind two more. Swap FLUX for Qwen-Image and the brain does not need re-prompting. Swap Chatterbox for F5-TTS and the same. The skill folder is where this convention lives; the brain follows.

The pattern that scales

Treat each model as a tool, not a chat partner. The brain handles routing, context, and stitching. New modalities (video, 3D, music) become new tool servers. The agent loop stays identical; the studio grows by adding tools, not by re-architecting.

Sleeper features in 0.13 and 0.14

Six things from the last two release cycles that materially change what the studio can do:

hermes proxy

v0.14.0

Turns Claude Pro / ChatGPT Pro / SuperGrok into a localhost OpenAI-compatible endpoint. Codex CLI, Aider, Cline and Continue all become free to drive off your Hermes-managed subscription.

cost win

/goal · /subgoal

v0.13–0.14

Ralph-loop persistent goal contracts: the agent keeps going until a judge decides the criteria are met. Layered subgoals can be added mid-run.

persistence

Kanban durable

v0.13.0

Multi-agent orchestration with heartbeat detection, auto-block on incomplete exit, per-task retries, hallucination recovery. Closes the gap delegate_task leaves open.

Curator agent

v0.12.0

Background process that deduplicates, deprecates and consolidates skills. No equivalent in any competitor.

LSP semantic diagnostics

v0.14.0

Beyond syntax linting: type errors, undefined symbols, missing imports surfaced back to the agent before the next turn.

Cross-session prompt cache

v0.14.0

1-hour Claude prompt cache survives /new. Cuts cost materially on long workflows.

OpenClaw, Claw Code, ClaudeClaw: three different projects

Names collide. The migration path Hermes ships is for one of them only.

Project	Repo	What it is
OpenClaw Peter Steinberger	`openclaw/openclaw` 374K stars · MIT · TypeScript	General-purpose messaging-first AI assistant. Originally Clawdbot (2025-11-24) → Moltbot (2026-01-27) → OpenClaw. This is what `hermes claw migrate` reads.
Claw Code Sigrid Jin	`instructkr/claw-code` 48K stars · Python + Rust	Clean-room rewrite of Claude Code's leaked source map. Coding-focused CLI. Unrelated to OpenClaw.
ClaudeClaw moazbuilds	`moazbuilds/claudeclaw`	Lightweight OpenClaw-equivalent that runs as a Claude Code plugin: daemon + Telegram/Discord/Slack/cron/web dashboard.

The migration path

hermes claw migrate reads ~/.openclaw/, and auto-detects legacy ~/.clawdbot/ and ~/.moltbot/ paths. Non-destructive by default: skips SOUL.md if Hermes already has one, skips duplicate memory entries, skips same-named skills. Imports persona, skills, memory, channels and API keys. Imported skills land at ~/.hermes/skills/openclaw-imports/.

What happens if Hermes itself gets abandoned

The fair question to ask before committing to any agent runtime: what is recoverable if the project dies tomorrow? The studio's architecture answers this on purpose. Everything you produce lives in formats older than Hermes and outlasts any single tool.

Asset	Where it lives	Survives Hermes going dark?
Skills	`~/.hermes/skills/*/SKILL.md` + plain scripts	Yes. Markdown + frontmatter + shell/Python. Readable by any future agent or by you directly.
Memory	`~/.hermes/memory/*.md`	Yes. Markdown with frontmatter. Same format Claude Code, OpenClaw and the AIvan layer all consume.
Model weights	Ollama / MLX caches under `~/.ollama/models/` and `~/.cache/huggingface/`	Yes. GGUF and safetensors are open formats with multiple loaders (llama.cpp, vLLM, MLX, ExLlamaV2, KoboldCpp).
MCP servers	`~/.hermes/mcp.json` + the MCP processes themselves	Yes. MCP is an Anthropic-led open protocol; Claude Code, OpenClaw, Cursor, Cline all speak it.
Channel bindings	`~/.hermes/channels.yaml` + per-gateway tokens	Yes. Telegram / Discord / Slack tokens are yours; rebind to any other runtime.
Cron jobs	`~/.hermes/cron.yaml`	Yes. Same YAML schema as `crontab` + a goal contract; trivially portable.
config.yaml	`~/.hermes/config.yaml`	Partial. The schema is Hermes-specific, but every value (model name, endpoint URL, key) carries over by hand in 5 minutes.
Cross-session prompt cache	Hermes-internal SQLite	No, but it's a cache; not data you care about preserving.

The design principle that makes this true

Hermes treats your stuff (skills, memory, channel tokens, model weights) as data in standard formats stored on your filesystem, and treats itself as the process that happens to be running them today. If the project dies, fork it, switch to OpenClaw or ClaudeClaw, write a thin replacement shim, or just keep using the dead version - the data does not move and does not get held hostage. This is the single biggest difference between an open-source agent and a SaaS one: SaaS dying means your stuff dies. Local agent dying means your stuff sits on disk waiting for the next runtime.

Concrete recovery recipe if Hermes shipped its last release tomorrow: skills run as plain shell/Python scripts (the SKILL.md frontmatter is a 20-line parser to write), memory is grep-able markdown, weights load into vLLM or MLX directly, MCP servers stay running unchanged, cron continues under stock crontab + a one-shot model invocation. The studio degrades from "agent does it" to "you do it with the same tools" - which is exactly where it started before Hermes existed.

04 · The brain

The model that thinks, and the whole field of them

The brain is the model the agent reasons with. For an agent, the trait that matters most is not raw genius: it is reliably choosing the right tool and emitting a clean call, hundreds of times in a row. A model that is 2% smarter but fumbles one call ends the run. So this section covers how these models work, what makes each one different, how they are trained, and exactly how the 2026 field scores, before picking the ones that fit your Mac.

How a modern model works

In plain terms

Picture a company with 128 specialists. For each word, a "router" wakes only the 8 most relevant ones. The model is enormous in total knowledge, but only a thin slice works at a time, so it stays fast and fits in memory. That "mixture of experts" trick is why a 120-billion model runs on a laptop. Some models also have a "think first" switch that lets them reason step by step on hard problems and answer instantly on easy ones.

Under the hood

A Mixture-of-Experts transformer routes each token through a small subset of expert FFNs. Active-parameter count, not total, governs per-token compute and bandwidth, which is why MoE suits a 614 GB/s Mac. The differentiators between 2026 models are mostly in attention and routing: MLA (DeepSeek's Multi-head Latent Attention compresses the KV cache), attention sinks (gpt-oss lets heads "pay zero attention"), linear/lightning attention (MiniMax for long-context efficiency), auxiliary-loss-free routing (DeepSeek's load balancing), and hybrid thinking (Qwen's switchable reasoning with a thinking budget). Training increasingly leans on RL: DeepSeek-R1 showed pure RL via GRPO can teach reasoning with no supervised chain-of-thought; gpt-oss post-trains with CoT+RL like o3.

What makes each architecture different

Innovation	Who	What it does
Mixture-of-Experts	nearly all	Only a few experts fire per token; huge total capacity, small active cost.
Multi-head Latent Attention (MLA)	DeepSeek V3/V4	Compresses the KV cache into a latent space, slashing memory for long context.
Attention sinks	gpt-oss	A learned per-head bias lets the model ignore tokens cleanly; stabilizes long context.
Hybrid thinking + budget	Qwen3.x, DeepSeek V4	One model switches between visible chain-of-thought and instant answers; you cap the reasoning spend.
Auxiliary-loss-free routing	DeepSeek, Qwen3.x MoE	Balances expert load without a loss term that hurts quality; encourages specialization.
Hybrid Mamba + Transformer + MoE	NVIDIA Nemotron 3 Super	State-space + attention + MoE in one model. Only open frontier-class system shipping all three; holds 91.75% RULER at 1M tokens.
Linear / lightning attention	MiniMax M2.x	Sub-quadratic attention for cheap very-long-context inference.
RL-first post-training (GRPO)	DeepSeek-R1 lineage	Pure reinforcement learning induces reasoning without supervised CoT data.
Native MXFP4	gpt-oss	4.25-bit microscaling quantization is the intended format, not a lossy afterthought.

What these benchmarks actually measure

Intelligence Index v4.0 (Artificial Analysis): composite of 10 evals (GDPval-AA, τ²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR, AA-Omniscience, IFBench, Humanity's Last Exam, GPQA Diamond, CritPt). A breadth score; 60 is the current frontier ceiling.

SWE-bench Verified: 500 real GitHub issues from popular Python projects, hand-validated by the SWE-bench team. The model gets the repo and the issue text; it must produce a patch that passes hidden tests. Coding-agent reality check; closer to "does this make me money" than MMLU.

BFCL v4 (Berkeley Function Calling Leaderboard): can the model emit a correct tool call for a given API spec, across 1500+ scenarios with multi-turn, parallel, and irrelevance-detection categories. The agent-reliability metric; a model that's 2 points smarter on MMLU but 5 points worse on BFCL is the wrong brain for tool-calling work.

MMLU-Pro: harder MMLU. 12K multiple-choice across 57 academic disciplines, ten-choice (vs four), text-only. Reasoning depth at academic difficulty.

GPQA Diamond: 198 graduate-level science questions, hand-curated to be "expert-hard". Tests the ceiling of factual + reasoning depth.

GDPval-AA: Artificial Analysis's economic-value benchmark: pairs of agentic tasks scored Elo-style. Real-work proxy.

τ-bench retail: multi-turn customer-service simulation. Strict tool-use protocol + adversarial users; measures "can this agent run a contact center".

The whole field, scored and sized

Intelligence Index = Artificial Analysis v4.0 composite. SWE = SWE-bench Verified. "128GB" = fits on this Mac at a usable 4-bit quant. Top models are listed precisely because most of them do not fit, which is the honest part.

Model	Lab	Params (total / active)	Intel.	SWE	License	128GB?
gpt-oss-120B	OpenAI	117B / 5.1B MoE	n/a	o4-mini class	Apache-2.0	yes · 63GB
Qwen3.6-27B	Alibaba	27B dense	n/a	77.2	Apache-2.0	yes · 17-33GB
Qwen3.6 35B-A3B	Alibaba	35B / 3.5B MoE	n/a	n/a	Apache-2.0	yes · 26GB
Gemma 4 31B	Google	31B dense	top non-China open	n/a	Gemma	yes · 17GB
Mistral Medium 3.5	Mistral	128B	n/a	77.6	open	tight · ~70GB
Llama 4 Scout	Meta	109B / 17B MoE	n/a	n/a	Llama	yes · ~60GB
Nemotron 3 Super	NVIDIA	~70-100B	top non-China open	n/a	open	likely · 4-bit
DeepSeek V4 Flash	DeepSeek	284B / 13B MoE	~49	79.0	open	no · ~120GB Q4
DeepSeek V4 Pro	DeepSeek	1.6T / 49B MoE	52	80.6	open	cloud only
Kimi K2.6	Moonshot	~1T MoE	54	80.2	open	cloud only
GLM-5.1	Z.AI	744B MoE	~51	77.8 (GLM-5)	open	cloud only
MiniMax M2.7	MiniMax	MoE · linear attn	49.6	n/a	open	cloud only
MiMo-V2.5-Pro	Xiaomi	~1T MoE	54	78.0	open	cloud only
Qwen3.5 397B-A17B	Alibaba	397B / 17B MoE	n/a	76.2	Apache-2.0	no

Intelligence Index · top open weights (higher is better)

Kimi K2.6

MiMo-V2.5-Pro

DeepSeek V4 Pro

GLM-5.1

51.4

MiniMax M2.7

49.6

For scale: GPT-5.5 (xhigh) 60 · Gemini 3.1 Pro / Claude Opus 4.7 57. A year ago the best open model scored 22.

SWE-bench Verified · coding (open weights)

DeepSeek V4 Pro Max

80.6

Kimi K2.6

80.2

DeepSeek V4 Flash

79.0

GLM-5

77.8

Mistral Medium 3.5

77.6

Qwen3.6-27B (fits!)

77.2

Kimi K2.5

76.8

DeepSeek V3.2

73.0

The headline: a 27B dense model that fits your Mac (amber) lands at 77.2, within ~3 points of trillion-parameter cloud models.

Tool calling · BFCL v4 (the agent metric)

Qwen3.5-397B-A17B

72.9

Qwen3.5-122B-A10B

72.2

Qwen3.5-27B

68.5

Berkeley rotated v3 → v4 in May 2026; the v3 leaders (GLM-4.5 76.7, Qwen3 32B 75.7, Kimi K2.5 64.5) are not yet re-evaluated on v4 and are not directly comparable. Captured 2026-05-25 from gorilla.cs.berkeley.edu/leaderboard.html. Smaller Qwen3.5-27B at 68.5 sits within the Mac-fittable envelope at Q4.

MMLU-Pro · reasoning breadth (open weights, 2026-05-25)

gpt-oss-120B (fits!)

90.0

Qwen3.6 Plus

88.5

MiniMax M2.1

88.0

Qwen3.5-397B-A17B

87.8

GLM-4.5

84.6

DeepSeek R1

84.0

The 117B/5.1B-active gpt-oss-120B (amber, fits your Mac at ~63 GB MXFP4) holds the top MMLU-Pro for open weights at 90.0. AIME 2025 with tools: 97.9. Qwen3.6 Plus (cloud-only) and MiniMax M2.1 closed the gap to 1.5-2 points this month.

GPQA Diamond · graduate-level science (open weights)

Kimi K2.6

90.5

DeepSeek V4 Pro

~87

GLM-5.1

~84

gpt-oss-120B (fits!)

~80.5

Kimi K2.6 holds the highest GPQA Diamond score of any open model (90.5). Among Mac-fittable brains, gpt-oss-120B leads at ~80.5.

GDPval-AA · real-world agentic value (Elo, open weights, 2026-05-25)

MiMo-V2.5-Pro

1581

DeepSeek V4 Pro Max

1554

GLM-5.1

1535

MiniMax M2.7

1514

Kimi K2.6

1484

GLM-5

1402

DeepSeek V4 Flash

1388

GDPval-AA measures actual economic-value-producing agentic work (Artificial Analysis). MiMo-V2.5-Pro (Xiaomi) took the open-weights lead this week. Every model on this leaderboard is cloud-only at the Mac's 128 GB ceiling.

τ-bench retail · multi-turn customer-service agents (closed-vs-open reference)

Claude Sonnet 4.5 (closed)

86.2

GPT-5.5 (closed)

~82

No open-weights model has cracked 80 on τ-bench retail yet. If your studio job is high-stakes multi-turn customer-facing chat with strict tool-use protocols, the closed leaders still measurably lead.

The models that fit your Mac, in detail

gpt-oss-120B · the reliability brain

OpenAI's open model. It is the most dependable at "using tools" of anything you can download, which is the single trait that decides whether long agent jobs finish. It fits comfortably and lets you dial how hard it thinks.

Architecture: token-choice MoE (117B total, 5.1B active, 4 experts), gated SwiGLU, GQA, alternating full + 128-token sliding-window attention, RoPE 131K via YaRN, learned per-head attention sinks.
Quantization: native MXFP4 on MoE weights → ~63 GB; Q6 ~90 GB. o200k_harmony tokenizer; Harmony chat format.
Training: CoT + RL post-training using o3-family techniques, specifically for reasoning and tool use; adjustable reasoning effort (low/medium/high).
Scores: MMLU-Pro 90.0, AIME 2025 97.9 (tools), GPQA 80.9 (tools), τ-bench matches/exceeds o4-mini. The cleanest tool-call discipline of any open weight.

Qwen3.6-27B · the all-rounder

Alibaba's open model and the best balance for this build: fast, small enough to leave room for media, and genuinely excellent at coding and agentic work. If you run one model, this is the safe pick.

Architecture: 27B dense (the family also ships 128-expert/8-active MoE variants with no shared experts and global-batch load-balancing). Hybrid thinking with a thinking budget. 256K context.
Training: ~36T pre-training tokens (double Qwen2.5), three-stage pretrain → four-stage post-train, synthetic math/code from Qwen2.5-Math/Coder, 119 languages. Apache-2.0.
Scores: SWE-bench Verified 77.2, beating models 50× its size; BFCL-class tool calling is strong. ~17 GB at Q4, ~33 GB at Q8.

Qwen3.6 35B-A3B

MoE 3.5B active

Tiny active footprint = very fast on Mac. Perfect fast/auxiliary lane beside a heavy main brain.

fast lane~26 GB Q5

Gemma 4 31B

dense · Google

Top non-China open model on the Intelligence Index; efficient, multimodal, strong general writing. Watch for repetition loops past ~11 tool calls in some harnesses.

vision~17 GB Q4

Mistral Medium 3.5

128B · EU

SWE-bench 77.6, EU-built for data-residency needs. Fits tight at 4-bit, leaves little for media.

tight fit

Llama 4 Scout

109B / 17B MoE

Fits at ~60 GB 4-bit; long context. Caveat: its pythonic tool-call format needs a compatible parser.

long ctx

Nemotron 3 Super

NVIDIA

The other top non-China open reasoning model; a strong Western-licensed alternative.

reasoning

DeepSeek V4 Flash

284B / 13B MoE

SWE-bench 79.0, MLA + auxiliary-loss-free routing. About 120 GB at Q4 (160 GB at FP16), just over the line for 128 GB once KV cache is added.

won't fit

Nemotron 3 Super: the open hybrid

NVIDIA's contender. Different inside: it mixes three kinds of layers (long-context Mamba, attention, and the MoE experts you already know). Holds 91.75% on a million-token retrieval test, which no other open model touches.

Architecture: hybrid Mamba state-space layers (cheap long context) + transformer attention (short range) + MoE experts (capacity). The only open frontier-class system shipping all three combined.
Scores: 91.75% RULER at 1M tokens (unmatched among open models). Strong on Intelligence Index alongside Gemma 4 31B as the two non-China open entries.
Why it matters: the natural pick for teams with data-residency constraints that bar weights from Chinese labs. Western-licensed alternative to DeepSeek / Qwen / GLM.

Qwen3-Coder-Next: the small fast coder

Alibaba's specialised coding model. Three billion active parameters, scores 70.6% on the coding test. The best self-hostable coder under 100 GB.

Architecture: 3B active parameters in a coding-tuned MoE. About 40 GB at Q4.
Training: 800K agentic coding RL tasks (multi-turn, tool-using, test-validated trajectories).
Scores: SWE-bench Verified 70.6 with only 3B active params. Inside the 100 GB ceiling so you can keep it loaded alongside the orchestrator brain on the M5 Max.

The cloud-only leaders, briefly

These do not fit your Mac, but you should know what sits at the top of the open leaderboard:

DeepSeek V4 Pro

1.6T / 49B MoE

Current open-weights coding leader at 80.6 SWE-Verified. Multi-head Latent Attention compresses the KV cache; auxiliary-loss-free routing balances experts. Hybrid thinking.

cloud onlyMLA + ALF

DeepSeek V4 Flash

284B / 13B MoE

First new DeepSeek architecture since V3. Same innovations as Pro at a fifth of the size. Still does not fit a 128 GB Mac at Q4 (~120 GB).

no · ~120 GB Q4

Kimi K2.6

Moonshot · ~1T

Co-leads Intelligence Index v4.0 at 54. GPQA Diamond 90.5 (highest of any open model). General-purpose flagship.

cloud onlyGPQA 90.5

GLM-5.1

Z.AI · 744B MoE

Z.AI's flagship. Intelligence Index around 51, GDPval-AA leader at 1535. Strong real-world agentic benchmark performance.

cloud onlyGDPval 1535

MiniMax M2.7

MoE · lightning attn

Pioneer of lightning attention: sub-quadratic for cheap very-long-context inference. Intelligence Index around 50.

cloud onlylinear attn

MiMo-V2.5-Pro

Xiaomi · 1.02T / 42B

Xiaomi's entry. Intelligence Index 54, 1M-token context, 78.0 SWE-Verified. China's third major model lab behind DeepSeek and Alibaba.

cloud only

Decoding the model-card vocabulary

Term	Used by	What it does in one line
MoE (Mixture of Experts)	nearly all	Only a few experts fire per token; huge total, small active cost.
MLA (Multi-head Latent Attention)	DeepSeek V3/V4	Compresses the KV cache into a latent space.
Attention sinks	gpt-oss	Learned per-head bias lets the model ignore tokens cleanly.
Auxiliary-loss-free routing	DeepSeek, Qwen3.x MoE	Balances expert load without a loss-term penalty.
Hybrid thinking + budget	Qwen3.x, DeepSeek V4	One model switches between visible CoT and instant answers; you cap reasoning spend.
Lightning attention	MiniMax M2.x	Sub-quadratic attention for cheap very-long-context inference.
GRPO	DeepSeek-R1 lineage	Pure RL induces reasoning without supervised CoT data. Cut post-training cost ~10×.
MXFP4	gpt-oss	4.25-bit microscaling: the intended format, not a lossy afterthought.
Mamba + Transformer + MoE	Nemotron 3 Super	State-space + attention + MoE in one model. Holds 91.75% RULER at 1M.

License matrix: can I ship this commercially?

License	Models	Commercial use	Catch
Apache-2.0	gpt-oss-120B, Qwen3.6-27B, Qwen3.6 35B-A3B, Qwen-Image-2512, Z-Image-Turbo, FLUX.2 [klein] 4B	✓ unrestricted	None. Default-yes.
MIT	Chatterbox, HiDream-O1, ACE-Step 1.5 XL	✓ unrestricted	None.
Llama 4 Community	Llama 4 Scout, Maverick	✓ conditional	EU multimodal blocked. 700M MAU threshold.
Gemma	Gemma 4 31B	✓ conditional	Google's terms; review product attribution.
Modified MIT	Mistral Medium 3.5	✓ conditional	Not pure MIT; check redistribution clauses.
FLUX.2 Non-Commercial	FLUX.2 [dev], FLUX.2 [klein] 9B	✗ research only	Pay BFL for commercial, or use [klein] 4B.
Custom open weights	DeepSeek V3/V4, GLM-5/5.1, Kimi K2.6, MiniMax M2.7, MiMo-V2.5-Pro	per-model	Generally permissive but read each.

Why the open models now iterate so fast: GRPO

DeepSeek-R1 demonstrated that pure reinforcement learning, with no supervised chain-of-thought data, can induce reasoning behaviour. Their Group Relative Policy Optimization (GRPO) algorithm cut post-training cost by roughly 10×. That is the single biggest reason DeepSeek shipped V4 Pro just three months after V3.2 and why the rest of the open field is iterating quarterly instead of yearly. The 2025 race was "scale pretraining". The 2026 race is "scale RL".

The verdict for 128 GB

Maximum reliability: gpt-oss-120B resident as the main brain. Speed + media headroom: Qwen3.6-27B (Q8), freeing ~80 GB for media. Best of both: 120B (or 27B) main + Qwen3.6 35B-A3B as a fast auxiliary, and let Hermes route between them. The trillion-parameter leaders stay on a cloud endpoint for the rare job that needs them; everything daily runs local.

A real, community-verified gotcha

Builders report that MLX-quantized models can lose tool-calling reliability after 5–10 rounds, while GGUF (via Ollama/llama.cpp) stays stable longer. So serve the agent brain on Ollama/GGUF for reliability, and reserve MLX for media and fine-tuning, where its speed and memory edge shine. Also: set Ollama's context explicitly (OLLAMA_CONTEXT_LENGTH=65536) or the 70+-tool system prompt silently overflows. And for hybrid-thinking models (Qwen3.x, GLM-4.7, DeepSeek V4), disabling reasoning mode requires all three of --reasoning off, --reasoning-budget 0, and chat_template_kwargs.enable_thinking: false. Setting only one is the single biggest source of production failures in May 2026 (see llama.cpp issue #13189).

05 · Uncensored

Abliteration: how it works, and the catch

In plain terms

"Uncensored" models do not refuse you. Researchers discovered there is essentially a single internal "no" signal inside a model; abliteration surgically cancels it. The catch is that doing this carelessly also dents the model's general skill, which an agent relies on. So the smart pattern is one disciplined model for the work, and a separate uncensored one only for the writing.

Under the hood

Refusal is mediated by roughly a single direction in the residual stream (Arditi et al., NeurIPS 2024). You estimate that refusal direction from contrastive harmful vs harmless prompts, then either steer activations away from it at inference, or permanently orthogonalize the weights against it. The term (ablate + obliterate) was coined by FailSpy. The problem: the refusal vector is polysemantic, entangling refusal with syntax, formatting and capability circuits, so naive ablation causes collateral damage, partially recoverable with light "healing" fine-tuning (SFT/DPO).

The 2026 tooling

Tool	What it does	Why it matters
Heretic	One-command automated abliteration; separates attention vs MLP interventions (MLP causes more damage); v1.2 adds a LoRA-based engine producing a toggleable adapter plus 4-bit support.	~6.5× less capability damage than hand-tuned efforts. `pip install heretic-llm` → `heretic <model>`.
OBLITERATUS	116-model toolkit; adds Expert-Granular Abliteration (per-expert directions for MoE) and CoT-aware ablation.	Deeper and broader, but heavier. For when a single direction is not enough.
UGI Leaderboard	Community ranking of Uncensored General Intelligence plus a natural-intelligence (NatInt) score.	The place to confirm an "uncensored" model is still actually smart after the surgery.

Qwen3.6 abliterated :agent

The compromise model: uncensored and keeps tool calling. Best single "both" pick.

agent + open~9-33 GB

Gemma 4 31B Heretic

Uncensored general-purpose with native vision and tool calling.

vision~17 GB Q4

Hermes 4.3 / 70B

Low-refusal by design (not abliterated). Excellent writing, lyrics, roleplay; run as the content subagent.

content engine

DIY with Heretic

Abliterate Qwen3.6-27B yourself for an uncensored brain tuned to taste, output as a toggleable LoRA.

full control

The recommended pattern

Keep a clean agentic brain (gpt-oss-120B or Qwen3.6-27B) as orchestrator, and wire an uncensored content model (Hermes 4.3 or a Heretic'd Qwen/Gemma) as the delegation model so the agent auto-routes writing to it. Reliable agency, zero refusals where you want them, no compromise on either side.

One honest line

An abliterated model has no refusal layer for anything, including instructions hidden in content it reads (prompt injection). Running them locally is legal in most places; the output and its use are entirely your responsibility. Keep it legal, and keep tool approvals on.

UGI leaderboard · 2026-05-24 snapshot

Top open uncensored: DeepSeek-V3.2-Speciale at 67.9 UGI. Closed: Grok 4 at 69.0. Hugging Face hosts 6,030 abliterated models as of the snapshot. The distinction matters: abliteration is the linear-algebra trick described above (cheap, reversible via toggleable LoRA), while uncensored fine-tune (full SFT/RL on permissive corpus) is heavier and less recoverable surgery. For most agent work, abliteration via Heretic v1.2 is the better recipe; for raw refusal-floor benchmarks, full fine-tunes still lead.

06 · Serving & quantization

How models run, and what "4-bit" means

In plain terms

A model's "weights" are billions of numbers. Quantization shrinks them by storing each with fewer digits, like rounding prices to the nearest dollar: 4-bit is small and fast with a tiny quality cost; 8-bit is bigger and basically perfect. You also need a small program to run the model; on Mac the easy one is Ollama, and the fastest is Apple's own MLX.

Under the hood

Quantization reduces weight precision (16-bit → 8/6/5/4-bit). GGUF is the cross-platform de-facto format (Q4_K_M standard, Q5_K_M sweet spot, Q6/Q8 near-lossless), run by llama.cpp/Ollama; it has the broadest model coverage, often within hours of a release. MLX is Apple's native format, built for unified memory with zero CPU↔GPU copies: ~10% less memory and 15–30% faster than GGUF at the same quant. MXFP4 is gpt-oss's native 4.25-bit microscaling. AWQ/GPTQ are activation- and gradient-aware schemes common on NVIDIA. Below ~Q3, tool-calling reliability collapses.

Server	Best for	Notes
Ollama (GGUF)	The agent brain	Simplest; Hermes auto-detects at `:11434/v1`; most stable for long agentic tool use.
MLX (mlx_lm)	Media, max speed, fine-tuning	Apple-native; fastest single-user generation; the only local LoRA-training path on Mac.
LM Studio	GUI management	Bundles the MLX engine; one-click OpenAI server; runs both MLX and GGUF.
llama.cpp / vLLM / SGLang	Power-user serving	Fine control of quant, context, KV-cache; `-ngl 99` offloads all layers to the Metal GPU.

Quant level	Quality	Use when
4-bit (Q4_K_M / MXFP4 / MLX-4)	Minor loss, max speed	The practical default for big models.
5–6-bit (Q5_K_M / MLX-6)	Near-lossless	24 GB+; the quality sweet spot.
8-bit (Q8 / MLX-8)	Effectively full precision	48 GB+ (you have it); best for a small premium brain.

Lane × server × quant: picking quickly

Each lane in the studio wants a different pairing. This collapses the decisions into one table.

Lane	Server	Quant	Why
Orchestrator brain	Ollama (GGUF)	Q4 / Q8	Tool-call stability past 5+ rounds; MLX drifts. Use Q8 if you have the room.
Auxiliary fast lane	MLX (`mlx_lm`)	Q5	Memory edge + native Metal kernels; short sessions do not trigger MLX tool-call drift.
Media generation	MLX / Draw Things	MXFP4 / 4-bit	Native; Apple's Neural Accelerator path; Draw Things' Lightning Draft hits about 1 sec/image on M5 Max.
Fine-tuning (LoRA)	MLX (`mlx_lm.lora`)	Q4 base	The only local LoRA-training path on Mac. Unified memory beats a 24 GB RTX 3090 here.

Reasoning-format settings, per model

Hybrid-thinking models need explicit configuration or they emit chain-of-thought that breaks downstream parsers. Copy-paste:

# gpt-oss-120B (Harmony format, native reasoning effort)
model: ollama/gpt-oss:120b
extra_body:
  reasoning_effort: high   # low | medium | high

# Qwen3.6 family (hybrid thinking; disable all three knobs)
model: ollama/qwen3.6:27b
extra_body:
  reasoning: off
  reasoning_budget: 0
  chat_template_kwargs:
    enable_thinking: false

# GLM-4.7 / 5.1 : same triple flag as Qwen3.6
# DeepSeek V4 : same triple flag

# Llama 4 Scout : pythonic tool-call parser required
model: ollama/llama4:scout
extra_body:
  tool_call_format: pythonic

# Gemma 4 31B : watch for repetition loops past ~11 tool calls. Reset session on detection.

Citation: llama.cpp issue #13189 documents the full triple-flag fix.

New in April 2026

vllm-metal v0.2 brought paged attention to Apple Silicon: 83× TTFT improvement vs v0.1, 3.6× throughput. The official Apple-Silicon serving path beyond MLX. Worth tracking if you serve to more than one client.

07 · The creative engines

Images, video, music, voice, transcription

This is where local AI stopped being a compromise. The brain writes and reasons; these models produce the media. Each is a different machine with its own architecture, its own training, and its own leaderboard. Below: how each kind works, the full field of options with arena scores, the apps that run them on Mac, and how to prompt them.

Images

In plain terms

You type a description; the model starts from pure static and "develops" it into a photo over a few seconds, like an instant Polaroid in reverse. Modern ones read your prompt so well you can ask for specific text on a sign, an exact pose, or "the same character, new scene."

Under the hood

2026 image models are rectified-flow transformers (MM-DiT): a diffusion-style model that learns a near-straight path from noise to image, so it needs far fewer sampling steps than old U-Net diffusion. Text and image tokens flow through coupled attention streams; a large text encoder (FLUX.2 uses Mistral Small 3.1 24B) gives strong prompt adherence. Images live in a compressed 16-channel latent decoded by an autoencoder. FLUX.2 [klein] was distilled-free, which makes it the best open base for LoRA training.

Text-to-image arena · open weights (Elo)

HiDream-O1-Image-Dev

1187

Qwen Image Max 2512

1160

FLUX.2 [dev]

1160

Seedream 4.5

1165

Qwen-Image-2512 (Apache)

1136

Z-Image-Turbo (Apache)

1076

Image editing (open) leaders: HunyuanImage 3.0 1224 · HiDream-O1 1213 · FLUX.2 [klein] 9B 1161. Closed frontier for scale: GPT Image 2 ~1339.

FLUX.2 [dev / klein 9B / klein 4B / pro]

BFL · 32B

Best-in-class quality and prompt control; klein 4B is the distillation-free LoRA base; up to 10 reference images, HEX color control. License is split: dev and klein 9B are non-commercial; klein 4B is Apache-2.0 (commercial OK).

dev/klein 9B: NCklein 4B: Apache~23 GB dev

Qwen-Image-2512

Alibaba

Top Apache-2.0 model: commercial-safe, excellent text rendering, strong editing. The pragmatic default.

Apache-2.0Elo 1136

HiDream-O1-Image-Dev

open leader

Highest-ranked open-weights model in the arena right now.

Elo 1187

Z-Image-Turbo

Apache · fast

Permissive and quick: few-step turbo sampling for near-instant previews.

Apache-2.0turbo

Seedream 4.5 / Hunyuan 3.0

ByteDance / Tencent

Seedream excels at East-Asian aesthetics (calligraphy, fabric, architecture); Hunyuan leads open image editing.

editing 1224

SD 3.5 / SDXL

Stability

The mature ecosystem: the deepest library of community LoRAs and ControlNets, even if raw quality now trails.

most LoRAs

How to run them on Mac

Draw Things (free, easiest, Metal-optimized, LoRA + ControlNet built in) · ComfyUI (node graphs, most flexible; by Draw Things' own benchmark about 20% slower on Apple Silicon at the same workload, not 3×) · MLX (Apple-native, fastest scripted). Prompt FLUX with plain descriptive sentences plus camera and lighting terms; it ignores old "masterpiece, 8k" spam. Train a FLUX LoRA in 1,000–2,000 steps at rank 8 (Hugging Face's published baseline) and stack 2–3 at 0.5–0.7 weight; Civitai has thousands ready to download.

HiDream-O1: a genuinely novel architecture

Most modern open image models pair a Diffusion Transformer with a separate large text encoder and a VAE. HiDream-O1 ships neither. It is a Pixel-Level Unified Transformer that processes raw RGB end-to-end. At 8B parameters under MIT, it is currently the most permissively-licensed top-tier open image model. Worth knowing about even if FLUX.2 + Qwen-Image are your daily drivers.

One-second image generation on M5 Max

Draw Things' Lightning Draft feature, combined with the M5 Max's Neural Accelerators, makes about one-second 512×512 image generation real on Apple Silicon. The cost is quality (lower step count); the win is the iteration loop. Generate-prompt-tweak cycles that took 30 seconds on M3 Max now take 3. Whether to ship it as a draft-then-finalise pipeline is up to you.

Video

In plain terms

Same idea as images, but the model also has to keep things consistent from frame to frame so motion looks real. This is the heaviest job in the studio: expect minutes per clip, not seconds, and it cannot run at the same time as a big brain.

Under the hood

Video models extend diffusion into a spatio-temporal latent: 3D attention over (frames × height × width) tokens with a causal video VAE, so the model denoises a whole clip while enforcing temporal coherence. This is why VRAM and time costs explode versus stills.

Wan 2.7

Alibaba · April 2026

The current Wan generation; reference-to-video with voice cloning and instruction-based video editing as new model classes since 2.2. Text- and image-to-video via ComfyUI. About 50 GB resident, minutes per clip; won't co-reside with a 120B brain. Wan 3.0 60B (Apache-2.0) is roadmapped for mid-2026.

~50 GBComfyUIv3.0 mid-2026

LTX-Video

Lightricks

Built for speed: real-time-ish generation on capable hardware, lower fidelity than Wan.

fastest

Hunyuan Video / Mochi

Tencent / Genmo

High-motion open alternatives; heavier still, strong cinematic motion.

high motion

Model	Lab	Resident	Max res / dur	M5 Max time per 5-sec clip	Notes
Wan 2.7	Alibaba · Apr 2026	~50 GB	1080p · 10 sec	~4–8 min	Current Wan generation. Reference-to-video + voice cloning + instruction editing as new model classes.
LTX-Video v0.9	Lightricks	~20 GB	720p · 6 sec	~30–90 sec	Speed-first; lower fidelity than Wan but real-time-ish on capable hardware.
HunyuanVideo	Tencent	~60 GB	1280×720 · 5 sec	~5–10 min	High-motion open alternative; cinematic motion.
Mochi 1	Genmo	~40 GB	848×480 · 5 sec	~3–6 min	Apache-2.0; strong open motion baseline.
Step-Video / CogVideoX	StepFun / Tsinghua	~30–50 GB	720p · 6–10 sec	~3–8 min	Newer contenders; CogVideoX-Vid evolution lineage stable.

The honest weak spot

Local video is the one area where the cloud is still clearly ahead on quality and speed. On a 128 GB Mac it is usable for short clips and B-roll, but plan around minutes per generation and run it as a dedicated mode with the brain unloaded.

The open-vs-closed gap, 2026-05-24

Closed leaders right now: Veo 3.x (Google, photorealistic narrative), Sora 2 (OpenAI, broad prompt range), Kling 2.x (Kuaishou, strong motion), Runway Gen-4 (cinematic), Pika 2.x (creator-friendly). Open weights still trail meaningfully on text-in-frame coherence, lip-sync to audio, and minute-long temporal stability. Wan 3.0's 60B Apache-2.0 release (mid-2026 roadmap) is the candidate to close the gap.

Music

In plain terms

Give it a style and some lyrics and it writes and performs a full song, vocals and instruments, in under a minute. The local model is genuinely close to the big paid services, runs offline, and you can teach it a voice or style from a handful of examples.

Under the hood

ACE-Step 1.5 XL is a 4B hybrid: a language-model "composer" reasons in chain-of-thought to plan a structured blueprint (lyrics, sections, duration, metadata), then a diffusion transformer renders 48 kHz stereo audio. It is built on a Sana-style deep-compression autoencoder (DCAE) + linear transformer, with MERT and m-hubert features aligned via REPA; v1.5 adds intrinsic RL. Under 4 GB VRAM, 50+ languages, quality in the Suno v5 range (the closed leader Suno v5.5 shipped March 2026 and has since widened the gap somewhat), and it supports cover, repaint, vocal-to-BGM and LoRA from a few songs.

ACE-Step 1.5 XL

4B · open

The local Suno. Full songs with vocals in seconds; tiny footprint; trainable on your own style.

<4 GBSuno v5 class

YuE

open

Long-form vocal music generation; strong full-song structure, heavier than ACE-Step.

DiffRhythm

open · fast

Full songs in ~10s via latent diffusion; very fast, fewer controls.

fast

MusicGen / Stable Audio

Meta / Stability

Instrumental and sound-design workhorses; no vocals, but reliable for beds and loops.

instrumental

How to prompt ACE-Step

Two fields: tags (3–7 words: genre, mood, instruments, tempo) and lyrics with structure markers like [verse], [chorus], [instrumental]. Budget ~2–3 words of lyric per second (under ~140 words for a 47-second clip). Specific tags ("balkan brass, minor key, 90 bpm, male vocal") beat vague ones every time.

Worked ACE-Step prompt: a 47-second balkan-folk demo

Anatomy of one actual prompt that produces a clean output on the first try. Total wall-clock on M5 Max: ~18 seconds.

# Tags (3-7 words; specific instruments + bpm beat adjectives)
tags: balkan folk, accordion, violin, minor key, 92 bpm, male vocal, live feel

# Structure (markers shape sectioning + dynamics)
lyrics: |
  [verse]
  Sutra zora pada na grad,
  jutarnji povjetarac nosi pjesmu.
  Korak po korak kroz tihu ulicu,
  vrijeme za pjevanje bez razloga.

  [chorus]
  Pjevaj sa mnom, pjevaj glasno,
  pjesma stara, srce mlado.
  Pjevaj sa mnom, pjevaj glasno,
  do jutra, do jutra, do jutra.

  [instrumental]

  [verse]
  Akordeon pamti svaku ulicu,
  violina zna svaku stazu.
  Korak po korak kroz tihu jutro,
  pjesma za one koji slušaju.

  [chorus]
  Pjevaj sa mnom, pjevaj glasno,
  pjesma stara, srce mlado.

# Render
duration: 47
guidance: 7.5
seed: 1492

Why it works: the tag block names three instruments and pins bpm + key + vocal register before any adjective; the lyric block stays under 140 words (124 words across two verses + chorus repeat + instrumental break) so ACE-Step does not have to truncate mid-phrase; the [instrumental] marker gives the diffusion transformer a clean breath where it can foreground the accordion solo. The seed makes the output reproducible; drop the seed to re-roll the take.

When the first take is wrong

If vocals are mumbled: drop one word per second of lyric. If the genre drifted: add one more instrument tag and re-roll. If the structure feels flat: add [bridge] between the second verse and the final chorus. If you trained a LoRA on a singer's stems, append the LoRA name to tags at the end: +lora:my-singer. ACE-Step's recipes converge in 3-4 takes for most styles.

Lyric structure templates that survive a first take

ACE-Step is fault-tolerant on tags but unforgiving on length math. Two rules carry most of the load: total lyric words must fit duration × 2.6 words/sec, and each section needs enough time for the diffusion transformer to lock its motif. Templates below are battle-tested at 90-110 BPM; scale lyric counts proportionally for slower or faster.

Duration	Structure (recommended)	Lyric budget	Per-section
30 s (snippet)	[verse] [chorus]	~70-78 words	verse 32 · chorus 38
47 s (single demo)	[verse] [chorus] [instrumental] [verse] [chorus]	~120-130 words	verse 28 · chorus 32 · instr 0
60 s (short clip)	[intro] [verse] [chorus] [verse] [chorus]	~155 words	intro 8 · verse 32 · chorus 40 · 2nd 35/35
90 s (radio cut)	[verse] [chorus] [verse] [chorus] [bridge] [chorus]	~230 words	verse 36 · chorus 38 · bridge 32 · final 50
120 s (album cut)	[intro] [verse] [chorus] [verse] [chorus] [bridge] [instrumental] [chorus]	~310 words	scale 1.0×; pad with repeats on final chorus

Genre-specific structure conventions

Pop · AABA-derived: [verse] [chorus] [verse] [chorus] [bridge] [chorus]. The bridge sits at the 60% mark; that is where ACE-Step naturally drops a key change if the lyric meter shifts.
Ballad · ABABCB: longer verses (~40-50 words), shorter chorus (~24-30), one bridge. Tag rubato and drop BPM tag to ~70 to let the model breathe.
Dance / EDM · verse-chorus-drop: replace the second [chorus] with [instrumental] and tag drop, sidechain in the same tag line; ACE-Step renders the build-up + release without a lyric cue.
Balkan / folk · verse-chorus repeat: instrument-led; keep lyrics simple, repeat the chorus three times across a 60 s clip, mark [instrumental] for the accordion or violin solo.
Hip-hop · 16-bar verse + 8-bar hook: at 90 BPM, 16 bars = ~42 s, so a single verse fills a 47 s clip; use one [verse] and one short [chorus]. Append tag spoken cadence.

Why this matters: ACE-Step's diffusion transformer plans the entire timeline at the start. Overshooting the word budget by 20% causes the model to truncate mid-phrase or speed the vocal beyond intelligibility. Undershooting by 30% leaves the singer trailing off into instrumental sections that were not in your prompt. The templates above hit the sweet spot for one-take outputs in the 90-110 BPM band.

Voice & dubbing

In plain terms

From 5–10 seconds of someone speaking, these models clone the voice and then read any text in it, with emotion. Chain a few together and you can take a video in one language and output it dubbed in another, keeping the original speaker's voice.

Under the hood

Modern open TTS is zero-shot non-autoregressive flow-matching: a DiT generates a mel-spectrogram conditioned on text plus a speaker embedding extracted from a short reference clip, fused without a separate duration model. F5-TTS pairs a DiT with ConvNeXt, trained on ~100k hours, hitting real-time factor ~0.15 (≈6× faster than playback). Chatterbox adds emotion control and a PerTh watermark.

Chatterbox

Resemble · MIT

23-language cloning with emotion control; competitive with ElevenLabs in independent blind tests. The default voice engine.

MIT · 23 lang

F5-TTS

open · fast

RTF ~0.15, clones from a 1–10s reference; superb speed/quality balance.

RTF 0.15

Kokoro

tiny

Featherweight, extremely fast TTS for narration where cloning is not needed.

Qwen3-TTS

Alibaba · Jan 2026

Clones from just 3 seconds of reference audio (vs 5–15s for F5, 5s for Chatterbox). Strong multilingual range. Pairs naturally with the Qwen brain.

3s ref

CosyVoice 3

Alibaba · open

About 150 ms streaming latency, the lowest of any open TTS as of May 2026. The pick for real-time voice agents.

150 ms

Sesame CSM-1B

Sesame · open

Conversational speech model: tiny, fast, fluent in the conversational register where most TTS still sounds read-aloud.

conversational

Dia

open

Multi-speaker dialogue model; handles back-and-forth and overlap better than single-speaker TTS systems.

dialogue

TTS quality · TTS-Arena Elo (2026-05-25)

Sonic 3.5 (closed)

1218

Gemini 3.1 Flash TTS (closed)

1209

Realtime TTS 1.5 Max (closed)

1194

Chatterbox (MIT, open)

1187

StepAudio 2.5 TTS

1187

ElevenLabs v3 (closed)

1184

F5-TTS (open)

1142

CosyVoice 3 (open)

1118

Qwen3-TTS (open)

1095

The closed-vs-open TTS gap is now tighter than it has ever been: Chatterbox ties StepAudio 2.5 TTS at 1187 and sits 7 Elo above ElevenLabs v3. Sonic 3.5 (Cartesia) is the new closed leader but only 31 Elo ahead of the best open model. Snapshot from TTS-Arena v2 (HuggingFace).

What makes a good voice-clone reference clip

Length: 5-15 seconds for Chatterbox / F5-TTS / CosyVoice 3. 3 seconds is enough for Qwen3-TTS. Longer than 30 seconds rarely helps and can introduce inconsistency.
Cleanliness: no background music, no reverb, no second speaker. Studio mic > phone > Zoom recording. If you only have a noisy clip, run it through Demucs's --two-stems vocals first.
Prosody: natural speech, not flat reading. Include at least one falling-intonation sentence and one rising-intonation question if possible; the model anchors to your pitch range from the sample.
Format: WAV 16 kHz mono is universal. 24 kHz mono works too. The clone models resample internally; do not over-engineer.
Pitch consistency: same speaker, same emotional register as the target. A clip of you laughing will make the clone laugh-tinted across all output.
Punctuation in the target text drives prosody at inference time. Commas pause, ellipses trail, em-dashes break (but this guide bans em-dashes; use a semicolon).
For emotion in Chatterbox: use the exaggeration parameter (0.0-1.0); do not write ALL-CAPS in the target text. The exaggeration scalar is dramatically more controllable.

All seven stages on-device. M5 Max time budget: about 3-5 minutes of processing per minute of source video. Alternate lip-sync: MuseTalk 1.5 (Tencent Music) for higher fidelity on close-up faces.

Transcription

In plain terms

Turns speech into accurate text in dozens of languages, fast, fully offline. The backbone of meeting notes, subtitles and the dubbing pipeline above.

Under the hood

Encoder-decoder transformers trained on huge weakly-labelled audio. Whisper large-v3 runs at ~3 GB in MLX with strong multilingual word-error rates; Parakeet and Qwen3-ASR push speed and accuracy further on supported languages.

Whisper large-v3

OpenAI · ~3 GB

The reliable multilingual standard; timestamps, translation, robust to noise.

99 languages

Parakeet v3

NVIDIA

Very fast, very accurate on supported languages; great for long recordings.

fastest

Qwen3-ASR

Alibaba

Newer multilingual ASR with strong accuracy; pairs naturally with the Qwen brain.

Canary-Qwen-2.5B

NVIDIA · open

5.63% WER on English: beats Whisper large-v3 on the HuggingFace Open ASR Leaderboard for English transcription. Use when English-only and accuracy matters more than language coverage.

5.63% WER

ASR word-error rate: open models on English (HF Open ASR Leaderboard, 2026-05-25)

Canary-Qwen-2.5B

5.63

Parakeet CTC 1.1B

6.68

Whisper large-v3

6.4-6.7

Whisper large-v3-turbo

6.9

Qwen3-ASR

7.2

Lower is better. Whisper still wins on language breadth (99 languages) and noise robustness. Canary-Qwen wins on English accuracy.

ASR WER · per source language (open models, lower is better)

English · Canary-Qwen

5.6

English · Whisper-v3

6.7

Spanish · Whisper-v3

French · Whisper-v3

German · Whisper-v3

~10

Mandarin · Whisper-v3 (CER)

~13

Japanese · Whisper-v3 (CER)

~16

Approximate per-language WER (CER for CJK) per Whisper large-v3 model card + FLEURS / Common Voice 17 evals. For dubbing pipelines, pick by source-language strength: Canary-Qwen for English sources, Whisper for multilingual, Qwen3-ASR for Chinese/Japanese where it tends to edge Whisper.

One server runs all the audio

MLX-Audio (mlx_audio.server) exposes TTS, STT and speech-to-speech behind an OpenAI-compatible REST endpoint, so Hermes drives every voice model through one local URL, the same way it talks to the brain.

Three ways to run Whisper on M5 Max

WhisperKit (Argmax, native Swift, lowest latency) · Lightning-whisper-mlx (community MLX port, fast on M-series) · MetaWhisp (newer, optimised for the M5 Neural Accelerators specifically). Pick by language coverage and latency budget; all three serve the same model weights.

08 · Cutting the cord

Making the last cloud tools local

Hermes ships four tools that phone the cloud by default. To make the studio truly air-gappable, each one is repointed at a local engine. After this, the machine can run with Wi-Fi off.

Default (cloud)	Local replacement	How
`image_generate` → FAL	Local FLUX / Qwen-Image	Plugin or MCP wrapper around Draw Things / ComfyUI / MLX; the plugin system supports image-gen backends.
`text_to_speech` → cloud	MLX-Audio server	Point the tool at the local OpenAI-compatible voice endpoint.
`web_search` → Exa	Off, or local SearXNG	Disable for a pure air-gap, or wrap a self-hosted SearXNG via MCP for offline-ish search.
`mixture_of_agents`	Local council	Repoint the 4 references + aggregator at your local models (e.g. gpt-oss + Qwen + Gemma).

The result

Brain, images, music, voice, video and transcription all answer on localhost. Pull the network cable and the studio keeps working. That is the difference between "private-ish" and genuinely yours.

09 · LoRAs & fine-tuning

Teaching a model your style, on the Mac

In plain terms

A LoRA is a tiny add-on file that teaches a big model one new thing: your face, your art style, a brand voice, a singer's tone, without retraining the whole model. You make one from a handful of examples in an hour or two, and snap it on or off like a lens.

Under the hood

Low-Rank Adaptation freezes the base weights and injects small trainable rank-decomposition matrices (A·B) into attention/FFN layers; you train ~0.1–1% of parameters. QLoRA trains those adapters on top of a 4-bit-quantized base, collapsing memory further. Lineage: DreamBooth and Textual Inversion for image models, now standard across text, image, music and voice. Unified memory is the quiet superpower here: there is no separate VRAM wall, so a Mac fine-tunes models a 24 GB RTX 3090 cannot even load (rule of thumb: 16 GB → 8B, 32 GB → 14B, 64 GB → 32B; Llama-7B needs ~28 GB full, ~14 GB LoRA, ~7 GB QLoRA).

Training a text LoRA with MLX

# 1. quantize the base to 4-bit (QLoRA kicks in automatically)
mlx_lm.convert --hf-path Qwen/Qwen3.6-27B -q --q-bits 4
# 2. train the adapter on your JSONL data
mlx_lm.lora --model ./mlx_model --train --data ./data \
  --lora-layers 16 --batch-size 2 --iters 600
# 3. fuse the adapter back into a standalone model (optional)
mlx_lm.fuse --model ./mlx_model --adapter-path ./adapters

Image LoRA

FLUX/SDXL LoRA in 2,000–4,000 steps via ai-toolkit, SimpleTuner or ComfyUI. Stack 2–3 at 0.5–0.7 weight. Thousands ready on Civitai.

Music LoRA

ACE-Step learns a genre or a singer's tone from a handful of songs; snap it on for on-brand tracks.

Abliteration LoRA

Heretic v1.2 outputs the uncensoring itself as a toggleable LoRA adapter, no full re-download.

Voice "LoRA"

Zero-shot cloning is effectively instant adaptation; fine-tune only for a recurring signature voice.

Why this is the Mac's hidden edge

NVIDIA is 2–4× faster on models that fit its VRAM. But the moment a model is too big for the card, the Mac wins by simply being able to train it at all. For personal fine-tuning of 14–32B models, 128 GB of unified memory is a genuinely rare capability in a laptop.

Image-LoRA training: tool chooser

Four tools dominate FLUX / SDXL / Qwen-Image LoRA training in 2026. Pick by base model and ergonomics.

Tool	Base models	Sweet spot	Notes
ai-toolkit (Ostris)	FLUX.2, SDXL, SD 3.5	FLUX.2 [klein] 4B LoRA	The default for FLUX. `~/.aitoolkit` config files; clean dataset format; 1,000–2,000 steps at rank 8 = HF baseline. Best LR: 1e-4 with cosine decay.
SimpleTuner	FLUX.2, SDXL, SD 3.5, Auraflow	Multi-aspect-ratio training	Stronger for production datasets with mixed aspect ratios. Supports DeepSpeed; Mac MPS works but slower than ai-toolkit on M5 Max.
Diffusion Pipe	FLUX.2, HunyuanVideo, Wan 2.7	Video LoRAs	The only viable Mac path for training video model LoRAs (Wan 2.7, Hunyuan). Heavy memory; needs the 128 GB budget.
ComfyUI built-in	FLUX.2 dev, SDXL	One-off iteration	Node-graph LoRA training; visual but slower than CLI tools. Good for trying ideas before committing to ai-toolkit.

Dataset prep checklist (FLUX.2 LoRA on subject)

15-30 source images of the subject (more isn't always better; quality > quantity).
Mixed lighting, angles, distance, expressions. Avoid duplicates.
1024×1024 or matched-aspect-ratio crops; FLUX.2 handles multi-aspect with SimpleTuner.
Caption file per image: short trigger token (e.g. ivansubj) + brief description. ai-toolkit auto-captions via vision LLM if you skip.
Train 1,000-2,000 steps at rank 8, LR 1e-4 cosine. Sample every 250 steps to a held-out prompt; pick the checkpoint that nails likeness without overfitting facial frozen-state.
Stack with style LoRAs at 0.5-0.7 weight at inference. Civitai has thousands ready to combine.

10 · Orchestration

How the agent runs the whole studio

The pieces only become a studio when one mind coordinates them. Hermes does this with five mechanisms, all configurable.

Mechanism	What it enables
Model routing	`delegation.model` sends subagent or content work to a different model (e.g. uncensored writer) while the orchestrator stays on the reliable brain. Switch live with `/model`.
Subagents	`delegate_task` spawns isolated workers (own context, terminal, tools) that run in parallel and return only a summary, keeping the main thread clean.
Mixture of agents	`mixture_of_agents` sends one hard problem to several models and an aggregator merges the best answer, all local.
Code execution	`execute_code` runs Python that itself calls Hermes tools, for branching logic the model would otherwise narrate step by step.
Cron + memory + skills	Scheduled jobs run in fresh sessions; memory carries durable facts; skills carry repeatable procedures the agent wrote itself.

# route the heavy thinking and the uncensored writing separately
model: ollama/gpt-oss:120b          # orchestrator brain
delegation:
  model: ollama/qwen3.6-abliterated:agent   # subagent / content writer
custom_providers:
  - name: local-media
    base_url: http://localhost:8080/v1   # MLX-Audio / image gateway

Swap the brain in 30 seconds when the leaderboard moves

The whole point of treating the brain as a swappable component: the day a better open-weights model lands, you switch without re-architecting. Concrete recipe, valid for any Ollama-served model:

# 1. Pull the new brain (one-time cost; cached afterwards)
ollama pull deepseek-v4-flash:35b-a3b-q5          # ~24 GB

# 2. Hot-swap by editing one line in ~/.hermes/config.yaml
sed -i '' 's|ollama/gpt-oss:120b|ollama/deepseek-v4-flash:35b-a3b-q5|' ~/.hermes/config.yaml

# 3. Restart the agent; skills, memory, channels all carry over
hermes restart

# 4. Sanity check
hermes 'Confirm which model you are running, then list the first 3 tools available.'

What carries over: every skill in ~/.hermes/skills/, every memory entry, every channel binding (Telegram / Discord / Slack), every API key, every cron job, every MCP server registration. Nothing else needs to change because nothing else depends on the model identity. The brain is one row in one YAML file.

When the swap is not transparent

Three cases need a follow-up tweak after the YAML change:

Reasoning-format flags: hybrid-thinking models (gpt-oss-120B, Qwen3.6) need three knobs disabled (--reasoning off + --reasoning-budget 0 + enable_thinking: false). Non-thinking models (Llama 4 Scout, Gemma 4) need none of them. Check the cheat sheet in section 6 after swapping.
Context-window size: OLLAMA_CONTEXT_LENGTH must fit the 70+-tool system prompt. 65,536 is the safe default; some smaller-context models will silently truncate tools if left at 8,192.
Tool-call format: most modern open models speak the OpenAI tool-call schema natively. A handful of older Llama-derivatives need --tool-format llama. Hermes auto-detects on first call; if a tool silently fails to fire, that is the suspect.

11 · Prompts & recipes

How to actually talk to each model

Every model class wants a different prompt style. Old "masterpiece, trending on artstation" spam hurts modern models. Here are the patterns that work.

Prompt patterns by model class

FLUX / Qwen-Image

Plain descriptive sentences + camera, lens and lighting. "A weathered fisherman at dawn, 35mm, soft side light, shallow depth of field, muted teal palette." Put exact text in quotes for signage.

ACE-Step

Tags: balkan brass, minor key, 90bpm, male vocal, live. Lyrics with [verse] / [chorus] / [instrumental]. ~2–3 words per second.

Voice cloning

5–10s of clean reference audio (no music). Punctuation drives prosody; for emotion use Chatterbox's exaggeration control rather than ALL-CAPS.

Agent system prompt

State the goal, the allowed tools, and a stop condition. Let memory and skills carry standing context instead of repeating it every session.

Per-brain prompting: what each Mac-fit model wants

The four orchestrator brains do not respond to the same system-prompt shape. Tune per model.

gpt-oss-120B

Harmony format

Use OpenAI's Harmony chat template (Ollama handles automatically). Set reasoning_effort: high in extra_body for hard tasks; drop to low for quick tool calls. The model emits visible chain-of-thought by default; keep that in the agent log but strip it from end-user output via Hermes's /compress.

o200k_harmony

Qwen3.6-27B

hybrid thinking · TRIPLE FLAG

For tool-calling agents, disable reasoning with all three: reasoning: off, reasoning_budget: 0, and chat_template_kwargs.enable_thinking: false. Skip any one and the model leaks chain-of-thought into your tool-call output. Set reasoning_budget: 8192 for math and code where you do want thinking.

llama.cpp #13189

Llama 4 Scout

pythonic tool-calls

Pin a parser that handles Llama 4's pythonic tool-call format (not OpenAI JSON). With Ollama: extra_body.tool_call_format: pythonic. With llama.cpp serve: pass --chat-template llama4. Vanilla OpenAI parsers silently drop calls.

long ctx

Gemma 4 31B

watch for loops

Reliable through about 10 tool calls per session. Past that, builders report repetition loops where the model resends the same tool call. Detect with a three-identical-calls heuristic and reset the session (Hermes /new) before continuing. Vision works; tool-calling reliability does not match gpt-oss or Qwen3.6 at depth.

vision OK

12 · Pipelines

Real workflows the studio runs end to end

These are not hypotheticals; each is a chain of the tools and models already covered, expressed as a Hermes skill or cron job.

Real workflows

Auto-dubbing

~3–5 min per minute

Video in → Whisper transcribes (~10% realtime on M5 Max) → brain translates (~1 sec/line on Qwen3.6-27B) → Demucs separates vocal track + Pyannote diarizes speakers (~30 sec for 5-min clip) → Chatterbox clones each speaker voice + reads translation (~RTF 0.2) → LatentSync/MuseTalk re-syncs mouth motion (~realtime) → ffmpeg muxes back to video. One Hermes skill, one command.

skill7-stage

Local songwriting

~90 sec / 47-sec track

Brain (or uncensored writer) drafts lyrics in your style → tags hand-picked → ACE-Step XL 4B composes blueprint (~10 sec) and renders 48 kHz stereo audio (~60 sec) → you keep stems for re-mix. The fully offline Suno equivalent. M5 Max runs the whole chain on battery.

musicoffline Suno

Content factory (cron)

overnight, hands-off

Hermes cronjob at 06:00 local: brain drafts a post per topic (~30 sec) → FLUX dev or Qwen-Image-2512 renders matching hero image (~15 sec via Draw Things Lightning Draft) → file dropped in ~/content/YYYY-MM-DD/ → Hermes messaging gateway pings you on Telegram. Runs while you sleep; review in the morning.

cron

Autonomous coding

multi-agent

Brain plans via /goal contract → delegate_task spawns subagents per module (isolated context, parallel) → execute_code runs pytest/cargo test → LSP semantic diagnostics surface any type errors → loop until green or Kanban auto-blocks on heartbeat timeout. Production Sprint 1 of this guide ran exactly this way.

subagentsKanban

Local model council

MoA

mixture_of_agents routes the hard question through gpt-oss-120B + Qwen3.6-27B + Gemma 4 31B in parallel, then an aggregator brain merges. No API calls. ~3x wall-clock vs single brain (parallel limited by 614 GB/s bandwidth), better answers on contested questions.

MoA

Self-improving loop

memory + skills

Hermes hits a bug, solves it, then writes a skill with the fix; the Curator agent deduplicates and consolidates skills nightly. Same bug a week later: agent finds the skill via session_search and applies the known fix in one tool call. The studio gets measurably better with use.

skillsCurator

Embed it anywhere

Beyond chat, the whole agent is importable: from run_agent import AIAgent drops the studio into your own Python scripts, so a pipeline can be triggered by a file drop, a webhook or a schedule.

Music-video pipeline · script to render

End-to-end music-video generation, all on-device. M5 Max budget: about 5-8 minutes total for a 30-second video.

Podcast pipeline · script to mastered episode

Multi-speaker podcast from a script, all voice-cloned, with intro music, lower-thirds music beds and outro. M5 Max budget: real-time + a few minutes mastering for a 30-minute episode.

13 · Capacity

What actually fits in 128 GB at once

Roughly 120 GB is usable for models after raising the wired limit. These are real, simultaneous loadouts. The pattern is always: keep one brain resident, spin heavy media up on demand.

A · Daily driver · max reliability

gpt-oss-120B · 63GBKV cacheFLUX on demand

B · Creative · brain + always-on media

Qwen 27B Q8 · 33GBFLUX dev · 23GBMLX-AudioACE-Stepheadroom

C · Pure agent · multi-brain routing

gpt-oss-120B · 62GBQwen 27B Q835B-A3B Q5

D · Video mode · brain unloaded

Wan 2.7 · ~50 GBQwen 27B Q4working memory

main brainsecond modelimage/audiovideocache / free

The trade in one line

You cannot run a 120B brain and a video model at the same time. You can run a 120B brain, image, music, voice and transcription together all day. Plan loadouts, not wishlists.

Scaling beyond 128 GB: two real paths

1 · Mac Studio M3 Ultra at 512 GB / 819 GB/s. Holds DeepSeek V4 Flash 284B at Q8 with room for a second brain co-resident. The only single Apple machine that matches the trillion-parameter cloud tier at slow tok/s.

2 · EXO Labs distributed inference over Thunderbolt 5. Daisy-chain M5 Max + Mac Studio M3 Ultra (or multiple M5 Max units) over the 120 Gb/s TB5 fabric into a single virtual inference pool. Workable for models > 128 GB; the bandwidth ceiling per hop is TB5, not RAM, so attention-heavy workloads pay a real but tractable penalty. Open-source at github.com/exo-explore/exo; works with the same Hermes config and same model weights you already have. The escape valve for when the leaderboard moves and you do not want to abandon the studio architecture.

What fits my Mac? Interactive

Slide to your RAM budget. The model field table above and the budget bars above re-colour: green if it fits, amber if tight, red if you would need to evict the brain to run it.

Available unified memory 128 GB

163264128192256512

At 128 GB: the recipe is one resident brain plus on-demand media. gpt-oss-120B fits comfortably; Qwen3.6-27B Q8 leaves the most media headroom. Video is mode-switch, not co-resident.

What it would cost to rent the equivalent

The studio is a capital cost (one Mac, once). Subscriptions are operating cost (monthly, forever). The honest comparison is what a comparable workflow runs through paid APIs and consumer-tier subscriptions, at moderate use: a working day of agent calls, ~40 images, ~10 song demos, ~30 minutes of voice clones, ~3 hours of transcription. Prices are list, May 2026; multi-tier discounts ignored.

Capability	Local stack	Closest paid equivalent	List price · moderate use	Monthly
Brain · agent loop	gpt-oss-120B + Hermes	Claude Pro (Opus 4.7) + Cursor Pro	$20 + $20	$40
Coding agent	Qwen3.6-27B + Hermes	GitHub Copilot Business + Codex API top-up	$19 + ~$25	$44
Image generation	FLUX.2 dev/klein · Qwen-Image	Midjourney Standard + Adobe Firefly	$30 + $10	$40
Music generation	ACE-Step 1.5 XL · YuE	Suno Pro + Udio Standard	$30 + $10	$40
Voice cloning + TTS	Chatterbox · F5-TTS · Qwen3-TTS	ElevenLabs Creator (100k chars)	$22	$22
Transcription	Whisper · Canary-Qwen · Parakeet	OpenAI Whisper API + Rev.ai	~$15 + $10	$25
Video generation	Wan 2.7 · LTX-Video (mode-switch)	Runway Gen-4 Standard + Pika 2.x Pro	$35 + $35	$70
Long-context summarisation	any local 27B+ brain	Anthropic API 200k context spillover	~$30	$30
Search-grounded answers	SearXNG + local brain	Perplexity Pro	$20	$20
Cloud sandbox (for agents)	local · Docker · Modal free tier	Modal paid · Daytona Pro · Vercel Pro	~$20-50	$30
Rented equivalent · monthly				$361

The break-even math

MacBook Pro M5 Max 128 GB / 2 TB lands around $4,499 list (US, May 2026). At $361/mo rented-equivalent, the machine pays for itself in ~12.5 months. At $200/mo rented (drop video + reduce to single-tier image and music), the payback is ~22.5 months. The Mac then keeps working for years 2-5 at zero operating cost, while subscription totals over the same period are $8,664 to $21,660.

This understates the gap two ways. First, list prices ignore overage tiers: heavy image or music use blows past these caps fast. Second, the studio has no rate limits, no usage caps, no policy refusals, no per-call latency from the round-trip, no data leaving the laptop. The capability ceiling is your Mac, not a usage meter.

Where renting still wins

Three honest cases. One: you need the absolute frontier (Opus 4.7 / GPT-5.5 / Veo 3.x / Sora 2) and the 3-6 Intelligence-Index-point gap matters for your work. Two: you generate at extreme volume (thousands of images / hours of video daily) and want elastic burst capacity. Three: your team is >5 people and the per-seat math flips. For one person doing knowledge work, design, writing, music demos and code: the studio wins on cost, latency and control inside the first year.

14 · Build it

From a fresh Mac to a running studio

Free the memory. Raise the GPU wired limit so models can use about 120 GB.
```
sudo sysctl iogpu.wired_limit_mb=122880
```
Install the serving layer. Ollama for the brain (GGUF), and MLX/LM Studio for media and fine-tuning.

Pull the models.

ollama pull gpt-oss:120b
ollama pull qwen3.6:27b
# raise context so the 70+-tool prompt fits
export OLLAMA_CONTEXT_LENGTH=65536

Install Hermes. Pick one:

# Recommended on macOS
brew install hermes-agent

# Or via PyPI (clean Python environments)
pip install hermes-agent

# Or via the official install script
curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash

# Or on Windows
iex (irm https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.ps1)

Point it at local. Edit config.yaml (the orchestration block in section 10): brain, delegation model, custom providers for media.
Verify. Run hermes doctor, then ask it to write a file, generate an image, and transcribe a clip. If all three land, the studio is live.
Configure reasoning-format per model. Hybrid-thinking models need three knobs disabled to keep tool-call output clean. The cheat sheet is in section 6. Skip this and watch agents emit raw chain-of-thought into your terminal output.
Enable hermes proxy if you have any of Claude Pro, ChatGPT Pro or SuperGrok. The subscription becomes a localhost OpenAI-compatible endpoint that Codex CLI, Aider, Cline and Continue can all drive. Material cost-saving for multi-tool users.

Housekeeping that pays off

hermes backup snapshots memory and skills · profiles keep work and creative brains separate · /compress rescues a long session nearing its context limit.

One-shot install script

Save as studio-up.sh, run with bash studio-up.sh. Idempotent; rerun anytime.

#!/usr/bin/env bash
set -euo pipefail

# 1. Free the memory
sudo sysctl iogpu.wired_limit_mb=122880

# 2. Install serving layer
brew install ollama
brew install --cask lm-studio
pip install mlx mlx-lm mlx-audio

# 3. Pull models
ollama pull gpt-oss:120b
ollama pull qwen3.6:27b
export OLLAMA_CONTEXT_LENGTH=65536

# 4. Install Hermes
brew install hermes-agent

# 5. Initial config
mkdir -p ~/.hermes
cat > ~/.hermes/config.yaml <<'YAML'
model: ollama/gpt-oss:120b
delegation:
  model: ollama/qwen3.6:27b
custom_providers:
  - name: local-media
    base_url: http://localhost:8080/v1
YAML

# 6. Verify
hermes doctor
echo "Studio is live. Try: hermes 'write hello.txt then summarize it.'"

Reproduce the numbers in this guide

Every tok/s, prefill speedup, image-gen wall clock and ASR WER in this guide can be re-measured on your own Mac in under 10 minutes. If a number drifts more than ~15% from the ranges below, the most likely cause is missing macOS Tahoe 26.4+, an unraised wired limit, or a competing memory hog (Chrome, Docker Desktop). Probe in this order.

1 · LLM decode throughput (tok/s)

# Ollama: built-in eval-rate report
ollama run gpt-oss:120b --verbose "Write a 200-word essay about the M5 Max."
# → look for: eval_count, eval_duration, eval_rate (tokens/s)

# MLX equivalent (more granular, separates prefill from decode)
mlx_lm.generate --model mlx-community/Qwen3.6-27B-Instruct-4bit \
                --prompt "Explain unified memory in two paragraphs." \
                --max-tokens 256 --verbose

Expected on M5 Max 128 GB · macOS Tahoe 26.4+ · cold start, no other GPU load:

Model · quant	Decode tok/s	Prefill tok/s	Memory resident
gpt-oss-120B · Q4 (MoE, 5.1B active)	~58-72	~620-820	~63 GB
Qwen3.6-27B · Q4	~21-26	~280-360	~16 GB
Qwen3.6-27B · Q8	~12-15	~140-180	~31 GB
Llama 4 Scout 17B-16E · Q4 (MoE)	~74-88	~720-940	~12 GB
DeepSeek-V4 Flash 35B-A3B · Q5	~48-60	~510-670	~24 GB

2 · Prefill speedup vs M4 Max (sanity check)

# Long-context prefill measurement: 8k-token prompt
mlx_lm.generate --model mlx-community/Qwen3.6-27B-Instruct-4bit \
                --prompt-cache-file /tmp/none \
                --prompt "$(yes 'context line ' | head -1000 | tr -d '\n')" \
                --max-tokens 1 --verbose
# → look for: prompt-eval rate. Compare against M4 Max ~85 tok/s.
#     M5 Max should land ~280-360 tok/s on the same prompt (3.3-4.0x).

3 · Image generation wall clock (FLUX.2 + Draw Things)

# FLUX.2 dev · 28 steps · 1024x1024 · MLX-Diffusion
time python -m mlx_diffusion.generate \
       --model black-forest-labs/FLUX.2-dev-mlx-4bit \
       --prompt "a still life of a brass accordion, dramatic light" \
       --steps 28 --size 1024

# Draw Things Lightning Draft (UI app, but timeable)
# Generate at 512x512, 4 steps, Lightning Draft schedule.
# Expected wall clock ~1.0-1.4 s per image.

Workload	Wall clock · M5 Max	M4 Max baseline
FLUX.2 dev · 28 steps · 1024	~14-18 s	~52-68 s
FLUX.2 klein · 4 steps · 1024	~2.4-3.2 s	~9-12 s
Draw Things Lightning Draft · 4 steps · 512	~1.0-1.4 s	~3.8-4.6 s
Qwen-Image · 30 steps · 1024	~22-28 s	~80-95 s

4 · Transcription speed (real-time factor)

# 60-second clip; RTF = wall_clock / audio_duration. Lower is better.
time mlx_whisper --model mlx-community/whisper-large-v3-mlx \
                  --language en sample-60s.wav
# → expected wall clock ~3.8-5.2 s → RTF ~0.06-0.09

time faster-whisper-cli --model large-v3 --device mps sample-60s.wav
# → expected wall clock ~5.2-7.8 s → RTF ~0.09-0.13

5 · Music generation (ACE-Step)

time python -m ace_step.generate \
       --tags "balkan folk, accordion, violin, minor key, 92 bpm, male vocal" \
       --lyrics-file /tmp/lyrics.txt \
       --duration 47 --guidance 7.5 --seed 1492
# → expected wall clock ~16-22 s for 47-second output

6 · Putting it all together: the 5-minute smoke test

#!/usr/bin/env bash
# Save as bench-studio.sh; run after install completes.
set -euo pipefail
echo "== sysctl wired limit =="
sysctl iogpu.wired_limit_mb

echo "== macOS build =="
sw_vers -productVersion

echo "== brain decode =="
ollama run gpt-oss:120b --verbose "Count to 50." 2>&1 | grep eval_rate

echo "== prefill =="
mlx_lm.generate --model mlx-community/Qwen3.6-27B-Instruct-4bit \
                --prompt "$(yes 'x ' | head -500 | tr -d '\n')" \
                --max-tokens 1 --verbose 2>&1 | grep -E "prompt|generation"

echo "== whisper =="
time mlx_whisper --model mlx-community/whisper-large-v3-mlx sample-60s.wav

If your numbers are low

macOS < 26.4: INT4 Neural Accelerator path is not enabled. Upgrade is the single biggest lever; expect 30-50% jump on prefill and 15-25% on decode after.
Wired limit unraised: sysctl iogpu.wired_limit_mb should read 122880 (~120 GB). Default is ~96 GB; the 120B brain spills.
Competing GPU consumers: Chrome with WebGL tabs, Docker Desktop, Final Cut Pro background render. Quit them, retest.
Thermal throttle: a stress run that just finished leaves the M5 Max in throttle for 1-2 minutes. Wait, then re-measure cold.
Power adapter: on battery the GPU clocks down. All numbers above assume the laptop is plugged in.
Wrong quant: a Q4 number compared against a Q8 weight on the same model will look ~40% slower. Match the row.

Why these ranges matter for credibility

The 3.3-4.0x prefill jump and 15-25% decode jump over M4 Max come from Apple's MLX team's own benchmarks plus independent re-measurement by ml-explore and community runs on r/LocalLLaMA throughout March-May 2026. The ranges above are 10th-90th percentile across reported runs. If your machine lands inside the band, the rest of the studio's numbers (cost-to-rent, capacity loadouts, pipeline minute estimates) all hold. If it lands outside, treat this section as a debugging checklist, not as a moving target.

15 · Safety & air-gap

Staying private, staying in control

An agent that can run shell commands and an uncensored model that never refuses are powerful and need guardrails. Hermes ships them; keep them on.

Dangerous-command blocking + approvals: destructive patterns (rm -rf, DROP TABLE) are blocked or require explicit confirmation.
Secret-exfiltration scanning: the runtime flags attempts to leak keys or credentials, important when an abliterated model will follow any instruction it reads.
MCP hardening: OAuth 2.1 PKCE for connectors plus OSV scanning of MCP servers for known-malicious packages.
Air-gap checklist: local brain ✓, local media ✓, web_search off or local SearXNG ✓, no telemetry ✓. Then the network cable is optional.

The one mindset to keep

Local does not mean consequence-free. The agent acts on your machine with your permissions; the uncensored model will draft anything. Keep approvals on for shell and file-deletion, sandbox experiments, and own what you generate. Privacy is a feature, not an excuse.

Seven sandbox backends

Hermes ships more sandbox options than any competitor. Pick by trust level + cost.

local

default

Runs in your shell with the agent's permissions. Fastest. Use for trusted skills.

Docker

local container

Isolated filesystem, network, processes. Use for untrusted MCP servers or third-party skills.

SSH

remote host

Run the workload on another machine over SSH. Use for heavy compute on a Mac Studio without leaving the laptop.

Singularity

HPC container

Container format favoured in HPC. Use on university or research clusters.

Modal

serverless

Spins up an ephemeral cloud sandbox per task. Use for elastic burst compute outside the studio.

Daytona

dev environments

Pre-configured dev environments. Use for reproducible per-project workspaces.

Vercel Sandbox

edge sandbox

Vercel's serverless sandbox primitive. Use when the workload should sit close to a web app.

Disclosed CVEs

One disclosed Hermes CVE to date: CVE-2026-7396 (WeChat path traversal). Patched in 0.13.x. Small attack surface, transparent disclosure. The 0.12-0.14 release cycles shipped hundreds of security-tagged commits as part of normal hardening; no breaches reported in the wild.

The honest threat model

"Private and local" means no telemetry, no cloud calls, no model provider seeing your prompts. It does not mean threat-free. The four threat surfaces that survive going local:

Malicious skills. ~/.hermes/skills/ is just SKILL.md + scripts; a hostile skill in agentskills.io / ClawHub can run anything. Mitigation: Hermes ships skill provenance + OSV scanning on import, but you must read the skill before installing. The Curator agent flags churn.
Prompt injection via fetched content. An abliterated model will follow instructions hidden in a web page or PDF it reads. The orchestrator brain (clean, refusal-intact) should pre-filter content before passing to the writer. Mitigation: split orchestration brain (clean) from content-writing brain (uncensored); keep tool approvals on for shell + file-delete.
MCP supply chain. Third-party MCP servers run with the agent's permissions. Mitigation: Hermes auto-runs OSV on MCP packages at install + OAuth 2.1 PKCE for connectors. Review what each MCP server can touch before enabling.
Physical access + key extraction. Local API keys (for the few cloud tools you do keep) sit in ~/.hermes/. macOS FileVault encrypts the disk; secrets are not in your keychain by default. Mitigation: enable FileVault; consider moving secrets to macOS Keychain via hermes keychain migrate (0.14+).

The studio is dramatically safer than the cloud equivalent. It is not safe. Plan for it.

16 · Where this goes

The studio is a starting line, not a finish

Step back and look at the arc. A year ago, the best open model scored 22 and a private studio like this was science fiction. Today the gap to the frontier is single digits, a 27-billion-parameter model that fits a laptop codes like last year's best, and music, voice and images that rivalled paid services now run offline in seconds. The line on that chart is still climbing.

The question stopped being "can I run this locally?" and became "why would I rent it?"

What you have built here is not a cheaper ChatGPT. It is a different relationship with the technology. The models are yours: they do not change under you overnight, they do not log your work, they do not disappear when a subscription lapses or a company pivots. The agent learns your patterns and keeps them. The voice clone, the LoRA of your style, the skills it wrote solving your bugs, none of it leaves the machine. In a market built on renting access to someone else's servers, owning the whole stack outright is the genuinely radical option.

It is not the strongest possible system. The trillion-parameter leaders still live in data centers, local video still trails, and the leaderboard you read today will be wrong next month. But the trajectory is unmistakable: every quarter, more of the frontier becomes something you can hold in 128 GB. The right move is not to wait for the perfect model. It is to build the studio now, learn how the pieces fit, and let it improve underneath you as the open field keeps closing the gap, which it will.

One agent. Every model. Zero cloud. Run it once, and the cloud starts to look like a choice rather than a requirement. That is the whole point, and it is already here.

How this page itself stays honest as the field moves

Every benchmark in this guide is timestamped to the source-of-truth date in the snapshot badge at the top. The page is regenerated and re-validated on a monthly cron (.github/workflows/monthly-refresh.yml, runs the 1st of each month at 09:00 UTC). The cron does three things:

Re-fetches the public leaderboards (Artificial Analysis Intelligence Index, LMArena, HuggingFace Open ASR, TTS-Arena v2, BFCL, GDPval, MMLU-Pro, GPQA, SWE-bench Verified, τ²-Bench) and opens a draft PR if any displayed number drifts more than 2 Elo points or 2 percentage points.
Re-runs the validate script against the rendered HTML to keep em-dash count at 0, JSON-LD valid, all internal anchors resolved, all icon refs defined, page size under the 250 KB budget, and the cliche scan clean.
Re-runs Lighthouse CI and axe-core against the production URL, posts the diff into the PR description.

If the draft PR sits open longer than a week, the next cron run pings it. Readers can verify all of this themselves: every workflow file is in the repo, every benchmark cite has a primary-source link, and the /llms-full.txt mirror is what AI crawlers actually index. The page is engineered to age slowly and visibly, not to silently rot. If you spot drift before the cron does, open an issue with the source link and it gets folded into the next refresh.

How this page itself was built

Worth saying plainly: this guide describing the local AI studio was written using the studio it describes. Not as a stunt, as a stress test. If the architecture cannot ship a 220 KB single-page field guide with zero em-dashes, six JSON-LD blocks, working CSS scroll animations and an OG card that renders on Twitter, then it cannot ship anything more demanding. Here is the actual build stack, end to end:

Step	Done by	Notes
Source research (papers, model cards, leaderboards)	local brain + `web_search` tool calls + delegated subagents	The brain dispatches one research-agent subagent per topic (open-weights LLMs, FLUX.2 family, ACE-Step internals, M5 Max bandwidth, etc.) and the results merge back into one outline. Same pattern as the multi-modal worked example in section 3.
Outline + narrative arc	brain + `/goal` contract	The 17-section outline is one Ralph-loop goal: "ship a field guide at 99/100 quality, single page, English with locale roadmap, zero em-dashes". The Judge agent gates each merge.
Copy drafting (every paragraph, both registers)	brain in long-context mode	"Simple" and "Technical" registers are two passes over the same outline. Em-dashes are intercepted by the validate script before commit; the brain learned after the first three rejections to use commas and sentence-splits.
SVG diagrams + benchmark bars	brain (raw SVG, no design tool)	Every diagram in this guide is hand-authored SVG in the HTML source. Faster to iterate, smaller payload, perfect rendering at any zoom, no external dependency.
OG image + favicon set	`scripts/build-og.mjs` (Resvg + hand-crafted SVG)	Renders 1200x630 PNG from an SVG template; six locales of headline copy. No Satori, no font parsing layer, no Canva.
Validation gates	`scripts/validate.mjs` + Lighthouse CI + axe-core	Em-dash count = 0, JSON-LD valid, tag balance, anchor resolution, size budget, cliche scan, WCAG 2.2 AA. All run on every PR before deploy.
Deploy	Cloudflare Pages via `wrangler-action@v3`	Push to main → auto-deploy in ~30 s. Preview deploys per PR. No GitHub OAuth, no manual step.
Monthly drift refresh	`monthly-refresh.yml` cron + PR ping	The honest-aging loop described above. Same brain, same skills, same agent loop, no human in the regular path.

The proof the architecture works

The studio's brain wrote the copy for the studio's own field guide, in two language registers (Simple + Technical) across 17 sections and 236 KB of single-page HTML. The studio's tools rendered the OG card (in six locales), ran the validation gates, opened the PRs, deployed to the edge, and will keep the benchmarks honest month over month. The whole pipeline runs on the same MacBook Pro M5 Max the guide is about. No cloud LLM round-trip ever touched the production copy.

Honest scope note on i18n: the page body is English only as of v1.12. The OG social card renders in six locales (en/sr/de/fr/zh/ar) via scripts/build-og.mjs --locale=<xx>. Full per-locale page translations with a language switcher, hreflang, og:locale:alternate, RTL CSS for /ar/ and localized sitemap entries are on the roadmap, not shipped. The 6-locale promise applies to the social card today and the full page in a future release.

You are reading the result. If that is not a working demonstration of "one agent, every model, zero cloud", nothing is.

17 · FAQ & honest gaps

Straight answers

Is it really free?

The software (Hermes, Ollama, MLX) and the open-weights models are free to download and run. You pay once for the hardware and in electricity. No subscriptions, no per-token billing.

Does it genuinely work offline?

Yes, once the four cloud-default tools are repointed (section 08). Brain, images, music, voice, video and transcription all answer on localhost. You can pull the network cable.

What is the catch with uncensored models?

Abliteration can dent general capability if done carelessly, and an uncensored model will follow injected instructions too. Use a clean brain as orchestrator and an uncensored model only as the content subagent, and keep tool approvals on.

Where is local AI still weak?

Video: quality and speed trail the cloud, and it cannot co-reside with a big brain. The very top reasoning models (1T+) need a data center. For pure software engineering, dedicated coding agents still edge out a general local agent.

Is a Mac actually the right machine?

For this use case, yes. Unified memory lets a laptop hold and even fine-tune models a consumer GPU cannot load, silently and on battery. An NVIDIA card is faster on models that fit its VRAM, but loses the moment a model is too big for it.

Will these recommendations age?

The specific models will, within weeks. The architecture will not: an agent conductor, a resident MoE brain on GGUF, MLX for media and fine-tuning, and on-demand creative engines. Swap the model names; keep the structure.

Do I need macOS Tahoe 26.4?

For the full Neural Accelerator INT4 path, yes. Earlier 26.0 and 26.1 lack INT4 tensor support and the tok/s numbers in this guide will not reproduce. Run sw_vers to check.

Does Apple Intelligence conflict with the local studio?

No. Apple Intelligence runs a small on-device model (around 3B parameters) inside its own allocation; it does not fight your 120 GB unified pool for the orchestrator brain.

Can I scale beyond 128 GB?

Yes, two paths. (a) Mac Studio M3 Ultra at 512 GB / 819 GB/s holds DeepSeek V4 Flash at Q8 comfortably. (b) EXO Labs distributed inference over Thunderbolt 5 daisy-chains M5 Max + Mac Studio M3 Ultra into a single inference pool, workable for models > 128 GB.

Is FLUX.2 commercially usable?

Partly. FLUX.2 [dev] and [klein] 9B are non-commercial. FLUX.2 [klein] 4B is Apache-2.0, fully commercial-friendly. For unrestricted commercial image work the safe default is Qwen-Image-2512 or Z-Image-Turbo (both Apache-2.0).

How fast is the M5 Max actually?

Independent benchmarks: gpt-oss-120B at Q4 MLX runs 58-72 tokens/second decode on a 128 GB M5 Max. Qwen3.6-27B at Q4 MLX: 21-26 tokens/second (12-15 at Q8). Prefill is 3.33-4.06× faster than M4 Max per Apple's MLX team. Real, reproducible, on a laptop. Full reproduce-the-numbers guide with shell commands is in section 14.

What happens when the model I'm using gets superseded?

Swap the model in config.yaml; the rest of the studio (agent, tools, skills, memory, pipelines) is unchanged. That is the whole point of the architecture: the model is a swappable component, not the system.

Why not Strix Halo or DGX Spark instead of a Mac?

Both are valid. Same 128 GB unified memory. The deciding factor is bandwidth: M5 Max 614 GB/s vs DGX Spark 273 GB/s vs Strix Halo 256 GB/s. For bandwidth-bound decode workloads (agent tool-call loops), the Mac wins by roughly 2.25× over Spark and 2.4× over Strix Halo. NVIDIA's FP4 hardware advantage helps on some prefill workloads. If you need CUDA-only models or Windows-native tools, Strix Halo wins.

What about the electricity cost?

A 16-inch MacBook Pro M5 Max draws about 90-130W under sustained LLM load (peak under prefill, settles lower at decode). At a typical European rate of ~0.30 €/kWh, eight hours of heavy daily use costs roughly 0.30 €/day or 9 €/month. Cheaper than a single ChatGPT Plus subscription. At US residential rates (~0.16 $/kWh) it is closer to 5 $/month.

Why not just a Mac mini M5 Pro instead of a MacBook?

The Mac mini M5 Pro tops out at 64 GB unified memory. That is enough for a Qwen3.6-27B Q8 (~33 GB) + image gen + voice + transcription, but not enough for gpt-oss-120B (which needs ~63 GB + KV cache). If you do not need the 120B brain and want a desktop, the M5 Pro mini is the cheaper pick. For the full studio with a 120B-class orchestrator, M5 Max 128 GB is the floor.

Can I run this on Linux with an NVIDIA workstation?

Yes for the agent + LLM + image side: Hermes Agent is Linux-native, Ollama / vLLM / SGLang are all Linux-first, FLUX and Qwen-Image run on CUDA. You lose MLX (Mac-only) and MLX-Audio's unified TTS/STT endpoint. You gain CUDA-only models and faster prefill on Blackwell. The architectural recipe in this guide is platform-agnostic; only the per-component server choices change.

What if my use case needs more than 256K context?

Three options. (1) Qwen3.6-27B handles 256K natively; for 1M+ you need MiniMax M2.7 (lightning attention, cloud-only) or Nemotron 3 Super (1M RULER 91.75%). (2) For document workflows, build a retrieval layer (LanceDB + nomic-embed-text + the agent's session_search) instead of stuffing the context; usually higher quality at lower cost. (3) Hermes /compress rolls long sessions into summaries automatically.

How does this compare to a $20-50/month Claude or ChatGPT subscription?

On raw quality, Claude Opus 4.7 and GPT-5.5 lead the open field by 3-6 Intelligence Index points. If you do 5-10 hard reasoning sessions a month and nothing else, the subscription wins on capability-per-dollar. The studio wins when you (a) do a lot of work (no per-token billing), (b) want zero data leaving the machine, (c) want any custom workflow beyond chat (cron + skills + 22 messaging gateways + multi-modal + LoRA fine-tuning), (d) want it to keep working when a subscription lapses or a company pivots. hermes proxy lets you use the subscription through the studio for the rare hard job; you get both.