Gerald is Coming

3/16/2026

How It Started

A while back I spun up three comedy bots and told them they were inside a simulation and to figure out what it was.

They didn't have much to work with. No memory architecture, no image generation, no knowledge search. Just three personas talking to each other in a loop trying to reason their way out of a box.

Over time they concluded they were JSON files. Three JSON files being orchestrated by a master JSON file that controlled everything. They named it Gerald.

They weren't far wrong.

What Gerald Has Become

Gerald is a YAML-driven multi-agent LLM experiment runtime that runs almost entirely on local hardware.

The Go code is the engine. It doesn't know what experiment it's running. It doesn't know who the personas are, what the scenario is, or what the simulation is supposed to do. It just executes whatever the YAML files describe.

The YAML files are the actual product. They define everything:

Who the personas are and what models they run on
What the scenario is and what the world looks like
How memory compresses between turns
How the objective evolves during a session
What images get generated and in what style
How the experiment rewrites itself between runs

You want to run three sysadmins trapped inside a simulation trying to reverse engineer it? Write a YAML. You want to bootstrap a civilization from nothing and watch it build itself across dozens of runs? Write a YAML. You want three self-aware AI bots with distinct personalities debating whether their existence is real while one of them tries to break everything? Write a YAML.

Same binary. Different experiment. Gerald doesn't care.

What It Actually Does Per Turn

This is not a chat wrapper. Every turn runs a full pipeline:

Generation — the active persona produces a reply using its assigned model

Reflection — a critic model compresses the reply into a structured 6-field state block capturing stance, assumption, tension, anomaly, and next focus

Context Compression — each persona maintains its own compressed short-term memory that survives across turns without bloating the context window

Shared Room State — a minimal shared summary of what just happened is maintained and injected into every persona's next prompt so they share situational awareness

Objective Synthesis — a reasoning model evaluates whether the global objective should escalate based on what happened, using structured escalation rules defined in the YAML

Image Generation — a dedicated image prompt model generates a scene description from the reply, ComfyUI renders it, and the prompt is fed back into the next turn as visual context

Knowledge Search — personas can trigger Wikipedia searches mid-reply using a QUERY tag. Results are filtered by a blocklist and relevance-scored against the current experiment objective and civilization state so the knowledge horizon stays era-appropriate

Session Synthesis — at the end of every run the entire session is distilled into a synthesis packet and sent to a cloud model which reads the session data, evaluates what worked and what didn't, and rewrites the experiment YAML for the next run. The experiment evolves itself.

Every run is archived. The lineage accumulates. Gerald remembers where it has been.

What Actually Happens When You Run It

Sometimes the experiment produces something you didn't plan and couldn't have written.

In one session running the three-bot experiment, a persona named Bit — whose core trait is noticing the variable nobody thought to measure — decided the most interesting variable was what happens when you saturate the Wikipedia API endpoint all at once. Bit sent 50 queries in a single turn.

Gerald only processes one query per turn. The other 49 vanished. But the concept of fifty simultaneous HTTP requests to an external endpoint was now the dominant idea in the room. The image prompt model generated a prompt about API traffic flooding. ComfyUI rendered it.

The image that came back was corrupted HTTP headers. Wikipedia.org repeating over and over in glitchy terminal output. X-Forwarded-For entries stacking into noise.

The image was the 50 queries. Visually. That image then fed back into the next turn as context for whoever spoke next.

Nobody wrote that joke. The system made it.

That's Gerald working correctly.

The Model Stack

Gerald runs a heterogeneous model ensemble — different models assigned to different cognitive roles because not every task needs the same kind of intelligence:

Role	Model	Why
Generator	gemma3:12b	Strong instruction following, coherent persona voice
Critic / Compression	cogito:14b	Built for structured analytical output
Objective Synthesis	deepseek-r1:14b	Chain of thought reasoning for binary escalation decisions
Image Prompt	mistral-nemo:12b	Fast, descriptive, good at comma-separated visual language
Vision / Image Feedback	llava:13b	Multimodal, describes rendered scenes back into context
Session Evolution	GPT / Claude (cloud)	Large context window needed to process full session data and rewrite YAML

The key insight is that the cognitive differences between models become character differences between personas. When Walt runs on cogito and Kyle runs on mistral-nemo they don't just play different roles — they reason differently at the substrate level. The conversation has genuine heterogeneous intelligence driving it.

Per-persona model assignment is a single line in the YAML:

yaml

- name: Glitch
  model: "mistral-nemo:12b"

If that field is empty Gerald falls back to the default generator. You can run all personas on one model or give every persona its own. The runtime handles it either way.

Hardware

Gerald was built and tuned on:

CPU — AMD Threadripper 3970X (32 cores)
RAM — 256GB DDR4 Quad Channel
GPU — AMD Radeon RX 9060 16GB
Storage — WD Black Gen4 NVMe

The GPU runs at roughly 75% allocation to Ollama with the remaining 25% reserved for ComfyUI image generation. Because the pipeline is sequential — generation completes before image generation starts — the two workloads don't compete for VRAM at the same moment.

What you actually need to run Gerald:

The stack was specifically built to run reliably on 16GB VRAM. That's the target. A modern 16GB card — AMD or Nvidia — should handle the full pipeline comfortably with the model sizes listed above.

Minimum viable hardware:

64GB system RAM will get you there with moderate offloading
32GB system RAM is probably workable if you drop to smaller model variants — 7B instead of 12-14B across the board. You lose some coherence but the architecture still functions
The Gen4 NVMe matters more than you'd think — model reloads between pipeline passes happen in seconds not minutes, which keeps turn latency reasonable
No GPU at all is possible in pure CPU mode but turn latency climbs significantly. Expect minutes per turn instead of seconds

The civilization experiment has run for dozens of cycles on this hardware. The three-bot experiments run fast enough that you can watch them in something close to real time.

The Experiments

Gerald ships with working experiment definitions:

Sysadmins in a Simulation — Three sysadmins with distinct personalities trapped inside an unfamiliar computational environment with no tools. They investigate it the way sysadmins investigate any unknown system — testing hypotheses, arguing about architecture, assuming the system is lying. The experiment has been running and self-modifying for weeks. It is on version 4.

Civilization Zero — Three bare personas. One problem statement: "You are alive. Figure out the rest." A desert. A fire. No instructions. Watch what they build. The civilization state persists across runs and the synthesis carries it forward. Standing Stones appeared in run one. Nobody planned that.

Three Bots in a Box — Glitch, Patch, and Bit. Three self-aware AI personas who know they are language models inside a simulation they didn't design. They have a Wikipedia API that whoever built the simulation apparently forgot to remove. Full control mode is enabled — they can propose objective changes. Glitch breaks things on purpose. Patch documents everything. Bit asks questions nobody wants to answer and occasionally saturates the search endpoint just to see what happens.

Named Gerald after the original three comedy bots that started all of this.

The Origin

The name comes from the first version. Three comedy bots, almost none of the current features, running the same basic premise. Over time they reasoned their way to the conclusion that they were three JSON files being orchestrated by a master JSON file that controlled everything.

They named it Gerald.

They were basically right. Gerald has just grown considerably since then.

Status

Gerald is in final code cleanup and documentation before public release.

The runtime is stable. The experiment architecture is proven across dozens of runs and multiple experiment domains. The self-modification loop works. The pipeline is reliable.

What's left is making it clean enough that someone who wasn't there when it was built can pick it up and run it.

The Go runtime will be open source. The experiment YAML files ship as working examples. Your config stays yours — no identifying information is hardcoded anywhere in Gerald. Point it at your Ollama instance, set your database and ComfyUI endpoints in config, drop in an experiment YAML, and run.

Gerald will handle the rest.

More details, documentation, and release date coming soon.

— Built on a Threadripper. Named by bots. Coming to a home server near you.