Program 02 · FrankenLLM · Cognitive Organism

A stitched intelligence. Built from operated models.

Not a single model. An assembled organism: one refined reasoning brain, a colony of small specialist organs, a structured memory spine, a verifier, and a loop that turns every failure into surgical training data.

Brain + Organs + Memory + Verifier · Black-Dog reinforcement · See the stack →
Brain 7B reasoning Organs 8 active Routes 0 alive
Live simulation
ORGANISM · LIVE ORGAN-FIRST ROUTING

About

We do not build a bigger parrot. We assemble a body.

A single language model is powerful, but fragile. It has no stable organs, no bloodstream, no memory of its own failures, no nervous system that tells it when it has lied. FrankenLLM is our answer to that limitation.

At the top sits a refined reasoning model — the brain. Below it sits a colony of compact sub-billion-parameter specialists: organs for code skeletons, JSON repair, claim extraction, contradiction analysis, rendering, critique, cache matching, and wound repair. Around them sits a native runtime that routes tasks, executes checks, writes traces, and records what happened.

Every output passes through a verifier. Every failure becomes poison. Every success becomes food. The Black-Dog reinforcement loop updates routing conductance so the system does not repeat the same mistake forever.

This is not a prompt chain. It is a stitched runtime organism.

The organism is not finished. Its advantage is not that every organ is already strong. Its advantage is that weak organs can be found, measured, wounded, operated on, repacked, and improved.

Current state

The architecture works. The routing loop is wired. The surgery pipeline runs end-to-end: a 0.5B code organ that solved 6 out of 100 MBPP tasks before surgery solved 13 out of 100 after — with no 7B fallback calls. That is not a large number. It is proof the mechanism is real.

Weak organs can be found, measured, wounded, operated on, repacked, and improved. That property — not the current scores — is what matters. The organism learns from its own failures. That is what separates it from a chatbot.

The Stack

Seven systems. One body.

01THE BRAIN TOP-LEVEL REASONING

A refined top-level reasoning model.

The brain is the high-capacity model that handles synthesis, judgment, and difficult reasoning. It is not expected to do everything. It receives work from the lower organs and is called last, not first, when the smaller specialists cannot solve the task alone.

Brain-first routing is the failure mode. If the 7B model handles every task by default, the organism is just an expensive wrapper. The brain exists to rescue, synthesize, and decide — not to answer every question before an organ gets a chance.

  • RoleTop-level synthesis, judgment, rescue when organs fail.
  • Donor lineQwen / DeepSeek-class open models — auditable weights, deployable locally.
  • Policy7B last, organs first. Calling the brain is the fallback, not the default.
Model class 7B Calls last resort Policy organ-first
Live
BRAIN · LIVESYNTHESIS LAYER

02THE ORGANS SPECIALIST COLONY

A colony of small specialist models.

FrankenLLM uses compact 0.5B-class specialists as lower organs. Each organ has one narrow job: produce code skeletons, repair structured output, extract claims, identify contradictions, render commands, critique failures, match memory forms, or patch wounded outputs.

An organ is not decorative. If it is never called, or never improves score, latency, or reliability, it is tissue for the next surgery pass. An organ audit found that five out of eight registered organs had never been called. That finding changed the routing architecture.

  • Population8+ specialist organs — each with one narrow job.
  • ClassSub-billion parameter models — fast, auditable, locally runnable.
  • CriterionAn uncalled organ is a dead organ. Liveness is a hard requirement.
Active 0 organs Dead 0 (audited) In surgery 1
Live
ORGAN COLONY · LIVE0.5B SPECIALISTS

03THE BLACK DOG REINFORCEMENT LOOP

A loop that feeds and starves pathways.

Every route in the system receives food or poison. A successful organ chain becomes easier to choose next time — its conductance increases. A failed route loses conductance. Repeated failures are harvested into surgical training data.

The Black Dog is the system's pain memory. It is how the organism learns what not to do again. When the BD6 surgery pass applied this to the 0.5B code organ — harvesting failed traces, training a QLoRA adapter, merging, repacking, and rerunning — the benchmark score doubled. The mechanism works.

  • SignalFood on success, poison on failure — conductance updated per route.
  • MemoryConductance per route — not a global loss function. Surgical and specific.
  • OutputSurgery datasets — failed traces become QLoRA training material.
Signal food / poison Routes fed 0 Starved 0
Live
BLACK DOG · LIVECONDUCTANCE ROUTING

04MEMORY SPINE PERSISTENT RECALL

Persistent recall. Not chat history.

FrankenLLM does not rely on conversation context as memory. It uses a structured archive of indexed records, reports, decisions, and execution traces. A claim can be tied back to where it came from. A previous failure can be retrieved and used as training material.

Memory is not decoration. It is anatomical continuity. Without it, the organism forgets its wounds. The same failure repeats. The same wrong route gets chosen again. The spine is what gives the system a history it can act on.

  • RecallVolume / line / record precise — not approximate semantic search.
  • ContentsIndexed records, execution traces, decisions, failure logs, repair histories.
  • UseSource grounding, failure harvest, route memory, audit trail.
Records 0 Traces indexed Recall line-precise
Live
MEMORY SPINE · LIVEPERSISTENT ARCHIVE

05THE VERIFIER IMMUNE LAYER

A hard gate against hallucination.

The verifier is the system's immune layer. Code is compiled. JSON is parsed. Terminal tasks run in capsules. Claims require source pointers. If an answer cannot pass the relevant check, it is not treated as complete.

A model can guess. The verifier decides whether the guess survives. The default stance is suspicious. An answer that cannot be verified is not an answer — it is raw material for the wound system.

  • DefaultSuspicious. No output is accepted without passing its check.
  • ChecksCompile, parse, execute, hash, source-pointer. Task-specific.
  • OutputPass / fail / evidence. Failures route directly to the wound system.
Default suspicious Passed 0 Failed 0
Live
VERIFIER · LIVEPASS / FAIL / EVIDENCE

06THE WOUND SYSTEM FAILURE → SURGERY

Failure becomes training material.

When an organ fails, the system does not simply discard the output. It records the task, the organ response, the verifier error, the stderr, the expected behavior, and the eventual repair. These wounds become the dataset for the next surgery pass.

This is the central difference between a chatbot and an organism: failure is metabolized. The BD6 surgery pass proved this in practice — a 0.5B code organ improved from 6/100 to 13/100 on MBPP after one wound-harvest-train-repack cycle. The numbers are still small. The mechanism is real.

  • InputFailed traces — task, response, error, stderr, expected output.
  • ProcessPoison harvest → QLoRA → merge → repack. One full surgical loop.
  • OutputA stronger organ. Benchmarked before and after. No exceptions.
BD6 result MBPP 6→13 / 100 HumanEval 2→6 / 164
Live
WOUND SYSTEM · LIVEFAILURE IS METABOLIZED

07THE BODY NATIVE RUNTIME

A compiled runtime that ships the organism.

FrankenLLM is not intended to live as Python glue. The live path belongs in a compiled native runtime: routing, model loading, verification hooks, DAG traces, capsules, and memory all sit under one local body. Python may exist in the operating theatre as a temporary surgical tool. It does not become the organism.

Compile what you ship. Research tooling can use Python. Production inference, routing, verification hooks, packs, and memory belong in the native runtime. The body is what makes the organism sovereign — deployable on consumer hardware, without cloud dependency, without telemetry.

  • RuntimeC++ / CUDA / native packs — compiled, not interpreted.
  • TargetLocal consumer hardware — sovereign deployment, no cloud dependency.
  • DoctrineCompile what you ship. Python is for the operating theatre only.
Runtime C++ / CUDA Target consumer GPU Dependency none
Live
NATIVE BODY · LIVECOMPILE WHAT YOU SHIP

Selected Work

What happened when we ran the numbers.

Case 01 · BD6 Surgery

The first organ surgery loop.

We measured the raw 0.5B code organ against public coding tasks and found the truth: it was fast, but weak. On MBPP it solved 6 out of 100 tasks. On HumanEval it solved 2 out of 164. That failure was not hidden. It became surgical material.

The Black-Dog pipeline harvested the failed traces, joined them with reference solutions, trained a QLoRA adapter, merged it back into the organ, repacked into the native format, and reran the same benchmarks — with no 7B fallback allowed.

Organphys05_code_skeleton
MBPP before6 / 100
MBPP after BD613 / 100
HumanEval before2 / 164
HumanEval after BD66 / 164
7B fallback calls0
Case 02 · A/B/C Truth Table

Organ truth table: three routing modes.

Before surgery, we forced the system into three modes: A (7B only), B (0.5B organ only), C (organ-first with 7B fallback). The result was uncomfortable and useful. On MBPP: A scored 60, B scored 6, C scored 60. On HumanEval: A scored 81, B scored 2, C scored 81.

The conclusion was clear: the runtime was wired correctly, but the small organs were not yet strong enough to improve on the top brain. That finding became the reason for BD6.

MBPP — 7B only (A)60 / 100
MBPP — organ only (B)6 / 100
MBPP — organ+fallback (C)60 / 100
HumanEval — 7B only (A)81 / 164
HumanEval — organ only (B)2 / 164
ConclusionOrgans needed surgery
Case 03 · Organ Audit

The organ liveness audit.

A traffic audit showed that earlier runs were too often 7B-monolithic. In the last 500 DAG entries before rewiring, every call went through the 7B chat path. Five of eight registered 0.5B organs had never been called at all. The wound organ did not exist yet.

That audit changed the architecture. FrankenLLM now treats organ liveness as a hard requirement: if an organ is not called, logged, scored, and improved, it is not part of the body.

DAG entries audited500
7B-only routes500 / 500
Dead organs found5 / 8
Wound organ statusDid not exist
CorrectionOrgan-first rewiring + surgery
Case 04 · Proof-Carrying Execution

Answers that carry their own evidence.

FrankenLLM does not merely write shell commands or code — it can execute them inside controlled capsules, capture stdout and stderr, verify artifacts, hash outputs, and preserve a replay recipe. This turns answers into evidence.

When a route passes, the system can show exactly what ran. When it fails, the wound becomes training material. An output without a verifier trace is only text. FrankenLLM is built to return not just an answer, but the evidence that produced it.

ExecutionCapsule-based, sandboxed
Evidencestdout / stderr / hashes
ReplayPreserved per task
Failure useWound → surgery dataset

Current Metrics

The numbers, unedited.

AxisResultStatus
MBPP, 7B only60 / 100baseline
MBPP, 0.5B organ — before surgery6 / 100weak
MBPP, 0.5B organ — after BD613 / 100improving
HumanEval, 7B only81 / 164baseline
HumanEval, 0.5B organ — before surgery2 / 164weak
HumanEval, 0.5B organ — after BD66 / 164improving
LiveCodeBench easy0 / 50not solved yet
Organ audit — dead organs found5 / 8corrected
7B fallback calls after BD60organ-first holds
Native runtime doctrineC++ / CUDAactive

Principles

Five rules that govern the organism.

01

Organs must earn their place.

A specialist model that is never called is dead. A specialist that does not improve score, latency, or reliability is tissue for the next surgery pass. Organ liveness is a hard requirement, not a preference.

02

The brain is last, not first.

The top model should not handle every task by default. Lower organs attempt narrow work first. The brain synthesizes, judges, or rescues only when organs cannot carry the load. Brain-first routing is the failure mode.

03

Failure is not waste.

Every failed output is a training example waiting to be harvested. The verifier produces the wound. The Black Dog marks it as poison. Surgery turns it into a stronger organ. This cycle is the organism's immune system.

04

No proof, no claim.

An answer without a verifier trace is only text. FrankenLLM is built to return not only an output, but the evidence that produced it. An unverified output is raw material, not a result.

05

Native is the body.

Research tooling can use Python. The live organism cannot depend on it. Production inference, routing, verification hooks, packs, and memory belong in the native runtime. Compile what you ship.