Open source AI · the 2026 stack
Open source AI — the full stack, with honest positions on every layer.
Open-source AI in 2026 is not a checkbox. It is a stack — open weights, an inference runtime that runs locally, a fine-tuning toolchain, an agent framework, an embedding model, and a vector store — each layer with several real options and several marketing-only ones. This page maps every layer, names the projects that actually matter, and gives the position CyberdyneLabs takes on each.
Companion pages: /ai (the field map), /run-ai-locally (the practical guide), /ai-faq (40 dated answers), /glossary (vocabulary).
1 · What "open" actually means in 2026.
The word "open" is doing a lot of work in this field, and not all of it honestly. There is a real spectrum from "fully reproducible" to "you can use the API for free."
| Tier | Code | Weights | Training data | Examples |
|---|---|---|---|---|
| Truly open | Public, permissive | Public, permissive | Public, recipe included | OLMo, Pythia, K2 |
| Open weights | Public | Public, permissive | Recipe partial, data not redistributable | Llama 3, Mistral, Qwen, DeepSeek, Gemma, Phi |
| Open access | Some public | Public, restrictive license | Closed | Llama 2 (older), some Gemma |
| Free API | Closed | Closed | Closed | GPT-4o-mini, Claude-Instant trial, etc. |
For most practical purposes — running locally, fine-tuning, redistribution — the second tier (open weights) is what you want. Truly-open projects (OLMo, Pythia) trade some quality for full reproducibility and are valuable for research; the rest is positioning.
2 · Open-weights model families.
Six families dominate 2026. Each row below is the family at large; specific size/checkpoint licensing varies.
| Family | Sizes (B) | License | Position |
|---|---|---|---|
| Qwen 2.5 (Alibaba) | 0.5 / 1.5 / 3 / 7 / 14 / 32 / 72 | Apache 2.0 (most) | Strongest 7B-class assistant per parameter. Multilingual. Permissive. Donor for our /surgery work. |
| Llama 3 / 3.1 / 3.3 (Meta) | 1 / 3 / 8 / 70 / 405 | Custom permissive | Heavy reasoning, English-leaning, big benchmark presence. Read the actual license — there are clauses on naming and acceptable-use. |
| Mistral / Mixtral (Mistral AI) | 7 (Mistral) / 8×7 / 8×22 (Mixtral) | Apache 2.0 (community), commercial paywall | Strong code, fast inference. Mixtral was the first widely-deployed MoE in the open-weights world. |
| DeepSeek V3 / V4-Flash | 67 / 284 (active 13–37) | Custom | Pushed the capability/cost ratio dramatically. V4-Flash (284 B / 13 B active, 159 GB weights) was closed by CyberdyneLabs as Surgery Case 01 — end-to-end on a single 8 GB RTX 3060 Ti via our own native C++/CUDA engine and PLANCK_PACK expert streaming. Decode: 1.86 tok/s warm → 0.16 tok/s full 43-layer text, disk-I/O-bound. See /r/V4_FLASH_TECH_BRIEF. |
| Gemma 2 / 3 (Google) | 2 / 9 / 27 | Gemma license | Strong instruction-following per parameter. License is restrictive — read it before commercial use. |
| Phi-4 (Microsoft) | 14 | MIT | Strong reasoning per parameter. MIT-licensed weights. Excellent for code and math. |
| OLMo / Pythia (AI2 / EleutherAI) | 1 – 32 | Apache 2.0 | Truly open — code, weights, and data recipe all public. Lower benchmark scores; high research value. |
For 2026 production, the practical choice usually narrows to Qwen 2.5 7B/72B or Llama 3.3 70B for general-purpose work, with one of the DeepSeek MoE models if you need very large parameter counts on small hardware.
3 · Inference runtimes.
| Runtime | Strength | License | Position |
|---|---|---|---|
| llama.cpp | Footprint, multi-backend | MIT | Default open-source local runtime. CPU/CUDA/Metal/ROCm/Vulkan. GGUF format. Pair with Ollama or LM Studio for friction-free use. |
| vLLM | Throughput, serving | Apache 2.0 | Production GPU inference at scale. PagedAttention + continuous batching. The default for self-hosted serving. |
| Ollama | Friction, UX | MIT | Wraps llama.cpp with a Docker-style CLI. ollama run qwen2.5:7b and you are talking to an LLM. |
| MLC-LLM | Compile-everywhere | Apache 2.0 | TVM-based compiler. Targets WebGPU, Metal, Vulkan, ROCm. Best for browser and exotic-backend deployment. |
| TGI | HF integration | Apache 2.0 | Hugging Face's serving stack. Tight integration with Hub. Good for HF-native shops. |
| ExLlamaV2 | EXL2 quant speed | MIT | Very fast on NVIDIA, EXL2 quant scheme has good quality/size frontier. Niche but excellent. |
| Frankenstellm (us) | Multi-organ system, single C++/CUDA binary (gigachad_native) | MIT | Single C++/CUDA binary. No PyTorch, no cuBLAS. 83.58 tok/s Q4 7B on RTX 3060 Ti. Hologram cache (860× repeat speedup), Black-Dog router, expert streaming for V4-Flash. |
The decision tree: local dev / laptop → Ollama or llama.cpp; self-hosted serving for many users → vLLM or TGI; browser / mobile → MLC-LLM; NVIDIA-only with EXL2 quants → ExLlamaV2; writing your own runtime as a research target → start by reading llama.cpp source, then ours.
4 · Fine-tuning toolchains.
| Tool | What it is | License | Position |
|---|---|---|---|
| TRL | HF Transformer-RLHF/PEFT | Apache 2.0 | The reference implementation for SFT, DPO, PPO, GRPO. Pair with PEFT and Transformers libraries. |
| axolotl | YAML-driven training | Apache 2.0 | Configure-by-YAML wrapper around TRL/DeepSpeed. Most-used community fine-tuning pipeline. |
| Unsloth | Fast, memory-efficient | Apache 2.0 | Custom Triton kernels. ~2× faster QLoRA training than vanilla. Excellent for single-GPU work. |
| QLoRA (paper + bitsandbytes) | 4-bit base + LoRA delta | MIT (bitsandbytes) | The technique that made 65B-class fine-tunable on 24 GB. Underpins almost everything below the multi-GPU tier. |
| DeepSpeed | ZeRO sharding | Apache 2.0 | For multi-GPU full fine-tuning. Pairs with Megatron-LM for very large training runs. |
| Surgery doctrine (us) | 4-axis acceptance gating | MIT (code) / CC-BY-SA (docs) | Wraps QLoRA with anchor / strict-schema / target-bench / no-leak gates. Reverts kept in public ledger. See /surgery and the BD-series reports in /r/. |
5 · Agent frameworks.
| Framework | Style | License | Position |
|---|---|---|---|
| LangChain | Composable Python pipeline | MIT | Largest community, biggest surface area, most legacy code. Useful as a starting point; many production teams replace it with a thinner stack as scale grows. |
| LangGraph | Stateful multi-agent | MIT | Explicit state-machine flavour. Good fit for production agents with clear control flow. |
| AutoGen | Conversational multi-agent | CC-BY 4.0 / MIT | Microsoft Research origin. Multiple agents converse to solve tasks. Cleaner abstractions than LangChain's early generations. |
| CrewAI | Role-based teams | MIT | Higher-level than AutoGen. "Crew" of role-typed agents. Lightweight and fast to prototype with. |
| OpenAI Agents SDK | OpenAI-native | MIT (SDK) | Released 2025. Tight integration with OpenAI tools. Open-source SDK against a closed runtime — caveat emptor. |
| Frankenstellm (us) | Multi-organ runtime | MIT | Native C++/CUDA cognitive runtime. One Q4 7B top-brain + 8 surgered specialist organs (5 GREEN after BD9), Black-Dog conductance router, line-addressable memory spine. |
6 · Embeddings + vector databases.
| Layer | Tool | License | Position |
|---|---|---|---|
| Embedding model | sentence-transformers | Apache 2.0 | The default Python interface for embeddings. Wraps any HF model into .encode(). |
| Embedding model | BGE / GTE / E5 / Nomic | Apache 2.0 (most) | Top-of-MTEB-leaderboard families. BGE-M3 is multilingual; GTE-Qwen is the strongest open as of mid-2025. |
| Vector DB (in-process) | FAISS | MIT | Library, not a service. Best for embedded use and reproducible research. Limited operationally. |
| Vector DB (server) | Qdrant | Apache 2.0 | Rust, strong query language, payload filtering. Production-ready single-node and cluster. |
| Vector DB (server) | Milvus | Apache 2.0 | CNCF graduate. Heavyweight but mature. Best for very large indexes and multi-tenant deployments. |
| Vector DB (light) | Chroma | Apache 2.0 | Python-friendly, easy to embed. Best for prototypes and small deployments. |
| Vector DB (Postgres) | pgvector | PostgreSQL | A pgvector index inside your existing Postgres. Best when you want one database, not two. |
7 · Image / video / audio.
| Modality | Project | License | Position |
|---|---|---|---|
| Image gen | Stable Diffusion XL / 3.5 | SAI license | SDXL is the production baseline; SD 3.5 sharper but more restrictive license. Pair with ComfyUI or A1111 for UX. |
| Image gen | FLUX.1 (Black Forest Labs) | FLUX license (dev / pro) | Higher quality than SDXL on prompts that need fidelity. License read carefully — dev is non-commercial. |
| Image gen | HiDream | MIT (some) | Strong open alternative for 2025+. Worth comparison vs FLUX dev. |
| Image edit | InstructPix2Pix, Anydoor | MIT / Apache | Edit-by-prompt. Reasonable for simple changes; fragile on complex scenes. |
| Video gen | HunyuanVideo / Mochi / WAN 2.1 | Various | 2025 wave of open video. Quality has caught up to commercial Sora-class on short clips. 30+ GB VRAM still typical. |
| ASR | Whisper / Distil-Whisper / Faster-Whisper | MIT | OpenAI's gift. Distil-Whisper / Faster-Whisper are 4× faster on the same hardware. Multilingual. |
| TTS | OpenVoice / XTTS / Piper / Kokoro | MIT / Apache | Voice cloning at usable quality, on consumer hardware. Kokoro is the smallest, runs on CPU. |
8 · Training data.
The frontier models train on multi-trillion-token corpora that are not redistributable. The open-data world is recovering — slowly — through projects that publish the recipe and curated subsets:
| Dataset | Scale | License | Position |
|---|---|---|---|
| FineWeb / FineWeb-Edu | 15T tokens | ODC-BY | HuggingFace's curated Common Crawl. Educational subset is the modern training-data baseline. |
| RedPajama-V2 | 30T tokens | Various source | Together's reproduction of LLaMA training mix. Heavy but well-documented. |
| The Pile / Pile-Uncopyrighted | 825 GB | Various | EleutherAI's classic. Pile-Uncopyrighted is the legally cleaner subset. |
| Dolma | 3T tokens | ODC-BY | Allen AI's pre-training corpus for OLMo. Recipe and tooling fully open. |
| SlimPajama / RefinedWeb | 627B / 600B | Various | High-quality deduplicated subsets. Practical for medium-scale pretraining. |
| Common Crawl | 3-4 PB / month | CC-BY-style for crawler | The raw input. Most of the above are processed subsets of CC. |
9 · Where CyberdyneLabs sits in this stack.
Six programs, all open. We sit in the open-weights tier: we ship deltas, not donor weights, because the donor (Qwen 2.5) is already openly available under Apache 2.0. The full position:
| Layer | Our contribution | License | Where |
|---|---|---|---|
| Runtime | Frankenstellm — single C++/CUDA binary (gigachad_native), Physarium-7B Q4 at 83.58 tok/s on 3060 Ti | MIT | /frankenstellm |
| Fine-tuning | Surgery doctrine — QLoRA + 4-axis gate + revert ledger | MIT (code) / CC-BY-SA (docs) | /surgery |
| Multi-organ orchestration | Frankenstellm — 7B brain + 8 specialist organs, Black-Dog router | MIT | /frankenstellm |
| Cognitive engine | ADAM — 1.2M concepts, Cl(3,0) dynamics, line-addressable spine | MIT | /adam |
| Embodied AI | MACHINA — N-d cognitive substrate world simulator | MIT | /machina |
| Layer-1 substrate | PhysarumChain — bio-routed L1, Cl(4,1) addresses, AMM | MIT | /physarum |
| 4D agent ecosystem | Hypercolony — 1024 agents on hypercube, Ibn Khaldun cycle | MIT | /hypercolony |
| Doctrine | Truth ledger, ARIZ kernel, Black-Dog learning loop | CC-BY-SA 4.0 | /downloads |
| Reports | 66 dated research reports, individually addressable | CC-BY-SA 4.0 | /r/ |
We do not run a paid API gateway, a metered weight-rental service, or a closed managed offering. The entire stack ships from /downloads as a single doctrine pack (24 docs) and reports archive (66 reports) plus the runtime binary. Patient capital, deliberate engineering, an aversion to premature deployment.