An open frontier-grade model, deployed on a single consumer GPU.
We took DeepSeek V4-Flash — an open-weight Mixture-of-Experts foundation model with 284 billion parameters in total, 13 billion active per token, a one-million-token context window, an MIT license, and the kind of inference profile that ordinarily requires multiple data-centre-class GPUs to operate — and we refined it down to a form that runs end-to-end on consumer hardware.
The refinement preserved the model's reasoning and removed almost everything else. The resulting system reasons through full-length tasks on a single workstation, behind a corporate firewall, without remote dependency. What the operation showed us, more than the deployment itself, was that the bottleneck of large-model inference is not the reasoning at all. It is the cost of moving the model's specialised parts in and out of memory.