Open-weight · Diffusion Language Model
The first open-weight diffusion LM
Every major language model today generates text one word at a time. Harold generates entire sequences in parallel — faster inference, variable-depth reasoning, and a fundamentally different approach to how AI produces language.
01 — A different architecture
Autoregressive models generate one token, then the next, then the next — a sequential bottleneck baked into every major LLM. Harold removes it entirely.
Instead of producing tokens sequentially, Harold refines the entire output at once — from noise to coherent text in a single pass. This is diffusion applied to language: the same principle behind the best image generators, now generating text.
Harold uses a looped architecture: the same layers are re-used multiple times, with the model deciding how many passes each input needs. Hard problems get more compute. Easy ones finish faster. Intelligence scales with difficulty, not model size.
Only ~25% of Harold's parameters activate per forward pass. A mixture-of-experts architecture routes each input to the most relevant specialists — 3.2B total parameters with the inference cost of a much smaller model.
02 — Harold v0.8
A 3.2B parameter diffusion language model with a hybrid SSM-Attention backbone, mixture-of-experts routing, and a looped architecture that provides variable-depth reasoning at fixed parameter cost. Open weights on completion.
03 — Early results
Harold's hybrid architecture matches dense Transformers at short sequences and pulls ahead as context grows — where most real-world applications live.
Harold v0.6 · 1.5B params · bfloat16 · single GPU · crossover at 4096 tokens · v0.8 benchmarks pending
04 — Why Minya
Every frontier LLM generates text the same way: one token at a time, left to right. It works — but it's not the only way, and it's not the fastest way.
Harold is a different kind of model.It generates entire sequences in parallel, allocates more compute to harder problems automatically, and activates only a fraction of its parameters per query. These aren't optimizations bolted onto an existing design — they're foundational choices that compound.
The category is already proven. Inception Labs raised $50M to build Mercury, the first commercial diffusion LLM, running at 1,100+ tokens per second on H100 GPUs. Harold is the open-weight counterpart — same paradigm, novel architecture, fully open under Apache 2.0.
Independent ML researcher. Building Harold since 2025. Published v0.6 (1.51B params, hybrid architecture) in April 2026. v0.8 is the active development frontier. Codeberg · Hugging Face · Contact
Harold v0.8 weights, inference runtime, and API — launching when pretraining completes.