Why HaroldModelPerformanceVisionCode ↗

Open-weight · Diffusion Language Model

Generate everything
at once.

The first open-weight diffusion LM

Every major language model today generates text one word at a time. Harold generates entire sequences in parallel — faster inference, variable-depth reasoning, and a fundamentally different approach to how AI produces language.

3.2BParameters
~25%Active per pass
8 stepsFull sequence generation
Request early accessExplore Harold →
Scroll
Why Harold

01 — A different architecture

Not an incremental improvement.
A paradigm shift.

Autoregressive models generate one token, then the next, then the next — a sequential bottleneck baked into every major LLM. Harold removes it entirely.

Parallel generation

Instead of producing tokens sequentially, Harold refines the entire output at once — from noise to coherent text in a single pass. This is diffusion applied to language: the same principle behind the best image generators, now generating text.

Variable-depth reasoning

Harold uses a looped architecture: the same layers are re-used multiple times, with the model deciding how many passes each input needs. Hard problems get more compute. Easy ones finish faster. Intelligence scales with difficulty, not model size.

Sparse efficiency

Only ~25% of Harold's parameters activate per forward pass. A mixture-of-experts architecture routes each input to the most relevant specialists — 3.2B total parameters with the inference cost of a much smaller model.

Traditional LLMs
Sequential. Each token waits for the previous one.
Harold
Parallel. The entire sequence is refined simultaneously.
Model

02 — Harold v0.8

The model.
In development.

Harold v0.8

A 3.2B parameter diffusion language model with a hybrid SSM-Attention backbone, mixture-of-experts routing, and a looped architecture that provides variable-depth reasoning at fixed parameter cost. Open weights on completion.

In developmentOpen weightsApache 2.0
3.2BTotal parameters
~800MActive per forward pass
44Effective depth
16Physical layers
Performance

03 — Early results

Faster where
it matters most.

Harold's hybrid architecture matches dense Transformers at short sequences and pulls ahead as context grows — where most real-world applications live.

Context lengthHarold tok/sTransformer tok/sMemorySpeedupBar
2561,2501,37414.55 GB0.91×
5122,4502,72614.57 GB0.90×
10244,8265,42614.64 GB0.89×
20489,1719,92414.81 GB0.92×
409614,94013,98215.37 GB1.07×

Harold v0.6 · 1.5B params · bfloat16 · single GPU · crossover at 4096 tokens · v0.8 benchmarks pending

Vision

04 — Why Minya

“Diffusion language models are the next architecture shift. They should be open.”

Every frontier LLM generates text the same way: one token at a time, left to right. It works — but it's not the only way, and it's not the fastest way.

Harold is a different kind of model.It generates entire sequences in parallel, allocates more compute to harder problems automatically, and activates only a fraction of its parameters per query. These aren't optimizations bolted onto an existing design — they're foundational choices that compound.

The category is already proven. Inception Labs raised $50M to build Mercury, the first commercial diffusion LLM, running at 1,100+ tokens per second on H100 GPUs. Harold is the open-weight counterpart — same paradigm, novel architecture, fully open under Apache 2.0.

Category validation: Diffusion language models are production-ready. Mercury proved it at scale. Harold brings this paradigm to the open-weight ecosystem — released under Apache 2.0, yours to build on.
J

Jonathan Vecchione — Founder, Minya AI

Independent ML researcher. Building Harold since 2025. Published v0.6 (1.51B params, hybrid architecture) in April 2026. v0.8 is the active development frontier. Codeberg · Hugging Face · Contact

Get early access

Harold v0.8 weights, inference runtime, and API — launching when pretraining completes.