Open-weight · Diffusion Language Model

Generate everything
at once.

The first open-weight diffusion LM

Every major language model today generates text one word at a time. Harold generates entire sequences in parallel — faster inference, variable-depth reasoning, and a fundamentally different approach to how AI produces language.

3.2BParameters

~25%Active per pass

8 stepsFull sequence generation

Request early access Explore Harold →

Scroll

01 — A different architecture

Not an incremental improvement.
A paradigm shift.

Autoregressive models generate one token, then the next, then the next — a sequential bottleneck baked into every major LLM. Harold removes it entirely.

⟶

Parallel generation

Instead of producing tokens sequentially, Harold refines the entire output at once — from noise to coherent text in a single pass. This is diffusion applied to language: the same principle behind the best image generators, now generating text.

◇

Variable-depth reasoning

Harold uses a looped architecture: the same layers are re-used multiple times, with the model deciding how many passes each input needs. Hard problems get more compute. Easy ones finish faster. Intelligence scales with difficulty, not model size.

△

Sparse efficiency

Only ~25% of Harold's parameters activate per forward pass. A mixture-of-experts architecture routes each input to the most relevant specialists — 3.2B total parameters with the inference cost of a much smaller model.

Traditional LLMs

Sequential. Each token waits for the previous one.

Harold

Parallel. The entire sequence is refined simultaneously.

02 — Harold v0.8

The model.
In development.

Harold v0.8

A 3.2B parameter diffusion language model with a hybrid SSM-Attention backbone, mixture-of-experts routing, and a looped architecture that provides variable-depth reasoning at fixed parameter cost. Open weights on completion.

In developmentOpen weightsApache 2.0

3.2BTotal parameters

~800MActive per forward pass

44Effective depth

16Physical layers

03 — Early results

Faster where
it matters most.

Harold's hybrid architecture matches dense Transformers at short sequences and pulls ahead as context grows — where most real-world applications live.

Context lengthHarold tok/sTransformer tok/sMemorySpeedupBar

2561,2501,37414.55 GB0.91×

5122,4502,72614.57 GB0.90×

10244,8265,42614.64 GB0.89×

20489,1719,92414.81 GB0.92×

409614,94013,98215.37 GB1.07×

Harold v0.6 · 1.5B params · bfloat16 · single GPU · crossover at 4096 tokens · v0.8 benchmarks pending

04 — Why Minya

“Diffusion language models are the next architecture shift. They should be open.”

Every frontier LLM generates text the same way: one token at a time, left to right. It works — but it's not the only way, and it's not the fastest way.

Harold is a different kind of model.It generates entire sequences in parallel, allocates more compute to harder problems automatically, and activates only a fraction of its parameters per query. These aren't optimizations bolted onto an existing design — they're foundational choices that compound.

The category is already proven. Inception Labs raised $50M to build Mercury, the first commercial diffusion LLM, running at 1,100+ tokens per second on H100 GPUs. Harold is the open-weight counterpart — same paradigm, novel architecture, fully open under Apache 2.0.

Category validation: Diffusion language models are production-ready. Mercury proved it at scale. Harold brings this paradigm to the open-weight ecosystem — released under Apache 2.0, yours to build on.

Jonathan Vecchione — Founder, Minya AI

Independent ML researcher. Building Harold since 2025. Published v0.6 (1.51B params, hybrid architecture) in April 2026. v0.8 is the active development frontier. Codeberg · Hugging Face · Contact

Get early access

Harold v0.8 weights, inference runtime, and API — launching when pretraining completes.

Generate everythingat once.

Not an incremental improvement.A paradigm shift.