Parvus in Quenya · Runs on device. No cloud required.

Edge AI,
reimagined.

Open-source foundation models for constrained hardware

Existing small LLMs are dense Transformers shrunk down.
Harold is built differently — a hybrid SSM-Attention architecture
with sparse MoE and continuous diffusion, designed from first principles
to run locally on smartphones, microcontrollers and IoT devices.

View on GitHub Explore Harold →
Scroll
Architecture

01 — How it works

Three ideas.
One architecture.

I
Mamba2 SSM

3 out of every 4 layers use Mamba2 State Space Models instead of attention — linear complexity instead of quadratic. The compute advantage compounds at longer sequences, exactly where IoT workloads live.

O(n) complexity
II
Sparse MoE

2 shared + 16 routed experts, top-2 selection per token. Harold has 3.2B total parameters but activates roughly 800M per forward pass — the inference cost of a much smaller model with the capacity of a larger one.

~25% params active
III
Continuous Diffusion

Harold uses Flow Matching instead of autoregressive next-token prediction. The entire sequence is refined in parallel from noise — enabling parallel decoding and native infill without tricks.

x0-prediction CFM
IV
Hash MoE Routing

THOR-style deterministic hash routing replaces learned routing — eliminating router overhead entirely with no convergence penalty. Benchmarked at +5% throughput over learnable routing with identical val loss.

+5% throughput
Model

02 — Harold v0.7

The model.
In training.

Harold v0.7

A 3.2B parameter hybrid Jamba diffusion language model. 40 layers in pattern [Mamba2×3, Attention]×10, GQA 4:1, DeepSeek-style MoE, YaRN RoPE seq_len=4096. Currently completing a 100k-iteration pretraining run on 8×B200 GPUs.

Pretraining active FineWeb + SlimPajama 8×B200 Vast.ai Open weights
3.2B Total parameters
~800M Active per forward pass
40 Layers
4096 Context length
Benchmarks

03 — Throughput results

Harold v0.6 vs
pure Transformer baseline.

Measured on Harold v0.6 (1.5B params) vs equivalent dense Transformer. The Mamba2 advantage compounds beyond 4096 tokens — the crossover point predicted by theory.

Seq Len Harold tok/s Transformer tok/s Harold mem Speedup Advantage
256 1,250 1,374 14.55 GB 0.91×
512 2,450 2,726 14.57 GB 0.90×
1024 4,826 5,426 14.64 GB 0.89×
2048 9,171 9,924 14.81 GB 0.92×
4096 14,940 13,982 15.37 GB 1.07×

Harold v0.6 · 1.5B params · bfloat16 · single GPU · seq_len crossover at 4096 tokens

Mission

04 — Why Minya

"The IoT AI market has no dominant open-source foundation model. That is the gap Harold fills."

Every major small LLM today — Phi-3, Gemma, Qwen — is a dense autoregressive Transformer. They were optimized for benchmark scores, not for running on a Raspberry Pi, a Jetson, or an Android device.

Harold is built differently. The hybrid Mamba2+Attention backbone is subquadratic. Sparse MoE means only a fraction of parameters activate per token. Continuous diffusion enables parallel decoding. These aren't optimizations — they're architectural choices that compound on constrained hardware.

Minya releases Harold weights openly with a native runtime for on-device deployment — no cloud, no latency, no data leaving the device. Enterprise licensing for commercial integration in automotive, industrial, and healthcare products.

Built in Naples. Designed for the edge of everything.

Join the early access list

Harold v0.7 weights and on-device runtime launching Q3 2026.