Parvus in Quenya · Runs on device. No cloud required.
Open-source foundation models for constrained hardware
Existing small LLMs are dense Transformers shrunk down.
Harold is built differently — a hybrid SSM-Attention architecture
with sparse MoE and continuous diffusion, designed from first principles
to run locally on smartphones, microcontrollers and IoT devices.
01 — How it works
3 out of every 4 layers use Mamba2 State Space Models instead of attention — linear complexity instead of quadratic. The compute advantage compounds at longer sequences, exactly where IoT workloads live.
O(n) complexity2 shared + 16 routed experts, top-2 selection per token. Harold has 3.2B total parameters but activates roughly 800M per forward pass — the inference cost of a much smaller model with the capacity of a larger one.
~25% params activeHarold uses Flow Matching instead of autoregressive next-token prediction. The entire sequence is refined in parallel from noise — enabling parallel decoding and native infill without tricks.
x0-prediction CFMTHOR-style deterministic hash routing replaces learned routing — eliminating router overhead entirely with no convergence penalty. Benchmarked at +5% throughput over learnable routing with identical val loss.
+5% throughput02 — Harold v0.7
A 3.2B parameter hybrid Jamba diffusion language model. 40 layers in pattern [Mamba2×3, Attention]×10, GQA 4:1, DeepSeek-style MoE, YaRN RoPE seq_len=4096. Currently completing a 100k-iteration pretraining run on 8×B200 GPUs.
03 — Throughput results
Measured on Harold v0.6 (1.5B params) vs equivalent dense Transformer. The Mamba2 advantage compounds beyond 4096 tokens — the crossover point predicted by theory.
Harold v0.6 · 1.5B params · bfloat16 · single GPU · seq_len crossover at 4096 tokens
04 — Why Minya
Every major small LLM today — Phi-3, Gemma, Qwen — is a dense autoregressive Transformer. They were optimized for benchmark scores, not for running on a Raspberry Pi, a Jetson, or an Android device.
Harold is built differently. The hybrid Mamba2+Attention backbone is subquadratic. Sparse MoE means only a fraction of parameters activate per token. Continuous diffusion enables parallel decoding. These aren't optimizations — they're architectural choices that compound on constrained hardware.
Minya releases Harold weights openly with a native runtime for on-device deployment — no cloud, no latency, no data leaving the device. Enterprise licensing for commercial integration in automotive, industrial, and healthcare products.
Built in Naples. Designed for the edge of everything.
Harold v0.7 weights and on-device runtime launching Q3 2026.