Standard attention is O(N²) — doubling the context quadruples the cost. The KV-cache grows without bound, making long-context deployment a systems nightmare.
Existing linear alternatives fix compute but destroy memory. Information gets overwritten with no mechanism to recover it. You trade one problem for another.
VLA frames linear attention as probabilistic inference on a hidden Markov model, then solves it in closed form.
Verified on synthetic retrieval, associative recall, and symbolic reasoning tasks.
91.8%
Learned λ accuracy
Retrieval with learned forgetting
100%
Rank-2 recovery
Full recovery with rank-2 updates
+37%
Symbolic reasoning
73.5% vs 53.5% baseline
O(N)
Runtime
Linear in sequence length
| Model | Complexity | Memory | Long-range | Verdict |
|---|---|---|---|---|
| Standard Transformer | O(N²) | Quadratic KV-cache | Full | Doesn't scale |
| Linear Attention | O(N) | Fixed state | Lossy | Forgets everything |
| Mamba / SSMs | O(N) | Fixed state | Partial | No principled recovery |
| VLA (ours) | O(N) | Dynamic state | Preserved | Best of both worlds |
When your data doesn't fit in a context window, VLA just keeps reading.
Process entire 200-page contracts in a single pass. Extract clauses, detect conflicts, and summarise terms without chunking or context windows.
Read full patient histories in one pass. Surfacing drug interactions and longitudinal patterns that chunked models miss entirely.
Analyse years of market data, earnings calls, and filings. Constant memory means you can stream live data without recomputing.
No KV-cache means no memory explosion. Deploy long-context models on mobile, IoT, and resource-constrained hardware.
VLA emerges from applying variational inference to the linear attention recurrence. Every component — the penalty matrix, the inverse tracker, the recovery step — is derived, not designed.
Variational Linear Attention
Full implementation + benchmarks
Triton-optimised forward pass
We're onboarding design partners for private beta. Tell us what you're building.