Research preview

Attention that scales to
infinite context

Variational Linear Attention replaces the KV-cache with a learned forgetting prior — O(N) runtime, constant memory, zero information loss.

O(N)

Complexity

Constant

Memory

91.8%

Retrieval accuracy

The problem

Transformers don't scale.
Linear attention forgets.

Standard attention is O(N²) — doubling the context quadruples the cost. The KV-cache grows without bound, making long-context deployment a systems nightmare.

Existing linear alternatives fix compute but destroy memory. Information gets overwritten with no mechanism to recover it. You trade one problem for another.

The solution

Three innovations,
one architecture

VLA frames linear attention as probabilistic inference on a hidden Markov model, then solves it in closed form.

Step 01

Observe token

Step 02

Update state with penalty

Step 03

Recover memory

Benchmarks

Numbers, not narratives

Verified on synthetic retrieval, associative recall, and symbolic reasoning tasks.

91.8%

Learned λ accuracy

Retrieval with learned forgetting

100%

Rank-2 recovery

Full recovery with rank-2 updates

+37%

Symbolic reasoning

73.5% vs 53.5% baseline

O(N)

Runtime

Linear in sequence length

ModelComplexityMemoryLong-rangeVerdict
Standard TransformerO(N²)Quadratic KV-cacheFullDoesn't scale
Linear AttentionO(N)Fixed stateLossyForgets everything
Mamba / SSMsO(N)Fixed statePartialNo principled recovery
VLA (ours)O(N)Dynamic statePreservedBest of both worlds
Use cases

Built for the longest contexts

When your data doesn't fit in a context window, VLA just keeps reading.

Legal & contract analysis

Process entire 200-page contracts in a single pass. Extract clauses, detect conflicts, and summarise terms without chunking or context windows.

Clinical notes & EMR

Read full patient histories in one pass. Surfacing drug interactions and longitudinal patterns that chunked models miss entirely.

Financial analytics

Analyse years of market data, earnings calls, and filings. Constant memory means you can stream live data without recomputing.

Edge & on-device AI

No KV-cache means no memory explosion. Deploy long-context models on mobile, IoT, and resource-constrained hardware.

Research

Open science,
rigorous maths

VLA emerges from applying variational inference to the linear attention recurrence. Every component — the penalty matrix, the inverse tracker, the recovery step — is derived, not designed.

Paper

Variational Linear Attention

In preparation
Code

Full implementation + benchmarks

Open source
Kernels

Triton-optimised forward pass

Coming soon

Get early access

We're onboarding design partners for private beta. Tell us what you're building.

All requests reviewed manually. No spam.