Research preview

Attention that scales to
infinite context

Variational Linear Attention replaces the KV-cache with a learned forgetting prior — O(N) runtime, constant memory, zero information loss.

Get early access How it works

O(N)

Complexity

Constant

Memory

91.8%

Retrieval accuracy

The problem

Transformers don't scale.
Linear attention forgets.

Standard attention is O(N²) — doubling the context quadruples the cost. The KV-cache grows without bound, making long-context deployment a systems nightmare.

Existing linear alternatives fix compute but destroy memory. Information gets overwritten with no mechanism to recover it. You trade one problem for another.

The solution

Three innovations,
one architecture

VLA frames linear attention as probabilistic inference on a hidden Markov model, then solves it in closed form.

Step 01

Observe token

Step 02

Update state with penalty

Step 03

Recover memory

Benchmarks

Numbers, not narratives

Verified on synthetic retrieval, associative recall, and symbolic reasoning tasks.

91.8%

Learned λ accuracy

Retrieval with learned forgetting

100%

Rank-2 recovery

Full recovery with rank-2 updates

+37%

Symbolic reasoning

73.5% vs 53.5% baseline

O(N)

Runtime

Linear in sequence length

Model	Complexity	Memory	Long-range	Verdict
Standard Transformer	O(N²)	Quadratic KV-cache	Full	Doesn't scale
Linear Attention	O(N)	Fixed state	Lossy	Forgets everything
Mamba / SSMs	O(N)	Fixed state	Partial	No principled recovery
VLA (ours)	O(N)	Dynamic state	Preserved	Best of both worlds

Use cases

Built for the longest contexts

When your data doesn't fit in a context window, VLA just keeps reading.

Legal & contract analysis

Process entire 200-page contracts in a single pass. Extract clauses, detect conflicts, and summarise terms without chunking or context windows.

Clinical notes & EMR

Read full patient histories in one pass. Surfacing drug interactions and longitudinal patterns that chunked models miss entirely.

Financial analytics

Analyse years of market data, earnings calls, and filings. Constant memory means you can stream live data without recomputing.

Edge & on-device AI

No KV-cache means no memory explosion. Deploy long-context models on mobile, IoT, and resource-constrained hardware.

Research

Open science,
rigorous maths

VLA emerges from applying variational inference to the linear attention recurrence. Every component — the penalty matrix, the inverse tracker, the recovery step — is derived, not designed.

Paper

Variational Linear Attention

In preparation

Code

Full implementation + benchmarks

Open source

Kernels

Triton-optimised forward pass

Coming soon

Get early access

We're onboarding design partners for private beta. Tell us what you're building.

Attention that scales toinfinite context

Transformers don't scale.Linear attention forgets.

Three innovations,one architecture