§ 01What it is

Strata is an embedded database for the AI era.

A branch-aware, MVCC, log-structured key-value store — KV, JSON, events, vectors, and graph as capabilities over one substrate.

§ 02Market Position

Market Position

Every class of database comes in two forms: one runs as a server, one is embedded in the application.

Era Server Embedded
RelationalOracle · Db2 · SQL Server · PostgresSQLite · H2
AnalyticsSnowflake · Databricks · BigQuery · RedshiftDuckDB
VectorPinecone · MilvusLanceDB · Chroma
Agent stateRedis · Mem0 · Zep · LettaStrata
§ 03Ideal Customer Profile

Ideal Customer Profile

01

Agentic AI developers

Their agents accumulate state and need it to survive restarts. They also need to branch it and roll back a failed path.

02

Vibecoders

They build fast with AI coding tools and would rather not run a database. Strata is just a library in the app, with nothing to set up.

03

Embedded & edge developers

Their software runs on-device or at the edge, often offline. They need a real database in the same process, down to a Raspberry Pi.

§ 04Capabilities

Capabilities

Fig 01

Branch & fork

An agent can fork the whole database and work on the branch without touching the parent. The branch is kept if the change works out, and dropped if it doesn't. Forks are cheap, so an agent can keep several going at once.

copy-on-write inheritance · fork is O(metadata), not O(data)

Fig 02

Time-travel & diff

Every version of every value is kept, so the database can be read as of any past point. That reproduces the state an agent saw at an earlier step, or the value a key held before a later write.

one versioned row chain · as-of reads from ordering (MVCC)

Fig 03

Multi-modal state

Key-value, JSON, events, vectors, and graph are all served from one store, and a single read sees them in one consistent snapshot.

five primitives · capabilities over one substrate

Fig 04

Embedded

Strata links into the application and runs in-process, with no separate server and no network round-trip. The same build runs on a Raspberry Pi and on server-class hardware.

self-contained Rust · flash- and disk-friendly

§ 05Product Architecture

Product Architecture

Strata is built as five crates. At the top, the executor is the command surface; at the bottom, core holds the shared identity and ordering types everything builds on. In between sit inference for model calls, the engine and its five primitives, and storage, which the next section opens up into nine layers.

five crates · executor → core

Executorcommand boundary

The public surface. Serializable commands in, results out.

Inferenceoptional

Inference runtime and a provider API for local, Anthropic, OpenAI, and Google models. Called by the executor.

Enginethe primitives

Builds the five primitives over the storage boundary.

KV JSON Event Vector Graph
Storagethe substrate

Durable, branch-isolated, versioned persistence.

↓ opened up next, in Storage Architecture

Coreshared atoms

Identity and ordering: branch, commit version, timestamp.

§ 06Storage Architecture

Storage Architecture

A branch-isolated, MVCC, log-structured key-value store, built in nine layers from backend I/O up to the public API. The storage layers handle durability and versioning; the engine adds the five primitives on top.

L6

Branch-isolated MVCC runtime

A branch is a copy-on-write fork of one versioned key space. Forking copies the branch's metadata and points it at the parent's existing data layers, so a new branch is cheap to create regardless of how much data it holds. Each key keeps a single chain of versions, ordered newest-first; latest, as-of, and history reads all walk that one chain.

key k · a read returns the newest version
main
fork
main · versions of k
main reads k = c

main holds one key k with three versions, newest first. A read returns the newest.

0 / 4

// show mefork = O(metadata)  ·  child shares parent layers by reference  ·  inherited visible = min(req, fork_version)

L5

Table runtime — the LSM

Writes land in an in-memory table, then flush to immutable L0 tables and merge down the levels through compaction. The runtime keeps no retention policy of its own. The caller passes in which versions a compaction is allowed to drop, so versioning and garbage collection are decided by the layer above it.

mem
0 / 3
↓ flush
L0
overlapping
↓ compactionpolicy: prune · tombstone · TTL
L1
sorted
L2

An LSM batches writes in memory, then flushes sorted tables to disk. Step ▸ to follow one through.

00 / 14

// show mecompaction(prune · tombstone · TTL) — caller-supplied  ·  L0 overlaps · L1+ sorted by range

L4

Durability spine

Visibility and durability are tracked as separate states. A write can become visible before its durability is confirmed, and if the process crashes during that window, recovery resolves each affected write to a specific outcome — durable, or visible with durability unconfirmed — and reports which is which.

// show mewrite → fsync → rename → dir-fsync  ·  crash in the gap → VisibleDurabilityUnconfirmed / VisibilityUnknown

L7

Commit runtime

A commit is a single batch, assigned one version and one timestamp. It is written to the WAL and made durable before it is applied and made visible, so a commit that reached the WAL before a crash is replayed during recovery. Every commit is recorded on a timeline, which is what as-of reads resolve against.

// show meone (version, timestamp) per commit  ·  WAL durable before visible · crash after WAL → replays

L8

Lifecycle & back-pressure

The engine moves through a typed lifecycle and reports a recovery-health status, so an open that recovered with possible data loss comes back marked degraded. Under heavy write load, flushing takes priority over compaction, and writers are slowed before memory is exhausted to keep the engine within its memory budget.

// show meRecoveryHealth::Degraded { DataLoss }  ·  flush preempts compaction · writers stall under back-pressure

L9

Storage API boundary

This is the only boundary the engine uses: open, commit, read, getv, as_of, scan, history, and fork. It keeps the engine out of backend I/O, byte formats, the WAL, manifests, and compaction. Storage provides these operations over keys and bytes, and the engine builds the five primitives on top of them.

// show meStorageRuntime: open · commit · read · getv · as_of · scan · history · fork  ·  engine builds the primitives on top

L3

Durable format

Owns the durable byte formats: record headers, frames, checksums, and the row key that every layer above sorts on. A key is branch · space · storage_space_id · user_key · commit, with the commit id stored bitwise-inverted so the newest version of a key sorts first. These formats are defined in this one layer, shared by the WAL, table, and recovery code.

// show merecord = len · crc32 · row  ·  InternalKey = branch | space | storage_space_id | user_key | !commit_id  ·  !commit_id = ~version

L2

Object layout

One canonical object namespace under a database root — manifest, wal, tables, snapshots, quarantine, locks — built from typed constructors so no layer assembles its own paths. The layer defines the names; the bytes and policy live above it. The database manifest sits at a fixed path, manifest/current, and an object can be deleted only once it is proven unreachable from there.

// show mefamilies manifest · wal · tables · snapshots · quarantine · locks  ·  reachability proven from manifest/current

L1

Backend IO

The portability layer. Everything above it works in terms of named objects under a database root, through a small set of operations — read, read-range, write, append — rather than files, descriptors, or fsync calls. The same contract backs a durable local-filesystem implementation and a browser/cache one, and leaves room for an object-storage backend later.

// show meread · range · write · append · publish_object  ·  object-first, never POSIX paths

§ 07Vector Architecture

Vector Architecture

The vector index is a per-slab derived artifact on the LSM. The index type depends on the level: brute-force in the memtable, flat in L0, HNSW in the compacted levels. Exact search over the full vectors is authoritative: the index only generates candidates, and an exact rerank produces the final ordering. A missing or corrupt index is rebuilt from the slab and cannot change a result.

vector index · per slab · escalates by level

memtable newest writes brute-force · exact

↓ flush · build the cheap index

L0 fresh slabs flat

↓ compaction · rebuild over live vectors · clean graph

L1 settled per-slab HNSW
L2 bulk per-slab HNSW

below a few thousand live vectors, the graph is skipped in favor of brute force. recall improves as slabs merge into larger graphs.

MVCC visibility and the branch fork cap are enforced in the merge, so the index never returns a row the read shouldn't see.

Deletes and compaction

An HNSW graph degrades as vectors are deleted or updated and must be rebuilt periodically. Compaction already rebuilds each slab over its live rows, so the graph is reconstructed without accumulated tombstones, at a cost that stays small at this scale.

Storage boundary

Storage manages the slab lifecycle and persists an opaque per-slab artifact without interpreting it; the engine builds the index from the slab's rows and returns the bytes to be committed with it. The index is derived state, rebuilt from the slab if lost.

Index choice by scale

At the embedded scale of thousands to tens of millions of vectors per collection, the data fits in memory and per-slab graphs are inexpensive to build, so HNSW is used by default. A centroid (IVF) index is available for larger deployments, where rebuild cost dominates.

§ 08Testing Methodology

Testing Methodology

SQLite is the gold standard for testing an embedded database, and this harness is modeled on it: one bug class at a time, each with the technique that exposes it, every test checked against one recovery oracle.

Recovery correctness

After any crash, what the database recovers must be a prefix of what it acknowledged: it may lose the most recent commits, but it must never leave a gap earlier in the history. A shadow model records every acknowledged commit, and one oracle checks the recovered state against that prefix after every fault the classes below inject.

the oracle every test verifies through it

Error paths

Fail the Nth backend operation, check that the engine recovers, then increment N until a full pass injects no new failure. Covers every reachable I/O call, plus disk-full and memory exhaustion.

fail-once · fail-continuously

Hostile disks

Run on a disk model that injects torn writes, reordered appends, non-atomic renames, and garbage in an unsynced tail, then verify recovery under each.

four filesystem models

Rare interleavings

A seeded scheduler randomises task order, the clock, and fault timing over the real lifecycle. Any failure prints its seed and replays bit-for-bit.

deterministic simulation

Compound failures

Inject a second fault while recovery, compaction, or a checkpoint is still running. The engine must reach a valid state or return a typed, resumable error.

fail · resume · succeed

Config divergence

Run the same workload under every mode, schedule, and budget; the results must be identical, and progress must stay bounded.

model-parity oracle

Process drift

Merge-blocking CI: Miri and sanitizers, coverage and mutation thresholds that only ratchet up, nightly fuzzing, and a guard that fails if the test map goes stale.

miri · mutants · charter guard

§ 09Benchmarks

Benchmarks

Strata is a key-value engine with a vector index, so it runs the standard suite for each: YCSB against RocksDB and LMDB, and ANN-Benchmarks at one, ten, and a hundred million vectors.

illustrativea benchmark run is in progress; these figures are placeholders, to be replaced with measured data.

Key-value engine

YCSB

Aupdate-heavy

Strata 412 K
RocksDB 430 K
LMDB 120 K

Bread-heavy

Strata 760 K
RocksDB 800 K
LMDB 1.42 M

Cread-only

Strata 905 K
RocksDB 950 K
LMDB 1.95 M

Dread-latest

Strata 815 K
RocksDB 845 K
LMDB 1.30 M

Eshort scan

Strata 290 K
RocksDB 320 K
LMDB 480 K

Fread-modify-write

Strata 355 K
RocksDB 370 K
LMDB 105 K

throughput · ops / sec · higher is better
8 vCPU · NVMe SSD · 100 M records · 1 KB · Zipfian

Vector search

ANN-Benchmarks · as it scales
recall@10
0.95
0.95
0.94
1 M10 M100 M
throughput q/s
8,600
3,200
1,100
1 M10 M100 M
p99 latency
1.4 ms
3.8 ms
9.5 ms
1 M10 M100 M
build time
4 m
45 m
8 h
1 M10 M100 M
index size
1.4 GB
14 GB
140 GB
1 M10 M100 M

ANN-Benchmarks protocol · 768-dim · cosine · top-10 · recall@10 ≈ 0.95
one index, measured at 1 M · 10 M · 100 M vectors