STORAGE & IDENTITY

How Albex remembers things.

Three primitives keep the engine honest across sessions and across repeated uploads: OPFS for snapshots, IndexedDB as the fallback, and a 64-bit content hash for deduplication. None of them are configurable surface area — they're internal contracts you should understand when reasoning about what the engine actually does.

PERSISTENCE BACKEND

OPFS & IndexedDB — why both

When you call engine.save("my-corpus"), the engine serialises the entire index (chunks, document table, text pool, Bloom filters) to a single binary blob and writes it to browser storage. engine.load("my-corpus") reads it back and memcpy's it into the BSS arrays. A 16 MB corpus restores in tens of milliseconds — orders of magnitude faster than re-parsing 50 DOCXs.

APIAvailable sinceWrite 16 MBWhy we picked it
OPFSChrome 102 · Safari 15.2 · Firefox 111~20 msZero-copy FileSystemWritableFileStream.write(Uint8Array)
IndexedDBAll browsers since 2015~80 msStructured-clone of the Uint8Array

The router lives in src/persistence.ts and detects navigator.storage.getDirectory at runtime. If present → OPFS path. Otherwise → IndexedDB. Both paths satisfy the same contract (savePersisted / loadPersisted / deletePersisted / listPersisted), so calling code is identical.

Two distinct uses of OPFS

The persistence backend serialises the engine state: chunks, Bloom filters, doc table. TieredStore uses OPFS differently — it stores the original file blobs, so an evicted document can be promoted back without asking the user to re-pick the file. Two stores, two purposes, same OPFS API.

CONTENT IDENTITY

FNV-1a 64-bit — why this hash

Every call to engine.indexFile(file) hashes the raw bytes before any parsing. If a document with that hash already lives in the index, the engine returns the existing entry and skips the work. This makes indexFile idempotent and safe to call repeatedly — drag the same DOCX twice, the index stays clean.

HashThroughput in JSOutputTrade-off
FNV-1a 64-bit~100 MB/s16 hex charsNon-cryptographic; ~10⁻¹⁵ collision probability at 128 docs.
SHA-256~30 MB/s64 hex charsCryptographic; 3× slower, no value added for our use case.
MurmurHash3~150 MB/sVariableSlightly faster; less portable across language ecosystems.

Why non-cryptographic is fine here

We don't need adversarial collision resistance. The threat model for a content hash in a deduplication store is "the user accidentally dragged the same file twice", not "an attacker is crafting two distinct files that hash the same to confuse the index". If an attacker could exploit a deliberate collision, the worst case would be that their second file does not get indexed — there is no path to elevation of privilege, data leak or denial of service.

SHA-256 would cost 3× the throughput for zero added safety in this scenario. We chose throughput. The same logic is used by Git for object identification (well, until SHA-256 transition, but Git's case IS adversarial), by SQLite for ROWID generation, by countless databases for dedup.

Where the hash surfaces

Two places. First, internally during indexFile — the dedup check. Second, on the returned IndexedDocument.contentHash field. The host can pass that hash to engine.removeDocument(hash) as a stable identifier that survives rename: if your user uploads same-contract-renamed.pdf and you want to remove the previously-uploaded contract.pdf that contains the same bytes, the hash matches even though the names don't.

RELATED

Where to go next