Three primitives keep the engine honest across sessions and across repeated uploads: OPFS for snapshots, IndexedDB as the fallback, and a 64-bit content hash for deduplication. None of them are configurable surface area — they're internal contracts you should understand when reasoning about what the engine actually does.
When you call engine.save("my-corpus"), the engine serialises the entire index (chunks, document table, text pool, Bloom filters) to a single binary blob and writes it to browser storage. engine.load("my-corpus") reads it back and memcpy's it into the BSS arrays. A 16 MB corpus restores in tens of milliseconds — orders of magnitude faster than re-parsing 50 DOCXs.
FileSystemWritableFileStream.write(Uint8Array) The router lives in src/persistence.ts and detects navigator.storage.getDirectory at runtime. If present → OPFS path. Otherwise → IndexedDB. Both paths satisfy the same contract (savePersisted / loadPersisted / deletePersisted / listPersisted), so calling code is identical.
The persistence backend serialises the engine state: chunks, Bloom filters, doc table. TieredStore uses OPFS differently — it stores the original file blobs, so an evicted document can be promoted back without asking the user to re-pick the file. Two stores, two purposes, same OPFS API.
Every call to engine.indexFile(file) hashes the raw bytes before any parsing. If a document with that hash already lives in the index, the engine returns the existing entry and skips the work. This makes indexFile idempotent and safe to call repeatedly — drag the same DOCX twice, the index stays clean.
We don't need adversarial collision resistance. The threat model for a content hash in a deduplication store is "the user accidentally dragged the same file twice", not "an attacker is crafting two distinct files that hash the same to confuse the index". If an attacker could exploit a deliberate collision, the worst case would be that their second file does not get indexed — there is no path to elevation of privilege, data leak or denial of service.
SHA-256 would cost 3× the throughput for zero added safety in this scenario. We chose throughput. The same logic is used by Git for object identification (well, until SHA-256 transition, but Git's case IS adversarial), by SQLite for ROWID generation, by countless databases for dedup.
Two places. First, internally during indexFile — the dedup check. Second, on the returned IndexedDocument.contentHash field. The host can pass that hash to engine.removeDocument(hash) as a stable identifier that survives rename: if your user uploads same-contract-renamed.pdf and you want to remove the previously-uploaded contract.pdf that contains the same bytes, the hash matches even though the names don't.