How Albex works inside.
Albex is a full-text search engine that lives entirely in the user's browser. That sentence is the thesis. Every other choice — the algorithms, the memory model, the absence of an allocator — is a corollary of it. This page walks through how the pieces fit together.
Four axioms follow:
- Zero server. No back-end ever. Documents never transit the network.
- Human latency. Searches must answer within ~100 ms — the perception threshold for "instant".
- Reduced footprint. Bytes shipped to the browser are a tax on every page load.
- Adapts to the machine. The same package runs on a Chromebook and a Mac Pro, picking what each can support.
The layer map
The codebase is split into six layers. Dependencies flow strictly upward: a layer never calls down into the one below it. This is what lets the algorithm layer be testable without WASM, and the adaptive layer be evolved without touching Rust.
Bloom filter, Bitap matcher, light Spanish stemmer, Unicode fold tables. No heap, no allocator, no I/O. Testable on host without WASM.
XML byte state machines for DOCX and XLSX. Same constraints as layer 0. New Rust formats live here.
BSS arrays, C ABI exports, EngineState bookkeeping, resumable search, tier feature flags. Pegs layers 0 and 1 to a JS-facing interface.
pdf-extract wrapper. Its own WASM binary (~1 MB) loaded lazily on first PDF — never paid by users who do not touch PDFs.
AlbexEngine API, format indexers in TS, query parser, typed errors, persistence layer, content hashing.
Profile detection, resource awareness, worker pool, WebGPU runtime, tiered storage. Strictly opt-in; the base engine works without any of it.
The whole machine, end to end
Every box below is a real piece of Albex — from the browser drop zone, through the JS↔WASM scratchpad bridge, into the Rust core and each algorithm, out to the PDF/OCR path, persistence, workers and the GPU. The animated edges trace the hot search cascade: Bloom → trigram → Bitap → score → top-K. Click any node to read how that piece works in detail.
From query to ranked results in four stages
Why this order matters
The cost ladder is deliberate: the cheapest filter runs first on every chunk; the expensive ones only on what survives. A typical search tests 100 000 chunks but runs the Bitap matcher on under 1 % of them. The Bloom filter is the gatekeeper that keeps total search latency under 10 ms even on big corpora.
64-bit probabilistic gate
A Bloom filter answers "could the pattern be here?" in two machine instructions: an AND and a compare. We use a single u64 per chunk, with each character mapped to one of 64 buckets by c & 0x3F.
The function hash is intentionally trivial. A cryptographic hash would give better distribution but cost 10–100× more CPU. The Bloom runs on every chunk in the corpus — sometimes 100 000 times per query — so raw speed wins over perfect distribution.
For typical European text the filter rejects 80–95 % of chunks before the more expensive Bitap stage. False positives are unavoidable but harmless: Bitap discards them at no extra cost.
Bit-parallel fuzzy match
The Shift-Or / Wu-Manber algorithm keeps the matching state inside a single u64 register. Each text byte advances the state with a shift + OR + AND — no backtracking, no per-character allocations, no branches dependent on the data.
We extend it to k errors (substitution, insertion, deletion) by maintaining k+1 parallel registers. Albex caps k at 3: more produces noise on short queries and multiplies the linear cost. Pattern length is capped at 64 bytes — the width of the register. Longer queries are truncated with a truncated flag the host can surface in its UI.
BSS-only, no allocator
The default wasm32-unknown-unknown target has no OS to call for memory. Pulling in std would drag in dlmalloc or wee_alloc, inflate the binary, add allocation overhead, and introduce a failure mode (OOM) that is hard to surface cleanly to JavaScript.
Instead, every storage region is a zero-initialised static array. Because the BSS segment is described in the binary but not stored, a .wasm with 16 MB of static arrays still weighs ~33 KB on disk. At runtime the WebAssembly linear memory grows to accommodate the layout (~20 MB for std tier), but this grow happens once at instantiation — no per-call allocation, no fragmentation, no OOM during a search.
The trade-off: capacity is a fixed ceiling. Exceeding it is a silent stop, not an error. That is acceptable because the user already chose a tier; a "give me more space" feature would mean an allocator, which would mean fragmentation and unpredictable latency.
Same package, different machines
Albex ships six variants of the main WASM binary (three capacity tiers, each with and without SIMD) plus the lazy PDF module. The browser fetches one variant per init() call, picked from the device profile.
deviceMemory ≤ 1 GBdeviceMemory 2–7 GBdeviceMemory ≥ 8 GBSIMD on top of every tier
The runtime probes for WebAssembly v128 by attempting to validate a tiny module that uses it. If it validates, the engine fetches albex_wasm_<tier>_simd.wasm — a binary built with -C target-feature=+simd128. The inner loop of the Bloom batch is branchless precisely so LLVM can auto-vectorise it.
Why "8 GB" appears as a ceiling everywhere
The W3C navigator.deviceMemory spec caps reported memory at 8 GB for privacy. A Mac Pro with 192 GB reports 8 just like an iPad with 16 GB. We can't distinguish them. The pro tier (128 MB pool) is the largest default because the working set must still fit comfortably in a browser tab — Chrome starts flagging pages as "high memory" around 500 MB.
For genuine archive-scale corpora the right answer is the TieredStore (LRU eviction with OPFS-persisted blobs) or the AlbexPool sharded across N workers, not a tier with 1 GB pool.
Bloom on the GPU when it pays off
The Bloom check on 100 000 chunks is bashfully parallel: each test is independent. A WGSL compute shader can run all 100 000 of them in a single dispatch. The result is a packed bitset of candidates; the CPU then runs Bitap only on those.
Engaging the GPU has a fixed cost (upload + dispatch + readback, ~3–5 ms). Below ~20 000 chunks the CPU wins. Above it, the GPU gives a 5–10× speedup. The default threshold is exactly that break-even point, but it is exposed as gpuThreshold for tuning.
The candidate mask is single-shot: it applies to the next searchBegin and is automatically cleared at the end of the last slice. That prevents stale masks from silently corrupting an unrelated next search if anything ever goes wrong in the JS layer.
When each language wins
Some parsers live in Rust; others live in TypeScript. The rule is not "Rust for important things". It is: which primitive of the host environment wins for this operation?
DOCX, XLSX and PDF go to Rust because the source files can be hundreds of megabytes and streaming a state machine is the only way to handle them. Markdown, HTML, JSON, CSV, EML and RTF go to TypeScript because they fit comfortably in memory and V8 gives us String.prototype.replace with JIT-compiled regex, which beats any hand-written WASM parser for documents of typical size.
Where to read more
This page summarises the parts that come up most often in conversation. For the full pseudocode, layout tables and side-by-side cost comparisons, check the in-repo Maintainer's guide (MAINTAINER.md) and Technical deep-dive (TECHNICAL.md).