Runs entirely in the browser — no server, no telemetry, no network call after load. Streaming parsers, accent‑insensitive fuzzy matching. Your documents are processed locally and never transmitted anywhere.
Image‑only PDFs are now searchable through the optional @albex/ocr companion. Tesseract.js, lazy by language, zero cost when not enabled.
Opt‑in alwaysExtractEmbeddedImages OCRs embedded images on top of vector text. For contracts with scanned signatures, reports with screenshot tables.
Per‑document content hash now persisted. Content‑hash dedup survives the save/load round trip. v1 snapshots still load.
When pdf-extract traps on an unusual PDF, the engine falls back to lopdf‑only image extraction. Many "unsupported" PDFs become searchable through OCR.
CSV strips the UTF‑8 BOM. EML decodes base64 and quoted‑printable bodies, walks nested multipart. RTF reads \'XX (cp1252) and \uN ? Unicode escapes.
searchStream → searchCooperative. The old name implied incremental streaming the method never did. Deprecated alias preserved until 0.4.0.
The file enters the sandbox.
Streaming parsers, BSS pool.
Bloom skips, Bitap finishes.
Entire engine lives in a 33 KB WASM binary. No server, no cloud, no network call after load.
wasmBitap (Shift‑Or / Wu‑Manber) up to 3 edits. Finds "clausula" even when typed "claúsula".
bitapLatin‑1 + Latin‑A fold. ES / FR / DE / IT / PT / PL / CZ / TR transparently.
unicodePhrase, OR, fuzzy, mixed. "a b" | c syntax. Up to 4 tokens per simple query.
dslDOCX · XLSX · PDF · HTML · MD · JSON · CSV · EML · RTF · TXT · XML. Heavy ones stream; the lite ones (CSV, EML, RTF) handle BOM, base64, cp1252.
parsers@albex/ocr drops Tesseract.js next to the engine. Six languages auto‑loaded by demand. Scanned PDFs become searchable.
ocrOpt‑in: native PDFs get their embedded images OCR’d too. For contracts with scanned signatures, reports with screenshots.
hybridStatic BSS region only. No heap fragmentation, no GC pressure, no OOM mid‑search.
no-alloc3 tier binaries (mini / std / pro) × SIMD variants. Auto‑picked from deviceMemory and WASM probes.
tierssearchCooperative() with frameBudgetMs yields to scheduler.yield() between slices. UI thread keeps a chance to paint.
frame-budgetAlbexPool shards documents across N workers. Map‑reduce search merges global top‑K.
poolWGSL compute shader runs Bloom in parallel for large corpora. Experimental; opt‑in via gpu auto‑select.
gpuTieredStore evicts cold docs to OPFS, promotes on demand. Search archives that exceed RAM.
tieredPer‑document content hash persisted. Save the index to OPFS in milliseconds; reload across sessions with dedup intact.
persistence| Pattern | Meaning | Example match |
|---|---|---|
word | Single fuzzy token | The word in context |
a b c | All tokens near each other | found a then b then c |
"a b" | Exact phrase | matched a b exactly |
a | b | Either token | has a or b |
"a b" | c | Phrase OR token | has a b or just c |
import { AlbexEngine } from "albex";
const engine = await AlbexEngine.create();
for (const file of input.files) {
await engine.indexFile(file);
}
const hits = engine.search('"clausula novena" | rescisión');
for (const h of hits) {
console.log(h.documentName, h.location, h.score, h.snippet);
}npm install albexpnpm add albexdeno add npm:albex