OPEN SOURCE · MIT · ZERO‑CONFIG · v0.3.0

Document search,shipped as 33 KBzero allocator, zero backend.

Runs entirely in the browser — no server, no telemetry, no network call after load. Streaming parsers, accent‑insensitive fuzzy matching. Your documents are processed locally and never transmitted anywhere.

Quick start →
$ npm i albex
GitHub
RELEASEv0.3.02026‑05‑30

Scanned‑PDF OCR

Image‑only PDFs are now searchable through the optional @albex/ocr companion. Tesseract.js, lazy by language, zero cost when not enabled.

Hybrid PDF mode

Opt‑in alwaysExtractEmbeddedImages OCRs embedded images on top of vector text. For contracts with scanned signatures, reports with screenshot tables.

Snapshot v2

Per‑document content hash now persisted. Content‑hash dedup survives the save/load round trip. v1 snapshots still load.

Parser‑crash recovery

When pdf-extract traps on an unusual PDF, the engine falls back to lopdf‑only image extraction. Many "unsupported" PDFs become searchable through OCR.

Hardened lite parsers

CSV strips the UTF‑8 BOM. EML decodes base64 and quoted‑printable bodies, walks nested multipart. RTF reads \'XX (cp1252) and \uN ? Unicode escapes.

Honest API rename

searchStreamsearchCooperative. The old name implied incremental streaming the method never did. Deprecated alias preserved until 0.4.0.

InputParsersScratchpadIndexerStoreSearch pipelineResults
Expand ↗8 nodes · 7 edges / O(N log k) search

Runtime topology — interactive

CORE BUNDLE
33 KB
Main WASM binary
800k chunks
Index capacity (pro tier)
<5 ms
Typical query
0 deps
Core + ingest crates

Three steps, no server.

STEP 01 · DROP

DROP

The file enters the sandbox.

File ──▶ sandbox boundary
no network · no disk
STEP 02 · INDEX

INDEX

Streaming parsers, BSS pool.

stream ──▶ parser ──▶ BSS pool
zero-copy · 64 KB scratch
STEP 03 · SEARCH

SEARCH

Bloom skips, Bitap finishes.

bloom filter ──▶ bitap
O(N log k) · heap top-K

Four stages, one pass.

01
Bloom filter
64‑bit probabilistic set
02
Bitap (Shift‑Or)
Wu‑Manber · up to 3 edits
1
·
·
1
·
·
·
·
·
1
·
·
·
·
1
·
·
·
1
·
·
·
·
1
·
1
·
1
·
·
·
·
·
·
·
·
1
·
·
·
·
·
1
·
·
1
·
1
·
·
·
·
·
·
1
·
·
·
1
·
·
·
·
1
03
Rich scoring
5 components · capped at 1000
exact
400
fuzzy
240
pos
180
prox
140
freq
40
04
Min‑heap top‑K
27× faster than insertion sort
1000
842
791
680
624
557
503
TEXT_POOL 16 MB
CHUNKS 3.2 MB
scratchpad
doc names

Built for the browser edge.

Zero backend

Entire engine lives in a 33 KB WASM binary. No server, no cloud, no network call after load.

wasm
Fuzzy & typo‑tolerant

Bitap (Shift‑Or / Wu‑Manber) up to 3 edits. Finds "clausula" even when typed "claúsula".

bitap
á
Accent‑insensitive

Latin‑1 + Latin‑A fold. ES / FR / DE / IT / PT / PL / CZ / TR transparently.

unicode
Query DSL

Phrase, OR, fuzzy, mixed. "a b" | c syntax. Up to 4 tokens per simple query.

dsl
11 formats

DOCX · XLSX · PDF · HTML · MD · JSON · CSV · EML · RTF · TXT · XML. Heavy ones stream; the lite ones (CSV, EML, RTF) handle BOM, base64, cp1252.

parsers
+ @albex/ocr
OCR companion

@albex/ocr drops Tesseract.js next to the engine. Six languages auto‑loaded by demand. Scanned PDFs become searchable.

ocr
+ @albex/ocr
Hybrid PDF mode

Opt‑in: native PDFs get their embedded images OCR’d too. For contracts with scanned signatures, reports with screenshots.

hybrid
Zero allocator

Static BSS region only. No heap fragmentation, no GC pressure, no OOM mid‑search.

no-alloc
Adapts to the host

3 tier binaries (mini / std / pro) × SIMD variants. Auto‑picked from deviceMemory and WASM probes.

tiers
Cooperative search

searchCooperative() with frameBudgetMs yields to scheduler.yield() between slices. UI thread keeps a chance to paint.

frame-budget
Worker pool

AlbexPool shards documents across N workers. Map‑reduce search merges global top‑K.

pool
WebGPU pre‑filter

WGSL compute shader runs Bloom in parallel for large corpora. Experimental; opt‑in via gpu auto‑select.

gpu
Tiered storage

TieredStore evicts cold docs to OPFS, promotes on demand. Search archives that exceed RAM.

tiered
Snapshot v2

Per‑document content hash persisted. Save the index to OPFS in milliseconds; reload across sessions with dedup intact.

persistence

Query language.

PatternMeaningExample match
wordSingle fuzzy tokenThe word in context
a b cAll tokens near each otherfound a then b then c
"a b"Exact phrasematched a b exactly
a | bEither tokenhas a or b
"a b" | cPhrase OR tokenhas a b or just c

Up in under a minute.

index.ts
import { AlbexEngine } from "albex";
const engine = await AlbexEngine.create();
for (const file of input.files) {
  await engine.indexFile(file);
}
const hits = engine.search('"clausula novena" | rescisión');
for (const h of hits) {
  console.log(h.documentName, h.location, h.score, h.snippet);
}

What goes in.

.docx
Word document
Streaming XML parser, paragraph + table extraction
rust
.xlsx
Excel workbook
Shared strings + inline strings streaming
rust
.pdf
PDF document
Text stream extraction, lazy loaded
lazy · ~1 MB
.md .markdown
Markdown
CommonMark stripped, paragraphs preserved
.html .htm
HTML
<script>/<style> skipped, block-level paragraphs
.json
JSON
Recursive walk; keys + leaf strings indexed
.csv
CSV
RFC 4180 lite; one row per chunk
.eml
Email (MIME)
From/To/Subject + first text/plain body part
.rtf
RTF
Control words and groups stripped, text preserved
.txt
Plain text
Direct UTF-8 pass-through
.xml
XML
Tag-stripped, entity-decoded

Default limits.

16 MB
Text pool
std tier; 4 MB mini · 128 MB pro
100k
Chunk capacity
std tier; 25k mini · 800k pro
128 docs
Document limit
std tier; 32 mini · 1 024 pro
64 chars
Query length
Bitap u64 register width

Ship document search today.

No backend required. No data leaves the browser.

npmnpm install albex
pnpmpnpm add albex
denodeno add npm:albex