OPEN SOURCE · MIT · ZERO‑CONFIG · v0.3.0

Document search,shipped as 33 KBzero allocator, zero backend.

Runs entirely in the browser — no server, no telemetry, no network call after load. Streaming parsers, accent‑insensitive fuzzy matching. Your documents are processed locally and never transmitted anywhere.

Quick start →

$ npm i albex

GitHub

RELEASEv0.3.02026‑05‑30

▣

Scanned‑PDF OCR

Image‑only PDFs are now searchable through the optional @albex/ocr companion. Tesseract.js, lazy by language, zero cost when not enabled.

⌷

Hybrid PDF mode

Opt‑in alwaysExtractEmbeddedImages OCRs embedded images on top of vector text. For contracts with scanned signatures, reports with screenshot tables.

◇

Snapshot v2

Per‑document content hash now persisted. Content‑hash dedup survives the save/load round trip. v1 snapshots still load.

⌬

Parser‑crash recovery

When pdf-extract traps on an unusual PDF, the engine falls back to lopdf‑only image extraction. Many "unsupported" PDFs become searchable through OCR.

⊞

Hardened lite parsers

CSV strips the UTF‑8 BOM. EML decodes base64 and quoted‑printable bodies, walks nested multipart. RTF reads \'XX (cp1252) and \uN ? Unicode escapes.

≋

Honest API rename

searchStream → searchCooperative. The old name implied incremental streaming the method never did. Deprecated alias preserved until 0.4.0.

Full CHANGELOG →

Input→Parsers→Scratchpad→Indexer→Store→Search pipeline→Results

Expand ↗8 nodes · 7 edges / O(N log k) search

CORE BUNDLE

33 KB

Main WASM binary

800k chunks

Index capacity (pro tier)

<5 ms

Typical query

0 deps

Core + ingest crates

HOW IT WORKS

Three steps, no server.

STEP 01 · DROP

DROP

The file enters the sandbox.

File ──▶ sandbox boundary
no network · no disk

STEP 02 · INDEX

INDEX

Streaming parsers, BSS pool.

stream ──▶ parser ──▶ BSS pool
zero-copy · 64 KB scratch

STEP 03 · SEARCH

SEARCH

Bloom skips, Bitap finishes.

bloom filter ──▶ bitap
O(N log k) · heap top-K

PIPELINE

Four stages, one pass.

Bloom filter

64‑bit probabilistic set

Bitap (Shift‑Or)

Wu‑Manber · up to 3 edits

Rich scoring

5 components · capped at 1000

exact

400

fuzzy

240

pos

180

prox

140

freq

Min‑heap top‑K

27× faster than insertion sort

1000

842

791

680

624

557

503

TEXT_POOL 16 MB

CHUNKS 3.2 MB

scratchpad

doc names

FEATURES

Built for the browser edge.

◬

Zero backend

Entire engine lives in a 33 KB WASM binary. No server, no cloud, no network call after load.

wasm

≈

Fuzzy & typo‑tolerant

Bitap (Shift‑Or / Wu‑Manber) up to 3 edits. Finds "clausula" even when typed "claúsula".

bitap

Accent‑insensitive

Latin‑1 + Latin‑A fold. ES / FR / DE / IT / PT / PL / CZ / TR transparently.

unicode

⌘

Query DSL

Phrase, OR, fuzzy, mixed. "a b" | c syntax. Up to 4 tokens per simple query.

dsl

∎

11 formats

DOCX · XLSX · PDF · HTML · MD · JSON · CSV · EML · RTF · TXT · XML. Heavy ones stream; the lite ones (CSV, EML, RTF) handle BOM, base64, cp1252.

parsers

+ @albex/ocr

▣

OCR companion

@albex/ocr drops Tesseract.js next to the engine. Six languages auto‑loaded by demand. Scanned PDFs become searchable.

ocr

+ @albex/ocr

⌷

Hybrid PDF mode

Opt‑in: native PDFs get their embedded images OCR’d too. For contracts with scanned signatures, reports with screenshots.

hybrid

⊘

Zero allocator

Static BSS region only. No heap fragmentation, no GC pressure, no OOM mid‑search.

no-alloc

⧉

Adapts to the host

3 tier binaries (mini / std / pro) × SIMD variants. Auto‑picked from deviceMemory and WASM probes.

tiers

◷

Cooperative search

searchCooperative() with frameBudgetMs yields to scheduler.yield() between slices. UI thread keeps a chance to paint.

frame-budget

⚒

Worker pool

AlbexPool shards documents across N workers. Map‑reduce search merges global top‑K.

pool

⌬

WebGPU pre‑filter

WGSL compute shader runs Bloom in parallel for large corpora. Experimental; opt‑in via gpu auto‑select.

gpu

⊟

Tiered storage

TieredStore evicts cold docs to OPFS, promotes on demand. Search archives that exceed RAM.

tiered

◇

Snapshot v2

Per‑document content hash persisted. Save the index to OPFS in milliseconds; reload across sessions with dedup intact.

persistence

SYNTAX

Query language.

Pattern	Meaning	Example match
`word`	Single fuzzy token	The word in context
`a b c`	All tokens near each other	found a then b then c
`"a b"`	Exact phrase	matched a b exactly
`a \| b`	Either token	has a or b
`"a b" \| c`	Phrase OR token	has a b or just c

QUICK START

Up in under a minute.

index.ts

import { AlbexEngine } from "albex";
const engine = await AlbexEngine.create();
for (const file of input.files) {
  await engine.indexFile(file);
}
const hits = engine.search('"clausula novena" | rescisión');
for (const h of hits) {
  console.log(h.documentName, h.location, h.score, h.snippet);
}

FORMATS

What goes in.

.docx

Word document

Streaming XML parser, paragraph + table extraction

rust

.xlsx

Excel workbook

Shared strings + inline strings streaming

rust

.pdf

PDF document

Text stream extraction, lazy loaded

lazy · ~1 MB

.md .markdown

Markdown

CommonMark stripped, paragraphs preserved

.html .htm

HTML

.json

JSON

Recursive walk; keys + leaf strings indexed

.csv

CSV

RFC 4180 lite; one row per chunk

.eml

Email (MIME)

From/To/Subject + first text/plain body part

.rtf

RTF

Control words and groups stripped, text preserved

.txt

Plain text

Direct UTF-8 pass-through

.xml

XML

Tag-stripped, entity-decoded

CAPACITY

Default limits.

16 MB

Text pool

std tier; 4 MB mini · 128 MB pro

100k

Chunk capacity

std tier; 25k mini · 800k pro

128 docs

Document limit

std tier; 32 mini · 1 024 pro

64 chars

Query length

Bitap u64 register width

Ship document search today.

No backend required. No data leaves the browser.

GitHub →Read the docs

npmnpm install albex

pnpmpnpm add albex

denodeno add npm:albex

Document search,shipped as 33 KBzero allocator, zero backend.

Scanned‑PDF OCR

Hybrid PDF mode

Snapshot v2

Parser‑crash recovery

Hardened lite parsers

Honest API rename

Runtime topology — interactive

Three steps, no server.

DROP

INDEX

SEARCH

Four stages, one pass.

Built for the browser edge.

Zero backend

Fuzzy & typo‑tolerant

Accent‑insensitive

Query DSL

11 formats

OCR companion

Hybrid PDF mode

Zero allocator

Adapts to the host

Cooperative search

Worker pool

WebGPU pre‑filter

Tiered storage

Snapshot v2

Query language.

Up in under a minute.

What goes in.

Default limits.

Ship document search today.