Technical Documentation
LavinHash is a high-performance fuzzy hashing library that generates compact fingerprints for file similarity detection. Unlike cryptographic hashes that change completely with any modification, fuzzy hashes allow computing similarity scores between files.
Traditional hashes (MD5, SHA-256) answer "Are these files identical?" LavinHash answers "How similar are these files?" making it ideal for malware detection, plagiarism analysis, and deduplication.
Quick Start Example
// Node.js / TypeScript - Functional Pipeline Approach
import { hash, compare, findDelta } from 'lavinhash';
import { readFile } from 'fs/promises';
import { pipeline } from 'stream/promises';
// Functional composition for file analysis
const analyzeFiles = async (paths: string[]) => {
const fingerprints = await Promise.all(
paths.map(async (path) => ({
path,
data: new Uint8Array(await readFile(path)),
fingerprint: null as any
}))
);
return fingerprints.map(({ path, data }) => ({
path,
fingerprint: hash(data),
metadata: { size: data.length, timestamp: Date.now() }
}));
};
// Compare with functional fold (reduce)
const findMostSimilar = (target: any, candidates: any[]) =>
candidates.reduce((best, current) => {
const sim = compare(target.fingerprint, current.fingerprint);
return sim > best.similarity
? { candidate: current, similarity: sim }
: best;
}, { candidate: null, similarity: 0 });
// Usage: Detect malware variants
const [target, ...database] = await analyzeFiles([
'unknown_sample.exe',
'known_trojan_v1.exe',
'known_trojan_v2.exe'
]);
const match = findMostSimilar(target, database);
console.log(`Match: ${match.candidate.path} (${match.similarity}%)`);DLAH Algorithm: Dual-Layer Adaptive Hashing
The DLAH algorithm analyzes files in two orthogonal dimensions, combining them to produce a robust similarity metric resistant to both structural and content modifications.
Layer 1: Structural Fingerprinting (30% weight)
Captures the file's topology using Shannon entropy analysis. This layer detects structural changes like data reorganization, compression changes, or block-level modifications.
Algorithm Steps:
- Divide input into fixed-size blocks (default: 256 bytes)
- For each block, calculate Shannon entropy: H(X) = -Σ p(x) log₂ p(x)
- Quantize entropy to 4-bit nibbles (0-15 range)
- Concatenate nibbles to form structural vector
- Compare vectors using Levenshtein distance (edit distance)
Example: 256-byte block → Entropy 4.83 → Quantized to nibble 12 (0xC) Layer 2: Content-Based Hashing (70% weight)
Extracts semantic features using a rolling hash over a sliding window. This layer detects content similarity even when data is moved, inserted, or partially modified.
Algorithm Steps:
- Initialize BuzHash with 64-byte window
- Slide window byte-by-byte, computing rolling hash
- When hash ≡ 0 (mod M), extract feature (adaptive trigger)
- Insert feature into 8192-bit Bloom filter using 3 hash functions
- Compare Bloom filters using Jaccard similarity: |A ∩ B| / |A ∪ B|
Adaptive Modulus Selection:
M = min(file_size / 256, 8192) Automatically adjusts feature density based on file size
Combined Similarity Score
Where:
- α = 0.3 (structural weight, configurable)
- Sstructural = Normalized Levenshtein similarity
- Scontent = Jaccard similarity of Bloom filters
- Result ∈ [0, 100] (percentage similarity)
System Architecture
Core Library (Rust)
src/
├── lib.rs # Public API and FFI exports
├── algo/
│ ├── entropy.rs # Shannon entropy (SIMD optimized)
│ ├── buzhash.rs # BuzHash rolling hash
│ └── bloom.rs # Fixed 8192-bit Bloom filter
├── model/
│ └── fingerprint.rs # Fingerprint struct (repr(C))
└── utils/
└── mem.rs # Zero-copy FFI helpersBinary Wire Format
| Offset | Field | Type | Size |
|---|---|---|---|
| 0x00 | Magic | u8 | 1 byte |
| 0x01 | Version | u8 | 1 byte |
| 0x02-0x03 | Struct Length | u16 (LE) | 2 bytes |
| 0x04-0x403 | Content Bloom | u8[1024] | 1024 bytes |
| 0x404+ | Structural Data | u8[n] | Variable |
Performance Characteristics
Time Complexity
Linear in file size
Space Complexity
Constant memory
Fingerprint Size
Independent of file size
Throughput
Single-threaded (Intel i7)
Optimization Techniques
- SIMD Entropy Calculation: AVX2 intrinsics for parallel processing of 8 blocks
- Rayon Parallelization: Automatic multi-threading for files > 1MB
- Cache-Friendly Design: Bloom filter fits in L1/L2 cache (1KB)
- Zero-Copy FFI: No memory duplication across language boundaries
- Lazy Evaluation: Iterator-based processing, minimal allocations
Production Use Cases
Practical Example: Malware Variant Detection
// Functional malware classification with monadic error handling
import { readFile } from 'fs/promises';
import { hash, compare } from 'lavinhash';
type MalwareFamily = { name: string; fingerprint: any; severity: string };
type ClassificationResult =
| { type: 'match'; family: string; similarity: number; severity: string }
| { type: 'unknown'; candidates: Array<{ family: string; similarity: number }> };
// Pure function composition
const classifyMalware = (database: MalwareFamily[]) =>
(unknownSample: Uint8Array): ClassificationResult => {
const unknownFP = hash(unknownSample);
const matches = database
.map(({ name, fingerprint, severity }) => ({
family: name,
similarity: compare(unknownFP, fingerprint),
severity
}))
.sort((a, b) => b.similarity - a.similarity);
const [best, ...rest] = matches;
return best.similarity >= 70
? { type: 'match', ...best }
: { type: 'unknown', candidates: matches.slice(0, 3) };
};
// Database as immutable data structure
const malwareDB: MalwareFamily[] = [
{ name: 'Trojan.Emotet', fingerprint: fp1, severity: 'critical' },
{ name: 'Ransomware.WannaCry', fingerprint: fp2, severity: 'critical' },
{ name: 'Backdoor.Cobalt', fingerprint: fp3, severity: 'high' }
];
// Pipeline with async/await
const analyzeSample = async (path: string) => {
const data = new Uint8Array(await readFile(path));
const classifier = classifyMalware(malwareDB);
return classifier(data);
};
// Usage with pattern matching
const result = await analyzeSample('suspicious.exe');
const message = result.type === 'match'
? `⚠️ ${result.family} detected (${result.similarity}%, ${result.severity})`
: `Unknown sample (top: ${result.candidates[0].family} at ${result.candidates[0].similarity}%)`;
console.log(message);Malware Detection & Classification
Identify variants of known malware families by comparing samples. Polymorphic malware often retains 70-90% similarity despite obfuscation.
File Deduplication
Find near-duplicate files in large datasets. Unlike exact-match dedup, catches slightly modified versions (renamed variables, reformatted code).
Plagiarism Detection
Detect copied code or documents with cosmetic changes. Resistant to identifier renaming, whitespace changes, and minor refactoring.
Version Control & Change Tracking
Determine if files are related versions. Delta detection feature shows exact changes between similar files.
Implementation Details
Rust Core Library Example
use lavinhash_core::{hash, compare, Config};
use std::fs;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Read files
let data1 = fs::read("file1.bin")?;
let data2 = fs::read("file2.bin")?;
// Generate fingerprints
let fp1 = hash(&data1, None)?;
let fp2 = hash(&data2, None)?;
// Compare
let similarity = compare(&fp1, &fp2, None);
println!("Similarity: {}%", similarity);
// Custom configuration
let config = Config::new()
.with_alpha(0.4) // Adjust structural weight
.with_window_size(128);
let fp3 = hash(&data1, Some(&config))?;
Ok(())
}Memory Safety Guarantees
- No unsafe code except in FFI boundary (fully audited)
- All allocations tracked, no memory leaks
- Validated with Miri (Rust interpreter for undefined behavior detection)
- ASAN/MSAN clean on all platforms
Cross-Platform Determinism
Identical files produce identical fingerprints across all platforms:
- Linux x86_64, ARM64
- Windows x86_64
- macOS x86_64, ARM64 (M1/M2)
- WebAssembly (wasm32)
Achieved through explicit endianness handling and deterministic hash seeding