Information is already there
Advanced lossless text compression using a neural language model and arithmetic coding. Runs on GPU or CPU via llama.cpp — no configuration needed.
Prediction-based arithmetic coding driven by a neural language model
Uses SmolLM2-135M with a ~49K token vocabulary to assign probability distributions over the next token. The better the prediction, the fewer bits are needed.
Inference runs via llama.cpp, which is ~7× faster than PyTorch. Automatically uses GPU if available, falls back to CPU — no extra configuration required.
Mathematically optimal encoding: each token is encoded in proportion to its predicted probability. A token predicted at 99% costs ~0.014 bits; only truly surprising tokens are expensive.
Input is split into chunks and distributed across multiple workers running concurrently. Each worker operates an independent LLM + arithmetic coding pipeline for maximum throughput.
Binary files are segmented into text-like and binary chunks. Text chunks use the neural pipeline; binary blobs use lzma or gzip. The result is always at least as good as compressing the whole file with lzma.
Both sides run the exact same model with the exact same weights, producing identical probability distributions. Decompressed output matches the original byte-for-byte, always.
In our experiments, Nacrith produces the strongest compression among all evaluated systems
Bits per byte — lower is better
| System | Model / Notes | alice29.txt (bpb) | enwik8 (bpb) |
|---|---|---|---|
| gzip -9 | — | 2.851 | 2.916 |
| CMIX v21 | LSTM + 2,000+ models | 1.635 | 1.170 |
| NNCP v3 | Transformer-XL (online) | 3.960 | ~1.190 |
| PAQ8px -8L | Context mixing | 1.728 | ~1.270 |
| ts_zip | RWKV-169M | ~1.142 | ~1.110 |
| FineZip | LLaMA-3-8B (fine-tuned) | — | 1.024 |
| Nacrith | SmolLM2-135M + llama.cpp | 0.918 | 0.9389 |
In our experiments: On enwik8 (95.4 MB Wikipedia), Nacrith achieves 0.9389 bpb — outperforming ts_zip (~1.11 bpb) by 15%, FineZip (1.024 bpb) by 8% with a 60× smaller model and no fine-tuning, and CMIX v21 (1.17 bpb) by 20%.
148.5 KB, Canterbury Corpus

95.4 MB, Wikipedia

Compressed / Original — lower is better

Space saved % — higher is better

~8-12% compression ratio on English text — roughly 3× better than gzip and 2.5× better than bzip2 in our experiments.
On alice29.txt, Nacrith achieves 0.918 bpb — 44% better than CMIX v21 and 20% better than ts_zip among evaluated systems.
Space savings of 88–92% consistently across small, medium, and large files.
On enwik8 (95.4 MB), Nacrith achieves 0.9389 bpb — the strongest result among evaluated systems.
Uses ~1.2 GB VRAM for the first worker, plus ~660 MB per additional worker. Parallel workers substantially improve throughput.
All results are fully lossless — decompressed output matches the original byte-for-byte.
Measured on a 100 KB English text sample. Shannon limits represent theoretical lower bounds for compressors of that order.
| Method | Compressed Size | Bits / Byte |
|---|---|---|
| Shannon 0th-order limit | 60.3 KB | 4.8025 |
| Shannon 1st-order limit | 44.2 KB | 3.5213 |
| gzip -9 | 39.0 KB | 3.1082 |
| xz -9 | 35.5 KB | 2.8257 |
| Shannon 2nd-order limit | 34.4 KB | 2.7373 |
| Nacrith | 9.6 KB | 0.7635 |
Nacrith achieves 0.76 bits/byte — 84% below the 0th-order Shannon limit, 78% below the 1st-order limit, and 72% below the 2nd-order limit. To our knowledge, these represent strong compression results.
Prediction-based arithmetic coding using a neural language model
Nacrith exploits the deep connection between prediction and compression (Shannon, 1948): a good predictor of text can be turned into a good compressor.
A transformer neural network with a ~49K token vocabulary. Given a sequence of tokens, it outputs a probability distribution over the entire vocabulary for what comes next. It captures grammar, common phrases, semantic relationships, and world knowledge — far beyond simple byte-pattern matching. Inference runs via llama.cpp, which is ~7× faster than PyTorch, and automatically targets GPU or CPU.
A mathematically optimal encoding scheme that maps a sequence of symbols to a single number in [0, 1). For each symbol, it narrows the interval proportionally to that symbol's probability. High-probability symbols barely shrink the interval (costing almost zero bits), while unlikely symbols shrink it a lot.
The LLM provides the probabilities; the arithmetic coder turns them into bits. A token predicted at 99% confidence costs ~0.014 bits. A token at 50% costs 1 bit. Only truly surprising tokens are expensive.
Key: Both sides run the exact same model with the exact same weights, producing identical probability distributions. This symmetry guarantees perfect lossless reconstruction.
Pattern matching on raw bytes within a sliding window. Only exploits local, literal repetitions.
Captures semantic and syntactic structure. It knows that after "The President of the United", the word "States" is extremely likely — even if that phrase never appeared recently. This deep understanding of language produces far better predictions, which directly translates to fewer bits.
Nacrith also supports compressing binary files such as PDFs, executables, and other non-UTF-8 data using a hybrid chunked approach. Binary mode is activated automatically when the input file is not valid UTF-8.
Every byte is classified as text-like (printable ASCII, tab/LF/CR) or binary. Contiguous runs are grouped, with short text runs (< 64 bytes) demoted to binary and small binary gaps bridged to keep text chunks contiguous.
All binary chunks are merged into a single blob and compressed with lzma (≥ 4 KB) or gzip (smaller blobs). If neither reduces size, raw bytes are stored as-is.
Each text chunk is split across workers and compressed using the full LLM + arithmetic coding pipeline. Workers operate concurrently on their sub-chunks for maximum throughput.
Note: Binary files are rarely pure binary — they often contain significant amounts of embedded text (strings, metadata, markup, code). Nacrith exploits this by segmenting the input into text and binary chunks, then compressing each with an appropriate method for its type.
Runs on any CUDA-capable GPU with at least 2 GB of VRAM (~1.2 GB for the first worker, ~660 MB per additional worker). Falls back to CPU automatically via llama.cpp when no GPU is available. Benchmarks were run on a low-end NVIDIA GTX 1050 Ti — a modern GPU will be significantly faster.