Neural Arithmetic Compression

Compression that understands language

State-of-the-art lossless text compression using neural language models. Achieving 3-4x better compression than traditional methods.

Get Started Read Paper Learn More

~15%

Compression ratio on English text

3-4x

Better than gzip, xz, zip

100%

Lossless reconstruction

Why Nacrith?

Nacrith combines the predictive power of neural networks with the mathematical precision of arithmetic coding

Neural Prediction

Arithmetic Coding

Mathematically optimal encoding that assigns shorter bit sequences to likely tokens

Beyond Shannon Limit

Compresses below classical entropy bounds by understanding deep linguistic structure

Benchmark Results

Tested on English prose of varying sizes. GPU: NVIDIA GTX 1050 Ti

Sample	Original	gzip	xz	zip	Nacrith
small	3.0 KB	1.4 KB (46.8%)	1.5 KB (50.2%)	1.5 KB (49.9%)	424 B (13.7%)
medium	50.1 KB	19.6 KB (39.2%)	18.3 KB (36.6%)	19.7 KB (39.3%)	7.4 KB (14.8%)
large	100.5 KB	39.0 KB (38.9%)	35.5 KB (35.3%)	39.1 KB (38.9%)	15.5 KB (15.4%)

Compressed File Sizes

Space Savings Comparison

Space Savings vs Original - Bar chart showing Nacrith achieves 85-86% space savings across all file sizes, compared to 50-65% for traditional compressors

Compression Ratios

Compression Ratio Comparison - Bar chart showing Nacrith achieves 14-15% ratio while traditional methods achieve 35-50%

Key Observations

•Nacrith achieves ~14-15% ratio on English text — roughly 2.5x better than gzip and 2.3x better than xz
•Saves 85% of space consistently across all tested sizes
•All results are fully lossless — decompressed output matches the original byte-for-byte

Hardware Requirements

Benchmarks were run on a low-end NVIDIA GTX 1050 Ti — with a modern GPU, compression and decompression would be significantly faster.

The model uses ~1.3 GB of VRAM during compression/decompression, so any CUDA-capable GPU with at least 2 GB of VRAM will work. Falls back to CPU if no GPU is available.

How It Works

The deep connection between prediction and compression

Compression

Input text → Tokenize

For each token:

1. LLM predicts P(next | context)

2. Encoder narrows interval by P

→ Compressed bitstream

Decompression

Compressed bits

For each position:

1. Same LLM predicts P

2. Decoder recovers token

3. Token feeds back as context

→ Original text

Why Nacrith Beats Traditional Compressors

Traditional (gzip/xz/zip)

Pattern matching on raw bytes within a sliding window. Only exploit local, literal repetitions.

Nacrith

Captures semantic and syntactic structure. Understands that after "The President of the United", "States" is extremely likely — even without recent repetition.

Beyond the Shannon Entropy Limit

Nacrith compresses well below the classical Shannon entropy bounds

Method	Size	bits/byte
Original	100.5 KB	8.0000
Shannon 0th-order limit	59.5 KB	4.7398
Shannon 1st-order limit	44.2 KB	3.5213
Shannon 2nd-order limit	34.4 KB	2.7373
gzip -9	39.0 KB	3.1082
xz -9	35.5 KB	2.8257
Nacrith	15.5 KB	1.2355

Nacrith achieves 1.24 bits/byte — 74% below the 0th-order Shannon limit and 55% below the 2nd-order limit. This is state-of-the-art compression performance.

Get Started

Quick installation and usage guide

Installation

# Clone the repository
git clone https://github.com/st4ck/nacrith.git
cd nacrith

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install torch transformers accelerate pytest

Usage

# Compress a file
python cli.py compress input.txt output.nc

# Decompress a file
python cli.py decompress output.nc restored.txt

# Run benchmarks
python benchmark.py

Ready to try Nacrith?

Experience state-of-the-art compression that truly understands your text

View on GitHub Read Documentation