SplitZip: Ultra-Fast Lossless KV Compression for Disaggregated LLM Serving

News

2026-06-05: This repository is released.

Abstract

SplitZip is a GPU-friendly lossless compressor for KV cache transfer in prefill-decode disaggregated LLM serving. It preserves BF16 KV tensors bitwise while reducing transfer volume and keeping both compression and decompression on the latency-critical GPU path.

The key observation is that BF16 KV activations have highly redundant exponent values. SplitZip encodes the most frequent exponent values with fixed 4-bit codes, keeps sign and mantissa bits exact, and routes rare exponent values through a sparse escape stream. The released artifact includes the public Triton codec and reproduction scripts for the BF16 exponent analysis and codec throughput measurements used in the paper.

Highlights

Lossless BF16 KV cache compression with bitwise round-trip recovery.
Offline-calibrated, Top-16 exponent codebooks with chunk-local sparse escape metadata.
GPU encode and decode kernels implemented in Triton.
Reproduction scripts for WikiText-2/Qwen3-32B exponent statistics and codec-path throughput.
Paper-reported codec-only throughput on real BF16 KV activations: 613.3 GB/s encode and 2181.8 GB/s decode.
Exercised in an out-of-tree SGLang disaggregated-serving path using Mooncake for KV-cache transfer.

Installation

SplitZip requires a CUDA-capable GPU and a PyTorch/Triton stack compatible with that GPU. The public scripts were testing with Python 3.12 and the dependencies outlined in requirements.txt.

conda create -n splitzip python=3.12 -y
conda activate splitzip
pip install -r requirements.txt

If you already maintain a PyTorch CUDA environment, install the packages from requirements.txt there instead of creating a new environment.

Codec API

import torch

from codec_gpu import ChunkLocalSplitZipGPU

x = torch.randn(1024, 4096, dtype=torch.bfloat16, device="cuda")

codec = ChunkLocalSplitZipGPU(device="cuda", chunk_size=1024)
coverage = codec.calibrate(x)

encoded = codec.encode(x)
decoded = codec.decode(encoded)

assert torch.equal(x.view(torch.int16), decoded.view(torch.int16))
print(coverage, encoded.compressed_bytes)

The current public codec API is stateful. The encoded objects are intended to be decoded with the same calibrated codec state. For paper-style experiments, calibrate the codebook on a separate calibration set rather than on the benchmark tensor itself.

Quick Start

Run a small codec smoke test with synthetic BF16 data:

python bench_codec_throughput.py \
  --device cuda:0 \
  --synthetic \
  --calibrate-on-input \
  --rows 1024 \
  --hidden-dim 4096 \
  --warmup 3 \
  --iters 10 \
  --repeats 2 \
  --output splitzip_smoke.json

The benchmark verifies bitwise lossless recovery before reporting compression ratio and encode/decode throughput.

Release Scope

This initial release includes:

Standalone Triton encode/decode kernels for BF16 CUDA tensors.
A stateful calibrated codec API for bitwise BF16 round-trip recovery.
BF16 exponent-statistics analysis for Qwen3-32B/WikiText-2 style runs.
Codec-path throughput benchmarking.

Reproducing Paper Artifact Scripts

The paper protocol uses WikiText-2 with Qwen3-32B:

codebook calibration: wikitext/wikitext-2-raw-v1:train;
entropy/statistics evaluation: wikitext/wikitext-2-raw-v1:test;
codec throughput tensor source: wikitext/wikitext-2-raw-v1:test.

Analyze BF16 KV exponent entropy:

python analyze_bf16_exponent_entropy_qwen32.py \
  --model Qwen/Qwen3-32B \
  --device-map auto \
  --max-prompts 4 \
  --output qwen3_32b_bf16_exponent_entropy.json

Run the public codec throughput benchmark:

python bench_codec_throughput.py \
  --device cuda:0 \
  --rows 65536 \
  --hidden-dim 4096 \
  --chunk-size 1024 \
  --warmup 10 \
  --iters 50 \
  --repeats 5 \
  --output splitzip_codec_throughput.json

For offline reproduction from saved BF16 tensors:

python bench_codec_throughput.py \
  --device cuda:0 \
  --input kv_tensor.pt \
  --calibration-input qwen32_wikitext2_train_kv.pt \
  --chunk-size 1024 \
  --output splitzip_codec_throughput.json

Use --synthetic and --calibrate-on-input only for kernel sanity checks; they are not the paper protocol.

Integration Status

SplitZip has been exercised in an out-of-tree SGLang disaggregated-serving path using Mooncake for KV-cache transfer. This repository currently releases the standalone Triton codec, BF16 exponent-analysis script, and codec-throughput benchmark.

The SGLang/Mooncake integration code is not included in this initial public release. We plan to document, publish, or upstream that integration path separately.

Repository Contents

codec_gpu.py: public ChunkLocalSplitZipGPU Triton codec.
bench_codec_throughput.py: encode/decode throughput benchmark for the public codec API.
analyze_bf16_exponent_entropy_qwen32.py: BF16 KV exponent entropy and Top-16 coverage analysis.
requirements.txt: minimal Python runtime dependencies.

Roadmap

Publish, or upstream the out-of-tree SGLang/Mooncake KV-transfer integration.
Add integration-facing codebook/payload metadata for portable transfer boundaries.
Add CI after the initial artifact release.

Citation

If you find SplitZip useful in your research, please cite:

@misc{guo2026splitzipultrafastlossless,
      title={SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving}, 
      author={Yipin Guo and Siddharth Joshi},
      year={2026},
      eprint={2605.01708},
      archivePrefix={arXiv},
      primaryClass={cs.DC},
      url={https://arxiv.org/abs/2605.01708}, 
}

License

This project is released under the MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
figs		figs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analyze_bf16_exponent_entropy_qwen32.py		analyze_bf16_exponent_entropy_qwen32.py
bench_codec_throughput.py		bench_codec_throughput.py
codec_gpu.py		codec_gpu.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SplitZip: Ultra-Fast Lossless KV Compression for Disaggregated LLM Serving

News

Abstract

Highlights

Installation

Codec API

Quick Start

Release Scope

Reproducing Paper Artifact Scripts

Integration Status

Repository Contents

Roadmap

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SplitZip: Ultra-Fast Lossless KV Compression for Disaggregated LLM Serving

News

Abstract

Highlights

Installation

Codec API

Quick Start

Release Scope

Reproducing Paper Artifact Scripts

Integration Status

Repository Contents

Roadmap

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages