Overview

The problem

A redistricting sampler — GerryChain’s ReCom, ForestReCom, a Sequential Monte Carlo routine — explores the space of legal districting plans by emitting a long sequence of plans. Serious analyses want many plans: tens of thousands to millions.

The natural way to store them is JSONL (JSON Lines), one plan per line:

{"assignment": [1, 1, 2, 2, 3, 3, ...], "sample": 1}
{"assignment": [1, 1, 2, 2, 3, 1, ...], "sample": 2}
...

This is simple and portable, but it does not scale. A 50,000-plan ensemble on Colorado’s ~140,000 census blocks is 13.5 GB of JSONL. Most of that is redundancy: each assignment is mostly long runs of the same district id, and consecutive plans differ only slightly.

What BEN does

BEN (Binary-Ensemble) is a binary format that wrings out that redundancy. The core compression is deliberately simple and works in two stages:

  1. Run-length encoding (RLE)[1, 1, 1, 2, 2, 2, 2, 3] becomes [(1, 3), (2, 4), (3, 1)]. Districting plans are mostly long runs, so this is a big win, especially when nearby geographic units sit next to each other in the node ordering.

  2. Bit-packing — each run’s value and length are stored in the minimum number of bits, not padded out to whole bytes.

On top of that, the XBEN format adds LZMA2 compression to exploit the repetition across plans, and several encoding variants specialize for how a particular sampler produces its plans.

The headline result

That 13.5 GB Colorado JSONL ensemble, reordered by GEOID20, becomes a ~280 MB BEN stream, and then a 5.6 MB XBEN file — a >2,400× reduction, completely lossless. The biggest single lever is node reordering; see Why reordering shrinks files.

The format family

BEN comes as three on-disk containers, each suited to a different job:

Container

What it is

Use it for

.ben

A plain BEN stream

Working with an ensemble: reading, replaying, subsampling

.xben

A BEN stream wrapped in LZMA2

Long-term storage and transferring ensembles

.bendl

A bundle: a BEN/XBEN stream plus the dual graph and metadata

The recommended default — one self-describing file

Formats: BEN vs XBEN vs BENDL covers the trade-offs in detail.

How the Python API is organized

The Python package mirrors the project’s CLI tools:

See The API map for when to reach for each, and the Vocabulary page for the precise meaning of plan, assignment, sample, and ensemble.

For the invariants that must hold across a real run — assignment length, graph node order, JSONL shape, and bundle assets — see The data contract.

For operational guidance after the basics, see Performance guide, Graph ordering deep dive, Limitations and invariants, and Compatibility and stability.