Plain BEN & XBEN streams

For storing and sharing an ensemble, the bundle tutorial covers what we generally recommend: the self-describing .bendl file. Underneath every bundle, though, sits a plain BEN/XBEN stream, and sometimes that lower layer is all you need:

  • you have an existing JSONL ensemble to convert,

  • someone handed you a bare .ben / .xben file,

  • or you are working alongside the ben CLI and want the same conversions from Python.

This notebook covers that layer: the whole-file converters in binary_ensemble.codec (JSONL ↔ BEN ↔ XBEN) and the streaming BenEncoder / BenDecoder classes.

The quick mental model: an ensemble usually starts life as JSONL — one {"assignment": [...], "sample": n} line per plan — which is simple but balloons to tens of gigabytes quickly. BEN shrinks that losslessly; XBEN adds LZMA2 on top for archival-grade compression. See Formats for the full picture.

Setup: generate a small ensemble

So that this notebook stays self-contained and reproducible, we generate a small ensemble instead of downloading one: a short GerryChain ReCom chain on a 16×16 grid (256 nodes), written out as a JSONL file. binary-ensemble only ever sees lists of integers, so any sampler — or any JSONL file you already have lying around — works exactly the same way.

import json
import os
from functools import partial
from pathlib import Path

import networkx as nx
from gerrychain import Graph, MarkovChain, Partition, accept, constraints, updaters
from gerrychain.proposals import recom

Path("example_data").mkdir(exist_ok=True)

# A 16x16 grid with unit population and stripe districts -> a contiguous start state.
SIDE, N_DISTRICTS = 16, 4
grid = nx.grid_2d_graph(SIDE, SIDE)
grid = nx.convert_node_labels_to_integers(grid, ordering="sorted")
for node in grid.nodes:
    _row, col = divmod(node, SIDE)
    grid.nodes[node]["TOTPOP"] = 1
    grid.nodes[node]["district"] = col // (SIDE // N_DISTRICTS)

gc_graph = Graph.from_networkx(grid)
node_order = list(gc_graph.nodes)  # the order we write each assignment in
initial = Partition(
    gc_graph,
    assignment="district",
    updaters={"population": updaters.Tally("TOTPOP", alias="population")},
)
ideal = sum(initial["population"].values()) / len(initial)
chain = MarkovChain(
    proposal=partial(recom, pop_col="TOTPOP", pop_target=ideal, epsilon=0.05, node_repeats=2),
    constraints=[constraints.contiguous],
    accept=accept.always_accept,
    initial_state=initial,
    total_steps=200,
)

with open("example_data/small_example.jsonl", "w") as f:
    for i, partition in enumerate(chain, start=1):
        assignment = partition.assignment.to_series().loc[node_order].astype(int).tolist()
        f.write(json.dumps({"assignment": assignment, "sample": i}) + "\n")

jsonl_size = os.path.getsize("example_data/small_example.jsonl")
print(
    f"wrote example_data/small_example.jsonl: 200 plans on {SIDE * SIDE} nodes, {jsonl_size} bytes"
)
wrote example_data/small_example.jsonl: 200 plans on 256 nodes, 159892 bytes

Converting between file types

The binary_ensemble.codec helpers convert whole files in a single call. (These are the same conversions the ben CLI tool performs — see CLI parity for the mapping.)

from binary_ensemble.codec import (
    encode_jsonl_to_ben,
    encode_jsonl_to_xben,
    encode_ben_to_xben,
    decode_ben_to_jsonl,
    decode_xben_to_jsonl,
    decode_xben_to_ben,
)

JSONL → BEN

BEN is the quickest format to produce. encode_jsonl_to_ben reads the JSONL ensemble and writes a compact .ben stream in a single pass — watch the size drop:

encode_jsonl_to_ben(
    in_file="example_data/small_example.jsonl",
    out_file="example_data/small_example.ben",
    overwrite=True,
)
print(f"BEN bytes: {os.path.getsize('example_data/small_example.ben')}")
BEN bytes: 4366

By default the conversion functions refuse to overwrite an existing output file — pass overwrite=True when you actually mean to replace it:

try:
    encode_jsonl_to_ben(
        in_file="example_data/small_example.jsonl",
        out_file="example_data/small_example.ben",
    )
except OSError as e:
    print(f"refused to overwrite: {e}")
refused to overwrite: Output file example_data/small_example.ben already exists (use overwrite=True to replace).

Encoding variants

A BEN stream is encoded with one of three variants, chosen with variant=:

  • "twodelta" (the default) delta-encodes pairwise ReCom moves — ideal for ReCom chains like ours.

  • "mkv_chain" collapses identical consecutive plans — for full MCMC chains with rejections.

  • "standard" stores each plan independently — a simple baseline.

You never specify the variant when reading — decoding auto-detects it. Here’s how much the choice matters for a ReCom ensemble (see Encoding variants for how to choose on other samplers):

for variant in ["standard", "mkv_chain", "twodelta"]:
    encode_jsonl_to_ben(
        "example_data/small_example.jsonl",
        f"example_data/small_example.{variant}.ben",
        overwrite=True,
        variant=variant,
    )
    print(f"{variant:>10}: {os.path.getsize(f'example_data/small_example.{variant}.ben'):>6} bytes")
  standard:   8359 bytes
 mkv_chain:   8759 bytes
  twodelta:   4366 bytes

BEN → XBEN

When a file is done changing, XBEN wraps the BEN stream in LZMA2 for much smaller files, at the cost of slower compression. The XBEN encoders accept n_threads and compression_level (0 fastest … 9 smallest):

encode_ben_to_xben(
    in_file="example_data/small_example.ben",
    out_file="example_data/small_example.xben",
    overwrite=True,
    compression_level=9,
)

# You can also go straight from JSONL to XBEN in one step.
encode_jsonl_to_xben(
    in_file="example_data/small_example.jsonl",
    out_file="example_data/small_example.direct.xben",
    overwrite=True,
)

for name in ["small_example.jsonl", "small_example.ben", "small_example.xben"]:
    print(f"{name:>22}: {os.path.getsize('example_data/' + name):>7} bytes")
   small_example.jsonl:  159892 bytes
     small_example.ben:    4366 bytes
    small_example.xben:    2104 bytes

Decoding

The decoders mirror the encoders, and all of them take (in_file, out_file, overwrite=False):

decode_ben_to_jsonl(
    "example_data/small_example.ben", "example_data/roundtrip.jsonl", overwrite=True
)
decode_xben_to_jsonl(
    "example_data/small_example.xben", "example_data/from_xben.jsonl", overwrite=True
)
decode_xben_to_ben("example_data/small_example.xben", "example_data/from_xben.ben", overwrite=True)
print("decoded BEN -> JSONL, XBEN -> JSONL, and XBEN -> BEN")
decoded BEN -> JSONL, XBEN -> JSONL, and XBEN -> BEN

Encoding is lossless — decoding a BEN stream back to JSONL recovers the original plans exactly:

def load(path):
    with open(path) as f:
        return [json.loads(line)["assignment"] for line in f]


identical = load("example_data/small_example.jsonl") == load("example_data/roundtrip.jsonl")
print(f"round-trip identical: {identical}")
round-trip identical: True

Streaming with BenEncoder / BenDecoder

When plans are produced one at a time — by a chain, not a file — there’s no reason to stage them through JSONL. BenEncoder writes assignments as they arrive; it’s a context manager, and the stream is flushed and finished on exit. BenDecoder reads a stream back the same way, one assignment at a time. (If this pattern looks familiar, it should: the bundle’s stream() writer and iterator are built on the same machinery.)

from binary_ensemble import BenEncoder, BenDecoder

plans = [[1, 1, 2, 2], [1, 2, 2, 2], [1, 1, 1, 2]]
with BenEncoder("example_data/tiny.ben", overwrite=True) as encoder:
    for plan in plans:
        encoder.write(plan)

decoder = BenDecoder("example_data/tiny.ben")
print(f"samples: {len(decoder)}")
for assignment in decoder:
    print(assignment)
samples: 3
[1, 1, 2, 2]
[1, 2, 2, 2]
[1, 1, 1, 2]

Subsampling

BenDecoder can yield just a subset of plans without materializing the rest, using the same three subsample_* methods you saw on the bundle decoder (one machinery, two decoders — the bundle tutorial has the full tour). Indices are 1-based, and a decoder is reusable: each call rewinds the stream and applies the new selection, so there’s no need to open a fresh decoder per subsample.

One thing worth knowing at this layer: how cheap a skipped sample is depends on the variant. standard and mkv_chain frames are skipped wholesale, while twodelta — the default — replays from the nearest snapshot checkpoint.

ben_file = "example_data/small_example.ben"
decoder = BenDecoder(ben_file)  # one decoder, reused for every subsample below

picked = [assignment[:6] for assignment in decoder.subsample_indices([1, 100, 200])]
print(f"indices [1, 100, 200] -> {picked}")

ranged = [assignment[:6] for assignment in decoder.subsample_range(50, 53)]
print(f"range(50, 53)         -> {ranged}")

print(f"every 50th            -> {sum(1 for _ in decoder.subsample_every(50))} plans")

# The same decoder rewinds and re-selects on each call, so you can run subsamples
# repeatedly without building a new decoder:
again = [assignment[:6] for assignment in decoder.subsample_indices([1, 100, 200])]
print(f"indices again         -> {again}")
indices [1, 100, 200] -> [[0, 0, 0, 0, 1, 1], [3, 3, 3, 3, 3, 3], [2, 2, 2, 2, 2, 2]]
range(50, 53)         -> [[2, 2, 2, 2, 2, 2], [2, 2, 2, 2, 2, 2], [2, 2, 2, 2, 2, 2], [0, 0, 0, 0, 0, 0]]
every 50th            -> 4 plans
indices again         -> [[0, 0, 0, 0, 1, 1], [3, 3, 3, 3, 3, 3], [2, 2, 2, 2, 2, 2]]

The same methods work on an XBEN stream — pass mode="xben". Reading XBEN pays a one-time decompression startup cost, so if you’ll be subsampling repeatedly, extract to BEN first with decode_xben_to_ben:

for assignment in BenDecoder("example_data/small_example.xben", mode="xben").subsample_range(1, 3):
    print(assignment[:6])
[0, 0, 0, 0, 1, 1]
[0, 0, 0, 0, 2, 2]
[0, 0, 0, 0, 2, 2]
/tmp/claude-1000/ipykernel_1810084/1859154157.py:1: UserWarning: XBEN may take a second to start decoding.
  for assignment in BenDecoder("example_data/small_example.xben", mode="xben").subsample_range(1, 3):

Where to next

  • Working with .bendl files — if you skipped it, go back: one self-describing file with the graph, metadata, and checksums built in is almost always what you want. Everything here (plus graph reordering for much better compression) is available at the bundle level.

  • Concepts — the formats, variants, and how the compression works.

  • API reference — every public class and function.