Plain BEN & XBEN streams¶
For storing and sharing an ensemble, the bundle tutorial covers what we
generally recommend: the self-describing .bendl file. Underneath every bundle, though, sits a
plain BEN/XBEN stream, and sometimes that lower layer is all you need:
you have an existing JSONL ensemble to convert,
someone handed you a bare
.ben/.xbenfile,or you are working alongside the
benCLI and want the same conversions from Python.
This notebook covers that layer: the whole-file converters in binary_ensemble.codec
(JSONL ↔ BEN ↔ XBEN) and the streaming BenEncoder / BenDecoder classes.
The quick mental model: an ensemble usually starts life as JSONL — one
{"assignment": [...], "sample": n} line per plan — which is simple but balloons to tens of
gigabytes quickly. BEN shrinks that losslessly; XBEN adds LZMA2 on top for archival-grade
compression. See Formats for the full picture.
Setup: generate a small ensemble¶
So that this notebook stays self-contained and reproducible, we generate a small ensemble
instead of downloading one: a short GerryChain
ReCom chain on a 16×16 grid (256 nodes), written out as a JSONL file.
binary-ensemble only ever sees lists of integers, so any sampler — or any JSONL
file you already have lying around — works exactly the same way.
import json
import os
from functools import partial
from pathlib import Path
import networkx as nx
from gerrychain import Graph, MarkovChain, Partition, accept, constraints, updaters
from gerrychain.proposals import recom
Path("example_data").mkdir(exist_ok=True)
# A 16x16 grid with unit population and stripe districts -> a contiguous start state.
SIDE, N_DISTRICTS = 16, 4
grid = nx.grid_2d_graph(SIDE, SIDE)
grid = nx.convert_node_labels_to_integers(grid, ordering="sorted")
for node in grid.nodes:
_row, col = divmod(node, SIDE)
grid.nodes[node]["TOTPOP"] = 1
grid.nodes[node]["district"] = col // (SIDE // N_DISTRICTS)
gc_graph = Graph.from_networkx(grid)
node_order = list(gc_graph.nodes) # the order we write each assignment in
initial = Partition(
gc_graph,
assignment="district",
updaters={"population": updaters.Tally("TOTPOP", alias="population")},
)
ideal = sum(initial["population"].values()) / len(initial)
chain = MarkovChain(
proposal=partial(recom, pop_col="TOTPOP", pop_target=ideal, epsilon=0.05, node_repeats=2),
constraints=[constraints.contiguous],
accept=accept.always_accept,
initial_state=initial,
total_steps=200,
)
with open("example_data/small_example.jsonl", "w") as f:
for i, partition in enumerate(chain, start=1):
assignment = partition.assignment.to_series().loc[node_order].astype(int).tolist()
f.write(json.dumps({"assignment": assignment, "sample": i}) + "\n")
jsonl_size = os.path.getsize("example_data/small_example.jsonl")
print(
f"wrote example_data/small_example.jsonl: 200 plans on {SIDE * SIDE} nodes, {jsonl_size} bytes"
)
wrote example_data/small_example.jsonl: 200 plans on 256 nodes, 159892 bytes
Converting between file types¶
The binary_ensemble.codec helpers convert whole files in a single call. (These are
the same conversions the ben CLI tool performs — see
CLI parity for the mapping.)
from binary_ensemble.codec import (
encode_jsonl_to_ben,
encode_jsonl_to_xben,
encode_ben_to_xben,
decode_ben_to_jsonl,
decode_xben_to_jsonl,
decode_xben_to_ben,
)
JSONL → BEN¶
BEN is the quickest format to produce. encode_jsonl_to_ben reads the JSONL ensemble and
writes a compact .ben stream in a single pass — watch the size drop:
encode_jsonl_to_ben(
in_file="example_data/small_example.jsonl",
out_file="example_data/small_example.ben",
overwrite=True,
)
print(f"BEN bytes: {os.path.getsize('example_data/small_example.ben')}")
BEN bytes: 4366
By default the conversion functions refuse to overwrite an existing output file — pass
overwrite=True when you actually mean to replace it:
try:
encode_jsonl_to_ben(
in_file="example_data/small_example.jsonl",
out_file="example_data/small_example.ben",
)
except OSError as e:
print(f"refused to overwrite: {e}")
refused to overwrite: Output file example_data/small_example.ben already exists (use overwrite=True to replace).
Encoding variants¶
A BEN stream is encoded with one of three variants, chosen with variant=:
"twodelta"(the default) delta-encodes pairwise ReCom moves — ideal for ReCom chains like ours."mkv_chain"collapses identical consecutive plans — for full MCMC chains with rejections."standard"stores each plan independently — a simple baseline.
You never specify the variant when reading — decoding auto-detects it. Here’s how much the choice matters for a ReCom ensemble (see Encoding variants for how to choose on other samplers):
for variant in ["standard", "mkv_chain", "twodelta"]:
encode_jsonl_to_ben(
"example_data/small_example.jsonl",
f"example_data/small_example.{variant}.ben",
overwrite=True,
variant=variant,
)
print(f"{variant:>10}: {os.path.getsize(f'example_data/small_example.{variant}.ben'):>6} bytes")
standard: 8359 bytes
mkv_chain: 8759 bytes
twodelta: 4366 bytes
BEN → XBEN¶
When a file is done changing, XBEN wraps the BEN stream in LZMA2 for much smaller
files, at the cost of slower compression. The XBEN encoders accept n_threads and
compression_level (0 fastest … 9 smallest):
encode_ben_to_xben(
in_file="example_data/small_example.ben",
out_file="example_data/small_example.xben",
overwrite=True,
compression_level=9,
)
# You can also go straight from JSONL to XBEN in one step.
encode_jsonl_to_xben(
in_file="example_data/small_example.jsonl",
out_file="example_data/small_example.direct.xben",
overwrite=True,
)
for name in ["small_example.jsonl", "small_example.ben", "small_example.xben"]:
print(f"{name:>22}: {os.path.getsize('example_data/' + name):>7} bytes")
small_example.jsonl: 159892 bytes
small_example.ben: 4366 bytes
small_example.xben: 2104 bytes
Decoding¶
The decoders mirror the encoders, and all of them take
(in_file, out_file, overwrite=False):
decode_ben_to_jsonl(
"example_data/small_example.ben", "example_data/roundtrip.jsonl", overwrite=True
)
decode_xben_to_jsonl(
"example_data/small_example.xben", "example_data/from_xben.jsonl", overwrite=True
)
decode_xben_to_ben("example_data/small_example.xben", "example_data/from_xben.ben", overwrite=True)
print("decoded BEN -> JSONL, XBEN -> JSONL, and XBEN -> BEN")
decoded BEN -> JSONL, XBEN -> JSONL, and XBEN -> BEN
Encoding is lossless — decoding a BEN stream back to JSONL recovers the original plans exactly:
def load(path):
with open(path) as f:
return [json.loads(line)["assignment"] for line in f]
identical = load("example_data/small_example.jsonl") == load("example_data/roundtrip.jsonl")
print(f"round-trip identical: {identical}")
round-trip identical: True
Streaming with BenEncoder / BenDecoder¶
When plans are produced one at a time — by a chain, not a file — there’s no reason
to stage them through JSONL. BenEncoder writes assignments as they arrive; it’s a
context manager, and the stream is flushed and finished on exit. BenDecoder reads a
stream back the same way, one assignment at a time. (If this pattern looks familiar,
it should: the bundle’s stream() writer and iterator are built on the same
machinery.)
from binary_ensemble import BenEncoder, BenDecoder
plans = [[1, 1, 2, 2], [1, 2, 2, 2], [1, 1, 1, 2]]
with BenEncoder("example_data/tiny.ben", overwrite=True) as encoder:
for plan in plans:
encoder.write(plan)
decoder = BenDecoder("example_data/tiny.ben")
print(f"samples: {len(decoder)}")
for assignment in decoder:
print(assignment)
samples: 3
[1, 1, 2, 2]
[1, 2, 2, 2]
[1, 1, 1, 2]
Subsampling¶
BenDecoder can yield just a subset of plans without materializing the rest, using
the same three subsample_* methods you saw on the bundle decoder (one machinery,
two decoders — the bundle tutorial has the full tour). Indices
are 1-based, and a decoder is reusable: each call rewinds the stream and applies the
new selection, so there’s no need to open a fresh decoder per subsample.
One thing worth knowing at this layer: how cheap a skipped sample is depends on the
variant. standard and mkv_chain frames are skipped wholesale, while twodelta —
the default — replays from the nearest snapshot checkpoint.
ben_file = "example_data/small_example.ben"
decoder = BenDecoder(ben_file) # one decoder, reused for every subsample below
picked = [assignment[:6] for assignment in decoder.subsample_indices([1, 100, 200])]
print(f"indices [1, 100, 200] -> {picked}")
ranged = [assignment[:6] for assignment in decoder.subsample_range(50, 53)]
print(f"range(50, 53) -> {ranged}")
print(f"every 50th -> {sum(1 for _ in decoder.subsample_every(50))} plans")
# The same decoder rewinds and re-selects on each call, so you can run subsamples
# repeatedly without building a new decoder:
again = [assignment[:6] for assignment in decoder.subsample_indices([1, 100, 200])]
print(f"indices again -> {again}")
indices [1, 100, 200] -> [[0, 0, 0, 0, 1, 1], [3, 3, 3, 3, 3, 3], [2, 2, 2, 2, 2, 2]]
range(50, 53) -> [[2, 2, 2, 2, 2, 2], [2, 2, 2, 2, 2, 2], [2, 2, 2, 2, 2, 2], [0, 0, 0, 0, 0, 0]]
every 50th -> 4 plans
indices again -> [[0, 0, 0, 0, 1, 1], [3, 3, 3, 3, 3, 3], [2, 2, 2, 2, 2, 2]]
The same methods work on an XBEN stream — pass mode="xben". Reading XBEN pays a
one-time decompression startup cost, so if you’ll be subsampling repeatedly, extract
to BEN first with decode_xben_to_ben:
for assignment in BenDecoder("example_data/small_example.xben", mode="xben").subsample_range(1, 3):
print(assignment[:6])
[0, 0, 0, 0, 1, 1]
[0, 0, 0, 0, 2, 2]
[0, 0, 0, 0, 2, 2]
/tmp/claude-1000/ipykernel_1810084/1859154157.py:1: UserWarning: XBEN may take a second to start decoding.
for assignment in BenDecoder("example_data/small_example.xben", mode="xben").subsample_range(1, 3):
Where to next¶
Working with
.bendlfiles — if you skipped it, go back: one self-describing file with the graph, metadata, and checksums built in is almost always what you want. Everything here (plus graph reordering for much better compression) is available at the bundle level.Concepts — the formats, variants, and how the compression works.
API reference — every public class and function.