Working with `.bendl` files¶

BENDL is a portmanteau of “BEN” (Binary-ENsemble) and “bundle”, and is the generally recommended format for storage and transmission of ensembles of districting plans. It is intended to be a single self-describing file that contains a districting ensemble, its associated graph, and any metadata / accompanying artifacts that the user would like to include.

The BENDL format was created to alleviate two common pain points for redistricting analysts:

How do you store millions of districting plans efficiently?
Which graph did you use to make that ensemble again? There are like 7 in this folder…

The purpose of this tutorial is to re-work one of the classic ReCom tutorials from GerryChain, but with the additional support of binary-ensemble, to demonstrate the intended workflow.

Note: There is also a companion notebook on the plain BEN/XBEN streams: the central layer of every bundle that encodes a redistricting ensemble.

What goes in a bundle?¶

A plain .ben file is just the assignment stream: a long sequence of districting plans and nothing else. A .bendl file wraps that stream together with assets:

the dual graph (graph.json), so the file explains itself;
an optional node_permutation_map.json, recording any reordering applied to the graph for better compression;
metadata.json, for run provenance (seed, parameters, generator, …);
arbitrary custom assets: notes, analysis results, plots, even geospatial blobs.

Keeping everything in one file means an ensemble can be shared and reproduced with no side files, the chain parameters travel with the plans, and analysis results can be appended to a finished bundle later without rewriting the stream. Bundles also have a natural lifecycle: work in BEN (fast) while a project is active, then recompress to XBEN for long-term archival with every asset preserved.

Setup¶

First we need a dual graph to draw plans on. Rather than download a multi-megabyte real-world graph, we generate a SIDE × SIDE grid (here 32×32 = 1024 nodes): big enough to behave like a real ensemble, small enough to run in seconds, and fully reproducible. Each node gets unit population (TOTPOP = 1) and an initial district label of vertical stripes, which gives ReCom a contiguous, balanced starting partition.

Then we deliberately shuffle the node order. Real-world dual graphs rarely arrive in a compression-friendly order (census blocks listed by GEOID, nodes in whatever order the shapefile happened to have), so the stored order usually has no relationship to graph locality. Shuffling reproduces that situation, and it is exactly the one where reordering before encoding pays off, as we will see below. The graph is written out as NetworkX adjacency JSON under example_data/, the shape a bundle stores.

import json
import random

import networkx as nx
from pathlib import Path

Path("example_data").mkdir(exist_ok=True)

SIDE, N_DISTRICTS = 32, 4  # 1024 nodes; SIDE must be divisible by N_DISTRICTS
GRAPH_PATH = Path("example_data/grid.json")


def build_grid_graph(side, n_districts, shuffle_seed=0):
    """A side*side grid with unit population, stripe districts, and a shuffled order."""
    g = nx.grid_2d_graph(side, side)
    g = nx.convert_node_labels_to_integers(g, ordering="sorted")  # row-major ints
    cols_per_district = side // n_districts
    for node in g.nodes:
        _row, col = divmod(node, side)
        g.nodes[node]["TOTPOP"] = 1
        g.nodes[node]["district"] = col // cols_per_district
    # Rebuild with nodes inserted in a random order, so the *stored* order has no
    # spatial locality (attributes and edges are preserved untouched).
    shuffled = list(g.nodes)
    random.Random(shuffle_seed).shuffle(shuffled)
    h = nx.Graph()
    h.add_nodes_from((node, g.nodes[node]) for node in shuffled)
    h.add_edges_from(g.edges)
    return h


grid = build_grid_graph(SIDE, N_DISTRICTS)
GRAPH_PATH.write_text(json.dumps(nx.readwrite.json_graph.adjacency_data(grid)))
print(f"graph file: {GRAPH_PATH} ({GRAPH_PATH.stat().st_size} bytes, {SIDE * SIDE} nodes)")

graph file: example_data/grid.json (95290 bytes, 1024 nodes)

The public surface¶

Everything bundle-related is re-exported from the top-level package, but it lives in two submodules:

binary_ensemble.bundle — BendlEncoder, BendlDecoder, compress_stream
binary_ensemble.graph — reorder, reorder_multi_level_cluster, reorder_reverse_cuthill_mckee, reorder_by_key

Note: The plain-stream BenEncoder / BenDecoder and the whole-file encode_* / decode_* codec helpers are covered in the streams notebook.

from binary_ensemble import BendlDecoder, BendlEncoder, compress_stream
from binary_ensemble import graph as bgraph

Adding in GerryChain¶

As per usual, our example ReCom chain will use a ReCom proposal and a couple of updaters. The standard chain recipe is independent of how the nodes are ordered, so we factor it into a helper that builds a fresh chain on whatever graph we hand it. We will call this once per bundle and stream each plan to disk as the chain produces it.

from functools import partial

from gerrychain import Graph, MarkovChain, Partition, accept, updaters
from gerrychain.proposals import recom


def make_chain(gc_graph, steps):
    """Build a fresh ReCom MarkovChain over ``gc_graph`` (a gerrychain.Graph)."""
    chain_updaters = {
        "population": updaters.Tally("TOTPOP", alias="population"),
        "cut_edges": updaters.cut_edges,
    }
    initial = Partition(gc_graph, assignment="district", updaters=chain_updaters)
    ideal_pop = sum(initial["population"].values()) / len(initial)
    return MarkovChain(
        proposal=partial(
            recom, pop_col="TOTPOP", pop_target=ideal_pop, epsilon=0.05, node_repeats=2
        ),
        constraints=[],
        accept=accept.always_accept,
        initial_state=initial,
        total_steps=steps,
    )

Writing your first BENDL¶

When working with the BENDL format, the general workflow is as follows:

create the encoder and add the graph (and any other assets),
open the single-use ben_stream(...) in a with block,
loop through the chain and write each plan to the stream,
when the with enc.ben_stream(...) block exits, the bundle is finalized on disk.

It is important to note that, when writing to a BENDL file, every assignment must be in a fixed, known node order. GerryChain makes no ordering promise, so we pin the order to the graph’s node iteration order and reindex each plan to it.

To take full advantage of the BENDL format, we need to first store the graph that we intend to pass to our Markov chain in the bundle itself. The BendlEncoder class provides a sort parameter that you can modify to make the ensemble storage more efficient (more on that later).

The add_graph method accepts most of the standard formats that you would expect:

a networkx.Graph instance (subclasses such as gerrychain.Graph count),
adjacency-format JSON as a parsed dict or list, raw bytes, or a file-like object with .read(),
a str / os.PathLike path to a JSON file. A plain str is interpreted as a path.

The add_graph function also returns the embedded graph (as a NetworkX graph) for immediate use, i.e. for building the GerryChain graph, and the write order is guaranteed to match what gets stored.

For this first bundle we pass sort=None to store the graph in its raw order, and we will show some optional pre-processing steps that can decrease the bundle size momentarily.

encoder = BendlEncoder("example_data/basic.bendl", overwrite=True)
stored_graph = encoder.add_graph(GRAPH_PATH, sort=None)
gc_graph = Graph.from_networkx(stored_graph)
node_order = list(gc_graph.nodes)  # the order stored == the order we write assignments in

with encoder.ben_stream() as stream:  # only the stream is context-managed
    for partition in make_chain(gc_graph, steps=1000):
        series = partition.assignment.to_series()
        stream.write(series.loc[node_order].astype(int).tolist())
# the bundle is finalized now that the stream context has closed and can no longer be updated

print("wrote example_data/basic.bendl")

wrote example_data/basic.bendl

Because we embedded the graph before the stream, the encoder knows the node count and checks every write against it. A wrong-length assignment raises immediately instead of silently corrupting the file, and since the exception escapes the stream context, the bundle is left unfinalized rather than stamped complete (more on that at the end):

encoder = BendlEncoder("example_data/willfail.bendl", overwrite=True)
encoder.add_graph(GRAPH_PATH, sort=None)
try:
    with encoder.ben_stream() as stream:
        stream.write([0, 1, 2])  # too short
except ValueError as e:
    print(f"rejected as expected: {e}")

rejected as expected: assignment length 3 does not match graph node count 1024

Reordering for compression (the default)¶

BEN and XBEN compress runs of equal adjacent labels, so a node ordering that keeps neighbouring nodes near each other in the stream compresses much better. Our grid’s stored order is shuffled, which makes the raw basic.bendl above close to a worst case. Fixing this is the encoder’s default behaviour: add_graph reorders the graph with multi-level clustering (sort="mlc") unless you opt out with sort=None. Reordering:

reorders the graph — sort="mlc" (default), sort="rcm", or sort="key" with key="<attribute>" (e.g. key="GEOID") to sort by a node attribute,
stores both the reordered graph.json and a node_permutation_map.json,
and returns the reordered graph.

That last point is what makes the workflow tidy: build the entire ReCom chain on the returned graph, and the chain’s natural node order already equals the stored order, so streaming needs no extra bookkeeping.

Note: For graphs composed of census blocks, sorting by the “GEOID” attribute generally produces the best compression

Note: Reordering is pre-stream only — it decides the write order — so add_graph(...) must come before ben_stream().

This will be the “real” bundle for the rest of the tutorial, so we also stamp in metadata and a custom asset while we are here.

from datetime import datetime

encoder = BendlEncoder("example_data/mlc_reordered.bendl", overwrite=True)

# add_graph reorders with MLC by default; build the chain on the returned graph.
reordered_graph = encoder.add_graph(GRAPH_PATH)
gc_graph = Graph.from_networkx(reordered_graph)
write_order = list(gc_graph.nodes)

# Provenance + extra assets (covered in detail in the next section).
encoder.add_metadata(
    {
        "generator": "gerrychain",
        "proposal": "recom",
        "epsilon": 0.05,
        "seed": 1234,
        "created_by": "me",
        "created_at": datetime(1970, 1, 1).isoformat(),
        "description": "ReCom ensemble on a 32x32 grid, MLC-reordered.",
    }
)
encoder.add_asset("readme.txt", "ReCom ensemble on a 32x32 grid, MLC-reordered.", "text")

with encoder.ben_stream() as stream:
    for partition in make_chain(gc_graph, steps=1000):
        series = partition.assignment.to_series()
        stream.write(series.loc[write_order].astype(int).tolist())

print("wrote example_data/mlc_reordered.bendl")

wrote example_data/mlc_reordered.bendl

Did reordering actually help?¶

It is tempting to just compare basic.bendl against mlc_reordered.bendl, but that is not a fair fight: they hold different ensembles (each was streamed live from its own independent ReCom run), so their stream sizes mix the ordering effect with run-to-run randomness. We’ll take a look anyway, and then do the comparison properly. The bundle header records the exact byte length of the embedded assignment stream, so stream_size() reads it back without decoding or copying anything:

basic_size = BendlDecoder("example_data/basic.bendl").stream_size()
mlc_size = BendlDecoder("example_data/mlc_reordered.bendl").stream_size()

print(f"basic.bendl         (raw, run A): {basic_size:>8} bytes")
print(f"mlc_reordered.bendl (mlc, run B): {mlc_size:>8} bytes")

basic.bendl         (raw, run A):   135872 bytes
mlc_reordered.bendl (mlc, run B):    40143 bytes

For an apples-to-apples measurement we need the same plans in two orderings. We can get that without running a second chain by relabeling basic.bendl’s exact ensemble into MLC order. relabel_bundle does the whole thing in one call: it reorders the stored graph, rewrites every assignment into the new node order, and stores a node_permutation_map.json so the change stays reversible (metadata and custom assets come along too). It is the bundle-level form of the CLI’s ben relabel ordering step:

from binary_ensemble import relabel_bundle

# overwrite=True replaces any copy left behind by a previous run of this notebook.
relabel_bundle(
    "example_data/basic.bendl",
    out_file="example_data/basic_mlc_relabeled.bendl",
    sort="mlc",
    overwrite=True,
)

raw_bytes = BendlDecoder("example_data/basic.bendl").stream_size()
mlc_bytes = BendlDecoder("example_data/basic_mlc_relabeled.bendl").stream_size()
print(f"same ensemble, raw order: {raw_bytes:>8} bytes")
print(f"same ensemble, MLC order: {mlc_bytes:>8} bytes")
print(f"-> {raw_bytes / mlc_bytes:.1f}x smaller from reordering alone")

same ensemble, raw order:   135872 bytes
same ensemble, MLC order:    40256 bytes
-> 3.4x smaller from reordering alone

Now the only thing that changed is the node ordering, so that ratio is the real compression win from MLC. On more complicated dual graphs, the savings can be very significant, and these savings matters most right before an expensive XBEN recompress.

Note: On a graph that already arrives in a locality-friendly order the gain is smaller, and the extra node_permutation_map.json can even make a tiny file net-larger. Reordering is cheap and rarely hurts, though, so the encoder does it unless you ask for raw order with sort=None.

Reordering under the hood: the standalone utilities¶

add_graph(..., sort=..., key=...) is built on the binary_ensemble.graph utilities, which you can also call directly. This is handy when you want to compute an ordering once and reuse it across several bundles, or inspect the permutation map before committing to anything. Each returns (reordered_graph, node_permutation_map): a live NetworkX graph plus the map dict.

reordered, permutation_map = bgraph.reorder(GRAPH_PATH, sort="rcm")
print(f"reorder(sort='rcm') -> {type(reordered).__name__} with {reordered.number_of_nodes()} nodes")

# Sort by a node attribute with sort="key" + key=...  (on real data this is how  you'd order by,
# say, "GEOID"; here the grid only has "district"/"id"):
graph_mlc, _ = bgraph.reorder(GRAPH_PATH, sort="mlc")
graph_rcm, _ = bgraph.reorder(GRAPH_PATH, sort="rcm")
graph_by_district, _ = bgraph.reorder(GRAPH_PATH, sort="key", key="district")

# reorder_multi_level_cluster / reorder_reverse_cuthill_mckee / reorder_by_key are thin convenience
# wrappers over these.
print("orderings: sort='mlc', sort='rcm', or sort='key' with key='<attribute>'")

# The permutation map is what makes a reordering reversible: its required field
# `node_permutation_old_to_new` maps original 0-based node positions -> new ones.
old_to_new = permutation_map["node_permutation_old_to_new"]
is_bijection = sorted(old_to_new.values()) == list(range(reordered.number_of_nodes()))
provenance = {k: permutation_map[k] for k in ("ordering_method", "key")}

print(f"old_to_new is a bijection over [0, n): {is_bijection}")
print(f"provenance fields: {provenance}")

reorder(sort='rcm') -> Graph with 1024 nodes
orderings: sort='mlc', sort='rcm', or sort='key' with key='<attribute>'
old_to_new is a bijection over [0, n): True
provenance fields: {'ordering_method': 'reverse-cuthill-mckee', 'key': None}

Metadata and custom assets¶

We already used both of these while building mlc_reordered.bendl. add_metadata writes the canonical metadata.json (provenance). add_asset stores a custom asset under a name you choose, and the content_type tells the facade how to treat the payload:

"json" — a dict/list (serialized for you) or a JSON string; the decoder auto-parses it on the way out.
"text" — any UTF-8 string.
"binary" — raw bytes, stored verbatim: plots, pickles, zipped shapefiles, anything.
"file" — a path (str or pathlib.Path); the file’s bytes are read in and stored. This is the easy way to ship an existing file inside the bundle.

Every asset carries a CRC32C checksum, and payloads of 1 KiB or more are xz-compressed on disk (both are transparent on read). The facade validates payloads up front, so a malformed "json" asset is caught at write time rather than discovered by a collaborator at read time.

Assets may be added before or after the stream; only the stream itself is single-use. Post-stream adds commit immediately (one directory rewrite each), so use them sparingly. Here we tack a couple of assets onto the already-finalized mlc_reordered.bendl to show all of this at once:

# Reopen the finalized bundle in append mode. Each add_* commits immediately,
# so there is nothing to finalize afterwards.
appender = BendlEncoder.append("example_data/mlc_reordered.bendl")
appender.add_asset("params.json", {"node_repeats": 2}, "json")  # dicts are fine

# Ship an existing file (here a stand-in for, say, a geopackage) straight off disk:
Path("example_data/tracts.gpkg").write_bytes(b"GPKG\x00stand-in geospatial bytes")
appender.add_asset("tracts.gpkg", "example_data/tracts.gpkg", "file")

# Validation in action — a "json" asset that isn't JSON is rejected up front:
encoder = BendlEncoder("example_data/tmp.bendl", overwrite=True)
try:
    encoder.add_asset("bad.json", "this is not json", "json")
except ValueError as e:
    print(f"rejected as expected: {e}")

rejected as expected: content_type='json' requires valid UTF-8 JSON: Expecting value: line 1 column 1 (char 0)

Reading a bundle¶

BendlDecoder(path) opens a bundle. The canonical getters pull the well-known assets back out in convenient form:

read_graph() → a live NetworkX graph (or None if absent),
read_metadata() → parsed metadata.json (or None),
read_node_permutation_map() → parsed map dict (or None).

The important detail: read_graph() returns the graph in the node order the assignments were written in. Since we built the chain on the reordered graph, everything will line up with the stream automatically:

decoder = BendlDecoder("example_data/mlc_reordered.bendl")

packaged_graph = decoder.read_graph()
graph_desc = f"{type(packaged_graph).__name__} with {packaged_graph.number_of_nodes()} nodes"
has_map = "node_permutation_old_to_new" in decoder.read_node_permutation_map()

print(f"read_graph() -> {graph_desc}")
print(f"read_metadata() -> {decoder.read_metadata()}")
print(f"read_node_permutation_map() has old_to_new: {has_map}")

read_graph() -> Graph with 1024 nodes
read_metadata() -> {'generator': 'gerrychain', 'proposal': 'recom', 'epsilon': 0.05, 'seed': 1234, 'created_by': 'me', 'created_at': '1970-01-01T00:00:00', 'description': 'ReCom ensemble on a 32x32 grid, MLC-reordered.'}
read_node_permutation_map() has old_to_new: True

The generic accessors reach any asset by name:

read_asset_bytes(name) → raw bytes,
read_json_asset(name) → parsed JSON.

Note: read_json_asset("graph.json") gives you the raw adjacency dict, in case you want the JSON rather than the rebuilt NetworkX object.

graph_keys = list(decoder.read_json_asset("graph.json").keys())

print(f"readme.txt   -> {decoder.read_asset_bytes('readme.txt')}")
print(f"params.json  -> {decoder.read_json_asset('params.json')}")
print(f"tracts.gpkg  -> {decoder.read_asset_bytes('tracts.gpkg')}")
print(f"graph.json (raw dict) top-level keys: {graph_keys}")

readme.txt   -> b'ReCom ensemble on a 32x32 grid, MLC-reordered.'
params.json  -> {'node_repeats': 2}
tracts.gpkg  -> b'GPKG\x00stand-in geospatial bytes'
graph.json (raw dict) top-level keys: ['directed', 'multigraph', 'graph', 'nodes', 'adjacency']

Inspecting a bundle¶

Before (or instead of) reading payloads, you can inspect structure. This is handy for tooling, debugging, or deciding whether a file is what you think it is:

version() → (major, minor) format version,
is_complete() → was it finalized cleanly,
assignment_format() → "ben" or "xben",
stream_size() → byte length of the embedded stream, straight from the header,
asset_size(name) → stored byte length of one asset, straight from the directory (the compressed size for xz-flagged assets),
asset_names() → directory names in order,
list_assets() → the full directory: name, type, offset, len, flag tags,
len(dec) / count_samples() → number of plans in the stream.

decoder = BendlDecoder("example_data/mlc_reordered.bendl")
print(f"format version:    {decoder.version()}")
print(f"is_complete:       {decoder.is_complete()}")
print(f"assignment_format: {decoder.assignment_format()}")
print(f"stream_size:       {decoder.stream_size()} bytes")
print(f"sample count:      {len(decoder)}")
print(f"asset_names:       {decoder.asset_names()}")
print("full directory:")
for entry in decoder.list_assets():
    print(f"    {entry}")

format version:    (1, 0)
is_complete:       True
assignment_format: ben
stream_size:       40143 bytes
sample count:      1000
asset_names:       ['graph.json', 'node_permutation_map.json', 'metadata.json', 'readme.txt', 'params.json', 'tracts.gpkg']
full directory:
    {'name': 'graph.json', 'type': 2, 'offset': 64, 'len': 6788, 'flags': ['json', 'xz', 'checksum']}
    {'name': 'node_permutation_map.json', 'type': 3, 'offset': 6852, 'len': 2964, 'flags': ['json', 'xz', 'checksum']}
    {'name': 'metadata.json', 'type': 1, 'offset': 9816, 'len': 201, 'flags': ['json', 'checksum']}
    {'name': 'readme.txt', 'type': 4, 'offset': 10017, 'len': 46, 'flags': ['checksum']}
    {'name': 'params.json', 'type': 4, 'offset': 50396, 'len': 19, 'flags': ['json', 'checksum']}
    {'name': 'tracts.gpkg', 'type': 4, 'offset': 50648, 'len': 30, 'flags': ['checksum']}

Trust, but verify¶

Notice the 'checksum' flag on every directory entry above: the writer checksums each asset and the embedded stream (CRC32C) as it goes. Reading, iterating, and subsampling do not re-verify those checksums — a partial read cannot prove a whole-stream checksum, and you do not want to pay for full verification on every loop. When integrity actually matters (you just downloaded a bundle, or you are about to archive one), ask for it explicitly. verify() checks every asset checksum and the stream checksum against the raw bytes on disk, and raises on the first mismatch:

decoder = BendlDecoder("example_data/mlc_reordered.bendl")
decoder.verify()
print("Every byte accounted for; assets and stream both check out!")

Every byte accounted for; assets and stream both check out!

Iterating the stream and reconstructing plans¶

A BendlDecoder iterates its embedded stream, yielding each assignment as a list[int]. Combined with read_graph(), that is enough to rebuild GerryChain Partitions straight from the bundle which eliminates the need to remember the correct graph file or node order.

import pandas as pd

decoder = BendlDecoder("example_data/mlc_reordered.bendl")
packaged_graph = decoder.read_graph()
order = pd.Index(packaged_graph.nodes)  # matches the written assignment order

cut_edge_counts = []
for assignment in decoder:
    partition = Partition(
        packaged_graph,
        assignment=pd.Series(assignment, index=order),
        updaters={"cut_edges": updaters.cut_edges},
    )
    cut_edge_counts.append(len(partition["cut_edges"]))

print(f"reconstructed {len(cut_edge_counts)} partitions from the bundle alone")
print(f"first five cut-edge counts: {cut_edge_counts[:5]}")

reconstructed 1000 partitions from the bundle alone
first five cut-edge counts: [96, 88, 113, 125, 113]

Subsampling¶

When winnowing a large ensemble you rarely want every plan. BendlDecoder has three native subsamplers; each returns the decoder set up to yield only the chosen plans, so you still just iterate. Indices are 1-based (plan 1 is the first sample):

subsample_indices([...]) — exactly these 1-based indices (sorted, unique),
subsample_range(start, end) — the 1-based inclusive range [start, end],
subsample_every(step, offset=1) — every step-th plan starting at offset (“thinning”).

bundle_file = "example_data/mlc_reordered.bendl"
decoder = BendlDecoder(bundle_file)  # one decoder, reused for every subsample below

picked = [assignment[:4] for assignment in decoder.subsample_indices([1, 500, 1000])]
print(f"indices [1, 500, 1000] -> {picked}")

ranged = [assignment[:4] for assignment in decoder.subsample_range(100, 104)]
print(f"range(100, 104)        -> {ranged}")  # plans 100..104 inclusive = 5 plans

print(f"every 250th            -> {sum(1 for _ in decoder.subsample_every(250))} plans")

# The same decoder rewinds and re-selects on each call, so you can run subsamples
# repeatedly without building a new decoder:
again = [assignment[:4] for assignment in decoder.subsample_indices([1, 500, 1000])]
print(f"indices again          -> {again}")

indices [1, 500, 1000] -> [[2, 2, 2, 2], [1, 1, 1, 1], [3, 3, 3, 3]]
range(100, 104)        -> [[3, 2, 3, 3], [3, 3, 3, 3], [3, 3, 3, 3], [3, 3, 3, 3], [1, 1, 1, 1]]
every 250th            -> 4 plans
indices again          -> [[2, 2, 2, 2], [1, 1, 1, 1], [3, 3, 3, 3]]

Extracting the raw stream¶

Sometimes you want the bare assignment stream back out, e.g. to hand it to the plain-stream tools or a different pipeline. extract_stream copies the embedded stream region verbatim to a standalone .ben/.xben file, which the stream-only BenDecoder (covered in the streams notebook) opens directly:

from binary_ensemble import BenDecoder

decoder = BendlDecoder("example_data/mlc_reordered.bendl")
decoder.extract_stream("example_data/extracted.ben", overwrite=True)

# Open the extracted file with the plain stream decoder (mode matches the bundle).
ben = BenDecoder("example_data/extracted.ben", mode=decoder.assignment_format())
print(f"extracted stream yields {sum(1 for _ in ben)} plans")

extracted stream yields 1000 plans

Appending analysis back onto the bundle¶

A finished, finalized bundle is not frozen. BendlEncoder.append(path) reopens it to add more assets later — say, the cut-edge summary we just computed, so the analysis travels with the plans it describes. The stream is not re-opened (it is already written); each add_* commits immediately to disk:

Note: Assets can also be removed with remove_asset(name). Since the asset name becomes free again, remove-then-add is the way to replace an asset (e.g. to update metadata.json).

appender = BendlEncoder.append("example_data/mlc_reordered.bendl")
appender.add_asset(
    "cut_edge_summary.json",
    {
        "mean": sum(cut_edge_counts) / len(cut_edge_counts),
        "min": min(cut_edge_counts),
        "max": max(cut_edge_counts),
    },
    "json",
)

decoder = BendlDecoder("example_data/mlc_reordered.bendl")
print(f"assets after append: {decoder.asset_names()}")
print(f"appended summary: {decoder.read_json_asset('cut_edge_summary.json')}")

assets after append: ['graph.json', 'node_permutation_map.json', 'metadata.json', 'readme.txt', 'params.json', 'tracts.gpkg', 'cut_edge_summary.json']
appended summary: {'mean': 133.274, 'min': 88, 'max': 189}

Assets-only bundles (no stream)¶

Technically, the BENDL format does not require an ensemble stream at all, and it is possible to make an assets-only bundle. This is useful for shipping a graph + metadata package on its own and the “stream” section will still decode to an empty iteration with len == 0, not a spurious “missing stream” error:

encoder = BendlEncoder("example_data/assets_only.bendl", overwrite=True)
encoder.add_graph(GRAPH_PATH, sort=None)
encoder.add_metadata({"note": "graph package, no plans"})
encoder.close()  # no stream was opened, so finalize explicitly

decoder = BendlDecoder("example_data/assets_only.bendl")
print(
    f"assets-only: is_complete = {decoder.is_complete()} "
    f"| len = {len(decoder)} | assets = {decoder.asset_names()}"
)

assets-only: is_complete = True | len = 0 | assets = ['graph.json', 'metadata.json']

Recompressing to XBEN for archival¶

BEN is the best working format for an active project, but when the project wraps up, XBEN is a better compression mechanism. We don’t use it as a working format since the the CPU overhead is significant, but the savings are substantial enough to be worth the cost of encoding once (fortunately, decoding is very fast from XBEN). compress_stream repackages a bundle’s BEN stream as XBEN, preserving every asset: graph, metadata, permutation map, custom blobs. Pass out_file=... to write a new bundle and leave the original untouched (overwrite=True replaces an existing copy); with no out_file the bundle is recompressed in place, atomically.

# Write a fresh XBEN copy, original preserved. overwrite=True replaces any copy
# left behind by a previous run of this notebook.
compress_stream(
    "example_data/mlc_reordered.bendl",
    out_file="example_data/archive.bendl",
    overwrite=True,
)

xben_decoder = BendlDecoder("example_data/archive.bendl")
n_plans_before = len(BendlDecoder("example_data/mlc_reordered.bendl"))
print(f"recompressed format: {xben_decoder.assignment_format()}")
print(f"assets preserved:    {xben_decoder.asset_names()}")
print(f"metadata preserved:  {xben_decoder.read_metadata()}")
print(f"plans unchanged:     {len(xben_decoder)} == {n_plans_before}")

recompressed format: xben
assets preserved:    ['graph.json', 'node_permutation_map.json', 'metadata.json', 'readme.txt', 'params.json', 'tracts.gpkg', 'cut_edge_summary.json']
metadata preserved:  {'generator': 'gerrychain', 'proposal': 'recom', 'epsilon': 0.05, 'seed': 1234, 'created_by': 'me', 'created_at': '1970-01-01T00:00:00', 'description': 'ReCom ensemble on a 32x32 grid, MLC-reordered.'}
plans unchanged:     1000 == 1000

/tmp/claude-1000/ipykernel_2893316/2142953642.py:9: UserWarning: XBEN may take a second to start decoding.
  xben_decoder = BendlDecoder("example_data/archive.bendl")

Note: Bundles with XBEN streams emit a one-time startup warning on decode, since opening them does real decompression work. The in-place mode writes to a temp file and swaps it over the original only on success, so an interrupted run cannot corrupt the bundle.

Lifecycle and failure semantics¶

One last guarantee, subtle but important: if an exception escapes the ben_stream() context (e.g. the chain throws, the write logic has a bug, the process dies partway, etc.) the bundle is left unfinalized rather than stamped complete over a half-written stream. You can detect this (is_complete() is False) and still recover everything that was written via extract_stream(..., allow_unfinalized=True):

encoder = BendlEncoder("example_data/partial.bendl", overwrite=True)
stored_graph = encoder.add_graph(GRAPH_PATH, sort=None)
gc_graph = Graph.from_networkx(stored_graph)
write_order = list(gc_graph.nodes)
try:
    with encoder.ben_stream() as stream:
        for i, partition in enumerate(make_chain(gc_graph, steps=1000)):
            if i == 50:
                raise RuntimeError("simulated crash mid-stream")
            series = partition.assignment.to_series()
            stream.write(series.loc[write_order].astype(int).tolist())
except RuntimeError as e:
    print(f"caught: {e}")

decoder = BendlDecoder("example_data/partial.bendl")
print(f"is_complete: {decoder.is_complete()} (left unfinalized, as intended)")
decoder.extract_stream("example_data/partial.ben", overwrite=True, allow_unfinalized=True)
recovered = sum(1 for _ in BenDecoder("example_data/partial.ben", mode="ben"))
print(f"recovered {recovered} plans written before the crash")

caught: simulated crash mid-stream
is_complete: False (left unfinalized, as intended)
recovered 50 plans written before the crash

Recap — when to reach for what¶

BendlEncoder / BendlDecoder are the default for storing an ensemble: one self-describing file, graph + metadata included, encoded live as the chain runs. Only the ben_stream() writer needs a with block — closing it finalizes the bundle (use close() for an assets-only bundle).
add_graph(graph) before the stream (MLC-reordered by default; pass sort="rcm", sort="key", key="GEOID", or sort=None for raw), then build the chain on the returned graph — a compression win and a write order that already matches the stored graph.
add_metadata / add_asset to stamp provenance and ship anything else alongside the plans ("json" takes dicts directly, "binary" takes raw bytes, "file" reads a path off disk); append to add results to a finished bundle, and remove_asset to drop one (remove-then-add replaces an asset).
verify() when integrity matters: after a download, before an archive.
relabel_bundle to reorder an existing BEN-stream bundle and rewrite its stream to match (in place or to a new file), e.g. to optimize a bundle you received raw before archiving it.
binary_ensemble.graph.reorder* when you want the reordering standalone, e.g. to reuse one ordering across several bundles.
compress_stream to graduate an active bundle from an embedded BEN stream to an embedded XBEN stream without losing any asset.
The plain stream layer (via extract_stream and the streams notebook) is there for when you specifically need the bare stream and are tracking the graph and node order yourself.

Working with .bendl files¶