Working with .bendl files¶
BENDL is a portmanteau of “BEN” (Binary-ENsemble) and “bundle”, and is the generally recommended format for storage and transmission of ensembles of districting plans. It is intended to be a single self-describing file that contains a districting ensemble, its associated graph, and any metadata / accompanying artifacts that the user would like to include.
The BENDL format was created to alleviate two common pain points for redistricting analysts:
How do you store millions of districting plans efficiently?
Which graph did you use to make that ensemble again? There are like 7 in this folder…
The purpose of this tutorial is to re-work one of the classic ReCom tutorials from
GerryChain, but with the additional support of
binary-ensemble, to demonstrate the intended workflow.
Note: There is also a companion notebook on the plain BEN/XBEN streams: the central layer of every bundle that encodes a redistricting ensemble.
What goes in a bundle?¶
A plain .ben file is just the assignment stream: a long sequence of districting plans and
nothing else. A .bendl file wraps that stream together with assets:
the dual graph (
graph.json), so the file explains itself;an optional
node_permutation_map.json, recording any reordering applied to the graph for better compression;metadata.json, for run provenance (seed, parameters, generator, …);arbitrary custom assets: notes, analysis results, plots, even geospatial blobs.
Keeping everything in one file means an ensemble can be shared and reproduced with no side files, the chain parameters travel with the plans, and analysis results can be appended to a finished bundle later without rewriting the stream. Bundles also have a natural lifecycle: work in BEN (fast) while a project is active, then recompress to XBEN for long-term archival with every asset preserved.
Setup¶
First we need a dual graph to draw plans on. Rather than download a multi-megabyte
real-world graph, we generate a SIDE × SIDE grid (here 32×32 = 1024 nodes): big enough to
behave like a real ensemble, small enough to run in seconds, and fully reproducible. Each
node gets unit population (TOTPOP = 1) and an initial district label of vertical
stripes, which gives ReCom a contiguous, balanced starting partition.
Then we deliberately shuffle the node order. Real-world dual graphs rarely arrive in a
compression-friendly order (census blocks listed by GEOID, nodes in whatever order the
shapefile happened to have), so the stored order usually has no relationship to graph
locality. Shuffling reproduces that situation, and it is exactly the one where reordering
before encoding pays off, as we will see below. The graph is written out as NetworkX
adjacency JSON under example_data/, the shape a bundle stores.
import json
import random
import networkx as nx
from pathlib import Path
Path("example_data").mkdir(exist_ok=True)
SIDE, N_DISTRICTS = 32, 4 # 1024 nodes; SIDE must be divisible by N_DISTRICTS
GRAPH_PATH = Path("example_data/grid.json")
def build_grid_graph(side, n_districts, shuffle_seed=0):
"""A side*side grid with unit population, stripe districts, and a shuffled order."""
g = nx.grid_2d_graph(side, side)
g = nx.convert_node_labels_to_integers(g, ordering="sorted") # row-major ints
cols_per_district = side // n_districts
for node in g.nodes:
_row, col = divmod(node, side)
g.nodes[node]["TOTPOP"] = 1
g.nodes[node]["district"] = col // cols_per_district
# Rebuild with nodes inserted in a random order, so the *stored* order has no
# spatial locality (attributes and edges are preserved untouched).
shuffled = list(g.nodes)
random.Random(shuffle_seed).shuffle(shuffled)
h = nx.Graph()
h.add_nodes_from((node, g.nodes[node]) for node in shuffled)
h.add_edges_from(g.edges)
return h
grid = build_grid_graph(SIDE, N_DISTRICTS)
GRAPH_PATH.write_text(json.dumps(nx.readwrite.json_graph.adjacency_data(grid)))
print(f"graph file: {GRAPH_PATH} ({GRAPH_PATH.stat().st_size} bytes, {SIDE * SIDE} nodes)")
graph file: example_data/grid.json (95290 bytes, 1024 nodes)
The public surface¶
Everything bundle-related is re-exported from the top-level package, but it lives in two submodules:
binary_ensemble.bundle—BendlEncoder,BendlDecoder,compress_streambinary_ensemble.graph—reorder,reorder_multi_level_cluster,reorder_reverse_cuthill_mckee,reorder_by_key
Note: The plain-stream
BenEncoder/BenDecoderand the whole-fileencode_*/decode_*codec helpers are covered in the streams notebook.
from binary_ensemble import BendlDecoder, BendlEncoder, compress_stream
from binary_ensemble import graph as bgraph
Adding in GerryChain¶
As per usual, our example ReCom chain will use a ReCom proposal and a couple of updaters. The standard chain recipe is independent of how the nodes are ordered, so we factor it into a helper that builds a fresh chain on whatever graph we hand it. We will call this once per bundle and stream each plan to disk as the chain produces it.
from functools import partial
from gerrychain import Graph, MarkovChain, Partition, accept, updaters
from gerrychain.proposals import recom
def make_chain(gc_graph, steps):
"""Build a fresh ReCom MarkovChain over ``gc_graph`` (a gerrychain.Graph)."""
chain_updaters = {
"population": updaters.Tally("TOTPOP", alias="population"),
"cut_edges": updaters.cut_edges,
}
initial = Partition(gc_graph, assignment="district", updaters=chain_updaters)
ideal_pop = sum(initial["population"].values()) / len(initial)
return MarkovChain(
proposal=partial(
recom, pop_col="TOTPOP", pop_target=ideal_pop, epsilon=0.05, node_repeats=2
),
constraints=[],
accept=accept.always_accept,
initial_state=initial,
total_steps=steps,
)
Writing your first BENDL¶
When working with the BENDL format, the general workflow is as follows:
create the encoder and add the graph (and any other assets),
open the single-use
ben_stream(...)in awithblock,loop through the chain and
writeeach plan to the stream,when the
with enc.ben_stream(...)block exits, the bundle is finalized on disk.
It is important to note that, when writing to a BENDL file, every assignment must be in a fixed, known node order. GerryChain makes no ordering promise, so we pin the order to the graph’s node iteration order and reindex each plan to it.
To take full advantage of the BENDL format, we need to first store the graph that we intend to pass
to our Markov chain in the bundle itself. The BendlEncoder class provides a sort parameter that
you can modify to make the ensemble storage more efficient (more on that later).
The add_graph method accepts most of the standard formats that you would expect:
a
networkx.Graphinstance (subclasses such asgerrychain.Graphcount),adjacency-format JSON as a parsed
dictorlist, rawbytes, or a file-like object with.read(),a
str/os.PathLikepath to a JSON file. A plainstris interpreted as a path.
The add_graph function also returns the embedded graph (as a NetworkX graph) for immediate use,
i.e. for building the GerryChain graph, and the write order is guaranteed to match what gets stored.
For this first bundle we pass sort=None to store the graph in its raw order, and we will show
some optional pre-processing steps that can decrease the bundle size momentarily.
encoder = BendlEncoder("example_data/basic.bendl", overwrite=True)
stored_graph = encoder.add_graph(GRAPH_PATH, sort=None)
gc_graph = Graph.from_networkx(stored_graph)
node_order = list(gc_graph.nodes) # the order stored == the order we write assignments in
with encoder.ben_stream() as stream: # only the stream is context-managed
for partition in make_chain(gc_graph, steps=1000):
series = partition.assignment.to_series()
stream.write(series.loc[node_order].astype(int).tolist())
# the bundle is finalized now that the stream context has closed and can no longer be updated
print("wrote example_data/basic.bendl")
wrote example_data/basic.bendl
Because we embedded the graph before the stream, the encoder knows the node count and
checks every write against it. A wrong-length assignment raises immediately instead of
silently corrupting the file, and since the exception escapes the stream context, the
bundle is left unfinalized rather than stamped complete (more on that at the end):
encoder = BendlEncoder("example_data/willfail.bendl", overwrite=True)
encoder.add_graph(GRAPH_PATH, sort=None)
try:
with encoder.ben_stream() as stream:
stream.write([0, 1, 2]) # too short
except ValueError as e:
print(f"rejected as expected: {e}")
rejected as expected: assignment length 3 does not match graph node count 1024
Reordering for compression (the default)¶
BEN and XBEN compress runs of equal adjacent labels, so a node ordering that keeps
neighbouring nodes near each other in the stream compresses much better. Our grid’s stored
order is shuffled, which makes the raw basic.bendl above close to a worst case. Fixing
this is the encoder’s default behaviour: add_graph reorders the graph with multi-level
clustering (sort="mlc") unless you opt out with sort=None. Reordering:
reorders the graph —
sort="mlc"(default),sort="rcm", orsort="key"withkey="<attribute>"(e.g.key="GEOID") to sort by a node attribute,stores both the reordered
graph.jsonand anode_permutation_map.json,and returns the reordered graph.
That last point is what makes the workflow tidy: build the entire ReCom chain on the returned graph, and the chain’s natural node order already equals the stored order, so streaming needs no extra bookkeeping.
Note: For graphs composed of census blocks, sorting by the “GEOID” attribute generally produces the best compression
Note: Reordering is pre-stream only — it decides the write order — so
add_graph(...)must come beforeben_stream().
This will be the “real” bundle for the rest of the tutorial, so we also stamp in metadata and a custom asset while we are here.
from datetime import datetime
encoder = BendlEncoder("example_data/mlc_reordered.bendl", overwrite=True)
# add_graph reorders with MLC by default; build the chain on the returned graph.
reordered_graph = encoder.add_graph(GRAPH_PATH)
gc_graph = Graph.from_networkx(reordered_graph)
write_order = list(gc_graph.nodes)
# Provenance + extra assets (covered in detail in the next section).
encoder.add_metadata(
{
"generator": "gerrychain",
"proposal": "recom",
"epsilon": 0.05,
"seed": 1234,
"created_by": "me",
"created_at": datetime(1970, 1, 1).isoformat(),
"description": "ReCom ensemble on a 32x32 grid, MLC-reordered.",
}
)
encoder.add_asset("readme.txt", "ReCom ensemble on a 32x32 grid, MLC-reordered.", "text")
with encoder.ben_stream() as stream:
for partition in make_chain(gc_graph, steps=1000):
series = partition.assignment.to_series()
stream.write(series.loc[write_order].astype(int).tolist())
print("wrote example_data/mlc_reordered.bendl")
wrote example_data/mlc_reordered.bendl
Did reordering actually help?¶
It is tempting to just compare basic.bendl against mlc_reordered.bendl, but that is not a fair
fight: they hold different ensembles (each was streamed live from its own independent ReCom
run), so their stream sizes mix the ordering effect with run-to-run randomness. We’ll take a
look anyway, and then do the comparison properly. The bundle header records the exact byte
length of the embedded assignment stream, so stream_size() reads it back without decoding
or copying anything:
basic_size = BendlDecoder("example_data/basic.bendl").stream_size()
mlc_size = BendlDecoder("example_data/mlc_reordered.bendl").stream_size()
print(f"basic.bendl (raw, run A): {basic_size:>8} bytes")
print(f"mlc_reordered.bendl (mlc, run B): {mlc_size:>8} bytes")
basic.bendl (raw, run A): 135872 bytes
mlc_reordered.bendl (mlc, run B): 40143 bytes
For an apples-to-apples measurement we need the same plans in two orderings. We can get
that without running a second chain by relabeling basic.bendl’s exact ensemble into MLC
order. relabel_bundle does the whole thing in one call: it reorders the stored graph,
rewrites every assignment into the new node order, and stores a
node_permutation_map.json so the change stays reversible (metadata and custom assets come
along too). It is the bundle-level form of the CLI’s ben relabel ordering step:
from binary_ensemble import relabel_bundle
# overwrite=True replaces any copy left behind by a previous run of this notebook.
relabel_bundle(
"example_data/basic.bendl",
out_file="example_data/basic_mlc_relabeled.bendl",
sort="mlc",
overwrite=True,
)
raw_bytes = BendlDecoder("example_data/basic.bendl").stream_size()
mlc_bytes = BendlDecoder("example_data/basic_mlc_relabeled.bendl").stream_size()
print(f"same ensemble, raw order: {raw_bytes:>8} bytes")
print(f"same ensemble, MLC order: {mlc_bytes:>8} bytes")
print(f"-> {raw_bytes / mlc_bytes:.1f}x smaller from reordering alone")
same ensemble, raw order: 135872 bytes
same ensemble, MLC order: 40256 bytes
-> 3.4x smaller from reordering alone
Now the only thing that changed is the node ordering, so that ratio is the real compression win from MLC. On more complicated dual graphs, the savings can be very significant, and these savings matters most right before an expensive XBEN recompress.
Note: On a graph that already arrives in a locality-friendly order the gain is smaller, and the extra
node_permutation_map.jsoncan even make a tiny file net-larger. Reordering is cheap and rarely hurts, though, so the encoder does it unless you ask for raw order withsort=None.
Reordering under the hood: the standalone utilities¶
add_graph(..., sort=..., key=...) is built on the binary_ensemble.graph utilities,
which you can also call directly. This is handy when you want to compute an ordering once
and reuse it across several bundles, or inspect the permutation map before committing to
anything. Each returns (reordered_graph, node_permutation_map): a live NetworkX graph
plus the map dict.
reordered, permutation_map = bgraph.reorder(GRAPH_PATH, sort="rcm")
print(f"reorder(sort='rcm') -> {type(reordered).__name__} with {reordered.number_of_nodes()} nodes")
# Sort by a node attribute with sort="key" + key=... (on real data this is how you'd order by,
# say, "GEOID"; here the grid only has "district"/"id"):
graph_mlc, _ = bgraph.reorder(GRAPH_PATH, sort="mlc")
graph_rcm, _ = bgraph.reorder(GRAPH_PATH, sort="rcm")
graph_by_district, _ = bgraph.reorder(GRAPH_PATH, sort="key", key="district")
# reorder_multi_level_cluster / reorder_reverse_cuthill_mckee / reorder_by_key are thin convenience
# wrappers over these.
print("orderings: sort='mlc', sort='rcm', or sort='key' with key='<attribute>'")
# The permutation map is what makes a reordering reversible: its required field
# `node_permutation_old_to_new` maps original 0-based node positions -> new ones.
old_to_new = permutation_map["node_permutation_old_to_new"]
is_bijection = sorted(old_to_new.values()) == list(range(reordered.number_of_nodes()))
provenance = {k: permutation_map[k] for k in ("ordering_method", "key")}
print(f"old_to_new is a bijection over [0, n): {is_bijection}")
print(f"provenance fields: {provenance}")
reorder(sort='rcm') -> Graph with 1024 nodes
orderings: sort='mlc', sort='rcm', or sort='key' with key='<attribute>'
old_to_new is a bijection over [0, n): True
provenance fields: {'ordering_method': 'reverse-cuthill-mckee', 'key': None}
Metadata and custom assets¶
We already used both of these while building mlc_reordered.bendl. add_metadata writes the
canonical metadata.json (provenance). add_asset stores a custom asset under a name you
choose, and the content_type tells the facade how to treat the payload:
"json"— adict/list(serialized for you) or a JSON string; the decoder auto-parses it on the way out."text"— any UTF-8 string."binary"— raw bytes, stored verbatim: plots, pickles, zipped shapefiles, anything."file"— a path (strorpathlib.Path); the file’s bytes are read in and stored. This is the easy way to ship an existing file inside the bundle.
Every asset carries a CRC32C checksum, and payloads of 1 KiB or more are xz-compressed on
disk (both are transparent on read). The facade validates payloads up front, so a malformed
"json" asset is caught at write time rather than discovered by a collaborator at read
time.
Assets may be added before or after the stream; only the stream itself is single-use.
Post-stream adds commit immediately (one directory rewrite each), so use them sparingly.
Here we tack a couple of assets onto the already-finalized mlc_reordered.bendl to show all of this
at once:
# Reopen the finalized bundle in append mode. Each add_* commits immediately,
# so there is nothing to finalize afterwards.
appender = BendlEncoder.append("example_data/mlc_reordered.bendl")
appender.add_asset("params.json", {"node_repeats": 2}, "json") # dicts are fine
# Ship an existing file (here a stand-in for, say, a geopackage) straight off disk:
Path("example_data/tracts.gpkg").write_bytes(b"GPKG\x00stand-in geospatial bytes")
appender.add_asset("tracts.gpkg", "example_data/tracts.gpkg", "file")
# Validation in action — a "json" asset that isn't JSON is rejected up front:
encoder = BendlEncoder("example_data/tmp.bendl", overwrite=True)
try:
encoder.add_asset("bad.json", "this is not json", "json")
except ValueError as e:
print(f"rejected as expected: {e}")
rejected as expected: content_type='json' requires valid UTF-8 JSON: Expecting value: line 1 column 1 (char 0)
Reading a bundle¶
BendlDecoder(path) opens a bundle. The canonical getters pull the well-known assets back
out in convenient form:
read_graph()→ a live NetworkX graph (orNoneif absent),read_metadata()→ parsedmetadata.json(orNone),read_node_permutation_map()→ parsed map dict (orNone).
The important detail: read_graph() returns the graph in the node order the assignments
were written in. Since we built the chain on the reordered graph, everything will line up with the
stream automatically:
decoder = BendlDecoder("example_data/mlc_reordered.bendl")
packaged_graph = decoder.read_graph()
graph_desc = f"{type(packaged_graph).__name__} with {packaged_graph.number_of_nodes()} nodes"
has_map = "node_permutation_old_to_new" in decoder.read_node_permutation_map()
print(f"read_graph() -> {graph_desc}")
print(f"read_metadata() -> {decoder.read_metadata()}")
print(f"read_node_permutation_map() has old_to_new: {has_map}")
read_graph() -> Graph with 1024 nodes
read_metadata() -> {'generator': 'gerrychain', 'proposal': 'recom', 'epsilon': 0.05, 'seed': 1234, 'created_by': 'me', 'created_at': '1970-01-01T00:00:00', 'description': 'ReCom ensemble on a 32x32 grid, MLC-reordered.'}
read_node_permutation_map() has old_to_new: True
The generic accessors reach any asset by name:
read_asset_bytes(name)→ rawbytes,read_json_asset(name)→ parsed JSON.
Note:
read_json_asset("graph.json")gives you the raw adjacency dict, in case you want the JSON rather than the rebuilt NetworkX object.
graph_keys = list(decoder.read_json_asset("graph.json").keys())
print(f"readme.txt -> {decoder.read_asset_bytes('readme.txt')}")
print(f"params.json -> {decoder.read_json_asset('params.json')}")
print(f"tracts.gpkg -> {decoder.read_asset_bytes('tracts.gpkg')}")
print(f"graph.json (raw dict) top-level keys: {graph_keys}")
readme.txt -> b'ReCom ensemble on a 32x32 grid, MLC-reordered.'
params.json -> {'node_repeats': 2}
tracts.gpkg -> b'GPKG\x00stand-in geospatial bytes'
graph.json (raw dict) top-level keys: ['directed', 'multigraph', 'graph', 'nodes', 'adjacency']
Inspecting a bundle¶
Before (or instead of) reading payloads, you can inspect structure. This is handy for tooling, debugging, or deciding whether a file is what you think it is:
version()→(major, minor)format version,is_complete()→ was it finalized cleanly,assignment_format()→"ben"or"xben",stream_size()→ byte length of the embedded stream, straight from the header,asset_size(name)→ stored byte length of one asset, straight from the directory (the compressed size for xz-flagged assets),asset_names()→ directory names in order,list_assets()→ the full directory: name, type, offset, len, flag tags,len(dec)/count_samples()→ number of plans in the stream.
decoder = BendlDecoder("example_data/mlc_reordered.bendl")
print(f"format version: {decoder.version()}")
print(f"is_complete: {decoder.is_complete()}")
print(f"assignment_format: {decoder.assignment_format()}")
print(f"stream_size: {decoder.stream_size()} bytes")
print(f"sample count: {len(decoder)}")
print(f"asset_names: {decoder.asset_names()}")
print("full directory:")
for entry in decoder.list_assets():
print(f" {entry}")
format version: (1, 0)
is_complete: True
assignment_format: ben
stream_size: 40143 bytes
sample count: 1000
asset_names: ['graph.json', 'node_permutation_map.json', 'metadata.json', 'readme.txt', 'params.json', 'tracts.gpkg']
full directory:
{'name': 'graph.json', 'type': 2, 'offset': 64, 'len': 6788, 'flags': ['json', 'xz', 'checksum']}
{'name': 'node_permutation_map.json', 'type': 3, 'offset': 6852, 'len': 2964, 'flags': ['json', 'xz', 'checksum']}
{'name': 'metadata.json', 'type': 1, 'offset': 9816, 'len': 201, 'flags': ['json', 'checksum']}
{'name': 'readme.txt', 'type': 4, 'offset': 10017, 'len': 46, 'flags': ['checksum']}
{'name': 'params.json', 'type': 4, 'offset': 50396, 'len': 19, 'flags': ['json', 'checksum']}
{'name': 'tracts.gpkg', 'type': 4, 'offset': 50648, 'len': 30, 'flags': ['checksum']}
Trust, but verify¶
Notice the 'checksum' flag on every directory entry above: the writer checksums each
asset and the embedded stream (CRC32C) as it goes. Reading, iterating, and subsampling do
not re-verify those checksums — a partial read cannot prove a whole-stream checksum,
and you do not want to pay for full verification on every loop. When integrity actually
matters (you just downloaded a bundle, or you are about to archive one), ask for it
explicitly. verify() checks every asset checksum and the stream checksum against the raw
bytes on disk, and raises on the first mismatch:
decoder = BendlDecoder("example_data/mlc_reordered.bendl")
decoder.verify()
print("Every byte accounted for; assets and stream both check out!")
Every byte accounted for; assets and stream both check out!
Iterating the stream and reconstructing plans¶
A BendlDecoder iterates its embedded stream, yielding each assignment as a list[int].
Combined with read_graph(), that is enough to rebuild GerryChain Partitions straight
from the bundle which eliminates the need to remember the correct graph file or node order.
import pandas as pd
decoder = BendlDecoder("example_data/mlc_reordered.bendl")
packaged_graph = decoder.read_graph()
order = pd.Index(packaged_graph.nodes) # matches the written assignment order
cut_edge_counts = []
for assignment in decoder:
partition = Partition(
packaged_graph,
assignment=pd.Series(assignment, index=order),
updaters={"cut_edges": updaters.cut_edges},
)
cut_edge_counts.append(len(partition["cut_edges"]))
print(f"reconstructed {len(cut_edge_counts)} partitions from the bundle alone")
print(f"first five cut-edge counts: {cut_edge_counts[:5]}")
reconstructed 1000 partitions from the bundle alone
first five cut-edge counts: [96, 88, 113, 125, 113]
Subsampling¶
When winnowing a large ensemble you rarely want every plan. BendlDecoder has three native
subsamplers; each returns the decoder set up to yield only the chosen plans, so you still
just iterate. Indices are 1-based (plan 1 is the first sample):
subsample_indices([...])— exactly these 1-based indices (sorted, unique),subsample_range(start, end)— the 1-based inclusive range[start, end],subsample_every(step, offset=1)— everystep-th plan starting atoffset(“thinning”).
bundle_file = "example_data/mlc_reordered.bendl"
decoder = BendlDecoder(bundle_file) # one decoder, reused for every subsample below
picked = [assignment[:4] for assignment in decoder.subsample_indices([1, 500, 1000])]
print(f"indices [1, 500, 1000] -> {picked}")
ranged = [assignment[:4] for assignment in decoder.subsample_range(100, 104)]
print(f"range(100, 104) -> {ranged}") # plans 100..104 inclusive = 5 plans
print(f"every 250th -> {sum(1 for _ in decoder.subsample_every(250))} plans")
# The same decoder rewinds and re-selects on each call, so you can run subsamples
# repeatedly without building a new decoder:
again = [assignment[:4] for assignment in decoder.subsample_indices([1, 500, 1000])]
print(f"indices again -> {again}")
indices [1, 500, 1000] -> [[2, 2, 2, 2], [1, 1, 1, 1], [3, 3, 3, 3]]
range(100, 104) -> [[3, 2, 3, 3], [3, 3, 3, 3], [3, 3, 3, 3], [3, 3, 3, 3], [1, 1, 1, 1]]
every 250th -> 4 plans
indices again -> [[2, 2, 2, 2], [1, 1, 1, 1], [3, 3, 3, 3]]
Extracting the raw stream¶
Sometimes you want the bare assignment stream back out, e.g. to hand it to the plain-stream
tools or a different pipeline. extract_stream copies the embedded stream region verbatim
to a standalone .ben/.xben file, which the stream-only BenDecoder (covered in the
streams notebook) opens directly:
from binary_ensemble import BenDecoder
decoder = BendlDecoder("example_data/mlc_reordered.bendl")
decoder.extract_stream("example_data/extracted.ben", overwrite=True)
# Open the extracted file with the plain stream decoder (mode matches the bundle).
ben = BenDecoder("example_data/extracted.ben", mode=decoder.assignment_format())
print(f"extracted stream yields {sum(1 for _ in ben)} plans")
extracted stream yields 1000 plans
Appending analysis back onto the bundle¶
A finished, finalized bundle is not frozen. BendlEncoder.append(path) reopens it to add
more assets later — say, the cut-edge summary we just computed, so the analysis travels
with the plans it describes. The stream is not re-opened (it is already written); each
add_* commits immediately to disk:
Note: Assets can also be removed with
remove_asset(name). Since the asset name becomes free again, remove-then-add is the way to replace an asset (e.g. to updatemetadata.json).
appender = BendlEncoder.append("example_data/mlc_reordered.bendl")
appender.add_asset(
"cut_edge_summary.json",
{
"mean": sum(cut_edge_counts) / len(cut_edge_counts),
"min": min(cut_edge_counts),
"max": max(cut_edge_counts),
},
"json",
)
decoder = BendlDecoder("example_data/mlc_reordered.bendl")
print(f"assets after append: {decoder.asset_names()}")
print(f"appended summary: {decoder.read_json_asset('cut_edge_summary.json')}")
assets after append: ['graph.json', 'node_permutation_map.json', 'metadata.json', 'readme.txt', 'params.json', 'tracts.gpkg', 'cut_edge_summary.json']
appended summary: {'mean': 133.274, 'min': 88, 'max': 189}
Assets-only bundles (no stream)¶
Technically, the BENDL format does not require an ensemble stream at all, and it is possible to make
an assets-only bundle. This is useful for shipping a graph + metadata package on its own and the
“stream” section will still decode to an empty iteration with len == 0, not a spurious
“missing stream” error:
encoder = BendlEncoder("example_data/assets_only.bendl", overwrite=True)
encoder.add_graph(GRAPH_PATH, sort=None)
encoder.add_metadata({"note": "graph package, no plans"})
encoder.close() # no stream was opened, so finalize explicitly
decoder = BendlDecoder("example_data/assets_only.bendl")
print(
f"assets-only: is_complete = {decoder.is_complete()} "
f"| len = {len(decoder)} | assets = {decoder.asset_names()}"
)
assets-only: is_complete = True | len = 0 | assets = ['graph.json', 'metadata.json']
Recompressing to XBEN for archival¶
BEN is the best working format for an active project, but when the project
wraps up, XBEN is a better compression mechanism. We don’t use it as a working format since the
the CPU overhead is significant, but the savings are substantial enough to be worth the cost of
encoding once (fortunately, decoding is very fast from XBEN). compress_stream repackages a
bundle’s BEN stream as XBEN, preserving every asset: graph, metadata, permutation map, custom
blobs. Pass out_file=... to write a new bundle and leave the original untouched
(overwrite=True replaces an existing copy); with no out_file the bundle is recompressed in
place, atomically.
# Write a fresh XBEN copy, original preserved. overwrite=True replaces any copy
# left behind by a previous run of this notebook.
compress_stream(
"example_data/mlc_reordered.bendl",
out_file="example_data/archive.bendl",
overwrite=True,
)
xben_decoder = BendlDecoder("example_data/archive.bendl")
n_plans_before = len(BendlDecoder("example_data/mlc_reordered.bendl"))
print(f"recompressed format: {xben_decoder.assignment_format()}")
print(f"assets preserved: {xben_decoder.asset_names()}")
print(f"metadata preserved: {xben_decoder.read_metadata()}")
print(f"plans unchanged: {len(xben_decoder)} == {n_plans_before}")
recompressed format: xben
assets preserved: ['graph.json', 'node_permutation_map.json', 'metadata.json', 'readme.txt', 'params.json', 'tracts.gpkg', 'cut_edge_summary.json']
metadata preserved: {'generator': 'gerrychain', 'proposal': 'recom', 'epsilon': 0.05, 'seed': 1234, 'created_by': 'me', 'created_at': '1970-01-01T00:00:00', 'description': 'ReCom ensemble on a 32x32 grid, MLC-reordered.'}
plans unchanged: 1000 == 1000
/tmp/claude-1000/ipykernel_2893316/2142953642.py:9: UserWarning: XBEN may take a second to start decoding.
xben_decoder = BendlDecoder("example_data/archive.bendl")
Note: Bundles with XBEN streams emit a one-time startup warning on decode, since opening them does real decompression work. The in-place mode writes to a temp file and swaps it over the original only on success, so an interrupted run cannot corrupt the bundle.
Lifecycle and failure semantics¶
One last guarantee, subtle but important: if an exception escapes the ben_stream() context (e.g.
the chain throws, the write logic has a bug, the process dies partway, etc.) the bundle is left
unfinalized rather than stamped complete over a half-written stream. You can detect this
(is_complete() is False) and still recover everything that was written via
extract_stream(..., allow_unfinalized=True):
encoder = BendlEncoder("example_data/partial.bendl", overwrite=True)
stored_graph = encoder.add_graph(GRAPH_PATH, sort=None)
gc_graph = Graph.from_networkx(stored_graph)
write_order = list(gc_graph.nodes)
try:
with encoder.ben_stream() as stream:
for i, partition in enumerate(make_chain(gc_graph, steps=1000)):
if i == 50:
raise RuntimeError("simulated crash mid-stream")
series = partition.assignment.to_series()
stream.write(series.loc[write_order].astype(int).tolist())
except RuntimeError as e:
print(f"caught: {e}")
decoder = BendlDecoder("example_data/partial.bendl")
print(f"is_complete: {decoder.is_complete()} (left unfinalized, as intended)")
decoder.extract_stream("example_data/partial.ben", overwrite=True, allow_unfinalized=True)
recovered = sum(1 for _ in BenDecoder("example_data/partial.ben", mode="ben"))
print(f"recovered {recovered} plans written before the crash")
caught: simulated crash mid-stream
is_complete: False (left unfinalized, as intended)
recovered 50 plans written before the crash
Recap — when to reach for what¶
BendlEncoder/BendlDecoderare the default for storing an ensemble: one self-describing file, graph + metadata included, encoded live as the chain runs. Only theben_stream()writer needs awithblock — closing it finalizes the bundle (useclose()for an assets-only bundle).add_graph(graph)before the stream (MLC-reordered by default; passsort="rcm",sort="key", key="GEOID", orsort=Nonefor raw), then build the chain on the returned graph — a compression win and a write order that already matches the stored graph.add_metadata/add_assetto stamp provenance and ship anything else alongside the plans ("json"takes dicts directly,"binary"takes raw bytes,"file"reads a path off disk);appendto add results to a finished bundle, andremove_assetto drop one (remove-then-add replaces an asset).verify()when integrity matters: after a download, before an archive.relabel_bundleto reorder an existing BEN-stream bundle and rewrite its stream to match (in place or to a new file), e.g. to optimize a bundle you received raw before archiving it.binary_ensemble.graph.reorder*when you want the reordering standalone, e.g. to reuse one ordering across several bundles.compress_streamto graduate an active bundle from an embedded BEN stream to an embedded XBEN stream without losing any asset.The plain stream layer (via
extract_streamand the streams notebook) is there for when you specifically need the bare stream and are tracking the graph and node order yourself.