Serialization & Serving Formats¶
Besides EDF/BDF, biosigIO can serialize a Recording to three columnar or array
formats: Parquet, Arrow/Feather, and Zarr. Each carries enough information to be
read back without any side files, and all of them round-trip through the same
entry point, Recording.from_file, which auto-detects the format from the file
extension (.parquet, .feather, .arrow, .zarr).
from biosigio import Recording
rec = Recording.from_file('recording.edf')
rec.to_parquet('recording.parquet') # analytics
rec.to_arrow('recording.feather') # fast IPC
rec.to_zarr('recording.zarr') # cloud-native serving
When to Use Which¶
The three formats solve different jobs. Parquet and Arrow/Feather are lossless columnar tables for analysis and interchange; Zarr is a derived, downsampled and quantized serving store, not an archival source.
| Format | Best for | Lossless | Notes |
|---|---|---|---|
| Parquet | Analytics (DuckDB, Polars, pandas, Spark) | Yes | Self-describing columnar table; one file per recording |
| Arrow/Feather | Fast zero-copy inter-process communication (IPC) | Yes | Same schema as Parquet, optimized for speed over compression |
| Zarr | Cloud-native serving: one store serves viewing, inference, and training | No | Derived, downsampled and quantized serving copy; the Brain Imaging Data Structure (BIDS) and EDF stay authoritative |
Parquet and Arrow keep the full signal: channels become columns, the time index is preserved, and channels, events, and recording metadata travel in the file's schema. Use them whenever you want the recording back exactly as it was.
Zarr is the serving format. A single conversion produces one cloud-native store
that all three downstream consumers read directly from object storage: a viewer
streams a min/max render pyramid, inference reads the anti-aliased base signal,
and training streams shards of it. Because it is downsampled per modality and
stored as scaled int16 by default, it is a derived copy and not the source of
truth. The original BIDS dataset (or EDF) remains the authoritative, citable
artifact. See the Zarr store contract for the full layout.
Lossless vs Derived Round-Trip¶
The two families round-trip differently, and the difference is intentional.
Parquet and Arrow are lossless. Reading a .parquet or .feather file back
restores the Recording exactly: signals (with their time index), channels,
events, and recording metadata. The file holds everything needed to rebuild the
object, so import is a faithful inverse of export.
Zarr is reconstructed, not restored. The importer rebuilds a Recording
from the store's canonical level 0 signal, which is the anti-aliased,
per-modality downsampled inference signal, so the reconstructed sampling rate is
the store's canonical rate rather than the original acquisition rate. The
per-channel Zarr-only attributes (the scale/offset quantization parameters,
the usable_for_inference flags, the render pyramid) are applied during
dequantization but are not surfaced back onto the reconstructed channels. The
min/max view pyramid (view/*) is render-only and is never read on import.
For both families, source_file is reset to the path you read from, while the
original source_format provenance is preserved (a re-imported serialization
file keeps the format it was originally converted from rather than being
relabeled as tabular or zarr).
Self-Describing Schema¶
Both families embed a versioned biosigio metadata blob so a reader can
recognize and version-check the file before trusting it.
- Parquet and Arrow carry the blob in the Arrow schema metadata
(
tabular_schema.FORMAT = "biosigio-tabular",FORMAT_VERSION = 1). It holds the recording metadata, per-channel info, and events as one JSON object. - Zarr carries an equivalent blob in the store's root attributes
(
FORMAT = "biosigio-zarr",FORMAT_VERSION = 2), reusing the same metadata encoding so both formats record state the same way.
The encoding is lossless for values that are not native to JSON: datetimes and dates become a typed envelope and are reconstructed to their original Python type on read, and numpy scalars/arrays are converted to their Python equivalents. Metadata that cannot be serialized raises an error rather than being silently coerced to a string, because metadata loss is data loss.
Optional Dependencies¶
These formats are not part of the core install; install the matching extra.
| Format | Extra | Install |
|---|---|---|
| Parquet, Arrow/Feather | arrow (pyarrow) |
uv sync --extra arrow or uv pip install 'biosigio[arrow]' |
| Zarr | zarr (zarr v3) |
uv sync --extra zarr or uv pip install 'biosigio[zarr]' |
If the extra is missing, the exporter and importer raise an ImportError with
the exact install command, so you never get a partial or silent failure.
Round-Trip Examples¶
Parquet¶
from biosigio import Recording
rec = Recording.from_file('recording.edf')
rec.to_parquet('recording.parquet')
# Auto-detected from the .parquet extension; restored exactly.
restored = Recording.from_file('recording.parquet')
Arrow/Feather¶
from biosigio import Recording
rec = Recording.from_file('recording.edf')
rec.to_arrow('recording.feather')
restored = Recording.from_file('recording.feather')
Zarr¶
from biosigio import Recording
rec = Recording.from_file('recording.edf')
rec.to_zarr('recording.zarr')
# Reconstructed at the store's canonical (downsampled) level-0 rate.
served = Recording.from_file('recording.zarr')
A Zarr store can hold several (modality, rate) groups that cannot share a
single time grid. When there is more than one group, pass group= to choose
which to reconstruct:
Keep these examples minimal; for the full store layout, tuning knobs, and the serving model, see the Zarr store contract, and for end-to-end walk-throughs see the serialization examples.
See Also¶
- Zarr Serving Store: the on-disk contract, signal rules, and serving model.
- Serialization Examples: worked round-trip examples.