spine.io.write.StageHDF5Writer
- class spine.io.write.StageHDF5Writer(file_name: str | None = None, directory: str | None = None, prefix: str | list[str] | None = None, suffix: str = 'stage', stage: str | None = None, keys: list[str] | None = None, skip_keys: list[str] | None = None, split: bool = True, lite: bool = False, keep_open: bool = True, flush_frequency: int | None = None, overwrite: bool = False)[source]
Write additive stage caches to one HDF5 file per source file.
This writer is intended for sequential cache materialization workflows where each processing stage writes a self-contained set of products under
/stages/<stage>while preserving previously completed stages. Cache files are split by source-file provenance automatically.Unlike
HDF5Writer, this class does not use one flat product namespace for the entire file. Each stage owns its owneventsdataset and product datasets, which allows failed later stages to be rewritten without modifying earlier completed stages.Methods
DataFormat([dtype, class_name, width, ...])Data structure to hold writing parameters.
StageState(keys, type_dict, object_dtypes[, ...])In-memory description of one stage schema.
__call__(data[, cfg])Append one batch to the configured stage.
append_entry(out_file, data, batch_id)Stores one entry.
append_key(out_file, event, data, key, batch_id)Stores data key in a specific dataset of an HDF5 file.
close()Close any persistent cache-file handles.
create(data[, cfg, append])Initialize the output file structure based on the data dictionary.
ensure_source_group(out_file, data, file_path)Create or validate the top-level source provenance group.
finalize()Mark the configured stage as complete across touched cache files.
finalize_stage(stage)Mark one stage as complete in every touched cache file.
flush()Flush all persistent HDF5 output handles to disk.
get_batch_source_info(data)Extract cache-file source provenance from one normalized batch.
get_data_type(data, key)Identify the dtype and shape objects to be dealt with.
get_data_types(data, keys)Get the data type information for each key.
get_file_names([file_name, prefix, suffix, ...])Build output file name(s) from an explicit name or input prefix(es).
get_object_dtype(obj)Loop over the attributes of a class to figure out what to store.
get_output_path(source_info[, multiple_sources])Resolve the cache-file path for one source file.
get_stored_keys(data)Get the list of data product keys to store.
initialize_datasets(out_file, type_dict)Create place hodlers for all the datasets to be filled.
Return the union of stage-group names across touched cache files.
split_batch_by_source(data)Split one normalized batch into one subset per source file.
store(out_file, event, key, array)Stores an ndarray in the file and stores its mapping in the event dataset.
store_flat(out_file, event, key, array_list)Stores a concatenated list of arrays in the file and stores its index mapping in the event dataset to break them.
store_jagged(out_file, event, key, array_list)Stores a jagged list of arrays in the file and stores an index mapping for each array element in the event dataset.
store_objects(out_file, event, key, array, ...)Stores a list of objects with understandable attributes in the file and stores its mapping in the event dataset.
with_source_provenance(data)Return a data dictionary augmented with persisted source provenance.
write_stage(stage, data[, cfg, attrs, ...])Append one batch of products to a named stage.
- __init__(file_name: str | None = None, directory: str | None = None, prefix: str | list[str] | None = None, suffix: str = 'stage', stage: str | None = None, keys: list[str] | None = None, skip_keys: list[str] | None = None, split: bool = True, lite: bool = False, keep_open: bool = True, flush_frequency: int | None = None, overwrite: bool = False) None[source]
Initialize the stage-cache writer.
- Parameters:
file_name (str, optional) – Output cache file name. When
directoryis not provided, this path also provides the parent directory for source-derived cache files. If omitted, the base output path is built fromprefixandsuffixusing the same naming rules asHDF5Writer.directory (str, optional) – Output directory used for all source-derived cache files. When provided, it overrides the directory encoded in
file_name.prefix (str or list[str], optional) – Input file prefix used to derive the base staged-cache file name when
file_nameis not specified.suffix (str, default "stage") – Suffix appended to source file basenames when deriving split cache file names.
stage (str, optional) – Stage name to use for the standard driver-facing writer contract. When provided,
__call__()writes to this stage andfinalize()marks it complete. If omitted, usewrite_stage()andfinalize_stage()directly.keys (list[str], optional) – List of data-product keys to persist in each stage. If omitted, store every product present in the batch apart from administrative source-file metadata.
skip_keys (list[str], optional) – List of data-product keys to exclude from each stage.
split (bool, default True) – Stage caches are always written one file per source file. This argument is accepted for compatibility with generic writer configuration, but it must remain True.
lite (bool, default False) – If True, store lite object representations when applicable
keep_open (bool, default True) – If True, keep one append handle open per process
flush_frequency (int, optional) – Flush the file after this many appended entries per stage. If None, only flush on explicit requests or close/finalize.
overwrite (bool, default False) – If True, replace the entire cache file if it already exists.
Methods
__init__([file_name, directory, prefix, ...])Initialize the stage-cache writer.
append_entry(out_file, data, batch_id)Stores one entry.
append_key(out_file, event, data, key, batch_id)Stores data key in a specific dataset of an HDF5 file.
close()Close any persistent cache-file handles.
create(data[, cfg, append])Initialize the output file structure based on the data dictionary.
ensure_source_group(out_file, data, file_path)Create or validate the top-level source provenance group.
finalize()Mark the configured stage as complete across touched cache files.
finalize_stage(stage)Mark one stage as complete in every touched cache file.
flush()Flush all persistent HDF5 output handles to disk.
get_batch_source_info(data)Extract cache-file source provenance from one normalized batch.
get_data_type(data, key)Identify the dtype and shape objects to be dealt with.
get_data_types(data, keys)Get the data type information for each key.
get_file_names([file_name, prefix, suffix, ...])Build output file name(s) from an explicit name or input prefix(es).
get_object_dtype(obj)Loop over the attributes of a class to figure out what to store.
get_output_path(source_info[, multiple_sources])Resolve the cache-file path for one source file.
get_stored_keys(data)Get the list of data product keys to store.
initialize_datasets(out_file, type_dict)Create place hodlers for all the datasets to be filled.
Return the union of stage-group names across touched cache files.
split_batch_by_source(data)Split one normalized batch into one subset per source file.
store(out_file, event, key, array)Stores an ndarray in the file and stores its mapping in the event dataset.
store_flat(out_file, event, key, array_list)Stores a concatenated list of arrays in the file and stores its index mapping in the event dataset to break them.
store_jagged(out_file, event, key, array_list)Stores a jagged list of arrays in the file and stores an index mapping for each array element in the event dataset.
store_objects(out_file, event, key, array, ...)Stores a list of objects with understandable attributes in the file and stores its mapping in the event dataset.
with_source_provenance(data)Return a data dictionary augmented with persisted source provenance.
write_stage(stage, data[, cfg, attrs, ...])Append one batch of products to a named stage.
Attributes
source_index_keys- name = 'stage_hdf5'
- class StageState(keys: set[str], type_dict: dict[str, DataFormat], object_dtypes: list[list[tuple[str, type]]], event_dtype: dtype | list[tuple[str, Any]] | None = None, entries_since_flush: int = 0)[source]
In-memory description of one stage schema.
The regular
HDF5Writerstores one flat schema for the whole file. Stage caches need one schema per stage, so this small dataclass carries the state required to keep appending consistently to a given stage group.- Attributes:
- event_dtype
- keys: set[str]
- type_dict: dict[str, DataFormat]
- event_dtype: dtype | list[tuple[str, Any]] | None = None
- entries_since_flush: int = 0
- close() None[source]
Close any persistent cache-file handles.
This only affects handles cached in the current process and may be called repeatedly.
- get_batch_source_info(data: dict[str, Any]) dict[str, Any][source]
Extract cache-file source provenance from one normalized batch.
- Parameters:
data (dict) – Normalized batch dictionary prepared for writing.
- Returns:
File-level source identity stored under the cache file’s top-level
/sourcegroup.- Return type:
dict[str, Any]
- ensure_source_group(out_file: File, data: dict[str, Any], file_path: str) None[source]
Create or validate the top-level source provenance group.
This enforces the one-cache-file-per-source-file contract. If a later stage attempts to write into an existing cache file with mismatched source provenance, the writer raises immediately.
- get_output_path(source_info: dict[str, Any], multiple_sources: bool = False) str[source]
Resolve the cache-file path for one source file.
- Parameters:
source_info (dict) – File-level source identity returned by
get_batch_source_info().multiple_sources (bool, default False) – If True, derive one output path from the source file basename. Otherwise reuse
self.file_namedirectly unless this writer is already in source-routed mode.
- split_batch_by_source(data: dict[str, Any]) list[tuple[str, dict[str, Any], dict[str, Any]]][source]
Split one normalized batch into one subset per source file.
- Returns:
One tuple per source file containing the resolved output file path, the batch subset that belongs to that source file, and the file-level source provenance dictionary.
- Return type:
list[tuple[str, dict, dict]]
- write_stage(stage: str, data: dict[str, Any], cfg: dict[str, Any] | None = None, attrs: dict[str, Any] | None = None, overwrite_stage: bool = False) None[source]
Append one batch of products to a named stage.
- Parameters:
stage (str) – Stage group name under
/stagesdata (dict) – Dictionary of batched data products
cfg (dict, optional) – Configuration to store alongside this stage
attrs (dict, optional) – Additional stage metadata to persist under
stage/info.attrsoverwrite_stage (bool, default False) – If True, delete any existing stage group with the same name and rebuild it from the provided data.
Notes
The input batch may span multiple source files. In that case the batch is partitioned by source provenance and written into one cache file per source file automatically.