DataConverter#

class pyvisgen.io.dataconverter.DataConverter[source]#

Bases: object

Convert datasets between HDF5, WebDataset, and PyTorch formats.

This class allows loading datasets from various formats and convert them to a target format. Where available or required, metadata is read or added to the respective datasets.

Examples

Convert WebDataset to HDF5:

>>> converter = DataConverter.from_wds("./data/visibilities")
>>> converter.to("./data/output", output_format="h5")

Convert HDF5 train split to WebDataset:

>>> converter = DataConverter.from_h5("./data/visibilities", dataset_split="train")
>>> converter.to("~/data/output", output_format="wds", compress=True)

Methods Summary

from_h5(data_dir[, dataset_split])

Create a DataConverter instance from HDF5 files.

from_pt(data_dir[, dataset_split])

Create a DataConverter instance from HDF5 files.

from_wds(data_dir[, dataset_split])

Create a DataConverter instance from WebDataset files.

to(output_dir[, output_format, amp_phase, ...])

Convert the loaded dataset to the specified output format.

Methods Documentation

classmethod from_h5(data_dir, dataset_split='all') Self[source]#

Create a DataConverter instance from HDF5 files.

Parameters:
data_dirstr or Path

Directory containing HDF5 files.

dataset_splitstr or list

Dataset split to load. If “all”, loads train, valid, and test. Default: "all"

Returns:
DataConverter

Configured DataConverter instance with HDF5 source files.

classmethod from_pt(data_dir, dataset_split='all')[source]#

Create a DataConverter instance from HDF5 files.

Parameters:
data_dirstr or Path

Directory containing .pt files.

dataset_splitstr or list

Dataset split to load. If “all”, loads train, valid, and test. Default: "all"

Returns:
DataConverter

Configured DataConverter instance with PyTorch pickle source files.

classmethod from_wds(data_dir, dataset_split='all') Self[source]#

Create a DataConverter instance from WebDataset files.

Parameters:
data_dirstr or Path

Directory containing WebDataset .tar(.gz) files.

dataset_splitstr or list

Dataset split to load. If “all”, loads train, valid, and test. Default: "all"

Returns:
DataConverter

Configured DataConverter instance with WebDataset source files.

Raises:
ImportError

If webdataset package is not installed.

to(output_dir: str | Path, output_format: str = 'h5', amp_phase: bool = True, shard_pattern: str = '%06d.tar', compress: bool = True, bundle_size: int = 100, convert_representation: bool = False) None[source]#

Convert the loaded dataset to the specified output format.

Parameters:
output_dirstr or Path

Directory to write converted files to.

output_formatstr, optional

Target format for conversion. One of h5, wds or pt. Default: "h5"

amp_phasebool, optional

Whether data is in amplitude/phase or real/imaginary representation. Default: True

shard_patternstr, optional

Naming pattern for WebDataset shards (only applies to wds output). Default: "%06d.tar"

compressbool

Whether to compress WebDataset shards (only applies to wds output). Default: True

bundle_sizeint, optional

Bundle size for HDF5 and WebDataset shards when converting from PyTorch pickle files. Default: 100

convert_representationbool, optional

If True convert from one amplitude/phase representation to real/imaginary or vice versa. Note, that this requires amp_phase to match the actual representation in the input data as this determines which way the conversion will be applied. Default: False

Raises:
RuntimeError

If source and target formats are identical.