DataConverter#
- class pyvisgen.io.dataconverter.DataConverter[source]#
Bases:
objectConvert datasets between HDF5, WebDataset, and PyTorch formats.
This class allows loading datasets from various formats and convert them to a target format. Where available or required, metadata is read or added to the respective datasets.
Examples
Convert WebDataset to HDF5:
>>> converter = DataConverter.from_wds("./data/visibilities") >>> converter.to("./data/output", output_format="h5")
Convert HDF5 train split to WebDataset:
>>> converter = DataConverter.from_h5("./data/visibilities", dataset_split="train") >>> converter.to("~/data/output", output_format="wds", compress=True)
Methods Summary
from_h5(data_dir[, dataset_split])Create a DataConverter instance from HDF5 files.
from_pt(data_dir[, dataset_split])Create a DataConverter instance from HDF5 files.
from_wds(data_dir[, dataset_split])Create a DataConverter instance from WebDataset files.
to(output_dir[, output_format, amp_phase, ...])Convert the loaded dataset to the specified output format.
Methods Documentation
- classmethod from_h5(data_dir, dataset_split='all') Self[source]#
Create a DataConverter instance from HDF5 files.
- Parameters:
- data_dirstr or
Path Directory containing HDF5 files.
- dataset_splitstr or list
Dataset split to load. If “all”, loads train, valid, and test. Default:
"all"
- data_dirstr or
- Returns:
- DataConverter
Configured DataConverter instance with HDF5 source files.
- classmethod from_pt(data_dir, dataset_split='all')[source]#
Create a DataConverter instance from HDF5 files.
- Parameters:
- data_dirstr or
Path Directory containing .pt files.
- dataset_splitstr or list
Dataset split to load. If “all”, loads train, valid, and test. Default:
"all"
- data_dirstr or
- Returns:
- DataConverter
Configured DataConverter instance with PyTorch pickle source files.
- classmethod from_wds(data_dir, dataset_split='all') Self[source]#
Create a DataConverter instance from WebDataset files.
- Parameters:
- data_dirstr or
Path Directory containing WebDataset .tar(.gz) files.
- dataset_splitstr or list
Dataset split to load. If “all”, loads train, valid, and test. Default:
"all"
- data_dirstr or
- Returns:
- DataConverter
Configured DataConverter instance with WebDataset source files.
- Raises:
- ImportError
If webdataset package is not installed.
- to(output_dir: str | Path, output_format: str = 'h5', amp_phase: bool = True, shard_pattern: str = '%06d.tar', compress: bool = True, bundle_size: int = 100, convert_representation: bool = False) None[source]#
Convert the loaded dataset to the specified output format.
- Parameters:
- output_dirstr or
Path Directory to write converted files to.
- output_formatstr, optional
Target format for conversion. One of h5, wds or pt. Default:
"h5"- amp_phasebool, optional
Whether data is in amplitude/phase or real/imaginary representation. Default:
True- shard_patternstr, optional
Naming pattern for WebDataset shards (only applies to wds output). Default:
"%06d.tar"- compressbool
Whether to compress WebDataset shards (only applies to wds output). Default:
True- bundle_sizeint, optional
Bundle size for HDF5 and WebDataset shards when converting from PyTorch pickle files. Default: 100
- convert_representationbool, optional
If
Trueconvert from one amplitude/phase representation to real/imaginary or vice versa. Note, that this requires amp_phase to match the actual representation in the input data as this determines which way the conversion will be applied. Default: False
- output_dirstr or
- Raises:
- RuntimeError
If source and target formats are identical.