pyUSID.io.usi_data.USIDataset¶
- class pyUSID.io.usi_data.USIDataset(h5_ref, sort_dims=False)[source]¶
Bases:
Dataset
A class that simplifies slicing, visualization, reshaping, reduction etc. of USID datasets in HDF5 files.
This class extends the
h5py.Dataset
.- Parameters:
h5_ref (
h5py.Dataset
) – The dataset which is actually a USID Main datasetsort_dims (bool, Optional. Default=False) – If set to True - Dimensions will be sorted from slowest to fastest Else - Dimensions will be arranged as they appear in ancillary datasets
- self.get_current_sorting()¶
- self.toggle_sorting()¶
- self.get_pos_values()¶
- self.get_spec_values()¶
- self.get_n_dim_form()¶
- self.slice()¶
- self.h5_spec_vals¶
Associated Spectroscopic Values dataset
- Type:
- self.h5_spec_inds¶
Associated Spectroscopic Indices dataset
- Type:
- self.h5_pos_vals¶
Associated Position Values dataset
- Type:
- self.h5_pos_inds¶
Associated Position Indices dataset
- Type:
Notes
The order of all labels and sizes attributes is determined by the current value of sort_dims.
Methods
Get a wrapper to read string data as Python strings:
Get a wrapper allowing you to perform reads to a different destination type, e.g.:
Get a wrapper to read a subset of fields from a compound data type:
Flush the dataset data and metadata to the file.
Prints the current sorting method.
Reshapes the dataset to an N-dimensional array
Extract the reference values for the specified position dimension
Extract the values for the specified spectroscopic dimension
Return chunk iterator.
The size of the first axis.
Make this dataset an HDF5 dimension scale.
Read data directly from HDF5 into an existing NumPy array.
- param dims:
Names of the position and/or spectroscopic dimensions that need to be reduced
Refresh the dataset metadata by reloading from the file.
Resize the dataset, or the specified axis.
Slice the dataset based on an input dictionary of 'str': slice pairs.
Slices the dataset, writes its output to back to the HDF5 file, and returns a USIDataset object
Output this USIDataset and position + spectroscopic values to a csv file.
Toggles between sorting from the fastest changing dimension to the slowest and sorting based on the order of the labels
Get a list of the data mappings for a virtual dataset
Interactive visualization of this dataset.
Write data directly to HDF5 from a NumPy array.
Attributes
Attributes attached to this object
Dataset chunks (or None)
Compression strategy (or None)
Compression setting.
Access dimension scales attached to this dataset.
Numpy dtype representing the datatype
External file settings.
Return a File instance associated with this object
Fill value for this dataset (0 by default)
Fletcher32 filter is present (T/F)
Low-level identifier appropriate for this object
Return
True
if this dataset is also a dimension scale.Check if this is a virtual dataset
Shape up to which this dataset can be resized.
Return the full name of this object.
Numpy-style attribute giving the raw dataset size as the number of bytes
Numpy-style attribute giving the number of dimensions
Return the parent group of this object.
An (opaque) HDF5 reference to this object
Create a region reference (Datasets only).
Scale/offset filter settings.
Numpy-style shape tuple giving dataset dimensions
Shuffle filter present (T/F)
Numpy-style attribute giving the total dataset size
- __array__(dtype=None)¶
Create a Numpy array containing the whole dataset. DON’T THINK THIS MEANS DATASETS ARE INTERCHANGEABLE WITH ARRAYS. For one thing, you have to read the whole dataset every time this method is called.
- __getitem__(args, new_dtype=None)¶
Read a slice from the HDF5 dataset.
Takes slices and recarray-style field names (more than one is allowed!) in any order. Obeys basic NumPy rules, including broadcasting.
Also supports:
Boolean “mask” array indexing
- __getnewargs__()¶
Disable pickle.
Handles for HDF5 objects can’t be reliably deserialised, because the recipient may not have access to the same files. So we do this to fail early.
If you really want to pickle h5py objects and can live with some limitations, look at the h5pickle project on PyPI.
- __iter__()¶
Iterate over the first axis. TypeError if scalar.
BEWARE: Modifications to the yielded data are NOT written to file.
- __len__()¶
The size of the first axis. TypeError if scalar.
Limited to 2**32 on 32-bit systems; Dataset.len() is preferred.
- __setitem__(args, val)¶
Write to the HDF5 dataset from a Numpy array.
NumPy’s broadcasting rules are honored, for “simple” indexing (slices and integers). For advanced indexing, the shapes must match.
- asstr(encoding=None, errors='strict')¶
Get a wrapper to read string data as Python strings:
>>> str_array = dataset.asstr()[:]
The parameters have the same meaning as in
bytes.decode()
. Ifencoding
is unspecified, it will use the encoding in the HDF5 datatype (either ascii or utf-8).
- astype(dtype)¶
Get a wrapper allowing you to perform reads to a different destination type, e.g.:
>>> double_precision = dataset.astype('f8')[0:100:2]
- property attrs¶
Attributes attached to this object
- property chunks¶
Dataset chunks (or None)
- property compression¶
Compression strategy (or None)
- property compression_opts¶
Compression setting. Int(0-9) for gzip, 2-tuple for szip.
- property dims¶
Access dimension scales attached to this dataset.
- property dtype¶
Numpy dtype representing the datatype
- property external¶
External file settings. Returns a list of tuples of (name, offset, size) for each external file entry, or returns None if no external files are used.
- fields(names, *, _prior_dtype=None)¶
Get a wrapper to read a subset of fields from a compound data type:
>>> 2d_coords = dataset.fields(['x', 'y'])[:]
If names is a string, a single field is extracted, and the resulting arrays will have that dtype. Otherwise, it should be an iterable, and the read data will have a compound dtype.
- property file¶
Return a File instance associated with this object
- property fillvalue¶
Fill value for this dataset (0 by default)
- property fletcher32¶
Fletcher32 filter is present (T/F)
- flush()¶
Flush the dataset data and metadata to the file. If the dataset is chunked, raw data chunks are written to the file.
This is part of the SWMR features and only exist when the HDF5 library version >=1.9.178
- get_n_dim_form(as_scalar=False, lazy=False)[source]¶
Reshapes the dataset to an N-dimensional array
- Parameters:
as_scalar (bool, optional. Default = False) – If False, the data is returned in its original (complex, compound) dtype Else, the data is flattened to a real-valued dataset
lazy (bool, optional. Default = False) – If set to false, n_dim_data will be a
numpy.ndarray
Else returned object isdask.array.core.Array
- Returns:
n_dim_data – N-dimensional form of the dataset
- Return type:
numpy.ndarray
ordask.core.Array
- get_pos_values(dim_name)[source]¶
Extract the reference values for the specified position dimension
- Parameters:
dim_name (str) – Name of one of the dimensions in self.pos_dim_labels
- Returns:
dim_values – Array containing the unit values of the dimension dim_name
- Return type:
- get_spec_values(dim_name)[source]¶
Extract the values for the specified spectroscopic dimension
- Parameters:
dim_name (str) – Name of one of the dimensions in self.spec_dim_labels
- Returns:
dim_values – Array containing the unit values of the dimension dim_name
- Return type:
- property id¶
Low-level identifier appropriate for this object
- property is_scale¶
Return
True
if this dataset is also a dimension scale.Return
False
otherwise.
- property is_virtual¶
Check if this is a virtual dataset
- iter_chunks(sel=None)¶
Return chunk iterator. If set, the sel argument is a slice or tuple of slices that defines the region to be used. If not set, the entire dataspace will be used for the iterator.
For each chunk within the given region, the iterator yields a tuple of slices that gives the intersection of the given chunk with the selection area.
A TypeError will be raised if the dataset is not chunked.
A ValueError will be raised if the selection region is invalid.
- len()¶
The size of the first axis. TypeError if scalar.
Use of this method is preferred to len(dset), as Python’s built-in len() cannot handle values greater then 2**32 on 32-bit systems.
- make_scale(name='')¶
Make this dataset an HDF5 dimension scale.
You can then attach it to dimensions of other datasets like this:
other_ds.dims[0].attach_scale(ds)
You can optionally pass a name to associate with this scale.
- property maxshape¶
Shape up to which this dataset can be resized. Axes with value None have no resize limit.
- property name¶
Return the full name of this object. None if anonymous.
- property nbytes¶
Numpy-style attribute giving the raw dataset size as the number of bytes
- property ndim¶
Numpy-style attribute giving the number of dimensions
- property parent¶
Return the parent group of this object.
This is always equivalent to obj.file[posixpath.dirname(obj.name)]. ValueError if this object is anonymous.
- read_direct(dest, source_sel=None, dest_sel=None)¶
Read data directly from HDF5 into an existing NumPy array.
The destination array must be C-contiguous and writable. Selections must be the output of numpy.s_[<args>].
Broadcasting is supported for simple indexing.
- reduce(dims, ufunc=<function mean>, to_hdf5=False, dset_name=None, verbose=False)[source]¶
- Parameters:
dims (str or list of str) – Names of the position and/or spectroscopic dimensions that need to be reduced
ufunc (callable, optional. Default = dask.array.mean) – Reduction function such as dask.array.mean available in dask.array
to_hdf5 (bool, optional. Default = False) – Whether or not to write the reduced data back to a new dataset
dset_name (str (optional)) – Name of the new USID Main datset in the HDF5 file that will contain the sliced data. Default - the sliced dataset takes the same name as this source dataset
verbose (bool, optional. Default = False) – Whether or not to print any debugging statements to stdout
- Returns:
reduced_nd (dask.array object) – Dask array object containing the reduced data. Call compute() on this object to get the equivalent numpy array
h5_main_red (USIDataset) – USIDataset reference if to_hdf5 was set to True. Otherwise - None.
- property ref¶
An (opaque) HDF5 reference to this object
- refresh()¶
Refresh the dataset metadata by reloading from the file.
This is part of the SWMR features and only exist when the HDF5 library version >=1.9.178
- property regionref¶
Create a region reference (Datasets only).
The syntax is regionref[<slices>]. For example, dset.regionref[…] creates a region reference in which the whole dataset is selected.
Can also be used to determine the shape of the referenced dataset (via .shape property), or the shape of the selection (via the .selection property).
- resize(size, axis=None)¶
Resize the dataset, or the specified axis.
The dataset must be stored in chunked format; it can be resized up to the “maximum shape” (keyword maxshape) specified at creation time. The rank of the dataset cannot be changed.
“Size” should be a shape tuple, or if an axis is specified, an integer.
BEWARE: This functions differently than the NumPy resize() method! The data is not “reshuffled” to fit in the new shape; each axis is grown or shrunk independently. The coordinates of existing data are fixed.
- property scaleoffset¶
Scale/offset filter settings. For integer data types, this is the number of bits stored, or 0 for auto-detected. For floating point data types, this is the number of decimal places retained. If the scale/offset filter is not in use, this is None.
- property shape¶
Numpy-style shape tuple giving dataset dimensions
- property shuffle¶
Shuffle filter present (T/F)
- property size¶
Numpy-style attribute giving the total dataset size
- slice(slice_dict, ndim_form=True, as_scalar=False, verbose=False, lazy=False)[source]¶
Slice the dataset based on an input dictionary of ‘str’: slice pairs. Each string should correspond to a dimension label. The slices can be array-likes or slice objects.
- Parameters:
slice_dict (dict) – Dictionary of array-likes. for any dimension one needs to slice
ndim_form (bool, optional) – Whether or not to return the slice in it’s N-dimensional form. Default = True
as_scalar (bool, optional) – Should the data be returned as scalar values only.
verbose (bool, optional) – Whether or not to print debugging statements
lazy (bool, optional. Default = False) – If set to false, data_slice will be a
numpy.ndarray
Else returned object isdask.array.core.Array
- Returns:
data_slice (
numpy.ndarray
, ordask.array.core.Array
) – Slice of the dataset. Dataset has been reshaped to N-dimensions if success is True, only by Position dimensions if success is ‘Positions’, or not reshape at all if success is False.success (str or bool) – Informs the user as to how the data_slice has been shaped.
- slice_to_dataset(slice_dict, dset_name=None, verbose=False, **kwargs)[source]¶
Slices the dataset, writes its output to back to the HDF5 file, and returns a USIDataset object
- Parameters:
slice_dict (dict) – Dictionary to slice one or more dimensions of the dataset by indices
dset_name (str (optional)) – Name of the new USID Main datset in the HDF5 file that will contain the sliced data. Default - the sliced dataset takes the same name as this source dataset
verbose (bool (optional)) – Whether or not to print debugging statements to stdout. Default = False
kwargs (keyword arguments) – keyword arguments that will be passed on to write_main_data()
- Returns:
h5_trunc – USIDataset containing the sliced data
- Return type:
- to_csv(output_path=None, force=False)[source]¶
Output this USIDataset and position + spectroscopic values to a csv file. This should ideally be limited to small datasets only
- Parameters:
- Returns:
output_file (str)
Author - Daniel Streater, Suhas Somnath
- toggle_sorting()[source]¶
Toggles between sorting from the fastest changing dimension to the slowest and sorting based on the order of the labels
- virtual_sources()¶
Get a list of the data mappings for a virtual dataset
- visualize(slice_dict=None, verbose=False, **kwargs)[source]¶
Interactive visualization of this dataset. Only available on jupyter notebooks
- Parameters:
slice_dict (dictionary, optional) – Slicing instructions
verbose (bool, optional) – Whether or not to print debugging statements. Default = Off
- Returns:
fig (
matplotlib.figure
handle) – Handle for the figure objectaxis (
matplotlib.Axes.axis
object) – Axis within which the data was plotted. Note - the interactive visualizer does not return this object
- write_direct(source, source_sel=None, dest_sel=None)¶
Write data directly to HDF5 from a NumPy array.
The source array must be C-contiguous. Selections must be the output of numpy.s_[<args>].
Broadcasting is supported for simple indexing.