pyUSID.io.usi_data.USIDataset¶

class pyUSID.io.usi_data.USIDataset(h5_ref, sort_dims=False)[source]¶

Bases: Dataset

A class that simplifies slicing, visualization, reshaping, reduction etc. of USID datasets in HDF5 files.

This class extends the h5py.Dataset.

Parameters:

h5_ref (h5py.Dataset) – The dataset which is actually a USID Main dataset
sort_dims (bool, Optional. Default=False) – If set to True - Dimensions will be sorted from slowest to fastest Else - Dimensions will be arranged as they appear in ancillary datasets

self.get_current_sorting()¶

self.toggle_sorting()¶

self.get_pos_values()¶

self.get_spec_values()¶

self.get_n_dim_form()¶

self.slice()¶

self.h5_spec_vals¶

Associated Spectroscopic Values dataset

Type:: h5py.Dataset

self.h5_spec_inds¶

Associated Spectroscopic Indices dataset

Type:: h5py.Dataset

self.h5_pos_vals¶

Associated Position Values dataset

Type:: h5py.Dataset

self.h5_pos_inds¶

Associated Position Indices dataset

Type:: h5py.Dataset

self.pos_dim_labels¶

The labels for the position dimensions.

Type:: list of str

self.spec_dim_labels¶

The labels for the spectroscopic dimensions.

Type:: list of str

self.n_dim_labels¶

The labels for the n-dimensional dataset.

Type:: list of str

self.pos_dim_sizes¶

A list of the sizes of each position dimension.

Type:: list of int

self.spec_dim_sizes¶

A list of the sizes of each spectroscopic dimension.

Type:: list of int

self.n_dim_sizes¶

A list of the sizes of each dimension.

Type:: list of int

Notes

The order of all labels and sizes attributes is determined by the current value of sort_dims.

Methods

`asstr`	Get a wrapper to read string data as Python strings:
`astype`	Get a wrapper allowing you to perform reads to a different destination type, e.g.:
`fields`	Get a wrapper to read a subset of fields from a compound data type:
`flush`	Flush the dataset data and metadata to the file.
`get_current_sorting`	Prints the current sorting method.
`get_n_dim_form`	Reshapes the dataset to an N-dimensional array
`get_pos_values`	Extract the reference values for the specified position dimension
`get_spec_values`	Extract the values for the specified spectroscopic dimension
`iter_chunks`	Return chunk iterator.
`len`	The size of the first axis.
`make_scale`	Make this dataset an HDF5 dimension scale.
`read_direct`	Read data directly from HDF5 into an existing NumPy array.
`reduce`
`refresh`	Refresh the dataset metadata by reloading from the file.
`resize`	Resize the dataset, or the specified axis.
`slice`	Slice the dataset based on an input dictionary of 'str': slice pairs.
`slice_to_dataset`	Slices the dataset, writes its output to back to the HDF5 file, and returns a USIDataset object
`to_csv`	Output this USIDataset and position + spectroscopic values to a csv file.
`toggle_sorting`	Toggles between sorting from the fastest changing dimension to the slowest and sorting based on the order of the labels
`virtual_sources`	Get a list of the data mappings for a virtual dataset
`visualize`	Interactive visualization of this dataset.
`write_direct`	Write data directly to HDF5 from a NumPy array.

Attributes

`attrs`	Attributes attached to this object
`chunks`	Dataset chunks (or None)
`compression`	Compression strategy (or None)
`compression_opts`	Compression setting.
`dims`	Access dimension scales attached to this dataset.
`dtype`	Numpy dtype representing the datatype
`external`	External file settings.
`file`	Return a File instance associated with this object
`fillvalue`	Fill value for this dataset (0 by default)
`fletcher32`	Fletcher32 filter is present (T/F)
`id`	Low-level identifier appropriate for this object
`is_scale`	Return `True` if this dataset is also a dimension scale.
`is_virtual`	Check if this is a virtual dataset
`maxshape`	Shape up to which this dataset can be resized.
`name`	Return the full name of this object.
`nbytes`	Numpy-style attribute giving the raw dataset size as the number of bytes
`ndim`	Numpy-style attribute giving the number of dimensions
`parent`	Return the parent group of this object.
`ref`	An (opaque) HDF5 reference to this object
`regionref`	Create a region reference (Datasets only).
`scaleoffset`	Scale/offset filter settings.
`shape`	Numpy-style shape tuple giving dataset dimensions
`shuffle`	Shuffle filter present (T/F)
`size`	Numpy-style attribute giving the total dataset size

__array__(dtype=None, copy=None)¶: Create a Numpy array containing the whole dataset. DON’T THINK THIS MEANS DATASETS ARE INTERCHANGEABLE WITH ARRAYS. For one thing, you have to read the whole dataset every time this method is called.

__getitem__(args, new_dtype=None)¶

Read a slice from the HDF5 dataset.

Takes slices and recarray-style field names (more than one is allowed!) in any order. Obeys basic NumPy rules, including broadcasting.

Also supports:

Boolean “mask” array indexing

__getnewargs__()¶

Disable pickle.

Handles for HDF5 objects can’t be reliably deserialised, because the recipient may not have access to the same files. So we do this to fail early.

If you really want to pickle h5py objects and can live with some limitations, look at the h5pickle project on PyPI.

__iter__()¶

Iterate over the first axis. TypeError if scalar.

BEWARE: Modifications to the yielded data are NOT written to file.

__len__()¶

The size of the first axis. TypeError if scalar.

Limited to 2**32 on 32-bit systems; Dataset.len() is preferred.

__setitem__(args, val)¶

Write to the HDF5 dataset from a Numpy array.

NumPy’s broadcasting rules are honored, for “simple” indexing (slices and integers). For advanced indexing, the shapes must match.

asstr(encoding=None, errors='strict')¶

Get a wrapper to read string data as Python strings:

>>> str_array = dataset.asstr()[:]

The parameters have the same meaning as in bytes.decode(). If encoding is unspecified, it will use the encoding in the HDF5 datatype (either ascii or utf-8).

astype(dtype)¶

Get a wrapper allowing you to perform reads to a different destination type, e.g.:

>>> double_precision = dataset.astype('f8')[0:100:2]

property attrs¶: Attributes attached to this object

property chunks¶: Dataset chunks (or None)

property compression¶: Compression strategy (or None)

property compression_opts¶: Compression setting. Int(0-9) for gzip, 2-tuple for szip.

property dims¶: Access dimension scales attached to this dataset.

property dtype¶: Numpy dtype representing the datatype

property external¶: External file settings. Returns a list of tuples of (name, offset, size) for each external file entry, or returns None if no external files are used.

fields(names, *, _prior_dtype=None)¶

Get a wrapper to read a subset of fields from a compound data type:

>>> 2d_coords = dataset.fields(['x', 'y'])[:]

If names is a string, a single field is extracted, and the resulting arrays will have that dtype. Otherwise, it should be an iterable, and the read data will have a compound dtype.

property file¶: Return a File instance associated with this object

property fillvalue¶: Fill value for this dataset (0 by default)

property fletcher32¶: Fletcher32 filter is present (T/F)

flush()¶

Flush the dataset data and metadata to the file. If the dataset is chunked, raw data chunks are written to the file.

This is part of the SWMR features and only exist when the HDF5 library version >=1.9.178

get_current_sorting()[source]¶: Prints the current sorting method.

get_n_dim_form(as_scalar=False, lazy=False)[source]¶

Reshapes the dataset to an N-dimensional array

Parameters:

as_scalar (bool, optional. Default = False) – If False, the data is returned in its original (complex, compound) dtype Else, the data is flattened to a real-valued dataset
lazy (bool, optional. Default = False) – If set to false, n_dim_data will be a numpy.ndarray Else returned object is dask.array.core.Array

Returns:

n_dim_data – N-dimensional form of the dataset

Return type:

numpy.ndarray or dask.core.Array

get_pos_values(dim_name)[source]¶

Extract the reference values for the specified position dimension

Parameters:: dim_name (str) – Name of one of the dimensions in self.pos_dim_labels
Returns:: dim_values – Array containing the unit values of the dimension dim_name
Return type:: numpy.ndarray

get_spec_values(dim_name)[source]¶

Extract the values for the specified spectroscopic dimension

Parameters:: dim_name (str) – Name of one of the dimensions in self.spec_dim_labels
Returns:: dim_values – Array containing the unit values of the dimension dim_name
Return type:: numpy.ndarray

property id¶: Low-level identifier appropriate for this object

property is_scale¶

Return True if this dataset is also a dimension scale.

Return False otherwise.

property is_virtual¶: Check if this is a virtual dataset

iter_chunks(sel=None)¶

Return chunk iterator. If set, the sel argument is a slice or tuple of slices that defines the region to be used. If not set, the entire dataspace will be used for the iterator.

For each chunk within the given region, the iterator yields a tuple of slices that gives the intersection of the given chunk with the selection area.

A TypeError will be raised if the dataset is not chunked.

A ValueError will be raised if the selection region is invalid.

len()¶

The size of the first axis. TypeError if scalar.

Use of this method is preferred to len(dset), as Python’s built-in len() cannot handle values greater then 2**32 on 32-bit systems.

make_scale(name='')¶

Make this dataset an HDF5 dimension scale.

You can then attach it to dimensions of other datasets like this:

other_ds.dims[0].attach_scale(ds)

You can optionally pass a name to associate with this scale.

property maxshape¶: Shape up to which this dataset can be resized. Axes with value None have no resize limit.

property name¶: Return the full name of this object. None if anonymous.

property nbytes¶: Numpy-style attribute giving the raw dataset size as the number of bytes

property ndim¶: Numpy-style attribute giving the number of dimensions

property parent¶

Return the parent group of this object.

This is always equivalent to obj.file[posixpath.dirname(obj.name)]. ValueError if this object is anonymous.

read_direct(dest, source_sel=None, dest_sel=None)¶

Read data directly from HDF5 into an existing NumPy array.

The destination array must be C-contiguous and writable. Selections must be the output of numpy.s_[<args>].

Broadcasting is supported for simple indexing.

reduce(dims, ufunc=<function mean>, to_hdf5=False, dset_name=None, verbose=False)[source]¶

Parameters:

dims (str or list of str) – Names of the position and/or spectroscopic dimensions that need to be reduced
ufunc (callable, optional. Default = dask.array.mean) – Reduction function such as dask.array.mean available in dask.array
to_hdf5 (bool, optional. Default = False) – Whether or not to write the reduced data back to a new dataset
dset_name (str (optional)) – Name of the new USID Main datset in the HDF5 file that will contain the sliced data. Default - the sliced dataset takes the same name as this source dataset
verbose (bool, optional. Default = False) – Whether or not to print any debugging statements to stdout

Returns:

reduced_nd (dask.array object) – Dask array object containing the reduced data. Call compute() on this object to get the equivalent numpy array
h5_main_red (USIDataset) – USIDataset reference if to_hdf5 was set to True. Otherwise - None.

property ref¶: An (opaque) HDF5 reference to this object

refresh()¶

Refresh the dataset metadata by reloading from the file.

This is part of the SWMR features and only exist when the HDF5 library version >=1.9.178

property regionref¶

Create a region reference (Datasets only).

The syntax is regionref[<slices>]. For example, dset.regionref[…] creates a region reference in which the whole dataset is selected.

Can also be used to determine the shape of the referenced dataset (via .shape property), or the shape of the selection (via the .selection property).

resize(size, axis=None)¶

Resize the dataset, or the specified axis.

The dataset must be stored in chunked format; it can be resized up to the “maximum shape” (keyword maxshape) specified at creation time. The rank of the dataset cannot be changed.

“Size” should be a shape tuple, or if an axis is specified, an integer.

BEWARE: This functions differently than the NumPy resize() method! The data is not “reshuffled” to fit in the new shape; each axis is grown or shrunk independently. The coordinates of existing data are fixed.

property scaleoffset¶: Scale/offset filter settings. For integer data types, this is the number of bits stored, or 0 for auto-detected. For floating point data types, this is the number of decimal places retained. If the scale/offset filter is not in use, this is None.

property shape¶: Numpy-style shape tuple giving dataset dimensions

property shuffle¶: Shuffle filter present (T/F)

property size¶: Numpy-style attribute giving the total dataset size

slice(slice_dict, ndim_form=True, as_scalar=False, verbose=False, lazy=False)[source]¶

Slice the dataset based on an input dictionary of ‘str’: slice pairs. Each string should correspond to a dimension label. The slices can be array-likes or slice objects.

Parameters:

slice_dict (dict) – Dictionary of array-likes. for any dimension one needs to slice
ndim_form (bool, optional) – Whether or not to return the slice in it’s N-dimensional form. Default = True
as_scalar (bool, optional) – Should the data be returned as scalar values only.
verbose (bool, optional) – Whether or not to print debugging statements
lazy (bool, optional. Default = False) – If set to false, data_slice will be a numpy.ndarray Else returned object is dask.array.core.Array

Returns:

data_slice (numpy.ndarray, or dask.array.core.Array) – Slice of the dataset. Dataset has been reshaped to N-dimensions if success is True, only by Position dimensions if success is ‘Positions’, or not reshape at all if success is False.
success (str or bool) – Informs the user as to how the data_slice has been shaped.

slice_to_dataset(slice_dict, dset_name=None, verbose=False, **kwargs)[source]¶

Slices the dataset, writes its output to back to the HDF5 file, and returns a USIDataset object

Parameters:

slice_dict (dict) – Dictionary to slice one or more dimensions of the dataset by indices
dset_name (str (optional)) – Name of the new USID Main datset in the HDF5 file that will contain the sliced data. Default - the sliced dataset takes the same name as this source dataset
verbose (bool (optional)) – Whether or not to print debugging statements to stdout. Default = False
kwargs (keyword arguments) – keyword arguments that will be passed on to write_main_data()

Returns:

h5_trunc – USIDataset containing the sliced data

Return type:

USIDataset

to_csv(output_path=None, force=False)[source]¶

Output this USIDataset and position + spectroscopic values to a csv file. This should ideally be limited to small datasets only

Parameters:

output_path (str, optional) – path that the output file should be written to. By default, the file will be written to the same directory as the HDF5 file
force (bool, optional) – Whether or not to force large dataset to be written to CSV. Default = False

Returns:

output_file (str)
Author - Daniel Streater, Suhas Somnath

toggle_sorting()[source]¶: Toggles between sorting from the fastest changing dimension to the slowest and sorting based on the order of the labels

virtual_sources()¶: Get a list of the data mappings for a virtual dataset

visualize(slice_dict=None, verbose=False, **kwargs)[source]¶

Interactive visualization of this dataset. Only available on jupyter notebooks

Parameters:

slice_dict (dictionary, optional) – Slicing instructions
verbose (bool, optional) – Whether or not to print debugging statements. Default = Off

Returns:

fig (matplotlib.figure handle) – Handle for the figure object
axis (matplotlib.Axes.axis object) – Axis within which the data was plotted. Note - the interactive visualizer does not return this object

write_direct(source, source_sel=None, dest_sel=None)¶

Write data directly to HDF5 from a NumPy array.

The source array must be C-contiguous. Selections must be the output of numpy.s_[<args>].

Broadcasting is supported for simple indexing.