pyUSID.io.usi_data.USIDataset

class pyUSID.io.usi_data.USIDataset(h5_ref, sort_dims=False)[source]

Bases: Dataset

A class that simplifies slicing, visualization, reshaping, reduction etc. of USID datasets in HDF5 files.

This class extends the h5py.Dataset.

Parameters:
  • h5_ref (h5py.Dataset) – The dataset which is actually a USID Main dataset

  • sort_dims (bool, Optional. Default=False) – If set to True - Dimensions will be sorted from slowest to fastest Else - Dimensions will be arranged as they appear in ancillary datasets

self.get_current_sorting()
self.toggle_sorting()
self.get_pos_values()
self.get_spec_values()
self.get_n_dim_form()
self.slice()
self.h5_spec_vals

Associated Spectroscopic Values dataset

Type:

h5py.Dataset

self.h5_spec_inds

Associated Spectroscopic Indices dataset

Type:

h5py.Dataset

self.h5_pos_vals

Associated Position Values dataset

Type:

h5py.Dataset

self.h5_pos_inds

Associated Position Indices dataset

Type:

h5py.Dataset

self.pos_dim_labels

The labels for the position dimensions.

Type:

list of str

self.spec_dim_labels

The labels for the spectroscopic dimensions.

Type:

list of str

self.n_dim_labels

The labels for the n-dimensional dataset.

Type:

list of str

self.pos_dim_sizes

A list of the sizes of each position dimension.

Type:

list of int

self.spec_dim_sizes

A list of the sizes of each spectroscopic dimension.

Type:

list of int

self.n_dim_sizes

A list of the sizes of each dimension.

Type:

list of int

Notes

The order of all labels and sizes attributes is determined by the current value of sort_dims.

Methods

asstr

Get a wrapper to read string data as Python strings:

astype

Get a wrapper allowing you to perform reads to a different destination type, e.g.:

fields

Get a wrapper to read a subset of fields from a compound data type:

flush

Flush the dataset data and metadata to the file.

get_current_sorting

Prints the current sorting method.

get_n_dim_form

Reshapes the dataset to an N-dimensional array

get_pos_values

Extract the reference values for the specified position dimension

get_spec_values

Extract the values for the specified spectroscopic dimension

iter_chunks

Return chunk iterator.

len

The size of the first axis.

make_scale

Make this dataset an HDF5 dimension scale.

read_direct

Read data directly from HDF5 into an existing NumPy array.

reduce

param dims:

Names of the position and/or spectroscopic dimensions that need to be reduced

refresh

Refresh the dataset metadata by reloading from the file.

resize

Resize the dataset, or the specified axis.

slice

Slice the dataset based on an input dictionary of 'str': slice pairs.

slice_to_dataset

Slices the dataset, writes its output to back to the HDF5 file, and returns a USIDataset object

to_csv

Output this USIDataset and position + spectroscopic values to a csv file.

toggle_sorting

Toggles between sorting from the fastest changing dimension to the slowest and sorting based on the order of the labels

virtual_sources

Get a list of the data mappings for a virtual dataset

visualize

Interactive visualization of this dataset.

write_direct

Write data directly to HDF5 from a NumPy array.

Attributes

attrs

Attributes attached to this object

chunks

Dataset chunks (or None)

compression

Compression strategy (or None)

compression_opts

Compression setting.

dims

Access dimension scales attached to this dataset.

dtype

Numpy dtype representing the datatype

external

External file settings.

file

Return a File instance associated with this object

fillvalue

Fill value for this dataset (0 by default)

fletcher32

Fletcher32 filter is present (T/F)

id

Low-level identifier appropriate for this object

is_scale

Return True if this dataset is also a dimension scale.

is_virtual

Check if this is a virtual dataset

maxshape

Shape up to which this dataset can be resized.

name

Return the full name of this object.

nbytes

Numpy-style attribute giving the raw dataset size as the number of bytes

ndim

Numpy-style attribute giving the number of dimensions

parent

Return the parent group of this object.

ref

An (opaque) HDF5 reference to this object

regionref

Create a region reference (Datasets only).

scaleoffset

Scale/offset filter settings.

shape

Numpy-style shape tuple giving dataset dimensions

shuffle

Shuffle filter present (T/F)

size

Numpy-style attribute giving the total dataset size

__array__(dtype=None)

Create a Numpy array containing the whole dataset. DON’T THINK THIS MEANS DATASETS ARE INTERCHANGEABLE WITH ARRAYS. For one thing, you have to read the whole dataset every time this method is called.

__getitem__(args, new_dtype=None)

Read a slice from the HDF5 dataset.

Takes slices and recarray-style field names (more than one is allowed!) in any order. Obeys basic NumPy rules, including broadcasting.

Also supports:

  • Boolean “mask” array indexing

__getnewargs__()

Disable pickle.

Handles for HDF5 objects can’t be reliably deserialised, because the recipient may not have access to the same files. So we do this to fail early.

If you really want to pickle h5py objects and can live with some limitations, look at the h5pickle project on PyPI.

__iter__()

Iterate over the first axis. TypeError if scalar.

BEWARE: Modifications to the yielded data are NOT written to file.

__len__()

The size of the first axis. TypeError if scalar.

Limited to 2**32 on 32-bit systems; Dataset.len() is preferred.

__setitem__(args, val)

Write to the HDF5 dataset from a Numpy array.

NumPy’s broadcasting rules are honored, for “simple” indexing (slices and integers). For advanced indexing, the shapes must match.

asstr(encoding=None, errors='strict')

Get a wrapper to read string data as Python strings:

>>> str_array = dataset.asstr()[:]

The parameters have the same meaning as in bytes.decode(). If encoding is unspecified, it will use the encoding in the HDF5 datatype (either ascii or utf-8).

astype(dtype)

Get a wrapper allowing you to perform reads to a different destination type, e.g.:

>>> double_precision = dataset.astype('f8')[0:100:2]
property attrs

Attributes attached to this object

property chunks

Dataset chunks (or None)

property compression

Compression strategy (or None)

property compression_opts

Compression setting. Int(0-9) for gzip, 2-tuple for szip.

property dims

Access dimension scales attached to this dataset.

property dtype

Numpy dtype representing the datatype

property external

External file settings. Returns a list of tuples of (name, offset, size) for each external file entry, or returns None if no external files are used.

fields(names, *, _prior_dtype=None)

Get a wrapper to read a subset of fields from a compound data type:

>>> 2d_coords = dataset.fields(['x', 'y'])[:]

If names is a string, a single field is extracted, and the resulting arrays will have that dtype. Otherwise, it should be an iterable, and the read data will have a compound dtype.

property file

Return a File instance associated with this object

property fillvalue

Fill value for this dataset (0 by default)

property fletcher32

Fletcher32 filter is present (T/F)

flush()

Flush the dataset data and metadata to the file. If the dataset is chunked, raw data chunks are written to the file.

This is part of the SWMR features and only exist when the HDF5 library version >=1.9.178

get_current_sorting()[source]

Prints the current sorting method.

get_n_dim_form(as_scalar=False, lazy=False)[source]

Reshapes the dataset to an N-dimensional array

Parameters:
  • as_scalar (bool, optional. Default = False) – If False, the data is returned in its original (complex, compound) dtype Else, the data is flattened to a real-valued dataset

  • lazy (bool, optional. Default = False) – If set to false, n_dim_data will be a numpy.ndarray Else returned object is dask.array.core.Array

Returns:

n_dim_data – N-dimensional form of the dataset

Return type:

numpy.ndarray or dask.core.Array

get_pos_values(dim_name)[source]

Extract the reference values for the specified position dimension

Parameters:

dim_name (str) – Name of one of the dimensions in self.pos_dim_labels

Returns:

dim_values – Array containing the unit values of the dimension dim_name

Return type:

numpy.ndarray

get_spec_values(dim_name)[source]

Extract the values for the specified spectroscopic dimension

Parameters:

dim_name (str) – Name of one of the dimensions in self.spec_dim_labels

Returns:

dim_values – Array containing the unit values of the dimension dim_name

Return type:

numpy.ndarray

property id

Low-level identifier appropriate for this object

property is_scale

Return True if this dataset is also a dimension scale.

Return False otherwise.

property is_virtual

Check if this is a virtual dataset

iter_chunks(sel=None)

Return chunk iterator. If set, the sel argument is a slice or tuple of slices that defines the region to be used. If not set, the entire dataspace will be used for the iterator.

For each chunk within the given region, the iterator yields a tuple of slices that gives the intersection of the given chunk with the selection area.

A TypeError will be raised if the dataset is not chunked.

A ValueError will be raised if the selection region is invalid.

len()

The size of the first axis. TypeError if scalar.

Use of this method is preferred to len(dset), as Python’s built-in len() cannot handle values greater then 2**32 on 32-bit systems.

make_scale(name='')

Make this dataset an HDF5 dimension scale.

You can then attach it to dimensions of other datasets like this:

other_ds.dims[0].attach_scale(ds)

You can optionally pass a name to associate with this scale.

property maxshape

Shape up to which this dataset can be resized. Axes with value None have no resize limit.

property name

Return the full name of this object. None if anonymous.

property nbytes

Numpy-style attribute giving the raw dataset size as the number of bytes

property ndim

Numpy-style attribute giving the number of dimensions

property parent

Return the parent group of this object.

This is always equivalent to obj.file[posixpath.dirname(obj.name)]. ValueError if this object is anonymous.

read_direct(dest, source_sel=None, dest_sel=None)

Read data directly from HDF5 into an existing NumPy array.

The destination array must be C-contiguous and writable. Selections must be the output of numpy.s_[<args>].

Broadcasting is supported for simple indexing.

reduce(dims, ufunc=<function mean>, to_hdf5=False, dset_name=None, verbose=False)[source]
Parameters:
  • dims (str or list of str) – Names of the position and/or spectroscopic dimensions that need to be reduced

  • ufunc (callable, optional. Default = dask.array.mean) – Reduction function such as dask.array.mean available in dask.array

  • to_hdf5 (bool, optional. Default = False) – Whether or not to write the reduced data back to a new dataset

  • dset_name (str (optional)) – Name of the new USID Main datset in the HDF5 file that will contain the sliced data. Default - the sliced dataset takes the same name as this source dataset

  • verbose (bool, optional. Default = False) – Whether or not to print any debugging statements to stdout

Returns:

  • reduced_nd (dask.array object) – Dask array object containing the reduced data. Call compute() on this object to get the equivalent numpy array

  • h5_main_red (USIDataset) – USIDataset reference if to_hdf5 was set to True. Otherwise - None.

property ref

An (opaque) HDF5 reference to this object

refresh()

Refresh the dataset metadata by reloading from the file.

This is part of the SWMR features and only exist when the HDF5 library version >=1.9.178

property regionref

Create a region reference (Datasets only).

The syntax is regionref[<slices>]. For example, dset.regionref[…] creates a region reference in which the whole dataset is selected.

Can also be used to determine the shape of the referenced dataset (via .shape property), or the shape of the selection (via the .selection property).

resize(size, axis=None)

Resize the dataset, or the specified axis.

The dataset must be stored in chunked format; it can be resized up to the “maximum shape” (keyword maxshape) specified at creation time. The rank of the dataset cannot be changed.

“Size” should be a shape tuple, or if an axis is specified, an integer.

BEWARE: This functions differently than the NumPy resize() method! The data is not “reshuffled” to fit in the new shape; each axis is grown or shrunk independently. The coordinates of existing data are fixed.

property scaleoffset

Scale/offset filter settings. For integer data types, this is the number of bits stored, or 0 for auto-detected. For floating point data types, this is the number of decimal places retained. If the scale/offset filter is not in use, this is None.

property shape

Numpy-style shape tuple giving dataset dimensions

property shuffle

Shuffle filter present (T/F)

property size

Numpy-style attribute giving the total dataset size

slice(slice_dict, ndim_form=True, as_scalar=False, verbose=False, lazy=False)[source]

Slice the dataset based on an input dictionary of ‘str’: slice pairs. Each string should correspond to a dimension label. The slices can be array-likes or slice objects.

Parameters:
  • slice_dict (dict) – Dictionary of array-likes. for any dimension one needs to slice

  • ndim_form (bool, optional) – Whether or not to return the slice in it’s N-dimensional form. Default = True

  • as_scalar (bool, optional) – Should the data be returned as scalar values only.

  • verbose (bool, optional) – Whether or not to print debugging statements

  • lazy (bool, optional. Default = False) – If set to false, data_slice will be a numpy.ndarray Else returned object is dask.array.core.Array

Returns:

  • data_slice (numpy.ndarray, or dask.array.core.Array) – Slice of the dataset. Dataset has been reshaped to N-dimensions if success is True, only by Position dimensions if success is ‘Positions’, or not reshape at all if success is False.

  • success (str or bool) – Informs the user as to how the data_slice has been shaped.

slice_to_dataset(slice_dict, dset_name=None, verbose=False, **kwargs)[source]

Slices the dataset, writes its output to back to the HDF5 file, and returns a USIDataset object

Parameters:
  • slice_dict (dict) – Dictionary to slice one or more dimensions of the dataset by indices

  • dset_name (str (optional)) – Name of the new USID Main datset in the HDF5 file that will contain the sliced data. Default - the sliced dataset takes the same name as this source dataset

  • verbose (bool (optional)) – Whether or not to print debugging statements to stdout. Default = False

  • kwargs (keyword arguments) – keyword arguments that will be passed on to write_main_data()

Returns:

h5_trunc – USIDataset containing the sliced data

Return type:

USIDataset

to_csv(output_path=None, force=False)[source]

Output this USIDataset and position + spectroscopic values to a csv file. This should ideally be limited to small datasets only

Parameters:
  • output_path (str, optional) – path that the output file should be written to. By default, the file will be written to the same directory as the HDF5 file

  • force (bool, optional) – Whether or not to force large dataset to be written to CSV. Default = False

Returns:

  • output_file (str)

  • Author - Daniel Streater, Suhas Somnath

toggle_sorting()[source]

Toggles between sorting from the fastest changing dimension to the slowest and sorting based on the order of the labels

virtual_sources()

Get a list of the data mappings for a virtual dataset

visualize(slice_dict=None, verbose=False, **kwargs)[source]

Interactive visualization of this dataset. Only available on jupyter notebooks

Parameters:
  • slice_dict (dictionary, optional) – Slicing instructions

  • verbose (bool, optional) – Whether or not to print debugging statements. Default = Off

Returns:

  • fig (matplotlib.figure handle) – Handle for the figure object

  • axis (matplotlib.Axes.axis object) – Axis within which the data was plotted. Note - the interactive visualizer does not return this object

write_direct(source, source_sel=None, dest_sel=None)

Write data directly to HDF5 from a NumPy array.

The source array must be C-contiguous. Selections must be the output of numpy.s_[<args>].

Broadcasting is supported for simple indexing.