sidpy.sid.dataset.Dataset¶

class sidpy.sid.dataset.Dataset(dask, name, chunks, dtype=None, meta=None, shape=None)[source]¶

Bases: Array

To instantiate from an existing array-like object, use Dataset.from_array() - requires numpy array, list or tuple

This dask array is extended to have the following attributes: -data_type: DataTypes (‘image’, ‘image_stack’, spectral_image’, … -units: str -quantity: str what kind of data (‘intensity’, ‘height’, ..) -title: title of the data set -modality: character of data such as ‘STM, ‘AFM’, ‘TEM’, ‘SEM’, ‘DFT’, ‘simulation’, ..) -source: origin of data such as acquisition instrument (‘Nion US100’, ‘VASP’, ..) -_axes: dictionary of Dimensions one for each data dimension

(the axes are dimension datasets with name, label, units, and ‘dimension_type’ attributes).

-metadata: dictionary of additional metadata -original_metadata: dictionary of original metadata of file,

-labels: returns labels of all dimensions. -data_descriptor: returns a label for the colorbar in matplotlib and such

functions: -from_array(data, title): constructs the dataset form an array like object (numpy array, dask array, …) -like_data(data,title): constructs the dataset form an array like object and copies attributes and metadata from parent dataset -copy() -plot(): plots dataset dependent on data_type and dimension_types. -get_extent(): extent to be used with imshow function of matplotlib -set_dimension(axis, dimensions): set a Dimension to a specific axis -rename_dimension(dimension, name): renames attribute of dimension -view_metadata: pretty plot of metadata dictionary -view_original_metadata: pretty plot of original_metadata dictionary

Initializes Dataset object which is essentially a Dask array underneath

self.quantity¶

Physical quantity. E.g. - current

Type:: str

self.units¶

Physical units. E.g. - amperes

Type:: str

self.data_type¶

Type of data such as Image, Spectrum, Spectral Image etc.

Type:: enum

self.title¶

Title for Dataset

Type:: str

self._structures¶

dictionary of ase.Atoms objects to represent structures, can be given a name

Type:: dict

self.view¶

Instance of class appropriate for visualizing this object

Type:: Visualizer

self.data_descriptor¶

Description of this dataset

Type:: str

self.modality¶

character of data such as ‘STM’, ‘TEM’, ‘DFT’

Type:: str

self.source¶

Source of this dataset. Such as instrument, analysis, etc.?

Type:: str

self.h5_dataset¶

Reference to HDF5 Dataset object from which this Dataset was created

Type:: h5py.Dataset

self._axes¶

Dictionary of Dimension objects per dimension of the Dataset

Type:: dict

self.meta_data¶

Metadata to store relevant additional information for the dataset.

Type:: dict

self.original_metadata¶

Metadata from the original source of the dataset. This dictionary often contains the vendor-specific metadata or internal attributes of the analysis algorithm

Type:: dict

Methods

`abs`
`add_structure`
`adjust_axis`
`all`	Returns True if all elements evaluate to True.
`angle`
`any`	Returns True if any of the elements evaluate to True.
`argmax`	Return indices of the maximum values along the given axis.
`argmin`	Return indices of the minimum values along the given axis.
`argtopk`	The indices of the top k elements of an array.
`astype`	Copy of the array, cast to a specified type.
`choose`	Use an index array to construct a new array from a set of choices.
`clip`	Return an array whose values are limited to `[min, max]`.
`compute`	Compute this dask collection
`compute_chunk_sizes`	Compute the chunk sizes for a Dask array.
`conj`	Complex-conjugate all elements.
`copy`	Returns a deep copy of this dataset.
`cumprod`	Return the cumulative product of the elements along the given axis.
`cumsum`	Return the cumulative sum of the elements along the given axis.
`del_dimension`	Deletes the dimension attached to axis 'ind'.
`dot`	Dot product of self and other.
`fft`	Gets the FFT of a sidpy.Dataset of any size
`flatten`	Return a flattened array.
`flatten_complex`	This function returns a dataset with real and imaginary components that have been flattened This is necessary for scenarios such as fitting of complex functions Must be a 2D or 1D dataset to begin with Output: - ouput_arr: sidpy.Dataset object
`fold`	This method collapses the dimensions of the sidpy dataset
`from_array`	Initializes a sidpy dataset from an array-like object (i.e. numpy array) All meta-data will be set to be generically.
`get_dimension_by_number`
`get_dimension_slope`
`get_dimensions_by_type`	get dimension by dimension_type name
`get_dimensions_types`
`get_extent`	get image extents as needed i.e. in matplotlib's imshow function.
`get_image_dims`	Get all spatial dimensions
`get_spectral_dims`	Get all spectral dimensions
`hdf_close`
`like_data`	Returns sidpy.Dataset of new values but with metadata of this dataset - if dimension of new dataset is different from this dataset and the scale is linear, then this scale will be applied to the new dataset (naming and units will stay the same), otherwise the dimension will be generic. -Additional functionality to override numeric functions :param data: values of new sidpy dataset :type data: array like :param title: title of new sidpy dataset :type title: optional string :param chunks: size of chunks for dask array :type chunks: optional list of integers :param lock: for dask array :type lock: optional boolean :param coordinates: coordinates for point cloud :type coordinates: array like :param variance: variance of dataset :type variance: numpy array, optional.
`map_blocks`	Map a function across all blocks of a dask array.
`map_overlap`	Map a function over blocks of the array with some overlap
`max`	Return the maximum along a given axis.
`mean`	Returns the average of the array elements along given axis.
`min`	Return the minimum along a given axis.
`moment`	Calculate the nth centralized moment.
`nonzero`	Return the indices of the elements that are non-zero.
`persist`	Persist this dask collection into memory
`plot`	Plots the dataset according to the
`prod`	Return the product of the array elements over the given axis
`ravel`	Return a flattened array.
`rechunk`	Convert blocks in dask array x for new chunks.
`reduce_dims`
`rename_dimension`	Renames Dimension at the specified index
`repeat`	Repeat elements of an array.
`reshape`	Reshape array to new shape
`round`	Return array with each element rounded to the given number of decimals.
`set_dimension`	sets the dimension for the dataset including new name and updating the axes dictionary
`set_thumbnail`	Creates a thumbnail which is stored in thumbnail attribute of sidpy Dataset Thumbnail data is saved to Thumbnail group of associated h5_file if it exists
`squeeze`	Remove axes of length one from array.
`std`	Returns the standard deviation of the array elements along given axis.
`store`	Store dask arrays in array-like objects, overwrite data in target
`sum`	Return the sum of the array elements over the given axis.
`swapaxes`	Return a view of the array with `axis1` and `axis2` interchanged.
`to_backend`	Move to a new Array backend
`to_dask_dataframe`	Convert dask Array to dask Dataframe
`to_delayed`	Convert into an array of `dask.delayed.Delayed` objects, one per chunk.
`to_hdf5`	Store array in HDF5 file
`to_svg`	Convert chunks from Dask Array into an SVG Image
`to_tiledb`	Save array to the TileDB storage manager
`to_zarr`	Save array to the zarr storage format
`topk`	The top k elements of an array.
`trace`	Return the sum along diagonals of the array.
`transpose`	Reverse or permute the axes of an array.
`unfold`
`var`	Returns the variance of the array elements, along given axis.
`view`	Get a view of the array as a new data type
`view_metadata`	Prints the metadata to stdout
`view_original_metadata`	Prints the original_metadata dictionary to stdout
`visualize`	Render the computation of this object's task graph using graphviz.

Attributes

`dask`
`A`
`T`
`blocks`	An array-like interface to the blocks of an array.
`chunks`	Chunks property.
`chunksize`
`data_descriptor`
`data_type`
`dtype`
`h5_dataset`
`imag`
`itemsize`	Length of one array element in bytes
`labels`
`metadata`
`modality`
`name`
`nbytes`	Number of bytes in array
`ndim`
`npartitions`
`numblocks`
`original_metadata`
`partitions`	Slice an array by partitions.
`quantity`
`real`
`shape`
`size`	Number of elements in array
`source`
`structures`
`title`
`units`
`variance`
`vindex`	Vectorized indexing with broadcasting.

all(axis=None, keepdims=False, split_every=None, out=None)[source]¶

Returns True if all elements evaluate to True.

Refer to dask.array.all() for full documentation.

See also

dask.array.all: equivalent function

any(axis=None, keepdims=False, split_every=None, out=None)[source]¶

Returns True if any of the elements evaluate to True.

Refer to dask.array.any() for full documentation.

See also

dask.array.any: equivalent function

argmax(axis=None, split_every=None, out=None)[source]¶

Return indices of the maximum values along the given axis.

Refer to dask.array.argmax() for full documentation.

See also

dask.array.argmax: equivalent function

argmin(axis=None, split_every=None, out=None)[source]¶

Return indices of the minimum values along the given axis.

Refer to dask.array.argmin() for full documentation.

See also

dask.array.argmin: equivalent function

argtopk(k, axis=-1, split_every=None)¶

The indices of the top k elements of an array.

Refer to dask.array.argtopk() for full documentation.

See also

dask.array.argtopk: equivalent function

astype(dtype, **kwargs)[source]¶

Copy of the array, cast to a specified type.

Parameters:

dtype (str or dtype) – Typecode or data-type to which the array is cast.
casting ({'no', 'equiv', 'safe', 'same_kind', 'unsafe'}, optional) –
Controls what kind of data casting may occur. Defaults to ‘unsafe’ for backwards compatibility.
- ’no’ means the data types should not be cast at all.
- ’equiv’ means only byte-order changes are allowed.
- ’safe’ means only casts which can preserve values are allowed.
- ’same_kind’ means only safe casts or casts within a kind,
  like float64 to float32, are allowed.
- ’unsafe’ means any data conversions may be done.
copy (bool, optional) –
By default, astype always returns a newly allocated array. If this is set to False and the dtype requirement is satisfied, the input array is returned instead of a copy.

Note

Dask does not respect the contiguous memory layout of the array, and will ignore the order keyword argument. The default order is ‘C’ contiguous.

property blocks¶

An array-like interface to the blocks of an array.

This returns a Blockview object that provides an array-like interface to the blocks of a dask array. Numpy-style indexing of a Blockview object returns a selection of blocks as a new dask array.

You can index array.blocks like a numpy array of shape equal to the number of blocks in each dimension, (available as array.blocks.size). The dimensionality of the output array matches the dimension of this array, even if integer indices are passed. Slicing with np.newaxis or multiple lists is not supported.

Examples

>>> import dask.array as da
>>> x = da.arange(8, chunks=2)
>>> x.blocks.shape # aliases x.numblocks
(4,)
>>> x.blocks[0].compute()
array([0, 1])
>>> x.blocks[:3].compute()
array([0, 1, 2, 3, 4, 5])
>>> x.blocks[::2].compute()
array([0, 1, 4, 5])
>>> x.blocks[[-1, 0]].compute()
array([6, 7, 0, 1])
>>> x.blocks.ravel() 
[dask.array<blocks, shape=(2,), dtype=int64, chunksize=(2,), chunktype=numpy.ndarray>,
 dask.array<blocks, shape=(2,), dtype=int64, chunksize=(2,), chunktype=numpy.ndarray>,
 dask.array<blocks, shape=(2,), dtype=int64, chunksize=(2,), chunktype=numpy.ndarray>,
 dask.array<blocks, shape=(2,), dtype=int64, chunksize=(2,), chunktype=numpy.ndarray>]

Return type:: An instance of dask.array.Blockview

choose(choices)[source]¶

Use an index array to construct a new array from a set of choices.

Refer to dask.array.choose() for full documentation.

See also

dask.array.choose: equivalent function

property chunks¶: Chunks property.

clip(min=None, max=None)[source]¶

Return an array whose values are limited to [min, max]. One of max or min must be given.

Refer to dask.array.clip() for full documentation.

See also

dask.array.clip: equivalent function

compute(**kwargs)¶

Compute this dask collection

This turns a lazy Dask collection into its in-memory equivalent. For example a Dask array turns into a NumPy array and a Dask dataframe turns into a Pandas dataframe. The entire dataset must fit into memory before calling this operation.

Parameters:

scheduler (string, optional) – Which scheduler to use like “threads”, “synchronous” or “processes”. If not provided, the default is to check the global settings first, and then fall back to the collection defaults.
optimize_graph (bool, optional) – If True [default], the graph is optimized before computation. Otherwise the graph is run as is. This can be useful for debugging.
kwargs – Extra keywords to forward to the scheduler function.

See also

dask.compute

compute_chunk_sizes()[source]¶

Compute the chunk sizes for a Dask array. This is especially useful when the chunk sizes are unknown (e.g., when indexing one Dask array with another).

Notes

This function modifies the Dask array in-place.

Examples

>>> import dask.array as da
>>> import numpy as np
>>> x = da.from_array([-2, -1, 0, 1, 2], chunks=2)
>>> x.chunks
((2, 2, 1),)
>>> y = x[x <= 0]
>>> y.chunks
((nan, nan, nan),)
>>> y.compute_chunk_sizes()  # in-place computation
dask.array<getitem, shape=(3,), dtype=int64, chunksize=(2,), chunktype=numpy.ndarray>
>>> y.chunks
((2, 1, 0),)

conj()[source]¶

Complex-conjugate all elements.

Refer to dask.array.conj() for full documentation.

See also

dask.array.conj: equivalent function

copy()[source]¶

Returns a deep copy of this dataset.

Return type:: sidpy dataset

cumprod(axis, dtype=None, out=None, method='sequential')[source]¶

Return the cumulative product of the elements along the given axis.

Refer to dask.array.cumprod() for full documentation.

See also

dask.array.cumprod: equivalent function

cumsum(axis, dtype=None, out=None, method='sequential')[source]¶

Return the cumulative sum of the elements along the given axis.

Refer to dask.array.cumsum() for full documentation.

See also

dask.array.cumsum: equivalent function

del_dimension(ind=None)[source]¶: Deletes the dimension attached to axis ‘ind’.

dot(other)[source]¶

Dot product of self and other.

Refer to dask.array.tensordot() for full documentation.

See also

dask.array.dot: equivalent function

fft(dimension_type=None)[source]¶

Gets the FFT of a sidpy.Dataset of any size

The data_type of the sidpy.Dataset determines the dimension_type over which the fourier transform is performed over, if the dimension_type is not set explicitly.

The fourier transformed dataset is automatically shifted to center of dataset.

Parameters:: dimension_type (None, str, or sidpy.DimensionType - optional) – dimension_type over which fourier transform is performed, if None an educated guess will determine that from dimensions of sidpy.Dataset
Returns:: fft_dset – 2 or 3 dimensional matrix arranged in the same way as input
Return type:: 2D or 3D complex sidpy.Dataset (not tested for higher dimensions)

Example

>> fft_dataset = sidpy_dataset.fft() >> fft_dataset.plot()

flatten()[source]¶

Return a flattened array.

Refer to dask.array.ravel() for full documentation.

See also

dask.array.ravel: equivalent function

flatten_complex()[source]¶: This function returns a dataset with real and imaginary components that have been flattened This is necessary for scenarios such as fitting of complex functions Must be a 2D or 1D dataset to begin with Output: - ouput_arr: sidpy.Dataset object

fold(dim_order=None, method=None)[source]¶: This method collapses the dimensions of the sidpy dataset

classmethod from_array(x, title='generic', chunks='auto', lock=False, datatype='UNKNOWN', units='generic', quantity='generic', modality='generic', source='generic', coordinates=None, variance=None, **kwargs)[source]¶

Initializes a sidpy dataset from an array-like object (i.e. numpy array) All meta-data will be set to be generically.

Parameters:

x (array-like object) – the values which will populate this dataset
chunks (optional integer or list of integers) – the shape of the chunks to be loaded
title (optional string) – the title of this dataset
lock (boolean) –
datatype (str or sidpy.DataType) – data type of set: i.e.: ‘image’, spectrum’, ..
units (str) – units of dataset i.e. counts, A
quantity (str) – quantity of dataset like intensity
modality (str) – modality of dataset like
source (str) – source of dataset like what kind of microscope or function
coordinates (numpy array, optional) – coordinates for point cloud
point_cloud (dict or None) – dict with coordinates and base_image for point_cloud data_type
variance (array-like object) – the variance values of the x array

Return type:

sidpy dataset

get_dimensions_by_type(dims_in, return_axis=False)[source]¶

get dimension by dimension_type name

Parameter¶

dims_in: dimension_type/str or list of dimension_types/string

returns:: dims_out – the kind of dimensions specified in input in numerical order of the dataset, not the input!
rtype:: list of [index]

get_extent(dimensions)[source]¶

get image extents as needed i.e. in matplotlib’s imshow function. This function works for equi- or non-equi spaced axes and is suitable for subpixel accuracy of positions

Parameters:: dimensions (list of dimensions) –
Return type:: list of floats

get_image_dims(return_axis=False)[source]¶: Get all spatial dimensions

get_spectral_dims(return_axis=False)[source]¶: Get all spectral dimensions

property itemsize: int¶: Length of one array element in bytes

like_data(data, title=None, chunks='auto', lock=False, coordinates=None, variance=None, **kwargs)[source]¶

Returns sidpy.Dataset of new values but with metadata of this dataset - if dimension of new dataset is different from this dataset and the scale is linear,

then this scale will be applied to the new dataset (naming and units will stay the same), otherwise the dimension will be generic.

-Additional functionality to override numeric functions :param data: values of new sidpy dataset :type data: array like :param title: title of new sidpy dataset :type title: optional string :param chunks: size of chunks for dask array :type chunks: optional list of integers :param lock: for dask array :type lock: optional boolean :param coordinates: coordinates for point cloud :type coordinates: array like :param variance: variance of dataset :type variance: numpy array, optional

Return type:: sidpy dataset

map_blocks(*args, name=None, token=None, dtype=None, chunks=None, drop_axis=None, new_axis=None, enforce_ndim=False, meta=None, **kwargs)¶

Map a function across all blocks of a dask array.

Note that map_blocks will attempt to automatically determine the output array type by calling func on 0-d versions of the inputs. Please refer to the meta keyword argument below if you expect that the function will not succeed when operating on 0-d arrays.

Parameters:

func (callable) – Function to apply to every block in the array. If func accepts block_info= or block_id= as keyword arguments, these will be passed dictionaries containing information about input and output chunks/arrays during computation. See examples for details.
args (dask arrays or other objects) –
dtype (np.dtype, optional) – The dtype of the output array. It is recommended to provide this. If not provided, will be inferred by applying the function to a small set of fake data.
chunks (tuple, optional) – Chunk shape of resulting blocks if the function does not preserve shape. If not provided, the resulting array is assumed to have the same block structure as the first input array.
drop_axis (number or iterable, optional) – Dimensions lost by the function.
new_axis (number or iterable, optional) – New dimensions created by the function. Note that these are applied after drop_axis (if present).
enforce_ndim (bool, default False) – Whether to enforce at runtime that the dimensionality of the array produced by func actually matches that of the array returned by map_blocks. If True, this will raise an error when there is a mismatch.
token (string, optional) – The key prefix to use for the output array. If not provided, will be determined from the function name.
name (string, optional) – The key name to use for the output array. Note that this fully specifies the output key name, and must be unique. If not provided, will be determined by a hash of the arguments.
meta (array-like, optional) – The meta of the output array, when specified is expected to be an array of the same type and dtype of that returned when calling .compute() on the array returned by this function. When not provided, meta will be inferred by applying the function to a small set of fake data, usually a 0-d array. It’s important to ensure that func can successfully complete computation without raising exceptions when 0-d is passed to it, providing meta will be required otherwise. If the output type is known beforehand (e.g., np.ndarray, cupy.ndarray), an empty array of such type dtype can be passed, for example: meta=np.array((), dtype=np.int32).
**kwargs – Other keyword arguments to pass to function. Values must be constants (not dask.arrays)

See also

dask.array.map_overlap: Generalized operation with overlap between neighbors.
dask.array.blockwise: Generalized operation with control over block alignment.

Examples

>>> import dask.array as da
>>> x = da.arange(6, chunks=3)

>>> x.map_blocks(lambda x: x * 2).compute()
array([ 0,  2,  4,  6,  8, 10])

The da.map_blocks function can also accept multiple arrays.

>>> d = da.arange(5, chunks=2)
>>> e = da.arange(5, chunks=2)

>>> f = da.map_blocks(lambda a, b: a + b**2, d, e)
>>> f.compute()
array([ 0,  2,  6, 12, 20])

If the function changes shape of the blocks then you must provide chunks explicitly.

>>> y = x.map_blocks(lambda x: x[::2], chunks=((2, 2),))

You have a bit of freedom in specifying chunks. If all of the output chunk sizes are the same, you can provide just that chunk size as a single tuple.

>>> a = da.arange(18, chunks=(6,))
>>> b = a.map_blocks(lambda x: x[:3], chunks=(3,))

If the function changes the dimension of the blocks you must specify the created or destroyed dimensions.

>>> b = a.map_blocks(lambda x: x[None, :, None], chunks=(1, 6, 1),
...                  new_axis=[0, 2])

If chunks is specified but new_axis is not, then it is inferred to add the necessary number of axes on the left.

Note that map_blocks() will concatenate chunks along axes specified by the keyword parameter drop_axis prior to applying the function. This is illustrated in the figure below:

Due to memory-size-constraints, it is often not advisable to use drop_axis on an axis that is chunked. In that case, it is better not to use map_blocks but rather dask.array.reduction(..., axis=dropped_axes, concatenate=False) which maintains a leaner memory footprint while it drops any axis.

Map_blocks aligns blocks by block positions without regard to shape. In the following example we have two arrays with the same number of blocks but with different shape and chunk sizes.

>>> x = da.arange(1000, chunks=(100,))
>>> y = da.arange(100, chunks=(10,))

The relevant attribute to match is numblocks.

>>> x.numblocks
(10,)
>>> y.numblocks
(10,)

If these match (up to broadcasting rules) then we can map arbitrary functions across blocks

>>> def func(a, b):
...     return np.array([a.max(), b.max()])

>>> da.map_blocks(func, x, y, chunks=(2,), dtype='i8')
dask.array<func, shape=(20,), dtype=int64, chunksize=(2,), chunktype=numpy.ndarray>

>>> _.compute()
array([ 99,   9, 199,  19, 299,  29, 399,  39, 499,  49, 599,  59, 699,
        69, 799,  79, 899,  89, 999,  99])

Your block function can get information about where it is in the array by accepting a special block_info or block_id keyword argument. During computation, they will contain information about each of the input and output chunks (and dask arrays) relevant to each call of func.

>>> def func(block_info=None):
...     pass

This will receive the following information:

>>> block_info  
{0: {'shape': (1000,),
     'num-chunks': (10,),
     'chunk-location': (4,),
     'array-location': [(400, 500)]},
 None: {'shape': (1000,),
        'num-chunks': (10,),
        'chunk-location': (4,),
        'array-location': [(400, 500)],
        'chunk-shape': (100,),
        'dtype': dtype('float64')}}

The keys to the block_info dictionary indicate which is the input and output Dask array:

Input Dask array(s): block_info[0] refers to the first input Dask array. The dictionary key is 0 because that is the argument index corresponding to the first input Dask array. In cases where multiple Dask arrays have been passed as input to the function, you can access them with the number corresponding to the input argument, eg: block_info[1], block_info[2], etc. (Note that if you pass multiple Dask arrays as input to map_blocks, the arrays must match each other by having matching numbers of chunks, along corresponding dimensions up to broadcasting rules.)
Output Dask array: block_info[None] refers to the output Dask array, and contains information about the output chunks. The output chunk shape and dtype may may be different than the input chunks.

For each dask array, block_info describes:

shape: the shape of the full Dask array,
num-chunks: the number of chunks of the full array in each dimension,
chunk-location: the chunk location (for example the fourth chunk over in the first dimension), and
array-location: the array location within the full Dask array (for example the slice corresponding to 40:50).

In addition to these, there are two extra parameters described by block_info for the output array (in block_info[None]):

chunk-shape: the output chunk shape, and
dtype: the output dtype.

These features can be combined to synthesize an array from scratch, for example:

>>> def func(block_info=None):
...     loc = block_info[None]['array-location'][0]
...     return np.arange(loc[0], loc[1])

>>> da.map_blocks(func, chunks=((4, 4),), dtype=np.float64)
dask.array<func, shape=(8,), dtype=float64, chunksize=(4,), chunktype=numpy.ndarray>

>>> _.compute()
array([0, 1, 2, 3, 4, 5, 6, 7])

block_id is similar to block_info but contains only the chunk_location:

>>> def func(block_id=None):
...     pass

This will receive the following information:

>>> block_id  
(4, 3)

You may specify the key name prefix of the resulting task in the graph with the optional token keyword argument.

>>> x.map_blocks(lambda x: x + 1, name='increment')
dask.array<increment, shape=(1000,), dtype=int64, chunksize=(100,), chunktype=numpy.ndarray>

For functions that may not handle 0-d arrays, it’s also possible to specify meta with an empty array matching the type of the expected result. In the example below, func will result in an IndexError when computing meta:

>>> rng = da.random.default_rng()
>>> da.map_blocks(lambda x: x[2], rng.random(5), meta=np.array(()))
dask.array<lambda, shape=(5,), dtype=float64, chunksize=(5,), chunktype=numpy.ndarray>

Similarly, it’s possible to specify a non-NumPy array to meta, and provide a dtype:

>>> import cupy  
>>> rng = da.random.default_rng(cupy.random.default_rng())  
>>> dt = np.float32
>>> da.map_blocks(lambda x: x[2], rng.random(5, dtype=dt), meta=cupy.array((), dtype=dt))  
dask.array<lambda, shape=(5,), dtype=float32, chunksize=(5,), chunktype=cupy.ndarray>

map_overlap(func, depth, boundary=None, trim=True, **kwargs)¶

Map a function over blocks of the array with some overlap

Refer to dask.array.map_overlap() for full documentation.

See also

dask.array.map_overlap: equivalent function

max(axis=None, keepdims=False, split_every=None, out=None)[source]¶

Return the maximum along a given axis.

Refer to dask.array.max() for full documentation.

See also

dask.array.max: equivalent function

mean(axis=None, dtype=None, keepdims=False, split_every=None, out=None)[source]¶

Returns the average of the array elements along given axis.

Refer to dask.array.mean() for full documentation.

See also

dask.array.mean: equivalent function

min(axis=None, keepdims=False, split_every=None, out=None)[source]¶

Return the minimum along a given axis.

Refer to dask.array.min() for full documentation.

See also

dask.array.min: equivalent function

moment(order, axis=None, dtype=None, keepdims=False, ddof=0, split_every=None, out=None)[source]¶

Calculate the nth centralized moment.

Refer to dask.array.moment() for the full documentation.

See also

dask.array.moment: equivalent function

property nbytes: int | float¶: Number of bytes in array

nonzero()¶

Return the indices of the elements that are non-zero.

Refer to dask.array.nonzero() for full documentation.

See also

dask.array.nonzero: equivalent function

property partitions¶

Slice an array by partitions. Alias of dask array .blocks attribute.

This alias allows you to write agnostic code that works with both dask arrays and dask dataframes.

This returns a Blockview object that provides an array-like interface to the blocks of a dask array. Numpy-style indexing of a Blockview object returns a selection of blocks as a new dask array.

You can index array.blocks like a numpy array of shape equal to the number of blocks in each dimension, (available as array.blocks.size). The dimensionality of the output array matches the dimension of this array, even if integer indices are passed. Slicing with np.newaxis or multiple lists is not supported.

Examples

>>> import dask.array as da
>>> x = da.arange(8, chunks=2)
>>> x.partitions.shape # aliases x.numblocks
(4,)
>>> x.partitions[0].compute()
array([0, 1])
>>> x.partitions[:3].compute()
array([0, 1, 2, 3, 4, 5])
>>> x.partitions[::2].compute()
array([0, 1, 4, 5])
>>> x.partitions[[-1, 0]].compute()
array([6, 7, 0, 1])
>>> x.partitions.ravel() 
[dask.array<blocks, shape=(2,), dtype=int64, chunksize=(2,), chunktype=numpy.ndarray>,
 dask.array<blocks, shape=(2,), dtype=int64, chunksize=(2,), chunktype=numpy.ndarray>,
 dask.array<blocks, shape=(2,), dtype=int64, chunksize=(2,), chunktype=numpy.ndarray>,
 dask.array<blocks, shape=(2,), dtype=int64, chunksize=(2,), chunktype=numpy.ndarray>]

Return type:: An instance of da.array.Blockview

persist(**kwargs)[source]¶

Persist this dask collection into memory

This turns a lazy Dask collection into a Dask collection with the same metadata, but now with the results fully computed or actively computing in the background.

The action of function differs significantly depending on the active task scheduler. If the task scheduler supports asynchronous computing, such as is the case of the dask.distributed scheduler, then persist will return immediately and the return value’s task graph will contain Dask Future objects. However if the task scheduler only supports blocking computation then the call to persist will block and the return value’s task graph will contain concrete Python results.

This function is particularly useful when using distributed systems, because the results will be kept in distributed memory, rather than returned to the local process as with compute.

Parameters:

scheduler (string, optional) – Which scheduler to use like “threads”, “synchronous” or “processes”. If not provided, the default is to check the global settings first, and then fall back to the collection defaults.
optimize_graph (bool, optional) – If True [default], the graph is optimized before computation. Otherwise the graph is run as is. This can be useful for debugging.
**kwargs – Extra keywords to forward to the scheduler function.

Return type:

New dask collections backed by in-memory data

See also

dask.persist

plot(verbose=False, figure=None, **kwargs)[source]¶

Plots the dataset according to the

shape of the sidpy Dataset,
data_type of the sidpy Dataset and
dimension_type of dimensions of sidpy Dataset
the dimension_type ‘spatial’ or ‘spectral’ determines how a dataset is plotted.

Recognized data_types are: 1D: any keyword, but ‘spectrum’ or ‘line_plot’ are encouraged 2D: ‘image’ or one of [‘spectrum_family’, ‘line_family’, ‘line_plot_family’, ‘spectra’] 3D: ‘image’, ‘image_map’, ‘image_stack’, ‘spectrum_image’ 4D: not implemented yet, but will be similar to spectrum_image.

Parameters:

verbose (boolean) –
kwargs (dictionary for additional plotting parameters) – additional keywords (besides the matplotlib ones) for plotting are: - scale_bar: for images to replace axis with a scale bar inside the image
figure (matplotlib figure object) – define figure to which this datset will be plotted

Returns:

self.view.fig

Return type:

matplotlib figure reference

prod(axis=None, dtype=None, keepdims=False, split_every=None, out=None)[source]¶

Return the product of the array elements over the given axis

Refer to dask.array.prod() for full documentation.

See also

dask.array.prod: equivalent function

ravel()[source]¶

Return a flattened array.

Refer to dask.array.ravel() for full documentation.

See also

dask.array.ravel: equivalent function

rechunk(chunks='auto', threshold=None, block_size_limit=None, balance=False)[source]¶

Convert blocks in dask array x for new chunks.

Refer to dask.array.rechunk() for full documentation.

See also

dask.array.rechunk: equivalent function

rename_dimension(ind, name)[source]¶

Renames Dimension at the specified index

Parameters:

ind (int) – Index of the dimension
name (str) – New name for Dimension

repeat(repeats, axis=None)[source]¶

Repeat elements of an array.

Refer to dask.array.repeat() for full documentation.

See also

dask.array.repeat: equivalent function

reshape(shape, merge_chunks=True, limit=None)[source]¶

Reshape array to new shape

Refer to dask.array.reshape() for full documentation.

See also

dask.array.reshape: equivalent function

round(decimals=0)[source]¶

Return array with each element rounded to the given number of decimals.

Refer to dask.array.round() for full documentation.

See also

dask.array.round: equivalent function

set_dimension(ind, dimension)[source]¶

sets the dimension for the dataset including new name and updating the axes dictionary

Parameters:

ind (int) – Index of dimension
dimension (sidpy.Dimension) – Dimension object describing this dimension of the Dataset

set_thumbnail(figure=None, thumbnail_size=128)[source]¶

Creates a thumbnail which is stored in thumbnail attribute of sidpy Dataset Thumbnail data is saved to Thumbnail group of associated h5_file if it exists

Parameters:: thumbnail_size (int) – size of icon in pixels (length of square)
Returns:: thumbnail
Return type:: numpy.ndarray

property size: int | float¶: Number of elements in array

squeeze(axis=None)[source]¶

Remove axes of length one from array.

Refer to dask.array.squeeze() for full documentation.

See also

dask.array.squeeze: equivalent function

std(axis=None, dtype=None, keepdims=False, ddof=0, split_every=None, out=None)[source]¶

Returns the standard deviation of the array elements along given axis.

Refer to dask.array.std() for full documentation.

See also

dask.array.std: equivalent function

Store dask arrays in array-like objects, overwrite data in target

This stores dask arrays into object that supports numpy-style setitem indexing. It stores values chunk by chunk so that it does not have to fill up memory. For best performance you can align the block size of the storage target with the block size of your array.

If your data fits in memory then you may prefer calling np.array(myarray) instead.

Parameters:

sources (Array or collection of Arrays) –
targets (array-like or Delayed or collection of array-likes and/or Delayeds) – These should support setitem syntax target[10:20] = .... If sources is a single item, targets must be a single item; if sources is a collection of arrays, targets must be a matching collection.
lock (boolean or threading.Lock, optional) – Whether or not to lock the data stores while storing. Pass True (lock each file individually), False (don’t lock) or a particular threading.Lock object to be shared among all writes.
regions (tuple of slices or collection of tuples of slices, optional) – Each region tuple in regions should be such that target[region].shape = source.shape for the corresponding source and target in sources and targets, respectively. If this is a tuple, the contents will be assumed to be slices, so do not provide a tuple of tuples.
compute (boolean, optional) – If true compute immediately; return dask.delayed.Delayed otherwise.
return_stored (boolean, optional) – Optionally return the stored result (default False).
kwargs – Parameters passed to compute/persist (only used if compute=True)

Returns:

If return_stored=True – tuple of Arrays
If return_stored=False and compute=True – None
If return_stored=False and compute=False – Delayed

Examples

>>> import h5py  
>>> f = h5py.File('myfile.hdf5', mode='a')  
>>> dset = f.create_dataset('/data', shape=x.shape,
...                                  chunks=x.chunks,
...                                  dtype='f8')  

>>> store(x, dset)  

Alternatively store many arrays at the same time

>>> store([x, y, z], [dset1, dset2, dset3])  

sum(axis=None, dtype=None, keepdims=False, split_every=None, out=None)[source]¶

Return the sum of the array elements over the given axis.

Refer to dask.array.sum() for full documentation.

See also

dask.array.sum: equivalent function

swapaxes(axis1, axis2)[source]¶

Return a view of the array with axis1 and axis2 interchanged.

Refer to dask.array.swapaxes() for full documentation.

See also

dask.array.swapaxes: equivalent function

to_backend(backend: str | None = None, **kwargs)¶

Move to a new Array backend

Parameters:: backend (str, Optional) – The name of the new backend to move to. The default is the current “array.backend” configuration.
Return type:: Array

to_dask_dataframe(columns=None, index=None, meta=None)¶

Convert dask Array to dask Dataframe

Parameters:

columns (list or string) – list of column names if DataFrame, single string if Series
index (dask.dataframe.Index, optional) –
An optional dask Index to use for the output Series or DataFrame.

The default output index depends on whether the array has any unknown chunks. If there are any unknown chunks, the output has None for all the divisions (one per chunk). If all the chunks are known, a default index with known divisions is created.

Specifying index can be useful if you’re conforming a Dask Array to an existing dask Series or DataFrame, and you would like the indices to match.
meta (object, optional) – An optional meta parameter can be passed for dask to specify the concrete dataframe type to use for partitions of the Dask dataframe. By default, pandas DataFrame is used.

See also

dask.dataframe.from_dask_array

to_delayed(optimize_graph=True)¶

Convert into an array of dask.delayed.Delayed objects, one per chunk.

Parameters:: optimize_graph (bool, optional) – If True [default], the graph is optimized before converting into dask.delayed.Delayed objects.

See also

dask.array.from_delayed

to_hdf5(filename, datapath, **kwargs)¶

Store array in HDF5 file

>>> x.to_hdf5('myfile.hdf5', '/x')  

Optionally provide arguments as though to h5py.File.create_dataset

>>> x.to_hdf5('myfile.hdf5', '/x', compression='lzf', shuffle=True)  

See also

dask.array.store, h5py.File.create_dataset

to_svg(size=500)¶

Convert chunks from Dask Array into an SVG Image

Parameters:

chunks (tuple) –
size (int) – Rough size of the image

Examples

>>> x.to_svg(size=500)  

Returns:: text
Return type:: An svg string depicting the array as a grid of chunks

to_tiledb(uri, *args, **kwargs)¶

Save array to the TileDB storage manager

See https://docs.tiledb.io for details about the format and engine.

See function dask.array.to_tiledb() for argument documentation.

See also

dask.array.to_tiledb: equivalent function

to_zarr(*args, **kwargs)¶

Save array to the zarr storage format

See https://zarr.readthedocs.io for details about the format.

Refer to dask.array.to_zarr() for full documentation.

See also

dask.array.to_zarr: equivalent function

topk(k, axis=-1, split_every=None)¶

The top k elements of an array.

Refer to dask.array.topk() for full documentation.

See also

dask.array.topk: equivalent function

trace(offset=0, axis1=0, axis2=1, dtype=None)[source]¶

Return the sum along diagonals of the array.

Refer to dask.array.trace() for full documentation.

See also

dask.array.trace: equivalent function

transpose(*axes)[source]¶

Reverse or permute the axes of an array. Return the modified array.

Refer to dask.array.transpose() for full documentation.

See also

dask.array.transpose: equivalent function

var(axis=None, dtype=None, keepdims=False, ddof=0, split_every=None, out=None)[source]¶

Returns the variance of the array elements, along given axis.

Refer to dask.array.var() for full documentation.

See also

dask.array.var: equivalent function

view(dtype=None, order='C')¶

Get a view of the array as a new data type

Parameters:

dtype – The dtype by which to view the array. The default, None, results in the view having the same data-type as the original array.
order (string) – ‘C’ or ‘F’ (Fortran) ordering
that (This reinterprets the bytes of the array under a new dtype. If) –
shape (dtype does not have the same size as the original array then the) –
change. (will) –
taking (Beware that both numpy and dask.array can behave oddly when) –
some (shape-changing views of arrays under Fortran ordering. Under) –
shape-changing (versions of NumPy this function will fail when taking) –
of (views of Fortran ordered arrays if the first dimension has chunks) –
one. (size) –

view_metadata()[source]¶

Prints the metadata to stdout

Return type:: None

view_original_metadata()[source]¶

Prints the original_metadata dictionary to stdout

Return type:: None

property vindex¶

Vectorized indexing with broadcasting.

This is equivalent to numpy’s advanced indexing, using arrays that are broadcast against each other. This allows for pointwise indexing:

>>> import dask.array as da
>>> x = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
>>> x = da.from_array(x, chunks=2)
>>> x.vindex[[0, 1, 2], [0, 1, 2]].compute()
array([1, 5, 9])

Mixed basic/advanced indexing with slices/arrays is also supported. The order of dimensions in the result follows those proposed for ndarray.vindex: the subspace spanned by arrays is followed by all slices.

Note: vindex provides more general functionality than standard indexing, but it also has fewer optimizations and can be significantly slower.

visualize(filename='mydask', format=None, optimize_graph=False, **kwargs)¶

Render the computation of this object’s task graph using graphviz.

Requires graphviz to be installed.

Parameters:

filename (str or None, optional) – The name of the file to write to disk. If the provided filename doesn’t include an extension, ‘.png’ will be used by default. If filename is None, no file will be written, and we communicate with dot using only pipes.
format ({'png', 'pdf', 'dot', 'svg', 'jpeg', 'jpg'}, optional) – Format in which to write output file. Default is ‘png’.
optimize_graph (bool, optional) – If True, the graph is optimized before rendering. Otherwise, the graph is displayed as is. Default is False.
color ({None, 'order'}, optional) – Options to color nodes. Provide cmap= keyword for additional colormap
**kwargs – Additional keyword arguments to forward to to_graphviz.

Examples

>>> x.visualize(filename='dask.pdf')  
>>> x.visualize(filename='dask.pdf', color='order')  

Returns:: result – See dask.dot.dot_graph for more information.
Return type:: IPython.display.Image, IPython.display.SVG, or None

See also

dask.visualize, dask.dot.dot_graph

Notes

For more information on optimization see here:

https://docs.dask.org/en/latest/optimize.html