Creating and Manipulating Datasets

Gerd Duscher and Suhas Somnath

08/25/2020

This document is a simple example of how to create and manipulate Dataset objects

UNDER CONSTRUCTION

[1]:
%matplotlib widget
import matplotlib.pyplot as plt
import numpy as np

import sys
sys.path.insert(0, '../../')

import sidpy
print('sidpy version: ', sidpy.__version__)
sidpy version:  0.12.3

Creating a sidpy.Dataset object

We can create a simple sidpy Dataset from any array like object Here we just use a numpy array filled with zeros

[2]:
dataset = sidpy.Dataset.from_array(np.random.random([4, 5, 10]), name='random')

print(dataset)
dataset
sidpy.Dataset of type UNKNOWN with:
 dask.array<array, shape=(4, 5, 10), dtype=float64, chunksize=(4, 5, 10), chunktype=numpy.ndarray>
 data contains: generic (generic)
 and Dimensions:
a:  generic (generic) of size (4,)
b:  generic (generic) of size (5,)
c:  generic (generic) of size (10,)
[2]:
Array Chunk
Bytes 1.56 kiB 1.56 kiB
Shape (4, 5, 10) (4, 5, 10)
Dask graph 1 chunks in 1 graph layer
Data type float64 numpy.ndarray
10 5 4
[3]:
dataset.provenance
[3]:
{'sidpy': {'from_array_': '_0.12.3_2024-07-12-16:56:52.772361'}}
[4]:
new_dataset = dataset.like_data(np.random.random([4, 5, 10]), name='random2')
print(new_dataset)
new_dataset.provenance
sidpy.Dataset of type UNKNOWN with:
 dask.array<array, shape=(4, 5, 10), dtype=float64, chunksize=(4, 5, 10), chunktype=numpy.ndarray>
 data contains: generic (generic)
 and Dimensions:
a:  generic (generic) of size (4,)
b:  generic (generic) of size (5,)
c:  generic (generic) of size (10,)
[4]:
{'sidpy': {'like_data': '_0.12.3_2024-07-12-16:57:11.387639',
  'parent_data': {'title': 'generic',
   'provenance': {'sidpy': {'from_array_': '_0.12.3_2024-07-12-16:56:52.772361'}}}}}
[6]:
new_dataset2 = dataset.like_data(np.random.random([4, 5, 10]), name='random2')
print(new_dataset2)
new_dataset2.provenance
sidpy.Dataset of type UNKNOWN with:
 dask.array<array, shape=(4, 5, 10), dtype=float64, chunksize=(4, 5, 10), chunktype=numpy.ndarray>
 data contains: generic (generic)
 and Dimensions:
a:  generic (generic) of size (4,)
b:  generic (generic) of size (5,)
c:  generic (generic) of size (10,)
[6]:
{'sidpy': {'like_data': '_0.12.3_2024-07-12-16:59:30.149263',
  'parent_data': {'title': 'generic',
   'provenance': {'sidpy': {'from_array_': '_0.12.3_2024-07-12-16:56:52.772361'}}}}}
[5]:
new_dataset.add_provenance('Nothin', 'done', 0.0)
new_dataset.provenance
[5]:
{'sidpy': {'like_data': '_0.12.3_2024-07-12-16:57:11.387639',
  'parent_data': {'title': 'generic',
   'provenance': {'sidpy': {'from_array_': '_0.12.3_2024-07-12-16:56:52.772361'}}}},
 'Nothin': {'done': '_0.0_2024-07-12-16:57:33.804301'}}

Note that data_set is a dask array…. We will be improving upon the information that will be displayed when printing sidpy.Dataset objects

Accessing data within a Dataset: Indexing of the dataset works like in numpy Note, that we first index and then we make a numpy array for printing reasons

[3]:
print(np.array(data_set[:,0,2]))
[0.41955319 0.59615527 0.8109613  0.34605858]

Slicing and dicing:

[14]:
data_dictionary = {"main_dataset": data_set,
                   'new_dataset': data_set,
                   'metadata': {'atoms': blobs},
                   'structure': {'SrTiO3': ase.build.SrTiO3()}}

data_dictionary['new_dataset'].metadata = {"origin_dataset": 'main_dataset'}

Metadata

sidpy automatically assigns generic top-level metadata regarding the Dataset. Users are encouraged to capture the context regarding the dataset. The attributes included in the sidpy dataset are Required Attributes:

  • quantity: string: Physical quantity that is contained in this dataset

  • units: string: Units for this physical quantity

  • data_type: string : What kind of data this is. Example - image, image stack, video, hyperspectral image, etc.

  • modality: string : Experimental / simulation modality - scientific meaning of data. Example - photograph, TEM micrograph, SPM Force-Distance spectroscopy.

  • source: string : Source for dataset like the kind of instrument. One could go very deep here into either the algorithmic details if this is a result from analysis or the exact configurations for the instrument that generated this dataset.

Those attributes are set to generic originally but one would want to set them for the specific dataset. The attributes data_type, quantity and units will be important for plotting the data.

Here’s how one could do that, but with the wrong key word:

[4]:
data_set.data_type = 'spectrum_image'  # not supported
---------------------------------------------------------------------------
Warning                                   Traceback (most recent call last)
<ipython-input-4-6099823a7a09> in <module>
----> 1 data_set.data_type = 'spectrum_image'  # not supported

~/Dropbox (ORNL)/Python_scripts/sidpy/sidpy/sid/dataset.py in data_type(self, value)
    598             else:
    599                 self._data_type = DataType.UNKNOWN
--> 600                 raise Warning('Supported data_types for plotting are only: ', DataType._member_names_)
    601
    602         elif isinstance(value, DataType):

Warning: ('Supported data_types for plotting are only: ', ['UNKNOWN', 'SPECTRUM', 'LINE_PLOT', 'LINE_PLOT_FAMILY', 'IMAGE', 'IMAGE_MAP', 'IMAGE_STACK', 'SPECTRAL_IMAGE', 'IMAGE_4D'])

Here’s how one could do that sucessfully:

[5]:
data_set.data_type = 'spectral_image'  # supported

data_set.units = 'nA'
data_set.quantity = 'Current'

Scientific metadata

These Dataset objects can also capture rich scientific metadata such as acquisition parameters, etc. as well: We would want to add those parameters as attributes. These attributes could be lists, numpy arrays or simple dictionaries. It is encouraged to add any parameters of data analysis to the datasets, to keep track of input parameters. Here I made some up as an illustration:

These Dataset objects can also capture rich scientific metadata such as acquisition parameters, etc. as well:

We would want to add those parameters as attributes. These attributes could be lists, numpy arrays or simple dictionaries. It is encouraged to add any parameters of data analysis to the datasets, to keep track of input parameters.

It is recommended to add any parameters to the (nested) metadata dictionary. These metadata can then be viewed in dataset.view_metadata and dataset.view_original_metadata. It is encouraged to add any parameters of data analysis to the datasets, to keep track of input parameters.

There is a size limit of 64kB for the storage of dictionaries in h5py. Therefore, large data such as reference data should be added directly as attributes. All attributes that you add to a dataset will be stored within the pyNSID file.

Please note, that the dictionary original_metadata should not be changed so that information provided by the acquisition device stays pristine, but relevant inforamtion should be copied over to the metadata attribute/dictionary.

Here I made up some metadata as an illustration:

[6]:
data_set.calibration = np.arange(5)
data_set.metadata = {'nothing': ' ', 'value': 6.8, 'instrument': {'microscope': 'Nion', 'acceleration_voltage':60000}}
data_set.metadata['acquired'] = 'nowhere'

print(data_set.calibration)
sidpy.dict_utils.print_nested_dict(data_set.metadata)
[0 1 2 3 4]
nothing :
value : 6.8
instrument :
        microscope : Nion
        acceleration_voltage : 60000
acquired : nowhere

Another set of metadata in these Datasets is the Dimension ones:

Dimensions

The Dataset is automatically populated with generic information about each dimension of the Dataset. It is a good idea to capture context regarding each of these dimensions using sidpy.Dimension. As a minimum we need a name and values (of the same length as the dimensions of the data). One can provide as much or as little information about each dimension.

[7]:
data_set.set_dimension(0, sidpy.Dimension(np.arange(data_set.shape[0]),
                                          name='x', units='um', quantity='Length',
                                          dimension_type='spatial'))
data_set.set_dimension(1, sidpy.Dimension(np.linspace(-2, 2, num=data_set.shape[1], endpoint=True),
                                          'y', units='um', quantity='Length',
                                          dimension_type='spatial'))
data_set.set_dimension(2, sidpy.Dimension(np.sin(np.linspace(0, 2 * np.pi, num=data_set.shape[2])),
                                          'bias' ))

One could also manually add information regarding specific components of dimensions associated with Datasets via:

[8]:
data_set.bias.dimension_type = 'spectral'
data_set.bias.units = 'V'
data_set.bias.quantity = 'Bias'

Let’s take a look at what the dataset looks like with the additional information regarding the dimensions.

We can access a dimension by its name or by the dimension number.

Also the print function now provides a little more information about our dataset.

[9]:
print(data_set.bias)
print(data_set.dim_1)
print(data_set)
data_set
bias:  Bias (V) of size (10,)
y:  Length (um) of size (5,)
sidpy.Dataset of type SPECTRAL_IMAGE with:
 dask.array<random, shape=(4, 5, 10), dtype=float64, chunksize=(4, 5, 10), chunktype=numpy.ndarray>
 data contains: Current (nA)
 and Dimensions:
x:  Length (um) of size (4,)
y:  Length (um) of size (5,)
bias:  Bias (V) of size (10,)
 with metadata: ['nothing', 'value', 'instrument', 'acquired']
[9]:
Array Chunk
Bytes 1.60 kB 1.60 kB
Shape (4, 5, 10) (4, 5, 10)
Count 1 Tasks 1 Chunks
Type float64 numpy.ndarray
10 5 4

Plotting

The Dataset object also comes with the ability to visualize its contents using the plot() function. Here we only show a simple application, but a more detailed description can be found in the plotting section. Here we plot a spectral image you can click in the image part of the plot on the left and the spectrum on the right will update.

[10]:
data_set.plot()

The plotting depends on the data_type of the dataset and the dimension_types of it’s dimension datasets. Above, we set the first two dimension_type types to spatial and the third one to spectral.

The data_type was spectral_image. So the spatial dimensions are recognized as relevant for an image and the third dimension is recognized as a spectrum, conducive to plotting as shown above. If we change the data_type to image, the default plotting behavoir is to plot the first slice in the dataset (i.e. data_set[:,:,0]).

[11]:
data_set.data_type = 'image'
data_set.plot()

Saving

These Dataset objects will be deleted from memory once the python script completes or when a notebook is closed. The information collected in a Dataset can reliably be stored to files using functions in sister packages - pyUSID and pyNSID that write the dataset according to the Universal Spectroscopy and Imaging Data (USID) or N-dimensional Spectrocsopy and Imaging Data (NSID) formats. Here are links to how one could save such Datasets for each package:

[ ]:

[ ]:

[ ]: