Creating and Manipulating Datasets¶
Gerd Duscher and Suhas Somnath
08/25/2020
This document is a simple example of how to create and manipulate Dataset objects
UNDER CONSTRUCTION
[1]:
%matplotlib widget
import matplotlib.pyplot as plt
import numpy as np
import sys
sys.path.insert(0, '../../')
import sidpy
print('sidpy version: ', sidpy.__version__)
sidpy version: 0.12.3
Creating a sidpy.Dataset
object¶
We can create a simple sidpy Dataset from any array like object Here we just use a numpy array filled with zeros
[2]:
dataset = sidpy.Dataset.from_array(np.random.random([4, 5, 10]), name='random')
print(dataset)
dataset
sidpy.Dataset of type UNKNOWN with:
dask.array<array, shape=(4, 5, 10), dtype=float64, chunksize=(4, 5, 10), chunktype=numpy.ndarray>
data contains: generic (generic)
and Dimensions:
a: generic (generic) of size (4,)
b: generic (generic) of size (5,)
c: generic (generic) of size (10,)
[2]:
|
[3]:
dataset.provenance
[3]:
{'sidpy': {'from_array_': '_0.12.3_2024-07-12-16:56:52.772361'}}
[4]:
new_dataset = dataset.like_data(np.random.random([4, 5, 10]), name='random2')
print(new_dataset)
new_dataset.provenance
sidpy.Dataset of type UNKNOWN with:
dask.array<array, shape=(4, 5, 10), dtype=float64, chunksize=(4, 5, 10), chunktype=numpy.ndarray>
data contains: generic (generic)
and Dimensions:
a: generic (generic) of size (4,)
b: generic (generic) of size (5,)
c: generic (generic) of size (10,)
[4]:
{'sidpy': {'like_data': '_0.12.3_2024-07-12-16:57:11.387639',
'parent_data': {'title': 'generic',
'provenance': {'sidpy': {'from_array_': '_0.12.3_2024-07-12-16:56:52.772361'}}}}}
[6]:
new_dataset2 = dataset.like_data(np.random.random([4, 5, 10]), name='random2')
print(new_dataset2)
new_dataset2.provenance
sidpy.Dataset of type UNKNOWN with:
dask.array<array, shape=(4, 5, 10), dtype=float64, chunksize=(4, 5, 10), chunktype=numpy.ndarray>
data contains: generic (generic)
and Dimensions:
a: generic (generic) of size (4,)
b: generic (generic) of size (5,)
c: generic (generic) of size (10,)
[6]:
{'sidpy': {'like_data': '_0.12.3_2024-07-12-16:59:30.149263',
'parent_data': {'title': 'generic',
'provenance': {'sidpy': {'from_array_': '_0.12.3_2024-07-12-16:56:52.772361'}}}}}
[5]:
new_dataset.add_provenance('Nothin', 'done', 0.0)
new_dataset.provenance
[5]:
{'sidpy': {'like_data': '_0.12.3_2024-07-12-16:57:11.387639',
'parent_data': {'title': 'generic',
'provenance': {'sidpy': {'from_array_': '_0.12.3_2024-07-12-16:56:52.772361'}}}},
'Nothin': {'done': '_0.0_2024-07-12-16:57:33.804301'}}
Note that data_set
is a dask array…. We will be improving upon the information that will be displayed when printing sidpy.Dataset
objects
Accessing data within a Dataset
: Indexing of the dataset works like in numpy Note, that we first index and then we make a numpy array for printing reasons
[3]:
print(np.array(data_set[:,0,2]))
[0.41955319 0.59615527 0.8109613 0.34605858]
Slicing and dicing:
[14]:
data_dictionary = {"main_dataset": data_set,
'new_dataset': data_set,
'metadata': {'atoms': blobs},
'structure': {'SrTiO3': ase.build.SrTiO3()}}
data_dictionary['new_dataset'].metadata = {"origin_dataset": 'main_dataset'}
Metadata¶
sidpy
automatically assigns generic top-level metadata regarding the Dataset
. Users are encouraged to capture the context regarding the dataset. The attributes included in the sidpy dataset are Required Attributes:
quantity
: string: Physical quantity that is contained in this datasetunits
: string: Units for this physical quantitydata_type
: string : What kind of data this is. Example - image, image stack, video, hyperspectral image, etc.modality
: string : Experimental / simulation modality - scientific meaning of data. Example - photograph, TEM micrograph, SPM Force-Distance spectroscopy.source
: string : Source for dataset like the kind of instrument. One could go very deep here into either the algorithmic details if this is a result from analysis or the exact configurations for the instrument that generated this dataset.
Those attributes are set to generic
originally but one would want to set them for the specific dataset. The attributes data_type
, quantity
and units
will be important for plotting the data.
Here’s how one could do that, but with the wrong key word:
[4]:
data_set.data_type = 'spectrum_image' # not supported
---------------------------------------------------------------------------
Warning Traceback (most recent call last)
<ipython-input-4-6099823a7a09> in <module>
----> 1 data_set.data_type = 'spectrum_image' # not supported
~/Dropbox (ORNL)/Python_scripts/sidpy/sidpy/sid/dataset.py in data_type(self, value)
598 else:
599 self._data_type = DataType.UNKNOWN
--> 600 raise Warning('Supported data_types for plotting are only: ', DataType._member_names_)
601
602 elif isinstance(value, DataType):
Warning: ('Supported data_types for plotting are only: ', ['UNKNOWN', 'SPECTRUM', 'LINE_PLOT', 'LINE_PLOT_FAMILY', 'IMAGE', 'IMAGE_MAP', 'IMAGE_STACK', 'SPECTRAL_IMAGE', 'IMAGE_4D'])
Here’s how one could do that sucessfully:
[5]:
data_set.data_type = 'spectral_image' # supported
data_set.units = 'nA'
data_set.quantity = 'Current'
Scientific metadata¶
These Dataset
objects can also capture rich scientific metadata such as acquisition parameters, etc. as well: We would want to add those parameters as attributes. These attributes could be lists, numpy arrays or simple dictionaries. It is encouraged to add any parameters of data analysis to the datasets, to keep track of input parameters. Here I made some up as an illustration:
These Dataset
objects can also capture rich scientific metadata such as acquisition parameters, etc. as well:
We would want to add those parameters as attributes. These attributes could be lists, numpy arrays or simple dictionaries. It is encouraged to add any parameters of data analysis to the datasets, to keep track of input parameters.
It is recommended to add any parameters to the (nested) metadata dictionary. These metadata can then be viewed in dataset.view_metadata and dataset.view_original_metadata. It is encouraged to add any parameters of data analysis to the datasets, to keep track of input parameters.
There is a size limit of 64kB for the storage of dictionaries in h5py. Therefore, large data such as reference data should be added directly as attributes. All attributes that you add to a dataset will be stored within the pyNSID file.
Please note, that the dictionary original_metadata
should not be changed so that information provided by the acquisition device stays pristine, but relevant inforamtion should be copied over to the metadata
attribute/dictionary.
Here I made up some metadata as an illustration:
[6]:
data_set.calibration = np.arange(5)
data_set.metadata = {'nothing': ' ', 'value': 6.8, 'instrument': {'microscope': 'Nion', 'acceleration_voltage':60000}}
data_set.metadata['acquired'] = 'nowhere'
print(data_set.calibration)
sidpy.dict_utils.print_nested_dict(data_set.metadata)
[0 1 2 3 4]
nothing :
value : 6.8
instrument :
microscope : Nion
acceleration_voltage : 60000
acquired : nowhere
Another set of metadata in these Datasets is the Dimension ones:
Dimensions¶
The Dataset
is automatically populated with generic information about each dimension of the Dataset
. It is a good idea to capture context regarding each of these dimensions using sidpy.Dimension
. As a minimum we need a name and values (of the same length as the dimensions of the data). One can provide as much or as little information about each dimension.
[7]:
data_set.set_dimension(0, sidpy.Dimension(np.arange(data_set.shape[0]),
name='x', units='um', quantity='Length',
dimension_type='spatial'))
data_set.set_dimension(1, sidpy.Dimension(np.linspace(-2, 2, num=data_set.shape[1], endpoint=True),
'y', units='um', quantity='Length',
dimension_type='spatial'))
data_set.set_dimension(2, sidpy.Dimension(np.sin(np.linspace(0, 2 * np.pi, num=data_set.shape[2])),
'bias' ))
One could also manually add information regarding specific components of dimensions associated with Datasets via:
[8]:
data_set.bias.dimension_type = 'spectral'
data_set.bias.units = 'V'
data_set.bias.quantity = 'Bias'
Let’s take a look at what the dataset looks like with the additional information regarding the dimensions.
We can access a dimension by its name or by the dimension number.
Also the print function now provides a little more information about our dataset.
[9]:
print(data_set.bias)
print(data_set.dim_1)
print(data_set)
data_set
bias: Bias (V) of size (10,)
y: Length (um) of size (5,)
sidpy.Dataset of type SPECTRAL_IMAGE with:
dask.array<random, shape=(4, 5, 10), dtype=float64, chunksize=(4, 5, 10), chunktype=numpy.ndarray>
data contains: Current (nA)
and Dimensions:
x: Length (um) of size (4,)
y: Length (um) of size (5,)
bias: Bias (V) of size (10,)
with metadata: ['nothing', 'value', 'instrument', 'acquired']
[9]:
|
Plotting¶
The Dataset
object also comes with the ability to visualize its contents using the plot()
function. Here we only show a simple application, but a more detailed description can be found in the plotting section. Here we plot a spectral image you can click in the image part of the plot on the left and the spectrum on the right will update.
[10]:
data_set.plot()
The plotting depends on the data_type of the dataset and the dimension_types of it’s dimension datasets. Above, we set the first two dimension_type types to spatial
and the third one to spectral
.
The data_type was spectral_image
. So the spatial dimensions are recognized as relevant for an image and the third dimension is recognized as a spectrum, conducive to plotting as shown above. If we change the data_type to image, the default plotting behavoir is to plot the first slice in the dataset (i.e. data_set[:,:,0]).
[11]:
data_set.data_type = 'image'
data_set.plot()
Saving¶
These Dataset
objects will be deleted from memory once the python script completes or when a notebook is closed. The information collected in a Dataset
can reliably be stored to files using functions in sister packages - pyUSID
and pyNSID
that write the dataset according to the Universal Spectroscopy and Imaging Data (USID) or N-dimensional Spectrocsopy and Imaging Data (NSID) formats. Here are links to how one could save such Datasets for each package:
[ ]:
[ ]:
[ ]: