USID in HDF5¶
Suhas Somnath
8/8/2017
Here we discuss guidelines and specifications for implementing the USID model into HDF5 files. While we could impose that the file extension be changed from .hdf5 to .h5USID, we choose to retain the .hdf5 extension so that other software are aware of the general file type and can recognize / read them easily.
Main
data:¶
Dataset structured as (positions
x time or spectroscopic
values)
dtype
: uint8, float32, complex64, compound if necessary, etc.chunking (optional) : Chunking is optional and not necessary in h5USID. Typically, HDF5 /
h5py
will automatically do a good job of deciding implicit chunking such that performance is maximized. If chunking is indeed necessary for your application, HDF group recommends that chunks be between 100 kB to 1 MB. We recommend chunking by whole number of positions if possible since data is more likely to be read by position rather than by specific spectral indices.compression
(optional): We recommend evaluating the necessity and performance impacts that compression can have on your application. In certain cases, compression can result in very long (10 - 20x) longer read and write times. In such cases, it may be best to compress only if one does not expect to read / modify the dataset frequently (archival).Required attributes:
quantity
- Single string that explains the data. The physical quantity contained in each cell of the dataset – eg – ‘Current’ or ‘Deflection’units
– Single string for units. The units for the physical quantity like ‘nA’, ‘V’, ‘pF’, etc.Position_Indices
- Reference to the position indices datasetPosition_Values
- Reference to the position values datasetSpectroscopic_Indices
- Reference to the spectroscopic indices datasetSpectroscopic_Values
- Reference to the spectroscopic values dataset
Note that we are only storing references to the ancillary datasets. This
allows multiple main
datasets to share the same ancillary datasets
without having to duplicate them.
Ancillary
data:¶
Position_Indices
structured as (positions
x spatial dimensions
)
Dimensions are arranged in ascending order of rate of change. In other words, the fastest changing dimension is in the first column and the slowest is in the last or rightmost column.
dtype
: uint32Required attributes:
labels
- list of strings for the column names like [‘X’, ‘Y’]units
– list of strings for units like [‘um’, ‘nm’]
Optional attributes: * Region references based on column names
Position_Values
structured as (positions
x spatial dimensions
)
dimensions are arranged in ascending order of rate of change. In other words, the fastest changing dimension is in the first column and the slowest is in the last or rightmost column.
dtype
: float32Required attributes:
labels
- list of strings for the column names like [‘X’, ‘Y’]units
– list of strings for units like [‘um’, ‘nm’]
Optional attributes: * Region references based on column names
Spectroscopic_Indices
structured as (spectroscopic dimensions
x
time
)
dimensions are arranged in ascending order of rate of change. In other words, the fastest changing dimension is in the first row and the slowest is in the last or lowermost row.
dtype
: uint32Required attributes:
labels
- list of strings for the column names like [‘Bias’, ‘Cycle’]units
– list of strings for units like [‘V’, ‘’]. Empty string for dimensionless quantities
Optional attributes: * Region references based on row names
Spectroscopic_Values
structured as (spectroscopic dimensions
x
time
)
dimensions are arranged in ascending order of rate of change. In other words, the fastest changing dimension is in the first row and the slowest is in the last or lowermost row.
dtype
: float32Required attributes:
labels
- list of strings for the column names like [‘Bias’, ‘Cycle’]units
– list of strings for units like [‘V’, ‘’]. Empty string for dimensionless quantities
Optional attributes:
Region references based on row names
Attributes¶
All groups and (at least Main
) datasets must be created with the following mandatory attributes for better traceability:
time_stamp
: ‘2017_08_15-22_15_45’ (date and time of creation of the group or dataset formatted as ‘YYYY_MM_DD-HH_mm_ss’ as a string)machine_id
: ‘mac1234.ornl.gov’ (a fully qualified domain name as a string)pyUSID_version
: ‘0.0.1’platform
: ‘Windows10….’ or something like ‘Darwin-17.4.0-x86_64-i386-64bit’ (for Mac OS) - a long string providing detailed information about the operating system
Groups¶
HDF5 Groups in h5USID are used to organize categories of information (raw measurements from instruments, results from data analysis, etc.) in an intuitive manner.
Measurement data¶
As mentioned earlier, instrument users may change experimental parameters during measurements. Even if these changes are minor, they can lead to misinterpretation of data if the changes are not handled robustly. To solve this problem, we recommend storing data under indexed groups named as
Measurement_00x
. Each time the parameters are changed, the dataset is truncated to the point until which data was collected and a new group is created to store the upcoming new measurement data.Each channel of information acquired during the measurement gets its own group.
The
Main
datasets would reside within these channel groups.Similar to the measurement groups, the channel groups are named as
Channel_00x
. The index for the group is incremented according to the index of the information channel.Depending on the circumstances, the ancillary datasets can be shared among channels.
Instead of the main dataset in
Channel_001
having references to the ancillary datasets inChannel_000
, we recommend placing the ancillary datasets outside the Channel groups in a area common to both channel groups. Typically, this is theMeasurement_00x
group.
This is what the tree structure in the file looks like when experimental parameters were changed twice and there are two channels of information being acquired during the measurements.
Datasets common to all measurement groups (perhaps some calibration data that is acquired only once before all measurements)
Measurement_000
(group)Channel_000
(group)Datasets here
Channel_001
(group)Datasets here
Datasets common to
Channel_000
andChannel_001
Measurement_001
(group)Channel_000
(group)Datasets here
Channel_001
(group)Datasets here
Datasets common to
Channel_000
andChannel_001
…
Tool (analysis / processing)¶
Each time an analysis or processing routine, referred generally as
tool
, is performed on a dataset of interest, the results are stored in new HDF5 datasets within a new HSF5 group.A completely new dataset(s) and group are created even if a minor operation is being performed on the dataset. In other words, we do NOT modify existing datasets.
Almost always, the tool is applied to one (or more)
main
datasets (referred to as thesource
dataset) and at least one of the results is typically also amain
dataset. These newmain
datasets will either need to be linked to the ancillary matrices of thesource
or to new ancillary datasets that will need to be created.The resultant dataset(s) are always stored in a group whose name is derived from the names of the tool and the dataset. This makes the data traceable, meaning that the names of the datasets and groups are sufficient to understand what processing or analysis steps were applied to the data to bring it to a particular point.
The group is named as
Source_Dataset-Tool_Name_00x
, where atool
namedTool_Name
is applied to amain
dataset namedSource_Dataset
.Since there is a possibility that the same tool could be applied to the very same dataset multiple times, we store the results of each run of the tool in a separate group. These groups are differentiated by the index that is appended to the name of the group.
Note that a
-
separates the dataset name from the tool name and anything after the last_
will be assumed to be the index of the groupPlease refer to the advanced topics section for tools that have more than one
source
datasets
In general, the results from tools applied to datasets should be stored as:
Source_Dataset
Source_Dataset-Tool_Name_000
(group containing results from first run of thetool
onSource_Dataset
)Attributes:
all mandatory attributes
algorithm
Other tool-relevant attributes
source_000
- reference toSource_Dataset
Dataset_Result0
Dataset_Result1
…
Source_Dataset-Tool_Name_001
(group containing results from second run of thetool
onSource_Dataset
)
This methodology is illustrated with an example of applying
K-Means Clustering
on theRaw_Data
acquired from a measurement:Raw_Data
(main
dataset)Raw_Data-Cluster_000
(group)Attributes:
all mandatory attributes
algorithm
: ‘K-Means’source_000
: reference toRaw_Data
Label_Indices
(ancillary spectroscopic dataset with 1 dimension of size 1)Label_Values
(ancillary spectroscopic dataset with 1 dimension of size 1)Labels
(Main dataset)Attributes:
quantity
: ‘Cluster labels’units
: ‘a. u.’Position_Indices
: Reference toPosition_Indices
from attribute ofRaw_Data
Position_Values
: Reference toPosition_Values
from attribute ofRaw_Data
Spectroscopic_Indices
: Reference toLabel_Indices
Spectroscopic_Values
: Reference toLabel_Values
all mandatory attributes
Cluster_Indices
(ancillary positions dataset with 1 dimension of size equal to number of clusters)Cluster_Values
(ancillary positions dataset with 1 dimension of size equal to number of clusters)Mean_Response
(main dataset) <- This dataset stores the endmembers or mean response for each clusterAttributes:
quantity
: copy from thequantity
attribute inRaw_Data
units
: copy from theunits
attribute inRaw_Data
Position_Indices
: Reference toCluster_Indices
Position_Values
: Reference toCluster_Values
Spectroscopic_Indices
: Reference toSpectroscopic_Indices
from attribute ofRaw_Data
Spectroscopic_Values
: Reference toSpectroscopic_Values
from attribute ofRaw_Data
all mandatory attributes
Note that the spectroscopic datasets that the
Labels
dataset link to are not calledSpectroscopic_Indices
orSpectroscopic_Values
themselves. They only need to follow the specifications outlined above. The same is true for the position datasets forMean_Response
.