{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# Utilities for reading h5USID files\n", "\n", "**Suhas Somnath**\n", "\n", "4/18/2018\n", "\n", "**This document illustrates the many handy functions in sidpy.hdf.hdf_utils and pyUSID.hdf_utils that significantly simplify reading data\n", "and metadata in Universal Spectroscopy and Imaging Data (USID) HDF5 files (h5USID files)**\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note**: Most of the functions demonstrated in this notebook have been moved out of ``pyUSID.hdf_utils`` and into ``sidpy.hdf``\n", "
\n", "\n", "## Introduction\n", "The USID model uses a data-centric approach to data analysis and processing meaning that results from all data analysis\n", "and processing are written to the same h5 file that contains the recorded measurements. **Hierarchical Data Format\n", "(HDF5)** files allow data, whether it is raw measured data or results of analysis, to be stored in multiple datasets within\n", "the same file in a tree-like manner. Certain rules and considerations have been made in pyUSID to ensure\n", "consistent and easy access to any data.\n", "\n", "The h5py python package provides great functions to create, read, and manage data in HDF5 files. In\n", "``pyUSID.hdf_utils``, we have added functions that facilitate scientifically relevant, or USID specific\n", "functionality such as checking if a dataset is a Main dataset, reshaping to / from the original N dimensional form of\n", "the data, etc. Due to the wide breadth of the functions in ``hdf_utils``, the guide for hdf_utils will be split in two\n", "parts - one that focuses on functions that facilitate reading and one that facilitate writing of data. The following\n", "guide provides examples of how, and more importantly when, to use functions in ``pyUSID.hdf_utils`` for various\n", "scenarios.\n", "\n", "## Recommended pre-requisite reading\n", "* [Universal Spectroscopic and Imaging Data (USID) model](https://pycroscopy.github.io/USID/usid_model.html)\n", "* [Crash course on HDF5 and h5py](./h5py_primer.html)\n", "\n", "\n", "## Import all necessary packages\n", "\n", "Before we begin demonstrating the numerous functions in ``pyUSID.hdf_utils``, we need to import the necessary\n", "packages. Here are a list of packages besides pyUSID that will be used in this example:\n", "\n", "* ``h5py`` - to open and close the file\n", "* ``wget`` - to download the example data file\n", "* ``numpy`` - for numerical operations on arrays in memory\n", "* ``matplotlib`` - basic visualization of data\n", "* ``sidpy`` - basic scientific hdf5 capabilities" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from __future__ import print_function, division, unicode_literals\n", "import os\n", "# Warning package in case something goes wrong\n", "from warnings import warn\n", "import subprocess\n", "import sys\n", "\n", "\n", "def install(package):\n", " subprocess.call([sys.executable, \"-m\", \"pip\", \"install\", package])\n", "# Package for downloading online files:\n", "\n", "try:\n", " # This package is not part of anaconda and may need to be installed.\n", " import wget\n", "except ImportError:\n", " warn('wget not found. Will install with pip.')\n", " import pip\n", " install(wget)\n", " import wget\n", "import h5py\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "# import sidpy - supporting package for pyUSID:\n", "try:\n", " import sidpy\n", "except ImportError:\n", " warn('sidpy not found. Will install with pip.')\n", " import pip\n", " install('sidpy')\n", " import sidpy\n", "\n", "# Finally import pyUSID.\n", "try:\n", " import pyUSID as usid\n", "except ImportError:\n", " warn('pyUSID not found. Will install with pip.')\n", " import pip\n", " install('pyUSID')\n", " import pyUSID as usid" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to demonstrate the many functions in hdf_utils, we will be using a h5USID file containing real\n", "experimental data along with results from analyses on the measurement data\n", "\n", "### This scientific dataset\n", "\n", "For this example, we will be working with a **Band Excitation Polarization Switching (BEPS)** dataset acquired from\n", "advanced atomic force microscopes. In the much simpler **Band Excitation (BE)** imaging datasets, a single spectrum is\n", "acquired at each location in a two dimensional grid of spatial locations. Thus, BE imaging datasets have two\n", "position dimensions (``X``, ``Y``) and one spectroscopic dimension (``Frequency`` - against which the spectrum is recorded).\n", "The BEPS dataset used in this example has a spectrum for **each combination of** three other parameters (``DC offset``,\n", "``Field``, and ``Cycle``). Thus, this dataset has three new spectral dimensions in addition to ``Frequency``. Hence,\n", "this dataset becomes a 2+4 = **6 dimensional dataset**\n", "\n", "### Load the dataset\n", "First, let us download this file from the pyUSID Github project:\n", "\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Working on:\n", "temp.h5\n" ] } ], "source": [ "url = 'https://raw.githubusercontent.com/pycroscopy/pyUSID/master/data/BEPS_small.h5'\n", "h5_path = 'temp.h5'\n", "_ = wget.download(url, h5_path, bar=None)\n", "\n", "print('Working on:\\n' + h5_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, lets open this HDF5 file in read-only mode. Note that opening the file does not cause the contents to be\n", "automatically loaded to memory. Instead, we are presented with objects that refer to specific HDF5 datasets,\n", "attributes or groups in the file\n", "\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "h5_path = 'temp.h5'\n", "h5_f = h5py.File(h5_path, mode='r')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, ``h5_f`` is an active handle to the open file\n", "\n", "## Inspect HDF5 contents\n", "\n", "The file contents are stored in a tree structure, just like files on a contemporary computer. The file contains\n", "groups (similar to file folders) and datasets (similar to spreadsheets).\n", "There are several datasets in the file and these store:\n", "\n", "* The actual measurement collected from the experiment\n", "* Spatial location on the sample where each measurement was collected\n", "* Information to support and explain the spectral data collected at each location\n", "* Since the USID model stores results from processing and analyses performed on the data in the same h5USID file,\n", " these datasets and groups are present as well\n", "* Any other relevant ancillary information\n", "\n", "### print_tree()\n", "Soon after opening any file, it is often of interest to list the contents of the file. While one can use the open\n", "source software HDFViewer developed by the HDF organization, ``pyUSID.hdf_utils`` also has a very handy function -\n", "``print_tree()`` to quickly visualize all the datasets and groups within the file within python.\n", "\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Contents of the H5 file:\n", "/\n", "├ Measurement_000\n", " ---------------\n", " ├ Channel_000\n", " -----------\n", " ├ Bin_FFT\n", " ├ Bin_Frequencies\n", " ├ Bin_Indices\n", " ├ Bin_Step\n", " ├ Bin_Wfm_Type\n", " ├ Excitation_Waveform\n", " ├ Noise_Floor\n", " ├ Position_Indices\n", " ├ Position_Values\n", " ├ Raw_Data\n", " ├ Raw_Data-SHO_Fit_000\n", " --------------------\n", " ├ Fit\n", " ├ Guess\n", " ├ Spectroscopic_Indices\n", " ├ Spectroscopic_Values\n", " ├ Spatially_Averaged_Plot_Group_000\n", " ---------------------------------\n", " ├ Bin_Frequencies\n", " ├ Mean_Spectrogram\n", " ├ Spectroscopic_Parameter\n", " ├ Step_Averaged_Response\n", " ├ Spatially_Averaged_Plot_Group_001\n", " ---------------------------------\n", " ├ Bin_Frequencies\n", " ├ Mean_Spectrogram\n", " ├ Spectroscopic_Parameter\n", " ├ Step_Averaged_Response\n", " ├ Spectroscopic_Indices\n", " ├ Spectroscopic_Values\n", " ├ UDVS\n", " ├ UDVS_Indices\n" ] } ], "source": [ "print('Contents of the H5 file:')\n", "sidpy.hdf_utils.print_tree(h5_f)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, ``print_tree()`` presents a clean tree view of the contents of the group. In this mode, only the group names\n", "are underlined. Alternatively, it can print the full paths of each dataset and group, with respect to the group / file\n", "of interest, by setting the ``rel_paths``\n", "keyword argument. ``print_tree()`` could also be used to display the contents of and HDF5 group instead of complete HDF5\n", "file as we have done above. Lets configure it to print the relative paths of all objects within the ``Channel_000``\n", "group:\n", "\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Measurement_000/Channel_000\n", "Bin_FFT\n", "Bin_Frequencies\n", "Bin_Indices\n", "Bin_Step\n", "Bin_Wfm_Type\n", "Excitation_Waveform\n", "Noise_Floor\n", "Position_Indices\n", "Position_Values\n", "Raw_Data\n", "Raw_Data-SHO_Fit_000\n", "Raw_Data-SHO_Fit_000/Fit\n", "Raw_Data-SHO_Fit_000/Guess\n", "Raw_Data-SHO_Fit_000/Spectroscopic_Indices\n", "Raw_Data-SHO_Fit_000/Spectroscopic_Values\n", "Spatially_Averaged_Plot_Group_000\n", "Spatially_Averaged_Plot_Group_000/Bin_Frequencies\n", "Spatially_Averaged_Plot_Group_000/Mean_Spectrogram\n", "Spatially_Averaged_Plot_Group_000/Spectroscopic_Parameter\n", "Spatially_Averaged_Plot_Group_000/Step_Averaged_Response\n", "Spatially_Averaged_Plot_Group_001\n", "Spatially_Averaged_Plot_Group_001/Bin_Frequencies\n", "Spatially_Averaged_Plot_Group_001/Mean_Spectrogram\n", "Spatially_Averaged_Plot_Group_001/Spectroscopic_Parameter\n", "Spatially_Averaged_Plot_Group_001/Step_Averaged_Response\n", "Spectroscopic_Indices\n", "Spectroscopic_Values\n", "UDVS\n", "UDVS_Indices\n" ] } ], "source": [ "sidpy.hdf_utils.print_tree(h5_f['/Measurement_000/Channel_000/'], rel_paths=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, ``print_tree()`` can also be configured to only print USID Main datasets besides Group objects using the ``main_dsets_only`` option. \n", "\n", "**Note**: only ``pyUSID`` has this capability unlike ``sidpy``:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/\n", "├ Measurement_000\n", " ---------------\n", " ├ Channel_000\n", " -----------\n", " ├ Raw_Data\n", " ├ Raw_Data-SHO_Fit_000\n", " --------------------\n", " ├ Fit\n", " ├ Guess\n", " ├ Spatially_Averaged_Plot_Group_000\n", " ---------------------------------\n", " ├ Spatially_Averaged_Plot_Group_001\n", " ---------------------------------\n" ] } ], "source": [ "usid.hdf_utils.print_tree(h5_f, main_dsets_only=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Accessing Attributes\n", "\n", "HDF5 datasets and groups can also store metadata such as experimental parameters. These metadata can be text,\n", "numbers, small lists of numbers or text etc. These metadata can be very important for understanding the datasets\n", "and guide the analysis routines.\n", "\n", "While one could use the basic ``h5py`` functionality to access attributes, one would encounter a lot of problems when\n", "attempting to decode attributes whose values were strings or lists of strings due to some issues in ``h5py``. This problem\n", "has been demonstrated in our `primer to HDF5 and h5py <./plot_h5py.html>`_. Instead of using the basic functionality of ``h5py``, we recommend always\n", "using the functions in pyUSID that reliably and consistently work for any kind of attribute for any version of\n", "python:\n", "\n", "### get_attributes()\n", "\n", "``get_attributes()`` is a very handy function that returns all or a specified set of attributes in an HDF5 object. If no\n", "attributes are explicitly requested, all attributes in the object are returned:\n", "\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "current_position_x : 4\n", "experiment_date : 26-Feb-2015 14:49:48\n", "data_tool : be_analyzer\n", "translator : ODF\n", "project_name : Band Excitation\n", "current_position_y : 4\n", "experiment_unix_time : 1503428472.2374\n", "sample_name : PZT\n", "xcams_id : abc\n", "user_name : John Doe\n", "comments : Band Excitation data\n", "Pycroscopy version : 0.0.a51\n", "data_type : BEPSData\n", "translate_date : 2017_08_22\n", "project_id : CNMS_2015B_X0000\n", "grid_size_y : 5\n", "sample_description : Thin Film\n", "instrument : cypher_west\n", "grid_size_x : 5\n" ] } ], "source": [ "for key, val in sidpy.hdf_utils.get_attributes(h5_f).items():\n", " print('{} : {}'.format(key, val))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "``get_attributes()`` is also great for only getting selected attributes. For example, if we only cared about the user\n", "and project related attributes, we could manually request for any that we wanted:\n", "\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "user_name : John Doe\n", "project_name : Band Excitation\n", "project_id : CNMS_2015B_X0000\n" ] } ], "source": [ "proj_attrs = sidpy.hdf_utils.get_attributes(h5_f, ['project_name', 'project_id', 'user_name'])\n", "for key, val in proj_attrs.items():\n", " print('{} : {}'.format(key, val))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### get_attr()\n", "\n", "If we are sure that we only wanted a specific attribute, we could instead use ``get_attr()`` as:\n", "\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "John Doe\n" ] } ], "source": [ "print(sidpy.hdf_utils.get_attr(h5_f, 'user_name'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### check_for_matching_attrs()\n", "Consider the scenario where we are have several HDF5 files or Groups or datasets and we wanted to check each one to\n", "see if they have the certain metadata / attributes. ``check_for_matching_attrs()`` is one very handy function that\n", "simplifies the comparision operation.\n", "\n", "For example, let us check if this file was authored by ``John Doe``:\n", "\n" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True\n" ] } ], "source": [ "print(sidpy.hdf.prov_utils.check_for_matching_attrs(h5_f, \n", " new_parms={'user_name': 'John Doe'}))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Finding datasets and groups\n", "\n", "There are numerous ways to search for and access datasets and groups in H5 files using the basic functionalities\n", "of h5py. pyUSID.hdf_utils contains several functions that simplify common searching / lookup operations as part of\n", "scientific workflows.\n", "\n", "### find_dataset()\n", "\n", "The ``find_dataset()`` function will return all datasets that whose names contain the provided string. In this case, we\n", "are looking for any datasets containing the string ``UDVS`` in their names. If you look above, there are two datasets\n", "(UDVS and UDVS_Indices) that match this condition:\n", "\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n" ] } ], "source": [ "udvs_dsets_2 = usid.hdf_utils.find_dataset(h5_f, 'UDVS')\n", "for item in udvs_dsets_2:\n", " print(item)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you might know by now, h5USID files contain three kinds of datasets:\n", "\n", "* ``Main`` datasets that contain data recorded / computed at multiple spatial locations.\n", "* ``Ancillary`` datasets that support a main dataset\n", "* Other datasets\n", "\n", "For more information, please refer to the documentation on the USID model.\n", "\n", "### check_if_main()\n", "``check_if_main()`` is a very handy function that helps distinguish between ``Main`` datasets and other objects\n", "(``Ancillary`` datasets, other datasets, Groups etc.). Lets apply this function to see which of the objects within the\n", "``Channel_000`` Group are ``Main`` datasets:\n", "\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Main Datasets:\n", "----------------\n", "Raw_Data\n", "\n", "Objects that were not Main datasets:\n", "--------------------------------------\n", "Bin_FFT\n", "Bin_Frequencies\n", "Bin_Indices\n", "Bin_Step\n", "Bin_Wfm_Type\n", "Excitation_Waveform\n", "Noise_Floor\n", "Position_Indices\n", "Position_Values\n", "Raw_Data-SHO_Fit_000\n", "Spatially_Averaged_Plot_Group_000\n", "Spatially_Averaged_Plot_Group_001\n", "Spectroscopic_Indices\n", "Spectroscopic_Values\n", "UDVS\n", "UDVS_Indices\n" ] } ], "source": [ "h5_chan_group = h5_f['Measurement_000/Channel_000']\n", "\n", "# We will prepare two lists - one of objects that are ``main`` and one of objects that are not\n", "\n", "non_main_objs = []\n", "main_objs = []\n", "for key, val in h5_chan_group.items():\n", " if usid.hdf_utils.check_if_main(val):\n", " main_objs.append(key)\n", " else:\n", " non_main_objs.append(key)\n", "\n", "# Now we simply print the names of the items in each list\n", "\n", "print('Main Datasets:')\n", "print('----------------')\n", "for item in main_objs:\n", " print(item)\n", "print('\\nObjects that were not Main datasets:')\n", "print('--------------------------------------')\n", "for item in non_main_objs:\n", " print(item)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above script allowed us to distinguish Main datasets from all other objects only within the Group named\n", "``Channel_000``.\n", "\n", "### get_all_main()\n", "What if we want to quickly find all ``Main`` datasets even within the sub-Groups of ``Channel_000``? To do this, we have a\n", "very handy function called - ``get_all_main()``:\n", "\n" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "located at: \n", "\t/Measurement_000/Channel_000/Raw_Data \n", "Data contains: \n", "\tCantilever Vertical Deflection (V) \n", "Data dimensions and original shape: \n", "Position Dimensions: \n", "\tX - size: 5 \n", "\tY - size: 5 \n", "Spectroscopic Dimensions: \n", "\tFrequency - size: 87 \n", "\tDC_Offset - size: 64 \n", "\tField - size: 2 \n", "\tCycle - size: 2\n", "Data Type:\n", "\tcomplex64\n", "--------------------------------------------------------------------\n", "\n", "located at: \n", "\t/Measurement_000/Channel_000/Raw_Data-SHO_Fit_000/Fit \n", "Data contains: \n", "\tSHO parameters (compound) \n", "Data dimensions and original shape: \n", "Position Dimensions: \n", "\tX - size: 5 \n", "\tY - size: 5 \n", "Spectroscopic Dimensions: \n", "\tDC_Offset - size: 64 \n", "\tField - size: 2 \n", "\tCycle - size: 2\n", "Data Fields:\n", "\tPhase [rad], R2 Criterion, Quality Factor, Amplitude [V], Frequency [Hz]\n", "--------------------------------------------------------------------\n", "\n", "located at: \n", "\t/Measurement_000/Channel_000/Raw_Data-SHO_Fit_000/Guess \n", "Data contains: \n", "\tSHO parameters (compound) \n", "Data dimensions and original shape: \n", "Position Dimensions: \n", "\tX - size: 5 \n", "\tY - size: 5 \n", "Spectroscopic Dimensions: \n", "\tDC_Offset - size: 64 \n", "\tField - size: 2 \n", "\tCycle - size: 2\n", "Data Fields:\n", "\tPhase [rad], R2 Criterion, Quality Factor, Amplitude [V], Frequency [Hz]\n", "--------------------------------------------------------------------\n" ] } ], "source": [ "main_dsets = usid.hdf_utils.get_all_main(h5_chan_group)\n", "for dset in main_dsets:\n", " print(dset)\n", " print('--------------------------------------------------------------------')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The datasets above show that the file contains three main datasets. Two of these datasets are contained in a HDF5\n", "Group called ``Raw_Data-SHO_Fit_000`` meaning that they are results of an operation called ``SHO_Fit`` performed on the\n", "``Main`` dataset - ``Raw_Data``. The first of the three main datasets is indeed the ``Raw_Data`` dataset from which the\n", "latter two datasets (``Fit`` and ``Guess``) were derived.\n", "\n", "The USID model allows the same operation, such as ``SHO_Fit``, to be performed on the same dataset (``Raw_Data``),\n", "multiple\n", "times. Each time the operation is performed, a new HDF5 Group is created to hold the new results. Often, we may\n", "want to perform a few operations such as:\n", "\n", "* Find the (source / main) dataset from which certain results were derived\n", "* Check if a particular operation was performed on a main dataset\n", "* Find all groups corresponding to a particular operation (e.g. - ``SHO_Fit``) being applied to a Main dataset\n", "\n", "``hdf_utils`` has a few handy functions for many of these use cases.\n", "\n", "### find_results_groups()\n", "First, lets show that ``find_results_groups()`` finds all Groups containing the results of a ``SHO_Fit`` operation applied\n", "to ``Raw_Data``:\n", "\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Instances of operation \"SHO_Fit\" applied to dataset named \"/Measurement_000/Channel_000/Raw_Data\":\n", "[]\n" ] } ], "source": [ "# First get the dataset corresponding to Raw_Data\n", "h5_raw = h5_chan_group['Raw_Data']\n", "\n", "operation = 'SHO_Fit'\n", "print('Instances of operation \"{}\" applied to dataset named \"{}\":'.format(operation, h5_raw.name))\n", "h5_sho_group_list = usid.hdf_utils.find_results_groups(h5_raw, operation)\n", "print(h5_sho_group_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As expected, the ``SHO_Fit`` operation was performed on ``Raw_Data`` dataset only once, which is why\n", "``find_results_groups()`` returned only one HDF5 Group - ``SHO_Fit_000``.\n", "\n", "### check_for_old()\n", "\n", "Often one may want to check if a certain operation was performed on a dataset with the very same parameters to\n", "avoid recomputing the results. ``hdf_utils.check_for_old()`` is a very handy function that compares parameters (a\n", "dictionary) for a new / potential operation against the metadata (attributes) stored in each existing results group\n", "(HDF5 groups whose name starts with ``Raw_Data-SHO_Fit`` in this case). Before we demonstrate ``check_for_old()``, lets\n", "take a look at the attributes stored in the existing results groups:\n", "\n" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Parameters already used for computing SHO_Fit on Raw_Data in the file:\n", "machine_id : mac109728.ornl.gov\n", "timestamp : 2017_08_22-15_02_08\n", "SHO_guess_method : pycroscopy BESHO\n", "SHO_fit_method : pycroscopy BESHO\n" ] } ], "source": [ "print('Parameters already used for computing SHO_Fit on Raw_Data in the file:')\n", "for key, val in sidpy.hdf_utils.get_attributes(h5_chan_group['Raw_Data-SHO_Fit_000']).items():\n", " print('{} : {}'.format(key, val))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let us check for existing results where the ``SHO_fit_method`` attribute matches an existing value and a new value:\n", "\n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Checking to see if SHO Fits have been computed on the raw dataset:\n", "\n", "Using \"pycroscopy BESHO\":\n", "[]\n", "\n", "Using \"alternate technique\"\n", "[]\n" ] } ], "source": [ "print('Checking to see if SHO Fits have been computed on the raw dataset:')\n", "print('\\nUsing \"pycroscopy BESHO\":')\n", "print(usid.hdf_utils.check_for_old(h5_raw, 'SHO_Fit',\n", " new_parms={'SHO_fit_method': 'pycroscopy BESHO'}))\n", "print('\\nUsing \"alternate technique\"')\n", "print(usid.hdf_utils.check_for_old(h5_raw, 'SHO_Fit',\n", " new_parms={'SHO_fit_method': 'alternate technique'}))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Clearly, while find_results_groups() returned any and all groups corresponding to ``SHO_Fit`` being applied to\n", "``Raw_Data``, ``check_for_old()`` only returned the group(s) where the operation was performed using the same specified\n", "parameters (``sho_fit_method`` in this case).\n", "\n", "Note that ``check_for_old()`` performs two operations - search for all groups with the matching nomenclature and then\n", "compare the attributes. ``check_for_matching_attrs()`` is the handy function, that enables the latter operation of\n", "comparing a giving dictionary of parameters against attributes in a given object.\n", "\n", "### get_source_dataset()\n", "``hdf_utils.get_source_dataset()`` is a very handy function for the inverse scenario where we are interested in finding\n", "the source dataset from which the known result was derived:\n", "\n" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Datagroup containing the SHO fits:\n", "\n", "\n", "Dataset on which the SHO Fit was computed:\n", "\n", "located at: \n", "\t/Measurement_000/Channel_000/Raw_Data \n", "Data contains: \n", "\tCantilever Vertical Deflection (V) \n", "Data dimensions and original shape: \n", "Position Dimensions: \n", "\tX - size: 5 \n", "\tY - size: 5 \n", "Spectroscopic Dimensions: \n", "\tFrequency - size: 87 \n", "\tDC_Offset - size: 64 \n", "\tField - size: 2 \n", "\tCycle - size: 2\n", "Data Type:\n", "\tcomplex64\n" ] } ], "source": [ "h5_sho_group = h5_sho_group_list[0]\n", "print('Datagroup containing the SHO fits:')\n", "print(h5_sho_group)\n", "print('\\nDataset on which the SHO Fit was computed:')\n", "h5_source_dset = usid.hdf_utils.get_source_dataset(h5_sho_group)\n", "print(h5_source_dset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since the source dataset is always a ``Main`` dataset, ``get_source_dataset()`` results a ``USIDataset`` object instead of\n", "a regular ``HDF5 Dataset`` object.\n", "\n", "Note that ``hdf_utils.get_source_dataset()`` and ``find_results_groups()`` rely on the USID rule that results of an\n", "operation be stored in a Group named ``Source_Dataset_Name-Operation_Name_00x``.\n", "\n", "### get_auxiliary_datasets()\n", "\n", "The association of datasets and groups with one another provides a powerful mechanism for conveying (richer) information. One way to associate objects with each other is to store the reference of an object as an attribute of another. This is precisely the capability that is leveraged to turn Central datasets into USID Main Datasets or ``USIDatasets``. USIDatasets need to have four attributes that are references to the ``Position`` and ``Spectroscopic``\n", "``ancillary`` datasets. Note, that USID does not restrict or preclude the storage of other relevant datasets as attributes of another dataset. For example, the ``Raw_Data`` dataset appears to contain several attributes whose keys / names match the names of datasets we see above and values all appear to be HDF5 object references:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Bin_Frequencies : \n", "Position_Indices : \n", "Excitation_Waveform : \n", "out_of_field_Plot_Group : \n", "Bin_Indices : \n", "Spectroscopic_Indices : \n", "UDVS : \n", "Bin_Wfm_Type : \n", "units : V\n", "Bin_Step : \n", "Bin_FFT : \n", "Spectroscopic_Values : \n", "UDVS_Indices : \n", "Position_Values : \n", "in_field_Plot_Group : \n", "Noise_Floor : \n", "quantity : Cantilever Vertical Deflection\n" ] } ], "source": [ "for key, val in sidpy.hdf_utils.get_attributes(h5_raw).items():\n", " print('{} : {}'.format(key, val))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As the name suggests, these HDF5 object references are references or addresses to datasets located elsewhere in the\n", "file. Conventionally, one would need to apply this reference to the file handle to get the actual HDF5 Dataset / Group\n", "object.\n", "\n", "``get_auxiliary_datasets()`` simplifies this process by directly retrieving the actual Dataset / Group associated with\n", "the attribute. Thus, we would be able to get a reference to the ``Bin_Frequencies`` Dataset via:\n", "\n" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "True\n" ] } ], "source": [ "h5_obj = sidpy.hdf_utils.get_auxiliary_datasets(h5_raw, 'Bin_Frequencies')[0]\n", "print(h5_obj)\n", "# Lets prove that this object is the same as the 'Bin_Frequencies' object that can be directly addressed:\n", "print(h5_obj == h5_f['/Measurement_000/Channel_000/Bin_Frequencies'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Accessing Ancillary Datasets\n", "One of the major benefits of h5USID is its ability to handle large multidimensional datasets at ease. ``Ancillary``\n", "datasets serve as the keys or legends for explaining the dimensionality, reshape-ability, etc. of a dataset. There are\n", "several functions in hdf_utils that simplify many common operations on ancillary datasets.\n", "\n", "Before we demonstrate the several useful functions in hdf_utils, lets access the position and spectroscopic ancillary\n", "datasets using the ``get_auxiliary_datasets()`` function we used above:\n", "\n" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "dset_list = sidpy.hdf_utils.get_auxiliary_datasets(h5_raw, ['Position_Indices', 'Position_Values',\n", " 'Spectroscopic_Indices', 'Spectroscopic_Values'])\n", "h5_pos_inds, h5_pos_vals, h5_spec_inds, h5_spec_vals = dset_list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As mentioned above, this is indeed a six dimensional dataset with two position dimensions and four spectroscopic\n", "dimensions. The ``Field`` and ``Cycle`` dimensions do not have any units since they are dimensionless unlike the other\n", "dimensions.\n", "\n", "### get_dimensionality()\n", "Now lets find out the number of steps in each of those dimensions using another handy function called\n", "``get_dimensionality()``:\n", "\n" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Size of each Position dimension:\n", "X : 5\n", "Y : 5\n", "\n", "Size of each Spectroscopic dimension:\n", "Frequency : 87\n", "DC_Offset : 64\n", "Field : 2\n", "Cycle : 2\n" ] } ], "source": [ "pos_dim_sizes = usid.hdf_utils.get_dimensionality(h5_pos_inds)\n", "spec_dim_sizes = usid.hdf_utils.get_dimensionality(h5_spec_inds)\n", "pos_dim_names = sidpy.hdf_utils.get_attr(h5_pos_inds, 'labels')\n", "spec_dim_names = sidpy.hdf_utils.get_attr(h5_spec_inds, 'labels')\n", "\n", "print('Size of each Position dimension:')\n", "for name, length in zip(pos_dim_names, pos_dim_sizes):\n", " print('{} : {}'.format(name, length))\n", "print('\\nSize of each Spectroscopic dimension:')\n", "for name, length in zip(spec_dim_names, spec_dim_sizes):\n", " print('{} : {}'.format(name, length))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### get_sort_order()\n", "\n", "In a few (rare) cases, the spectroscopic / position dimensions are not arranged in descending order of rate of change.\n", "In other words, the dimensions in these ancillary matrices are not arranged from fastest-varying to slowest.\n", "To account for such discrepancies, ``hdf_utils`` has a very handy function that goes through each of the columns or\n", "rows in the ancillary indices matrices and finds the order in which these dimensions vary.\n", "\n", "Below we illustrate an example of sorting the names of the spectroscopic dimensions from fastest to slowest in\n", "the BEPS data file:\n", "\n" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Rate of change of spectroscopic dimensions: [0 2 1 3]\n", "\n", "Spectroscopic dimensions arranged as is:\n", "['Frequency' 'DC_Offset' 'Field' 'Cycle']\n", "\n", "Spectroscopic dimensions arranged from fastest to slowest\n", "['Frequency' 'Field' 'DC_Offset' 'Cycle']\n" ] } ], "source": [ "spec_sort_order = usid.hdf_utils.get_sort_order(h5_spec_inds)\n", "print('Rate of change of spectroscopic dimensions: {}'.format(spec_sort_order))\n", "print('\\nSpectroscopic dimensions arranged as is:')\n", "print(spec_dim_names)\n", "sorted_spec_labels = np.array(spec_dim_names)[np.array(spec_sort_order)]\n", "print('\\nSpectroscopic dimensions arranged from fastest to slowest')\n", "print(sorted_spec_labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### get_unit_values()\n", "\n", "When visualizing the data it is essential to plot the data against appropriate values on the X, Y, or Z axes.\n", "Recall that by definition that the values over which each dimension is varied, are repeated and tiled over the entire\n", "position or spectroscopic dimension of the dataset. Thus, if we had just the bias waveform repeated over two cycles,\n", "spectroscopic values would contain the bias waveform tiled twice and the cycle numbers repeated as many times as the\n", "number of points in the bias waveform. Therefore, extracting the bias waveform or the cycle numbers from the ancillary\n", "datasets is not trivial. This problem is especially challenging for multidimensional datasets such as the one under\n", "consideration. Fortunately, ``hdf_utils`` has a very handy function for this as well:\n", "\n" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Position unit values:\n", "Y : [0. 1. 2. 3. 4.]\n", "X : [0. 1. 2. 3. 4.]\n" ] } ], "source": [ "pos_unit_values = usid.hdf_utils.get_unit_values(h5_pos_inds, h5_pos_vals)\n", "print('Position unit values:')\n", "for key, val in pos_unit_values.items():\n", " print('{} : {}'.format(key, val))\n", "spec_unit_values = usid.hdf_utils.get_unit_values(h5_spec_inds, h5_spec_vals)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since the spectroscopic dimensions are quite complicated, lets visualize the results from ``get_unit_values()``:\n", "\n" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig, axes = plt.subplots(ncols=2, nrows=2, figsize=(6.5, 6))\n", "for axis, name in zip(axes.flat, spec_dim_names):\n", " axis.set_title(name)\n", " axis.plot(spec_unit_values[name], 'o-')\n", "\n", "fig.suptitle('Spectroscopic Dimensions', fontsize=16, y=1.05)\n", "fig.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reshaping Data\n", "\n", "### reshape_to_n_dims()\n", "\n", "The USID model stores N dimensional datasets in a flattened 2D form of position x spectral values. It can become\n", "challenging to retrieve the data in its original N-dimensional form, especially for multidimensional datasets such as\n", "the one we are working on. Fortunately, all the information regarding the dimensionality of the dataset are contained\n", "in the spectral and position ancillary datasets. ``reshape_to_n_dims()`` is a very useful function that can help\n", "retrieve the N-dimensional form of the data using a simple function call:\n", "\n" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Succeeded in reshaping flattened 2D dataset to N dimensions\n", "Shape of the data in its original 2D form\n", "(25, 22272)\n", "Shape of the N dimensional form of the dataset:\n", "(5, 5, 87, 64, 2, 2)\n", "And these are the dimensions\n", "['X' 'Y' 'Frequency' 'DC_Offset' 'Field' 'Cycle']\n" ] } ], "source": [ "ndim_form, success, labels = usid.hdf_utils.reshape_to_n_dims(h5_raw, get_labels=True)\n", "if success:\n", " print('Succeeded in reshaping flattened 2D dataset to N dimensions')\n", " print('Shape of the data in its original 2D form')\n", " print(h5_raw.shape)\n", " print('Shape of the N dimensional form of the dataset:')\n", " print(ndim_form.shape)\n", " print('And these are the dimensions')\n", " print(labels)\n", "else:\n", " print('Failed in reshaping the dataset')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### reshape_from_n_dims()\n", "The inverse problem of reshaping an N dimensional dataset back to a 2D dataset (let's say for the purposes of\n", "multivariate analysis or storing into h5USID files) is also easily solved using another handy\n", "function - ``reshape_from_n_dims()``:\n", "\n" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Shape of flattened two dimensional form\n", "(25, 22272)\n" ] } ], "source": [ "two_dim_form, success = usid.hdf_utils.reshape_from_n_dims(ndim_form, h5_pos=h5_pos_inds, h5_spec=h5_spec_inds)\n", "if success:\n", " print('Shape of flattened two dimensional form')\n", " print(two_dim_form.shape)\n", "else:\n", " print('Failed in flattening the N dimensional dataset')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Close and delete the h5_file\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "h5_f.close()\n", "os.remove(h5_path)" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [default]", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.5" } }, "nbformat": 4, "nbformat_minor": 1 }