{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "# The USIDataset\n",
    "\n",
    "**Suhas Somnath**\n",
    "\n",
    "11/11/2017\n",
    "\n",
    "**This document illustrates how the pyUSID.USIDataset class substantially simplifies accessing information about,\n",
    "slicing, and visualizing N-dimensional Universal Spectroscopy and Imaging Data (USID) Main datasets**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## USID Main Datasets\n",
    "According to the **Universal Spectroscopy and Imaging Data (USID)** model, all spatial dimensions are collapsed to a\n",
    "single dimension and, similarly, all spectroscopic dimensions\n",
    "are also collapsed to a single dimension. Thus, the data is stored as a two-dimensional (N x P) matrix with N spatial\n",
    "locations each with P spectroscopic data points.\n",
    "\n",
    "This general and intuitive format allows imaging data from any instrument, measurement scheme, size, or dimensionality\n",
    "to be represented in the same way. Such an instrument independent data format enables a single set of analysis and\n",
    "processing functions to be reused for multiple image formats or modalities.\n",
    "\n",
    "``Main datasets`` are greater than the sum of their parts. They are more capable and information-packed than\n",
    "conventional datasets since they have (or are linked to) all the necessary information to describe a measured dataset.\n",
    "The additional information contained / linked by ``Main datasets`` includes:\n",
    "\n",
    "* the recorded physical quantity\n",
    "* units of the data\n",
    "* names of the position and spectroscopic dimensions\n",
    "* dimensionality of the data in its original N dimensional form etc.\n",
    "\n",
    "## USIDatasets = USID Main Datasets\n",
    "Regardless, ``Main datasets`` are just concepts or blueprints and not concrete digital objects in a programming language\n",
    "or a file. ``USIDatasets`` are **tangible representations of Main datasets**. From an implementation perspective, the\n",
    "USIDataset class extends the ``h5py.Dataset object``. In other words, USIDatasets have all the capabilities of\n",
    "standard HDF5 / h5py Dataset objects but are supercharged from a scientific perspective since they:\n",
    "\n",
    "* are self-describing\n",
    "* allow quick interactive visualization in Jupyter notebooks\n",
    "* allow intuitive slicing of the N dimensional dataset\n",
    "* and much much more.\n",
    "\n",
    "While it is most certainly possible to access this information and enable these functionalities via the native ``h5py``\n",
    "functionality, it can become tedious very quickly.  In fact, a lot of the functionality of USIDataset comes from\n",
    "orchestration of multiple functions in ``pyUSID.hdf_utils`` outlined in other documents. The USIDataset class\n",
    "makes such necessary information and functionality easily accessible.\n",
    "\n",
    "Since Main datasets are the hubs of information in a USID HDF5 file (**h5USID**), we expect that the majority of\n",
    "the data interrogation will happen via USIDatasets\n",
    "\n",
    "## Recommended pre-requisite reading\n",
    "* [Universal Spectroscopic and Imaging Data (USID) model](https://pycroscopy.github.io/USID/usid_model.html)\n",
    "* [Crash course on HDF5 and h5py](./h5py_primer.html)\n",
    "* Utilities for [reading](./hdf_utils_read.html) h5USID files using pyUSID\n",
    "\n",
    "## Example scientific dataset\n",
    "\n",
    "Before, we dive into the functionalities of USIDatasets we need to understand the dataset that will be used in this\n",
    "example. For this example, we will be working with a Band Excitation Polarization Switching (BEPS) dataset acquired\n",
    "from advanced atomic force microscopes. In the much simpler Band Excitation (BE) imaging datasets, a single spectra\n",
    "is acquired at each location in a two dimensional grid of spatial locations. Thus, BE imaging datasets have two\n",
    "position dimensions (X, Y) and one spectroscopic dimension (frequency - against which the spectra is recorded). The\n",
    "BEPS dataset used in this example has a spectra for each combination of three other parameters (DC offset, Field, and\n",
    "Cycle). Thus, this dataset has three new spectral dimensions in addition to the spectra itself. Hence, this dataset\n",
    "becomes a 2+4 = 6 dimensional dataset\n",
    "\n",
    "## Load all necessary packages\n",
    "\n",
    "First, we need to load the necessary packages. Here are a list of packages, besides pyUSID, that will be used in\n",
    "this example:\n",
    "\n",
    "* ``h5py`` - to open and close the file\n",
    "* ``wget`` - to download the example data file\n",
    "* ``numpy`` - for numerical operations on arrays in memory\n",
    "* ``matplotlib`` - basic visualization of data\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from __future__ import print_function, division, unicode_literals\n",
    "import os\n",
    "# Warning package in case something goes wrong\n",
    "from warnings import warn\n",
    "import subprocess\n",
    "import sys\n",
    "\n",
    "\n",
    "def install(package):\n",
    "    subprocess.call([sys.executable, \"-m\", \"pip\", \"install\", package])\n",
    "# Package for downloading online files:\n",
    "\n",
    "\n",
    "try:\n",
    "    # This package is not part of anaconda and may need to be installed.\n",
    "    import wget\n",
    "except ImportError:\n",
    "    warn('wget not found.  Will install with pip.')\n",
    "    import pip\n",
    "    install('wget')\n",
    "    import wget\n",
    "import h5py\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "try:\n",
    "    import sidpy\n",
    "except ImportError:\n",
    "    warn('sidpy not found.  Will install with pip.')\n",
    "    import pip\n",
    "    install('sidpy')\n",
    "    import sidpy\n",
    "\n",
    "try:\n",
    "    import pyUSID as usid\n",
    "except ImportError:\n",
    "    warn('pyUSID not found.  Will install with pip.')\n",
    "    import pip\n",
    "    install('pyUSID')\n",
    "    import pyUSID as usid"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load the dataset\n",
    "First, lets download example h5USID file from the pyUSID Github project:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "url = 'https://raw.githubusercontent.com/pycroscopy/pyUSID/master/data/BEPS_small.h5'\n",
    "h5_path = 'temp.h5'\n",
    "_ = wget.download(url, h5_path, bar=None)\n",
    "\n",
    "print('Working on:\\n' + h5_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, lets open this HDF5 file in read-only mode. Note that opening the file does not cause the contents to be\n",
    "automatically loaded to memory. Instead, we are presented with objects that refer to specific HDF5 datasets,\n",
    "attributes or groups in the file\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "h5_path = 'temp.h5'\n",
    "h5_f = h5py.File(h5_path, mode='r')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, ``h5_f`` is an active handle to the open file.\n",
    "Lets quickly look at the contents of this HDF5 file using a handy function in ``pyUSID.hdf_utils`` - ``print_tree()``\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Contents of the H5 file:')\n",
    "sidpy.hdf_utils.print_tree(h5_f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this example, we will only focus on the ``Raw_Data`` dataset which contains the 6D raw measurement data. First lets\n",
    "access the HDF5 dataset and check if it is a ``Main`` dataset in the first place:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "h5_raw = h5_f['/Measurement_000/Channel_000/Raw_Data']\n",
    "print(h5_raw)\n",
    "print('h5_raw is a main dataset? {}'.format(usid.hdf_utils.check_if_main(h5_raw)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It turns out that this is indeed a Main dataset. Therefore, we can turn this in to a USIDataset without any\n",
    "problems.\n",
    "\n",
    "## Creating a USIDataset\n",
    "All one needs for creating a USIDataset object is a Main dataset. Here is how we can supercharge h5_raw:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd_raw = usid.USIDataset(h5_raw)\n",
    "print(pd_raw)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice how easy it was to create a USIDataset object. Also, note how the USIDataset is much more informative in\n",
    "comparison with the conventional h5py.Dataset object.\n",
    "\n",
    "### USIDataset = Supercharged(h5py.Dataset)\n",
    "Remember that USIDataset is just an extension of the h5py.Dataset object class. Therefore, both the ``h5_raw`` and\n",
    "``pd_raw`` refer to the same object as the following equality test demonstrates. Except ``pd_raw`` knows about the\n",
    "``ancillary datasets`` and other information which makes it a far more powerful object for you.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(pd_raw == h5_raw)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Easier access to information\n",
    "Since the USIDataset is aware and has handles to the supporting ancillary datasets, they can be accessed as\n",
    "properties of the object unlike HDF5 datasets. Note that these ancillary datasets can be accessed using functionality\n",
    "in pyUSID.hdf_utils as well. However, the USIDataset option is far easier.\n",
    "\n",
    "Let us compare accessing the Spectroscopic Indices via the USIDataset and hdf_utils functionality:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "h5_spec_inds_1 = pd_raw.h5_spec_inds\n",
    "h5_spec_inds_2 = sidpy.hdf_utils.get_auxiliary_datasets(h5_raw, 'Spectroscopic_Indices')[0]\n",
    "print(h5_spec_inds_1 == h5_spec_inds_2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the same vein, it is also easy to access **string descriptors** of the ancillary datasets and the Main dataset.\n",
    "The ``hdf_utils`` alternatives to these operations / properties also exist and are discussed in an alternate document,\n",
    "but will not be discussed here for brevity.:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Desctiption of physical quantity in the Main dataset:')\n",
    "print(pd_raw.data_descriptor)\n",
    "print('Position Dimension names and sizes:')\n",
    "for name, length in zip(pd_raw.pos_dim_labels, pd_raw.pos_dim_sizes):\n",
    "    print('{} : {}'.format(name, length))\n",
    "print('Spectroscopic Dimension names and sizes:')\n",
    "for name, length in zip(pd_raw.spec_dim_labels, pd_raw.spec_dim_sizes):\n",
    "    print('{} : {}'.format(name, length))\n",
    "print('Position Dimensions:')\n",
    "print(pd_raw.pos_dim_descriptors)\n",
    "print('Spectroscopic Dimensions:')\n",
    "print(pd_raw.spec_dim_descriptors)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Values for each Dimension\n",
    "When visualizing the data it is essential to plot the data against appropriate values on the X, Y, Z axes. The\n",
    "USIDataset object makes it very easy to access the values over which a dimension was varied using the\n",
    "``get_pos_values()`` and ``get_spec_values()`` functions. This functionality is enabled by the ``get_unit_values()``\n",
    "function in ``pyUSID.hdf_utils``.\n",
    "\n",
    "For example, let us say we wanted to see how the ``DC_Offset`` dimension was varied, we could:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dim_name = 'DC_Offset'\n",
    "dc_vec = pd_raw.get_spec_values(dim_name)\n",
    "fig, axis = plt.subplots(figsize=(3.5, 3.5))\n",
    "axis.plot(dc_vec)\n",
    "axis.set_xlabel('Points in dimension')\n",
    "axis.set_title(dim_name)\n",
    "fig.tight_layout()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reshaping to N dimensions\n",
    "\n",
    "The USID model stores N dimensional datasets in a flattened 2D form of position x spectral values. It can become\n",
    "challenging to retrieve the data in its original N-dimensional form, especially for multidimensional datasets\n",
    "such as the one we are working on. Fortunately, all the information regarding the dimensionality of the dataset\n",
    "are contained in the spectral and position ancillary datasets. PycoDataset makes it remarkably easy to obtain the N\n",
    "dimensional form of a dataset:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ndim_form = pd_raw.get_n_dim_form()\n",
    "print('Shape of the N dimensional form of the dataset:')\n",
    "print(ndim_form.shape)\n",
    "print('And these are the dimensions')\n",
    "print(pd_raw.n_dim_labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Slicing\n",
    "It is often very challenging to grapple with multidimensional datasets such as the one in this example. It may not\n",
    "even be possible to load the entire dataset in its 2D or N dimensional form to memory if the dataset is several (or\n",
    "several hundred) gigabytes large. Slicing the 2D Main dataset can easily become confusing and frustrating. To solve\n",
    "this problem, USIDataset has a ``slice()`` function that efficiently loads the only the sliced data into memory and\n",
    "reshapes the data to an N dimensional form. Best of all, the slicing arguments can be provided in the actual\n",
    "N dimensional form!\n",
    "\n",
    "For example, imagine that we cannot load the entire example dataset in its N dimensional form and then slice it. Lets\n",
    "try to get the spatial map for the following conditions without loading the entire dataset in its N dimensional form\n",
    "and then slicing it :\n",
    "\n",
    "* 14th index of DC Offset\n",
    "* 1st index of cycle\n",
    "* 0th index of Field (remember Python is 0 based)\n",
    "* 43rd index of Frequency\n",
    "\n",
    "To get this, we would slice as:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "spat_map_1, success = pd_raw.slice({'Frequency': 43, 'DC_Offset': 14, 'Field': 0, 'Cycle': 1})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As a verification, lets try to plot the same spatial map by slicing the N dimensional form we got earlier and compare\n",
    "it with what we got above:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "spat_map_2 = np.squeeze(ndim_form[:, :, 43, 14, 0, 1])\n",
    "print('2D slicing == ND slicing: {}'.format(np.allclose(spat_map_1, spat_map_2)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Interactive Visualization\n",
    "USIDatasets also enable quick, interactive, and easy visualization of data up to 2 position and 2 spectroscopic\n",
    "dimensions (4D datasets). Since this particular example has 6 dimensions, we would need to slice two dimensions in\n",
    "order to visualize the remaining 4 dimensions. Note that this interactive visualization **only** works on Jupyter\n",
    "Notebooks. This html file generated by a python script does not allow for interactive visualization and you may only\n",
    "see a set of static plots. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd_raw.visualize(slice_dict={'Field': 0, 'Cycle': 1});"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Close and delete the h5_file\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "h5_f.close()\n",
    "os.remove(h5_path)"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}