{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "# ArrayTranslator for translating from proprietary file formats\n",
    "\n",
    "**Suhas Somnath**\n",
    "\n",
    "8/8/2017\n",
    "\n",
    "This document illustrates an example of extracting data out of proprietary raw data files and writing the information\n",
    "into a **Universal Spectroscopy and Imaging Data (USID)** HDF5 file (referred to as a **h5USID** file) using the\n",
    "``pyUSID.ArrayTranslator``\n",
    "\n",
    "<span style=\"color:red\">**Note**: The Pycroscopy ecosystem of packages are moving away from ``Translators`` and towards [sidpy.Readers](https://pycroscopy.github.io/SciFiReaders/notebooks/00_basic_usage/plot_example_reader.html) instead.</span> \n",
    "\n",
    "We encourage users to use ``Translators`` over ``Readers`` only when it makes more sense to. \n",
    "<br>\n",
    "<span style=\"color:red\">**Note**: If your data has an N-dimensional form, consider [creating a sidpy.Dataset](https://pycroscopy.github.io/sidpy/notebooks/00_basic_usage/create_dataset.html) object and then writing the ``Dataset`` to a h5USID file using [pyUSID.hdf_utils.write_sidpy_dataset()](https://pycroscopy.github.io/pyUSID/_autosummary/pyUSID.io.hdf_utils.model.write_sidpy_dataset.html)  instead.</span> \n",
    "<br>\n",
    "\n",
    "## Introduction\n",
    "In most scientific disciplines, commercial instruments tend to write the data and metadata out into proprietary file\n",
    "formats that significantly impede access to the data and metadata, thwart sharing of data and correlation of data from\n",
    "multiple instruments, and complicate long-term archival, among other things. One of the data wrangling steps in science\n",
    "is the extraction of the data and metadata out of the proprietary file formats and writing the information into files\n",
    "that are easier to access, share, etc. The overwhelming part of this data wrangling effort is in investigating how to\n",
    "extract the data and metadata into memory. Often, the data and parameters in these files are **not** straightforward to\n",
    "access. In certain cases, additional / dedicated software packages are necessary to access the data while in many other\n",
    "cases, it is possible to extract the necessary information from built-in **numpy** or similar python packages included\n",
    "with **anaconda**. Once the information is accessible in the computer memory, such as in the\n",
    "form of numpy arrays, scientists have a wide variety of tools to write the data out into files.\n",
    "\n",
    "Simpler data such as images or single spectra can easily be written into plain text files. Simple or complex / large /\n",
    "multidimensional data can certainly be stored as numpy data files. However, there are significant drawbacks to writing\n",
    "data into non-standardized structures or file formats. First, while the structure of the data and metadata may be\n",
    "intuitive for the original author of the data, that may not be the case for another researcher. Furthermore, such\n",
    "formatting may change from a day-to-day basis. As a consequence, it becomes challenging to develop code that can accept\n",
    "such data whose format keeps changing.\n",
    "\n",
    "One solution to these challenges is to write the data out into standardized files such as ``h5USID`` files.\n",
    "The USID model aims to make data access, storage, curation, etc. simply by storing the data along with all\n",
    "relevant parameters in a single file (HDF5 for now).\n",
    "\n",
    "The process of copying data from the original format to **h5USID** files is called\n",
    "**Translation** and the classes available in pyUSID and children packages such as pycroscopy that perform these\n",
    "operation are called **Translators**.\n",
    "\n",
    "As we alluded to earlier, the process of translation can be broken down into two basic components:\n",
    "\n",
    "1. Extracting data and metadata out of the proprietary file format\n",
    "2. Writing the extracted data and metadata into standardized h5USID files\n",
    "\n",
    "This process is the same regardless of the origin, complexity, or size of the scientific data. It is not necessary that\n",
    "the two components be disjoint - there are many situations where both components may need to happen simultaneously\n",
    "especially when the data sizes are very large.\n",
    "\n",
    "The goal of this document is to demonstrate how one would extract data and parameters from a Scanning Tunnelling\n",
    "Spectroscopy (STS) raw data file obtained from an Omicron Scanning Tunneling Microscope (STM) into a h5USID file.\n",
    "In this dataset, a spectra was collected for each position in a two-dimensional grid of spatial locations, thereby\n",
    "resulting in a 3D dataset. The data and metadata in this example are small enough that the translation process can\n",
    "indeed be separated out into two distinct components.\n",
    "\n",
    "### Recommended pre-requisite reading\n",
    "\n",
    "Before proceeding with this example, we recommend reading the previous documents to learn more about:\n",
    "\n",
    "* [Universal Spectroscopic and Imaging Data (USID) model](https://pycroscopy.github.io/USID/usid_model.html)\n",
    "\n",
    "\n",
    "### Import all necessary packages\n",
    "There are a few setup procedures that need to be followed before any code is written. In this step, we simply load a\n",
    "few python packages that will be necessary in the later steps.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Ensure python 3 compatibility:\n",
    "from __future__ import division, print_function, absolute_import, unicode_literals\n",
    "\n",
    "# The package for accessing files in directories, etc.:\n",
    "import os\n",
    "import zipfile\n",
    "\n",
    "# Warning package in case something goes wrong\n",
    "from warnings import warn\n",
    "import subprocess\n",
    "import sys\n",
    "\n",
    "\n",
    "def install(package):\n",
    "    subprocess.call([sys.executable, \"-m\", \"pip\", \"install\", package])\n",
    "# Package for downloading online files:\n",
    "try:\n",
    "    # This package is not part of anaconda and may need to be installed.\n",
    "    import wget\n",
    "except ImportError:\n",
    "    warn('wget not found.  Will install with pip.')\n",
    "    import pip\n",
    "    install(wget)\n",
    "    import wget\n",
    "\n",
    "# The mathematical computation package:\n",
    "import numpy as np\n",
    "\n",
    "# The package used for creating and manipulating HDF5 files:\n",
    "import h5py\n",
    "\n",
    "# Packages for plotting:\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# import sidpy - supporting package for pyUSID:\n",
    "try:\n",
    "    import sidpy\n",
    "except ImportError:\n",
    "    warn('sidpy not found.  Will install with pip.')\n",
    "    import pip\n",
    "    install('sidpy')\n",
    "    import sidpy\n",
    "\n",
    "# Finally import pyUSID:\n",
    "try:\n",
    "    import pyUSID as usid\n",
    "except ImportError:\n",
    "    warn('pyUSID not found.  Will install with pip.')\n",
    "    import pip\n",
    "    install('pyUSID')\n",
    "    import pyUSID as usid"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Procure the Raw Data file\n",
    "Here we will download a compressed data file from Github and unpack it:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "url = 'https://raw.githubusercontent.com/pycroscopy/pyUSID/master/data/STS.zip'\n",
    "zip_path = 'STS.zip'\n",
    "if os.path.exists(zip_path):\n",
    "    os.remove(zip_path)\n",
    "_ = wget.download(url, zip_path, bar=None)\n",
    "\n",
    "zip_path = os.path.abspath(zip_path)\n",
    "# figure out the folder to unzip the zip file to\n",
    "folder_path, _ = os.path.split(zip_path)\n",
    "zip_ref = zipfile.ZipFile(zip_path, 'r')\n",
    "# unzip the file\n",
    "zip_ref.extractall(folder_path)\n",
    "zip_ref.close()\n",
    "# delete the zip file\n",
    "os.remove(zip_path)\n",
    "\n",
    "data_file_path = 'STS.asc'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Extracting data and metadata from proprietary files\n",
    "### 1.1 Explore the raw data file\n",
    "\n",
    "\n",
    "Inherently, one may not know how to read these ``.asc`` files. One option is to try and read the file as a text file\n",
    "one line at a time.\n",
    "\n",
    "If one is lucky, as in the case of these ``.asc`` files, the file can be read like conventional text files.\n",
    "\n",
    "Here is how we tested to see if the ``asc`` files could be interpreted as text files. Below, we read just the first 10\n",
    "lines in the file\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(data_file_path, 'r') as file_handle:\n",
    "    for lin_ind in range(10):\n",
    "        print(file_handle.readline().replace('\\n', ''))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2 Read the contents of the file\n",
    "Now that we know that these files are simple text files, we can manually go through the file to find out which lines\n",
    "are important, at what lines the data starts etc.\n",
    "Manual investigation of such ``.asc`` files revealed that these files are always formatted in the same way. Also, they\n",
    "contain instrument- and experiment-related parameters in the first ``403`` lines and then contain data which is\n",
    "arranged as one pixel per row.\n",
    "\n",
    "STS experiments result in 3 dimensional datasets ``(X, Y, current)``. In other words, a 1D array of current data (as a\n",
    "function of excitation bias) is sampled at every location on a two dimensional grid of points on the sample.\n",
    "By knowing where the parameters are located and how the data is structured, it is possible to extract the necessary\n",
    "information from these files.\n",
    "\n",
    "Since we know that the data sizes (<200 MB) are much smaller than the physical memory of most computers, we can start\n",
    "by safely loading the contents of the entire file to memory.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Reading the entire file into memory\n",
    "with open(data_file_path, 'r') as file_handle:\n",
    "    string_lines = file_handle.readlines()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.3 Extract the metadata\n",
    "In the case of these ``.asc`` files, the parameters are present in the first few lines of the file. Below we will\n",
    "demonstrate how we parse the first 17 lines to extract some very important parameters. Note that there are several\n",
    "other important parameters in the next 350 or so lines. However, in the interest of brevity, we will focus only on the\n",
    "first few lines of the file. The interested reader is recommended to read the ``ASCTranslator`` available in\n",
    "``pycroscopy`` for more complete details.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Preparing an empty dictionary to store the metadata / parameters as key-value pairs\n",
    "parm_dict = dict()\n",
    "\n",
    "# Reading parameters stored in the first few rows of the file\n",
    "for line in string_lines[3:17]:\n",
    "    # Remove the hash / pound symbol, if any\n",
    "    line = line.replace('# ', '')\n",
    "    # Remove new-line escape-character, if any\n",
    "    line = line.replace('\\n', '')\n",
    "    # Break the line into two parts - the parameter name and the corresponding value\n",
    "    temp = line.split('=')\n",
    "    # Remove spaces in the value. Remember, the value is still a string and not a number\n",
    "    test = temp[1].strip()\n",
    "    # Now, attempt to convert the value to a number (floating point):\n",
    "    try:\n",
    "        test = float(test)\n",
    "        # In certain cases, the number is actually an integer, check and convert if it is:\n",
    "        if test % 1 == 0:\n",
    "            test = int(test)\n",
    "    except ValueError:\n",
    "        pass\n",
    "    parm_dict[temp[0].strip()] = test\n",
    "\n",
    "# Print out the parameters extracted\n",
    "for key in parm_dict.keys():\n",
    "    print(key, ':\\t', parm_dict[key])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At this point, we recommend reformatting the parameter names to standardized nomenclature.\n",
    "We realize that the materials imaging community has not yet agreed upon standardized nomenclature for metadata.\n",
    "Therefore, we leave this as an optional, yet recommended step.\n",
    "For example, in pycroscopy, we may categorize the number of rows and columns in an image under ``grid`` and\n",
    "data sampling parameters under ``IO``.\n",
    "As an example, we may rename ``x-pixels`` to ``positions_num_cols`` and ``y-pixels`` to ``positions_num_rows``.\n",
    "\n",
    "### 1.4 Extract parameters that define dimensions\n",
    "Just having the metadata above and the main measurement data is insufficient to fully describe experimental data.\n",
    "We also need to know how the experimental parameters were varied to acquire the multidimensional dataset at hand.\n",
    "In other words, we need to answer how the grid of locations was defined and how the bias was varied to acquire the\n",
    "current information at each location. This is precisely what we will do below.\n",
    "\n",
    "Since, we did not parse the entire list of parameters present in the file above, we will need to make some up.\n",
    "Please refer to the formal ``ASCTranslator`` to see how this step would have been different.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_rows = int(parm_dict['y-pixels'])\n",
    "num_cols = int(parm_dict['x-pixels'])\n",
    "num_pos = num_rows * num_cols\n",
    "spectra_length = int(parm_dict['z-points'])\n",
    "\n",
    "# We will assume that data was collected from -3 nm to +7 nm on the Y-axis or along the rows\n",
    "y_qty = 'Y'\n",
    "y_units = 'nm'\n",
    "y_vec = np.linspace(-3, 7, num_rows, endpoint=True)\n",
    "\n",
    "# We will assume that data was collected from -5 nm to +5 nm on the X-axis or along the columns\n",
    "x_qty = 'X'\n",
    "x_units = 'nm'\n",
    "x_vec = np.linspace(-5, 5, num_cols, endpoint=True)\n",
    "\n",
    "# The bias was sampled from -1 to +1 V in the experiment. Here is how we generate the Bias axis:\n",
    "bias_qty = 'Bias'\n",
    "bias_units = 'V'\n",
    "bias_vec = np.linspace(-1, 1, spectra_length)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.5 Extract the data\n",
    "We have observed that the data in these ``.asc`` files are consistently present after the first ``403`` lines of\n",
    "parameters. Using this knowledge, we need to populate a data array using data that is currently present as text lines\n",
    "in memory (from step 2).\n",
    "\n",
    "These ``.asc`` file store the 3D data (X, Y, spectra) as a 2D matrix (positions, spectra). In other words, the spectra\n",
    "are arranged one below another. Thus, reading the 2D matrix from top to bottom, the data arranged column-by-column,\n",
    "and then row-by-row So, for simplicity, we will prepare an empty 2D numpy array to store the data as it exists in the\n",
    "raw data file.\n",
    "\n",
    "Recall that in step 2, we were lucky enough to read the entire data file into memory given its small size.\n",
    "The data is already present in memory as a list of strings that need to be parsed as a matrix of numbers.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_headers = 403\n",
    "\n",
    "raw_data_2d = np.zeros(shape=(num_pos, spectra_length), dtype=np.float32)\n",
    "\n",
    "# Iterate over ever measurement position:\n",
    "for pos_index in range(num_pos):\n",
    "    # First, get the correct (string) line corresponding to the current measurement position.\n",
    "    # Recall that we would need to skip the many header lines to get to the data\n",
    "    this_line = string_lines[num_headers + pos_index]\n",
    "    # Each (string) line contains numbers separated by tabs (``\\t``). Let us break the line into several shorter strings\n",
    "    # each containing one number. We will ignore the last entry since it is empty.\n",
    "    string_spectrum = this_line.split('\\t')[:-1]  # omitting the new line\n",
    "    # Now that we have a list of numbers represented as strings, we need to convert this list to a 1D numpy array\n",
    "    # the converted array is set to the appropriate position in the main 2D array.\n",
    "    raw_data_2d[pos_index] = np.array(string_spectrum, dtype=np.float32)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If the data is so large that it cannot fit into memory, we would need to read data one (or a few) position(s) at a\n",
    "time, process it (e.g. convert from string to numbers), and write it to the HDF5 file without keeping much or any data\n",
    "in memory.\n",
    "\n",
    "The three-dimensional dataset (``Y``, ``X``, ``Bias``) is currently represented as a two-dimensional array:\n",
    "(``X`` * ``Y``, ``Bias``). To make it easier for us to understand and visualize, we can turn it into a 3D array:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_data_3d = raw_data_2d.reshape(num_rows, num_cols, spectra_length)\n",
    "print('Shape of 2D data: {}, Shape of 3D data: {}'.format(raw_data_2d.shape, raw_data_3d.shape))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Just as we did for the parameters (``X``, ``Y``, and ``Bias``) that were varied in the experiment,\n",
    "we need to specify the quantity that is recorded from the sensors / detectors, units, and what the data\n",
    "represents:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "main_data_name = 'STS'\n",
    "main_qty = 'Current'\n",
    "main_units = 'nA'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Visualize the extracted data\n",
    "Here is a visualization of the current-voltage spectra at a few locations:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = sidpy.plot_utils.plot_curves(bias_vec, raw_data_2d, num_plots=9,\n",
    "                                        x_label=bias_qty + '(' + bias_units + ')',\n",
    "                                        y_label=main_qty + '(' + main_units + ')',\n",
    "                                        title='Current-Voltage Spectra at different locations',\n",
    "                                        fig_title_yoffset=1.05)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is a visualization of spatial maps at different bias values\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = sidpy.plot_utils.plot_map_stack(raw_data_3d, reverse_dims=True, pad_mult=(0.15, 0.15),\n",
    "                                           title='Spatial maps of current at different bias', stdevs=2,\n",
    "                                           color_bar_mode='single', num_ticks=3, x_vec=x_vec, y_vec=y_vec,\n",
    "                                           evenly_spaced=True, fig_mult=(3, 3), title_yoffset=0.95)\n",
    "\n",
    "for axis, bias_ind in zip(axes, np.linspace(0, len(bias_vec), 9, endpoint=False, dtype=np.uint)):\n",
    "    axis.set_title('Bias = %3.2f V' % bias_vec[bias_ind])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Writing information into h5USID files\n",
    "So far, we have captured all the information from the ``.asc`` files. We are now ready to write the data into USID\n",
    "formatted HDF5 files! We will be using the ``pyUSID.ArrayTranslator`` class for this stage.\n",
    "\n",
    "The ``ArrayTranslator`` class can be used in two different ways. We will go over both methods.\n",
    "\n",
    "### 2.A ArrayTranslator as a quick file writer\n",
    "Though not intended to be used in this manner, the ``ArrayTranslator`` can be used in scripts to quickly write out\n",
    "data into a HDF5 file. The benefit over simply saving data using\n",
    "`numpy.save() <https://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html>`_ is that the data will be\n",
    "written in a way that it will be accessible by the `pyUSID.USIDataset <./plot_usi_dataset.html>`_ class that offers\n",
    "several handy capabilities. Such usage of the ``ArrayTranslator`` offers minimal benefits over using the\n",
    "`pyUSID.hdf_utils.write_main_data() <../intermediate/plot_hdf_utils_write.html#write-main-dataset>`_ function,\n",
    "upon which it is based.\n",
    "\n",
    "#### 2.A.1 Preparing the name of the new HDF5 file\n",
    "Below, we will specify the name of the HDF5 file that we want to write the prepared data and metadata. This step\n",
    "would need to be performed regardless of whether one is writing data out into numpy files, text files, spreadsheets,\n",
    "or h5USID files (this case).\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# First, let us separate the file name from the path to the folder containing the raw data file\n",
    "folder_path, file_name = os.path.split(data_file_path)\n",
    "# Next, we will remove the ``.asc`` extension\n",
    "file_name = file_name[:-4] + '_Script'\n",
    "# The new file name will share the same base name as the original file but will end with a ``.h5`` extension.\n",
    "# This HDF5 or H5 file will live in the same folder as the raw data file\n",
    "h5_path_1 = os.path.join(folder_path, file_name + '.h5')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Indeed, a simple ``replace('.asc', '.h5')`` might have done the same job. However, the above method is recommended\n",
    "\n",
    "#### 2.A.2 Preparing `Dimension` objects\n",
    "Before the ``ArrayTranslator`` can be used, we need to formally define the dimensions that define the\n",
    "three-dimensional measurement in the data file. In this example, we have two `Position` dimensions - ``X`` and ``Y``\n",
    "and one `Spectroscopic` dimension - ``Bias`` against which data for each spectra were collected.\n",
    "\n",
    "In pyUSID, we formally define dimensions using simple\n",
    "`pyUSID.Dimension <../intermediate/plot_write_utils.html#dimension>`_ objects. These ``Dimension`` objects are simply\n",
    "descriptors of dimensions and take the name of the quantity, physical units, and the values over which the dimension\n",
    "was varied. Both, the `Position` and `Spectroscopic` dimensions need to be defined using ``Dimension`` objects and the\n",
    "``Dimension`` objects should be arranged from fastest varying to slowest varying dimensions.\n",
    "\n",
    "The `Spectroscopic` dimensions are trivial since we only have one dimension - ``Bias``.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "spec_dims = usid.Dimension(bias_qty, bias_units, bias_vec)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Given that the spectra were acquired column-by-column and then row-by-row, we would need to arrange the `Position`\n",
    "dimensions as ``X`` followed by ``Y``.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pos_dims = [usid.Dimension(x_qty, x_units, x_vec),\n",
    "            usid.Dimension(y_qty, y_units, y_vec)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.A.3 Reshape the Main data (if necessary)\n",
    "Recall that ``Main`` datasets in USID are two dimensional in shape where all position dimensions (``X``, and ``Y`` in\n",
    "this case) are collapsed along the first axis and the spectroscopic dimensions (``Bias`` in this case) are\n",
    "collapsed along the second axis. Fortunately, this is exactly how the data was already laid out in the original raw\n",
    "data file. So, we can use that two-dimensional array as is. We can skip this step.\n",
    "\n",
    "#### 2.A.4 Writing to a h5USID file\n",
    "We are now ready to use the ``ArrayTranslator!``\n",
    "The ArrayTranslator simplifies the creation of h5USID files. It handles the HDF5 file creation,\n",
    "HDF5 dataset creation and writing, creation of ancillary HDF5 datasets, group creation, writing parameters, linking\n",
    "ancillary datasets to the main dataset etc. With a single call to the ``translate()`` function of the\n",
    "``ArrayTranslator``, we complete the translation process:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tran = usid.ArrayTranslator()\n",
    "_ = tran.translate(h5_path_1, main_data_name,\n",
    "                   raw_data_2d, main_qty, main_units,\n",
    "                   pos_dims, spec_dims,\n",
    "                   parm_dict=parm_dict)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Verifying the newly written H5 file:\n",
    "Let us perform some simple and quick verification to show that the data has indeed been translated correctly:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with h5py.File(h5_path_1, mode='r') as h5_file:\n",
    "    # See if a tree has been created within the hdf5 file:\n",
    "    print('Contents of the h5USID file:')\n",
    "    print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')\n",
    "    sidpy.hdf_utils.print_tree(h5_file)\n",
    "    print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')\n",
    "    print('Comprehensive information about the Main dataset:')\n",
    "    print('-------------------------------------------------')\n",
    "    h5_main = usid.hdf_utils.get_all_main(h5_file)[-1]\n",
    "    print(h5_main)\n",
    "    print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')\n",
    "    print('Verification plots:')\n",
    "    fig, axes = plt.subplots(ncols=2, figsize=(11, 5))\n",
    "    spat_map = np.reshape(h5_main[:, 100], (100, 100))\n",
    "    sidpy.plot_utils.plot_map(axes[0], spat_map, origin='lower')\n",
    "    axes[0].set_title('Spatial map')\n",
    "    axes[0].set_xlabel('X')\n",
    "    axes[0].set_ylabel('Y')\n",
    "    axes[1].plot(np.linspace(-1.0, 1.0, h5_main.shape[1]),\n",
    "                 h5_main[250])\n",
    "    axes[1].set_title('IV curve at a single pixel')\n",
    "    axes[1].set_xlabel('Tip bias [V]')\n",
    "    axes[1].set_ylabel('Current [nA]')\n",
    "\n",
    "    fig.tight_layout()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.B Extending the ArrayTranslator\n",
    "What we have done above is essentially, write real measurement data and metadata into a standardized USID HDF5 file.\n",
    "As is evident above, the process of writing to the HDF5 file is rather simple because of the ``ArrayTranslator``.\n",
    "However, the above code is part of a script that is susceptible to edits. Minor changes in the naming / formatting of\n",
    "certain strings, reshaping of the datasets can very quickly break analysis or visualization code later on.\n",
    "Encapsulating the data reading and writing process into a formal ``Translator`` class also makes it easier for others\n",
    "to use it and write data into the same consistent format. In fact, upon writing the class, proprietary data files\n",
    "can be translated using just two lines as we will see below.\n",
    "Therefore, we recommend extending the ``ArrayTranslator`` class, when possible, instead of using it independently\n",
    "like a function.\n",
    "\n",
    "#### 2. B. 1Defining the class\n",
    "Writing a python class that extends the ``ArrayTranslator`` class is far less intimidating than it sounds. The code\n",
    "that goes into the class is virtually identical to what has been written above. In fact the code that will be written\n",
    "below is very similar to real ``Translator`` classes found in our sister package - `pycroscopy`.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class ExampleTranslator(usid.ArrayTranslator):\n",
    "    \"\"\"\n",
    "    The above definition of the class states that our ExampleTranslator inherits all the capabilities and\n",
    "    behaviors of the ArrayTranslator class and builds on top of it\n",
    "    \"\"\"\n",
    "\n",
    "    def translate(self, input_file_path):\n",
    "        \"\"\"\n",
    "        Extracts the data and metadata out of proprietary formatted files and writes it into a SID formatted HDF5 file\n",
    "\n",
    "        Parameters\n",
    "        ----------\n",
    "        input_file_path : str\n",
    "            Path to the input data file containing all the information\n",
    "\n",
    "        Returns\n",
    "        -------\n",
    "        h5_path_out_2 : str\n",
    "            Path to the USID HDF5 output file\n",
    "        \"\"\"\n",
    "\n",
    "        \"\"\"\n",
    "        --------------------------------------------------------------------------------------------\n",
    "        1. Extracting data and metadata out of the proprietary file\n",
    "        --------------------------------------------------------------------------------------------\n",
    "        1.2 Read the contents of the file into memory\n",
    "        \"\"\"\n",
    "        with open(input_file_path, 'r') as file_handle:\n",
    "            string_lines = file_handle.readlines()\n",
    "\n",
    "        \"\"\"\n",
    "        1.3 Extract all experiment and instrument related parameters\n",
    "        \"\"\"\n",
    "        parm_dict = dict()\n",
    "\n",
    "        for line in string_lines[3:17]:\n",
    "            line = line.replace('# ', '')\n",
    "            line = line.replace('\\n', '')\n",
    "            temp = line.split('=')\n",
    "            test = temp[1].strip()\n",
    "            try:\n",
    "                test = float(test)\n",
    "                if test % 1 == 0:\n",
    "                    test = int(test)\n",
    "            except ValueError:\n",
    "                pass\n",
    "            parm_dict[temp[0].strip()] = test\n",
    "\n",
    "        \"\"\"\n",
    "        1.4 Extract or generate parameters that define the three dimensions\n",
    "        \"\"\"\n",
    "\n",
    "        num_rows = int(parm_dict['y-pixels'])\n",
    "        num_cols = int(parm_dict['x-pixels'])\n",
    "        num_pos = num_rows * num_cols\n",
    "        spectra_length = int(parm_dict['z-points'])\n",
    "\n",
    "        # We will assume that data was collected from -3 nm to +7 nm on the Y-axis or along the rows\n",
    "        y_qty = 'Y'\n",
    "        y_units = 'nm'\n",
    "        y_vec = np.linspace(-3, 7, num_rows, endpoint=True)\n",
    "\n",
    "        # We will assume that data was collected from -5 nm to +5 nm on the X-axis or along the columns\n",
    "        x_qty = 'X'\n",
    "        x_units = 'nm'\n",
    "        x_vec = np.linspace(-5, 5, num_cols, endpoint=True)\n",
    "\n",
    "        # The bias was sampled from -1 to +1 V in the experiment. Here is how we generate the Bias axis:\n",
    "        bias_qty = 'Bias'\n",
    "        bias_units = 'V'\n",
    "        bias_vec = np.linspace(-1, 1, spectra_length)\n",
    "\n",
    "        \"\"\"\n",
    "        1.5 Extract the data\n",
    "        \"\"\"\n",
    "        num_headers = 403\n",
    "\n",
    "        raw_data_2d = np.zeros(shape=(num_pos, spectra_length), dtype=np.float32)\n",
    "\n",
    "        # Iterate over ever measurement position:\n",
    "        for pos_index in range(num_pos):\n",
    "            this_line = string_lines[num_headers + pos_index]\n",
    "            string_spectrum = this_line.split('\\t')[:-1]  # omitting the new line\n",
    "            raw_data_2d[pos_index] = np.array(string_spectrum, dtype=np.float32)\n",
    "\n",
    "        \"\"\"\n",
    "        2.1 Prepare the output file path\n",
    "        \"\"\"\n",
    "        folder_path, file_name = os.path.split(data_file_path)\n",
    "        h5_path = os.path.join(folder_path, file_name[:-4] + '_Class' + '.h5')\n",
    "\n",
    "        \"\"\"\n",
    "        --------------------------------------------------------------------------------------------\n",
    "        2.B Writing to h5USID file using pyUSID\n",
    "        --------------------------------------------------------------------------------------------\n",
    "        2.B.2 Expressing the Position and Spectroscopic Dimensions using pyUSID.Dimension objects\n",
    "        \"\"\"\n",
    "        pos_dims = [usid.Dimension(x_qty, x_units, x_vec),\n",
    "                    usid.Dimension(y_qty, y_units, y_vec)]\n",
    "        spec_dims = usid.Dimension(bias_qty, bias_units, bias_vec)\n",
    "\n",
    "        \"\"\"\n",
    "        2.B.3 Reshape the Main data from its original N-dimensional form to the USID 2D form \n",
    "            We skip this step since it is unnecessary in this case\n",
    "            \n",
    "        2.B.4 Call the translate() function of the base ArrayTranslator class   \n",
    "        \"\"\"\n",
    "        _ = super(ExampleTranslator, self).translate(h5_path, main_data_name,\n",
    "                                                     raw_data_2d, main_qty, main_units,\n",
    "                                                     pos_dims, spec_dims,\n",
    "                                                     parm_dict=parm_dict)\n",
    "\n",
    "        return h5_path"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Comments\n",
    "As you can tell from above, the vast majority of the code in the Class (and the script above it) pertain to the\n",
    "first phase - the extraction of data and metadata out of the proprietary file format. The parts specific to writing\n",
    "the data to the h5USID file are no more than 4-5 lines.\n",
    "\n",
    "As you could tell by now, the code in this class is virtually identical to the code above.\n",
    "Perhaps the biggest differences between the two codes come in the definition of the class and section ``2.B.4``:\n",
    "\n",
    "In section ``2.A.4`` above, we had instantiated the ``ArrayTranslator`` and called its ``translate()`` method:\n",
    "\n",
    ".. code-block:: python\n",
    "\n",
    "    tran = usid.ArrayTranslator()\n",
    "    h5_path_out_1 = tran.translate(...)\n",
    "\n",
    "In the case of the ``ExampleTranslator`` class above, we define the class itself as an extension / child class of the\n",
    "``ArrayTranslator`` in this line:\n",
    "\n",
    ".. code-block:: python\n",
    "\n",
    "    class ExampleTranslator(usid.ArrayTranslator):\n",
    "\n",
    "This means that our ``ExampleTranslator`` class inherits all the capabilities (including our favorite -\n",
    "``translate()`` function) and behaviors of the ``ArrayTranslator`` class and builds on top of it. This is why we don't\n",
    "need to instantiate the ``ArrayTranslator`` in section ``2.B.4``. All we are doing in our ``translate()`` function is\n",
    "adding the intelligence that is relevant to our specific scientific example and piggybacking on the many capabilities\n",
    "of the ``ArrayTranslator`` class for the actual file writing. This piggybacking is visible in the last line:\n",
    "\n",
    ".. code-block:: python\n",
    "\n",
    "    h5_path_out_1 = super(ExampleTranslator, self).translate(....)\n",
    "\n",
    "Essentially, we are asking ``ArrayTranslator.translate()`` to take over and do the rest.\n",
    "\n",
    "#### Using this ExampleTranslator\n",
    "What we did above is provide a template for what should happen when someone provides an input file. We have not really\n",
    "tried it out yet. The lines below will illustrate how easy it becomes to perform `translations` from now on:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# instantiate the class first:\n",
    "my_tran = ExampleTranslator()\n",
    "# Then call the translate function:\n",
    "h5_path_2 = my_tran.translate(data_file_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the class is written, translations become a two-line operation!\n",
    "\n",
    "### Verifying the newly written H5 file:\n",
    "Let us perform some simple and quick verification to show that the data has indeed been translated correctly:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with h5py.File(h5_path_2, mode='r') as h5_file:\n",
    "    # See if a tree has been created within the hdf5 file:\n",
    "    print('Contents of the h5USID file:')\n",
    "    print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')\n",
    "    sidpy.hdf_utils.print_tree(h5_file)\n",
    "    print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')\n",
    "    print('Comprehensive information about the Main dataset:')\n",
    "    print('-------------------------------------------------')\n",
    "    h5_main = usid.hdf_utils.get_all_main(h5_file)[-1]\n",
    "    print(h5_main)\n",
    "    print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')\n",
    "    print('Verification plots:')\n",
    "    fig, axes = plt.subplots(ncols=2, figsize=(11, 5))\n",
    "    spat_map = np.reshape(h5_main[:, 100], (100, 100))\n",
    "    sidpy.plot_utils.plot_map(axes[0], spat_map, origin='lower')\n",
    "    axes[0].set_title('Spatial map')\n",
    "    axes[0].set_xlabel('X')\n",
    "    axes[0].set_ylabel('Y')\n",
    "    axes[1].plot(np.linspace(-1.0, 1.0, h5_main.shape[1]),\n",
    "                 h5_main[250])\n",
    "    axes[1].set_title('IV curve at a single pixel')\n",
    "    axes[1].set_xlabel('Tip bias [V]')\n",
    "    axes[1].set_ylabel('Current [nA]')\n",
    "\n",
    "    fig.tight_layout()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Limits of the ArrayTranslator\n",
    "The ``ArrayTranslator`` is perfect when one is dealing with a **single** USID `Main` dataset. However, if the\n",
    "proprietary file contained multiple such 3D hyperspectral images, one would need to use the lower-level functions that\n",
    "power the ``ArrayTranslator``. pyUSID offers\n",
    "[several functions](../intermediate/hdf_utils_write.html#write-main-dataset) that make it easy to handle such\n",
    "more involved translations.\n",
    "\n",
    "### What about the base Translator class?\n",
    "The ``pyUSID.Translator`` class is a highly abstract class and does not do much. The ``Translator`` class exists only\n",
    "to standardize the way in which translators are operated - the instantiation of the class followed by the call to the\n",
    "``translate()`` method. The ``ArrayTranslator`` is itself\n",
    "a child class of the ``Translator`` class and is the lowest class capable of doing something by itself while still\n",
    "being application-agnostic.\n",
    "\n",
    "### More information\n",
    "Our sister class - BGLib, has several\n",
    "[translators](https://github.com/pycroscopy/BGlib/tree/master/BGlib/be/translators) that translate popular\n",
    "file formats generated by nanoscale imaging instruments. Few translators extend the ``ArrayTranslator`` like\n",
    "we did above, while most use the low-level functions in ``pyUSID.hdf_utils``.A single, robust Translator class can handle the finer variations / modes in the data. \n",
    "\n",
    "We have found python packages online to open a few proprietary file formats and have written translators using these\n",
    "packages. If you are having trouble reading the data in your files and cannot find any packages online, consider\n",
    "contacting the manufacturer of the instrument which generated the data in the proprietary format for help.\n",
    "\n",
    "### Cleaning up\n",
    "Remove both the original and translated files:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "os.remove(h5_path_1)\n",
    "os.remove(h5_path_2)\n",
    "os.remove(data_file_path)"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}