{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# Formalizing Data Processing using the Process Class\n", "\n", "**Suhas Somnath, Oak Ridge National Lab**\n", "\n", "**Rajiv Giridharagopal, University of Washington**\n", "\n", "Updated: 6/12/2020\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**In this example, we will learn how to implement the pyUSID Process class. This method is \n", "ideal for situations where we want to parallel operate on a large dataset.**\n", "\n", "## Introduction\n", "Most of code written for scientific research is in the form of single-use / one-off scripts due to a few common\n", "reasons:\n", "\n", "* the author feels that it is the fastest mode to accomplishing a research task\n", "* the author feels that they are unlikely to perform the same operation again\n", "* the author does not anticipate the possibility that others may need to run their code\n", "\n", "However, more often than not, nearly all researchers have found that one or more of these assumptions fail and a lot\n", "of time is spent on fixing bugs and generalizing / formalizing code such that it can be shared or reused. Moreover, we\n", "live in an era of open science where the scientific community and an ever-increasing number of scientific journals\n", "are moving towards a paradigm where the data and code need to be made available with journal papers. Therefore, in the\n", "interest of saving time, energy, and reputation, it makes a lot more sense to formalize (parts of) one's data analysis\n", "code.\n", "\n", "For many researchers, formalizing data processing or analysis may seem like a daunting task due to the complexity of\n", "and the number of sub-operations that need to performed. ``pyUSID.Process`` greatly simplifies the process of\n", "formalizing code by lifting or reducing the burden of implementing important, yet tedious tasks and considerations\n", "such as:\n", "\n", "* **memory management** - reading chunks of datasets that can be processed with the available memory, something very\n", " crucial for very large datasets that cannot entirely fit into the computer's memory\n", "* **Scalable parallel computing** -\n", "\n", " * On personal computers - considerate CPU usage - Process will use all but one or two CPU cores for the (parallel)\n", " computation, which allows the user to continue using the computer for other activities such as reading mail, etc.\n", " * New in pyUSID v. 0.0.5 - Ability to scale to multiple computers in a cluster. The Process class can scale the same\n", " scientific code written for personal computers to use multiple computers (or nodes) on a high performance\n", " computing (HPC) resource or a cloud-based cluster to dramatically reduce the computational time\n", "* **pausing and resuming computation** - interrupting and resuming the computation at a more convenient time,\n", " something that is especially valuable for lengthy computations.\n", "* **avoiding repeated computation and returning existing results** - pyUSID.Process will return existing results\n", " computed using the exact same parameters instead of re-computing and storing duplicate copies of the same results.\n", "* **testing before computation** - checking the processing / analysis on a single unit (typically a single pixel) of\n", " data before the entire data is processed. This is particularly useful for lengthy computations.\n", "\n", "Using ``pyUSID.Process``, the user only needs to address the following basic operations:\n", "\n", "1. Reading data from file\n", "2. Computation on a single unit of data\n", "3. Writing results to disk\n", "\n", "### Components of pyUSID.Process\n", "The most important functions in the Process class are:\n", "\n", "* ``__init__()`` - instantiates a 'Process' object of this class after validating the inputs.\n", "* ``_create_results_datasets()`` - creates the HDF5 datasets and Group(s) to store the results.\n", "* ``_map_function()`` - the operation that will per be performed on each element in the dataset.\n", "* ``test()`` - This simple function lets the user test the ``map_function`` on a unit of data (a single pixel\n", " typically) to see if it returns the desired results. It saves a lot of computational time by allowing the user to\n", " spot-check results before computing on the entire dataset\n", "* ``_read_data_chunk()`` - reads the input data from one or more datasets.\n", "* ``_write_results_chunk()`` - writes the computed results back to the file\n", "* ``_unit_computation()`` - Defines how to process a chunk (multiple units) of data. This allows room for\n", " pre-processing of input data and post-processing of results if necessary. If neither are required, this function\n", " essentially applies the parallel computation on ``_map_function()``.\n", "* ``compute()`` - this does the bulk of the work of (iteratively) reading a chunk of data >> processing in parallel\n", " via ``_unit_computation()`` >> calling ``_write_results_chunk()`` to write data. Most sub-classes, including the one\n", " below, do not need to extend / modify this function.\n", "\n", "See the \"Flow of Functions\" section near the bottom for a bit more detail.\n", "\n", "### Recommended pre-requisite reading\n", "* [Universal Spectroscopic and Imaging Data (USID) model](https://pycroscopy.github.io/USID/usid_model.html)\n", "* [Crash course on HDF5 and h5py](./h5py_primer.html)\n", "* Utilities for [reading](./hdf_utils_read.html) and [writing](./plot_hdf_utils_write.html)\n", " h5USID files using pyUSID\n", "* Crash course on [parallel processing](https://pycroscopy.github.io/sidpy/notebooks/01_parallel_computing/parallel_compute.html)\n", "\n", "### Example problem\n", "We will be working with a Band Excitation Piezoresponse Force Microscopy (BE-PFM) imaging dataset\n", "acquired from advanced atomic force microscopes. In this dataset, a spectra was collected for each position in a two\n", "dimensional grid of spatial locations. Thus, this is a three dimensional dataset that has been flattened to a two\n", "dimensional matrix in accordance with the USID model.\n", "\n", "This example is based on the parallel computing primer where we searched for the peak of each spectra in a dataset.\n", "While that example focused on comparing serial and parallel computing, we will focus on demonstrating the simplicity\n", "with which such a data analysis algorithm can be formalized.\n", "\n", "This example is a simplification of the pycroscopy.analysis.BESHOFitter class in our sister project - Pycroscopy." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import necessary packages" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "//anaconda/lib/python3.5/site-packages/pyUSID/viz/__init__.py:18: FutureWarning: Please use sidpy.viz.plot_utils instead of pyUSID.viz.plot_utils. pyUSID.plot_utils will be removed in a future release of pyUSID\n", " FutureWarning)\n" ] } ], "source": [ "from __future__ import division, print_function, absolute_import, unicode_literals\n", "\n", "# The package for accessing files in directories, etc.:\n", "import os\n", "\n", "# Warning package in case something goes wrong\n", "from warnings import warn\n", "import subprocess\n", "import sys\n", "\n", "def install(package):\n", " subprocess.call([sys.executable, \"-m\", \"pip\", \"install\", package])\n", "# Package for downloading online files:\n", "try:\n", " # This package is not part of anaconda and may need to be installed.\n", " import wget\n", "except ImportError:\n", " warn('wget not found. Will install with pip.')\n", " import pip\n", " install('wget')\n", " import wget\n", "\n", "# The mathematical computation package:\n", "import numpy as np\n", "\n", "# The package used for creating and manipulating HDF5 files:\n", "import h5py\n", "\n", "# Packages for plotting:\n", "import matplotlib.pyplot as plt\n", "\n", "# import sidpy - supporting package for pyUSID:\n", "try:\n", " import sidpy\n", "except ImportError:\n", " warn('sidpy not found. Will install with pip.')\n", " import pip\n", " install('sidpy')\n", " import sidpy\n", "\n", "# Finally import pyUSID:\n", "try:\n", " import pyUSID as usid\n", "except ImportError:\n", " warn('pyUSID not found. Will install with pip.')\n", " import pip\n", " install('pyUSID')\n", " import pyUSID as usid" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Download the fitting function as a separate file" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "peak_func_file = 'peak_finding.py'\n", "url = 'https://raw.githubusercontent.com/pycroscopy/pyUSID/master/notebooks/user_guide/supporting_docs/peak_finding.py'\n", "if os.path.exists(peak_func_file):\n", " os.remove(peak_func_file)\n", "_ = wget.download(url, peak_func_file, bar=None)\n", "# the scientific function\n", "from peak_finding import find_all_peaks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The goal is to **find the amplitude at the peak in each spectra**. Clearly, the operation of finding the peak in one\n", "spectra is independent of the same operation on another spectra. Thus, we could divide the dataset in to N parts and\n", "use N CPU cores to compute the results much faster than it would take a single core to compute the results. Such\n", "problems are ideally suited for making use of all the advanced functionalities in the Process class.\n", "\n", "## Defining the class\n", "In order to solve our problem, we would need to implement a ``sub-class`` of pyUSID.Process or in other words -\n", "**extend pyUSID.Process**. As mentioned above, the pyUSID.Process class already generalizes several important\n", "components of data processing. We only need to extend this class by implementing the science-specific functionality.\n", "The rest of the capabilities will be **inherited** from pyUSID.Process.\n", "\n", "Lets think about what operations need be performed for each of the core Process functions listed above.\n", "\n", "### map_function()\n", "The most important component in our new Process class is the unit computation that needs to be performed on each\n", "spectra. ``map_function()`` needs to take as input a single spectra and return the amplitude at the peak (a single\n", "value). The ``compute()`` and ``unit_computation()`` will handle the parallelization.\n", "\n", "The scipy package has a very handy function called *find_peaks_cwt()* that facilitates the search for one or more\n", "peaks in a spectrum. We will be using a simplified function called *find_all_peaks()*.\n", "The exact methodology for finding the peaks is not of interest for this\n", "particular example. However, this function finds the index of 0 or more peaks in the spectra. We only expect\n", "one peak at the center of the spectra. Therefore, we can use the ``find_all_peaks()`` function to find the peaks and\n", "address those situations when too few or too many (> 1) peaks are found in a single spectra. Finally, we need to use\n", "the index of the peak to find the amplitude from the spectra.\n", "\n", "