Scientific analysis of nanoscale materials imaging data
- pycroscopy is a python package for storing, processing, analyzing, and visualizing multidimensional scientific data.
- pycroscopy uses a data-centric model wherein the raw data collected from the instrument, results from analysis and processing routines are all written to standardized hierarchical data format (HDF5) files for traceability, reproducibility, and provenance.
- pycroscopy uses popular packages such as numpy, scipy, scikit-image, scikit-learn, joblib, matplotlib, etc. for most of the computation, analysis and visualization.
- You can choose to perform your analysis outside pycroscopy if you prefer and use pycroscopy to standardize the data storage.
- See a high-level overview of pycroscopy in this presentation
- See scientific research enabled by pycroscopy.
- Jump to our GitHub project
With pycroscopy we aim to:
- significantly lower the barrier to advanced data analysis procedures by simplifying I/O, processing, visualization, etc.
- serve as a hub for collaboration across scientific domains (microscopists, material scientists, biologists…)
- provide a community-driven, open standard for data formatting
- provide a framework for developing origin-agnostic / universal data analysis routines
As we see it, there are a few opportunities in scientific imaging (that surely apply to several other scientific domains):
- 1. Growing data sizes
- Cannot use desktop computers for analysis
- Need: High performance computing, storage resources and compatible, scalable file structures
- 2. Increasing data complexity
- Sophisticated imaging and spectroscopy modes resulting in 5,6,7… dimensional data
- Need: Robust software and generalized data formatting
- 3. Multiple file formats
- Different formats from each instrument. Proprietary in most cases
- Incompatible for correlation
- Need: Open, instrument-independent data format
- 4. Disjoint communities
- Similar analysis routines written by each community (SPM, STEM, TOF SIMs, XRD…) independently!
- Need: Centralized repository, instrument agnostic analysis routines that bring communities together
- 5. Expensive analysis software
- Software supplied with instruments often insufficient / incapable of custom analysis routines
- Commercial software (Eg: Matlab, Origin..) are often prohibitively expensive.
- Need: Free, powerful, open source, user-friendly software
- 6. Closed science
- Analysis software and data not shared
- No guarantees of reproducibility or traceability
- Need: open source data structures, file formats, centralized code and data repositories
- pycroscopy uses an instrument agnostic data structure that facilitates the storage of data, regardless of dimensionality (conventional 1D spectra and 2D images to 9D hyperspectral datasets and beyond!) or instrument of origin (AFMs, STEMs, Raman spectroscopy etc.).
- This generalized representation of data allows us to write a single and generalized version of analysis and processing functions that can be applied to any kind of data.
- The data is stored in hierarchical data format (HDF5) files which have numerous benefits including flexibility in storing multiple datasets of arbitrary sizes and dimensionality, supercomputer compatibility, storage of important metadata.
- Once the relevant data and metadata are extracted from proprietary raw data files and written into pycroscopy formatted HDF5 files
via a translation process,
the user gains access to the rest of the utilities present in
- Scientific workflows are developed and disseminated through jupyter notebooks that are interactive and portable web applications containing text, images, code / scripts, and graphical results. Notebooks containing the complete / parts of workflow from raw data to publishable figures often become supplementary material for journal publications thereby enabling traceability, reproducibility for open science.
- This project begun largely as an effort by scientists and engineers at the Institute for Functional Imaging of Materials (IFIM) to implement a python library that can support the I/O, processing, and analysis of the gargantuan stream of images that their microscopes generate (thanks to the large IFIM users community!).
- It is now being developed and maintained by Suhas Somnath of the Advanced Data & Workflows Group (ADWG) at the Oak Ridge National Laboratory Leadership Computing Facility (OLCF) and Chris R. Smith of IFIM.
- By sharing our methodology and code for analyzing scientific imaging data we hope that it will benefit the wider scientific community. We also hope, quite ardently, that other scientists would follow suit.
- Please visit our credits and acknowledgements page for more information.