pyUSID.processing.process.Process¶

class pyUSID.processing.process.Process(h5_main, process_name, parms_dict=None, cores=None, max_mem_mb=4096, mem_multiplier=1.0, lazy=False, h5_target_group=None, verbose=False)[source]¶

Bases: object

An abstract class for formulating scientific problems as computational problems. This class handles the tedious, science-agnostic, file-operations, parallel-computations, and book-keeping operations such that children classes only need to specify application-relevant code for processing the data.

Parameters:

h5_main (USIDataset) – The USID main HDF5 dataset over which the analysis will be performed.
process_name (str) – Name of the process
cores (uint, optional) – How many cores to use for the computation. Default: all available cores - 2 if operating outside MPI context
max_mem_mb (uint, optional) – How much memory to use for the computation. Default 1024 Mb
mem_multiplier (float, optional. Default = 1) – mem_multiplier is the number that will be multiplied with the (byte) size of a single position in the source dataset in order to better estimate the number of positions that can be processed at any given time (how many pixels of the source and results datasets can be retained in memory). The default value of 1.0 only accounts for the source dataset. A value greater than 1 would account for the size of results datasets as well. For example, if the result dataset is the same size and precision as the source dataset, the multiplier will be 2 (1 for source, 1 for result)
lazy (bool, optional. Default = False) – If True, read_data_chunk and write_results_chunk will operate on dask arrays. If False - everything will be in numpy.
h5_target_group (h5py.Group, optional. Default = None) – Location where to look for existing results and to place newly computed results. Use this kwarg if the results need to be written to a different HDF5 file. By default, this value is set to the parent group containing h5_main
verbose (bool, Optional, default = False) – Whether or not to print debugging statements

self.h5_results_grp¶

HDF5 group containing the HDF5 datasets that contain the results of the computation

Type:: h5py.Group

self.verbose¶

Whether or not to print debugging statements

Type:: bool

self.parms_dict¶

Dictionary of parameters for the computation

Type:: dict

self.duplicate_h5_groups¶

List of h5py.Group objects containing computational results that have been completely computed with the same set of parameters as those in self.parms_dict

Type:: list

self.partial_h5_groups¶

List of h5py.Group objects containing computational results that have been partially computed with the same set of parameters as those in self.parms_dict

Type:: list

self.process_name¶

Name of the process. This is used for checking for existing completely and partially computed results as well as for naming the HDF5 group that will contain the results of the computation

Type:: str

self._cores¶

Number of CPU cores to use for parallel computations. Ignored in the MPI context. Each rank gets 1 CPU core

Type:: uint

self._max_pos_per_read¶

Number of positions in the dataset to read per chunk

Type:: uint

self._status_dset_name¶

Name of the HDF5 dataset that keeps track of the positions in the source dataset thave already been computed

Type:: str

self._results¶

List of objects returned as the result of computation performed by the self._map_function for each position in the current batch of positions that were processed

Type:: list

self._h5_target_group¶

Location where existing / future results will be stored

Type:: h5py.Group

self.__resume_implemented¶

Whether or not this (child) class has implemented the self._get_existing_datasets() function

Type:: bool

self.__bytes_per_pos¶

Number of bytes used by one position of the source dataset

Type:: uint

self.mpi_comm¶

MPI communicator. None if not running in an MPI context

Type:: mpi4py.MPI.COMM_WORLD

self.mpi_rank¶

MPI rank. Always 0 if not running in an MPI context

Type:: uint

self.mpi_size¶

Number of ranks in COMM_WORLD. 1 if not running in an MPI context

Type:: uint

self.__ranks_on_socket¶

Number of MPI ranks on a given CPU socket

Type:: uint

self.__socket_master_rank¶

Master MPI rank for a given CPU chip / socket

Type:: uint

self.__compute_jobs¶

List of positions in the HDF5 dataset that need to be computed. This may not be a continuous list of numbers if multiple MPI workers had previously started computing and were interrupted.

Type:: array-like

self.__start_pos¶

The index within self.__compute_jobs that a particular MPI rank / worker needs to start computing from.

Type:: uint

self.__rank_end_pos¶

The index within self.__compute_jobs that a particular MPI rank / worker needs to start computing till.

Type:: uint

self.__end_pos¶

The index within self.__compute_jobs that a particular MPI rank / worker needs to start computing till for the current batch of positions.

Type:: uint

self.__pixels_in_batch¶

The positions being computed on by the current compute worker

Type:: array-like

Methods

`compute`	Creates placeholders for the results, applies the `_unit_computation()` to chunks of the dataset
`test`	Tests the process on a subset (for example a pixel) of the whole data.
`use_partial_computation`	Extracts the necessary parameters from the provided h5 group to resume computation

Attributes

parms_dict

The name of the HDF5 dataset that should be present to signify which positions have already been computed This is NOT a fully private variable so that multiple processes can be run within a single group - Eg Fitter In the case of Fitter - this name can be changed from 'completed_guesses' to 'completed_fits' check_for_duplicates will be called by the Child class where they have the opportunity to change this variable before checking for duplicates

compute(override=False, *args, **kwargs)[source]¶

Creates placeholders for the results, applies the _unit_computation() to chunks of the dataset

Parameters:

override (bool, optional. default = False) – By default, compute will simply return duplicate results to avoid recomputing or resume computation on a group with partial results. Set to True to force fresh computation.
args (list) – arguments to the mapped function in the correct order
kwargs (dict) – keyword arguments to the mapped function

Returns:

h5_results_grp – Group containing all the results

Return type:

h5py.Group

parms_dict¶: The name of the HDF5 dataset that should be present to signify which positions have already been computed This is NOT a fully private variable so that multiple processes can be run within a single group - Eg Fitter In the case of Fitter - this name can be changed from ‘completed_guesses’ to ‘completed_fits’ check_for_duplicates will be called by the Child class where they have the opportunity to change this variable before checking for duplicates

test(**kwargs)[source]¶

Tests the process on a subset (for example a pixel) of the whole data. The class can be re-instantiated with improved parameters and tested repeatedly until the user is content, at which point the user can call compute() on the whole dataset.

Notes

This is not a function that is expected to be called in MPI

Parameters:

dict (kwargs -) – keyword arguments to test the process
optional – keyword arguments to test the process

use_partial_computation(h5_partial_group=None)[source]¶

Extracts the necessary parameters from the provided h5 group to resume computation

Parameters:: h5_partial_group (h5py.Group) – Group containing partially computed results