pyUSID.processing.process.Process¶
- class pyUSID.processing.process.Process(h5_main, process_name, parms_dict=None, cores=None, max_mem_mb=4096, mem_multiplier=1.0, lazy=False, h5_target_group=None, verbose=False)[source]¶
Bases:
object
An abstract class for formulating scientific problems as computational problems. This class handles the tedious, science-agnostic, file-operations, parallel-computations, and book-keeping operations such that children classes only need to specify application-relevant code for processing the data.
- Parameters:
h5_main (
USIDataset
) – The USID main HDF5 dataset over which the analysis will be performed.process_name (str) – Name of the process
cores (uint, optional) – How many cores to use for the computation. Default: all available cores - 2 if operating outside MPI context
max_mem_mb (uint, optional) – How much memory to use for the computation. Default 1024 Mb
mem_multiplier (float, optional. Default = 1) – mem_multiplier is the number that will be multiplied with the (byte) size of a single position in the source dataset in order to better estimate the number of positions that can be processed at any given time (how many pixels of the source and results datasets can be retained in memory). The default value of 1.0 only accounts for the source dataset. A value greater than 1 would account for the size of results datasets as well. For example, if the result dataset is the same size and precision as the source dataset, the multiplier will be 2 (1 for source, 1 for result)
lazy (bool, optional. Default = False) – If True, read_data_chunk and write_results_chunk will operate on dask arrays. If False - everything will be in numpy.
h5_target_group (h5py.Group, optional. Default = None) – Location where to look for existing results and to place newly computed results. Use this kwarg if the results need to be written to a different HDF5 file. By default, this value is set to the parent group containing h5_main
verbose (bool, Optional, default = False) – Whether or not to print debugging statements
- self.h5_results_grp¶
HDF5 group containing the HDF5 datasets that contain the results of the computation
- Type:
- self.duplicate_h5_groups¶
List of
h5py.Group
objects containing computational results that have been completely computed with the same set of parameters as those in self.parms_dict- Type:
- self.partial_h5_groups¶
List of
h5py.Group
objects containing computational results that have been partially computed with the same set of parameters as those in self.parms_dict- Type:
- self.process_name¶
Name of the process. This is used for checking for existing completely and partially computed results as well as for naming the HDF5 group that will contain the results of the computation
- Type:
- self._cores¶
Number of CPU cores to use for parallel computations. Ignored in the MPI context. Each rank gets 1 CPU core
- Type:
uint
- self._max_pos_per_read¶
Number of positions in the dataset to read per chunk
- Type:
uint
- self._status_dset_name¶
Name of the HDF5 dataset that keeps track of the positions in the source dataset thave already been computed
- Type:
- self._results¶
List of objects returned as the result of computation performed by the self._map_function for each position in the current batch of positions that were processed
- Type:
- self._h5_target_group¶
Location where existing / future results will be stored
- Type:
- self.__resume_implemented¶
Whether or not this (child) class has implemented the self._get_existing_datasets() function
- Type:
- self.__bytes_per_pos¶
Number of bytes used by one position of the source dataset
- Type:
uint
- self.mpi_comm¶
MPI communicator. None if not running in an MPI context
- Type:
mpi4py.MPI.COMM_WORLD
- self.mpi_rank¶
MPI rank. Always 0 if not running in an MPI context
- Type:
uint
- self.mpi_size¶
Number of ranks in COMM_WORLD. 1 if not running in an MPI context
- Type:
uint
- self.__ranks_on_socket¶
Number of MPI ranks on a given CPU socket
- Type:
uint
- self.__socket_master_rank¶
Master MPI rank for a given CPU chip / socket
- Type:
uint
- self.__compute_jobs¶
List of positions in the HDF5 dataset that need to be computed. This may not be a continuous list of numbers if multiple MPI workers had previously started computing and were interrupted.
- Type:
array-like
- self.__start_pos¶
The index within self.__compute_jobs that a particular MPI rank / worker needs to start computing from.
- Type:
uint
- self.__rank_end_pos¶
The index within self.__compute_jobs that a particular MPI rank / worker needs to start computing till.
- Type:
uint
- self.__end_pos¶
The index within self.__compute_jobs that a particular MPI rank / worker needs to start computing till for the current batch of positions.
- Type:
uint
- self.__pixels_in_batch¶
The positions being computed on by the current compute worker
- Type:
array-like
Methods
Creates placeholders for the results, applies the
_unit_computation()
to chunks of the datasetTests the process on a subset (for example a pixel) of the whole data.
Extracts the necessary parameters from the provided h5 group to resume computation
Attributes
The name of the HDF5 dataset that should be present to signify which positions have already been computed This is NOT a fully private variable so that multiple processes can be run within a single group - Eg Fitter In the case of Fitter - this name can be changed from 'completed_guesses' to 'completed_fits' check_for_duplicates will be called by the Child class where they have the opportunity to change this variable before checking for duplicates
- compute(override=False, *args, **kwargs)[source]¶
Creates placeholders for the results, applies the
_unit_computation()
to chunks of the dataset- Parameters:
override (bool, optional. default = False) – By default, compute will simply return duplicate results to avoid recomputing or resume computation on a group with partial results. Set to True to force fresh computation.
args (list) – arguments to the mapped function in the correct order
kwargs (dict) – keyword arguments to the mapped function
- Returns:
h5_results_grp – Group containing all the results
- Return type:
- parms_dict¶
The name of the HDF5 dataset that should be present to signify which positions have already been computed This is NOT a fully private variable so that multiple processes can be run within a single group - Eg Fitter In the case of Fitter - this name can be changed from ‘completed_guesses’ to ‘completed_fits’ check_for_duplicates will be called by the Child class where they have the opportunity to change this variable before checking for duplicates
- test(**kwargs)[source]¶
Tests the process on a subset (for example a pixel) of the whole data. The class can be re-instantiated with improved parameters and tested repeatedly until the user is content, at which point the user can call
compute()
on the whole dataset.Notes
This is not a function that is expected to be called in MPI
- Parameters:
dict (kwargs -) – keyword arguments to test the process
optional – keyword arguments to test the process
- use_partial_computation(h5_partial_group=None)[source]¶
Extracts the necessary parameters from the provided h5 group to resume computation
- Parameters:
h5_partial_group (
h5py.Group
) – Group containing partially computed results