pyUSID.processing.process.Process

class pyUSID.processing.process.Process(h5_main, process_name, parms_dict=None, cores=None, max_mem_mb=4096, mem_multiplier=1.0, lazy=False, h5_target_group=None, verbose=False)[source]

Bases: object

An abstract class for formulating scientific problems as computational problems. This class handles the tedious, science-agnostic, file-operations, parallel-computations, and book-keeping operations such that children classes only need to specify application-relevant code for processing the data.

Parameters:
  • h5_main (USIDataset) – The USID main HDF5 dataset over which the analysis will be performed.

  • process_name (str) – Name of the process

  • cores (uint, optional) – How many cores to use for the computation. Default: all available cores - 2 if operating outside MPI context

  • max_mem_mb (uint, optional) – How much memory to use for the computation. Default 1024 Mb

  • mem_multiplier (float, optional. Default = 1) – mem_multiplier is the number that will be multiplied with the (byte) size of a single position in the source dataset in order to better estimate the number of positions that can be processed at any given time (how many pixels of the source and results datasets can be retained in memory). The default value of 1.0 only accounts for the source dataset. A value greater than 1 would account for the size of results datasets as well. For example, if the result dataset is the same size and precision as the source dataset, the multiplier will be 2 (1 for source, 1 for result)

  • lazy (bool, optional. Default = False) – If True, read_data_chunk and write_results_chunk will operate on dask arrays. If False - everything will be in numpy.

  • h5_target_group (h5py.Group, optional. Default = None) – Location where to look for existing results and to place newly computed results. Use this kwarg if the results need to be written to a different HDF5 file. By default, this value is set to the parent group containing h5_main

  • verbose (bool, Optional, default = False) – Whether or not to print debugging statements

self.h5_results_grp

HDF5 group containing the HDF5 datasets that contain the results of the computation

Type:

h5py.Group

self.verbose

Whether or not to print debugging statements

Type:

bool

self.parms_dict

Dictionary of parameters for the computation

Type:

dict

self.duplicate_h5_groups

List of h5py.Group objects containing computational results that have been completely computed with the same set of parameters as those in self.parms_dict

Type:

list

self.partial_h5_groups

List of h5py.Group objects containing computational results that have been partially computed with the same set of parameters as those in self.parms_dict

Type:

list

self.process_name

Name of the process. This is used for checking for existing completely and partially computed results as well as for naming the HDF5 group that will contain the results of the computation

Type:

str

self._cores

Number of CPU cores to use for parallel computations. Ignored in the MPI context. Each rank gets 1 CPU core

Type:

uint

self._max_pos_per_read

Number of positions in the dataset to read per chunk

Type:

uint

self._status_dset_name

Name of the HDF5 dataset that keeps track of the positions in the source dataset thave already been computed

Type:

str

self._results

List of objects returned as the result of computation performed by the self._map_function for each position in the current batch of positions that were processed

Type:

list

self._h5_target_group

Location where existing / future results will be stored

Type:

h5py.Group

self.__resume_implemented

Whether or not this (child) class has implemented the self._get_existing_datasets() function

Type:

bool

self.__bytes_per_pos

Number of bytes used by one position of the source dataset

Type:

uint

self.mpi_comm

MPI communicator. None if not running in an MPI context

Type:

mpi4py.MPI.COMM_WORLD

self.mpi_rank

MPI rank. Always 0 if not running in an MPI context

Type:

uint

self.mpi_size

Number of ranks in COMM_WORLD. 1 if not running in an MPI context

Type:

uint

self.__ranks_on_socket

Number of MPI ranks on a given CPU socket

Type:

uint

self.__socket_master_rank

Master MPI rank for a given CPU chip / socket

Type:

uint

self.__compute_jobs

List of positions in the HDF5 dataset that need to be computed. This may not be a continuous list of numbers if multiple MPI workers had previously started computing and were interrupted.

Type:

array-like

self.__start_pos

The index within self.__compute_jobs that a particular MPI rank / worker needs to start computing from.

Type:

uint

self.__rank_end_pos

The index within self.__compute_jobs that a particular MPI rank / worker needs to start computing till.

Type:

uint

self.__end_pos

The index within self.__compute_jobs that a particular MPI rank / worker needs to start computing till for the current batch of positions.

Type:

uint

self.__pixels_in_batch

The positions being computed on by the current compute worker

Type:

array-like

Methods

compute

Creates placeholders for the results, applies the _unit_computation() to chunks of the dataset

test

Tests the process on a subset (for example a pixel) of the whole data.

use_partial_computation

Extracts the necessary parameters from the provided h5 group to resume computation

Attributes

parms_dict

The name of the HDF5 dataset that should be present to signify which positions have already been computed This is NOT a fully private variable so that multiple processes can be run within a single group - Eg Fitter In the case of Fitter - this name can be changed from 'completed_guesses' to 'completed_fits' check_for_duplicates will be called by the Child class where they have the opportunity to change this variable before checking for duplicates

compute(override=False, *args, **kwargs)[source]

Creates placeholders for the results, applies the _unit_computation() to chunks of the dataset

Parameters:
  • override (bool, optional. default = False) – By default, compute will simply return duplicate results to avoid recomputing or resume computation on a group with partial results. Set to True to force fresh computation.

  • args (list) – arguments to the mapped function in the correct order

  • kwargs (dict) – keyword arguments to the mapped function

Returns:

h5_results_grp – Group containing all the results

Return type:

h5py.Group

parms_dict

The name of the HDF5 dataset that should be present to signify which positions have already been computed This is NOT a fully private variable so that multiple processes can be run within a single group - Eg Fitter In the case of Fitter - this name can be changed from ‘completed_guesses’ to ‘completed_fits’ check_for_duplicates will be called by the Child class where they have the opportunity to change this variable before checking for duplicates

test(**kwargs)[source]

Tests the process on a subset (for example a pixel) of the whole data. The class can be re-instantiated with improved parameters and tested repeatedly until the user is content, at which point the user can call compute() on the whole dataset.

Notes

This is not a function that is expected to be called in MPI

Parameters:
  • dict (kwargs -) – keyword arguments to test the process

  • optional – keyword arguments to test the process

use_partial_computation(h5_partial_group=None)[source]

Extracts the necessary parameters from the provided h5 group to resume computation

Parameters:

h5_partial_group (h5py.Group) – Group containing partially computed results