Caching and recording discipline data¶

GEMSEO offers various features that allow to record and cache the values of discipline inputs and outputs, as well as its jacobian.

Introduction¶

Executing a discipline triggers a simulation which can be costly.

The first need for caching is to avoid duplicate simulations with the same inputs.
Then, the generated data contain valuable information which one may want to analyze after or during the execution, so storing this data on the disk is useful.
Finally, in case of machine crash, restarting the MDO process from scratch may be a waste of computational resources. Again, storing the input and output data on the disk avoids duplicate execution in case of crash.

In GEMSEO, each MDODiscipline has a cache.

>>> from gemseo.api import create_discipline
>>> discipline = create_discipline('AnalyticDiscipline', name='my_discipline', expressions={'y':'2*x'})
>>> print(discipline.cache)
my_discipline
| Type: SimpleCache
| Input names: None
| Output names: None
| Length: 0
| Tolerance: 0.0

Setting a cache policy¶

All disciplines have the MDODiscipline.SIMPLE_CACHE cache policy enabled by default. Other ones are MDODiscipline.MEMORY_FULL_CACHE and MDODiscipline.HDF5_CACHE.

The cache policy can be defined by means of the MDODiscipline.set_cache_policy() method:

MDODiscipline.set_cache_policy(cache_type='SimpleCache', cache_tolerance=0.0, cache_hdf_file=None, cache_hdf_node_name=None, is_memory_shared=True)[source]

Set the type of cache to use and the tolerance level.

This method defines when the output data have to be cached according to the distance between the corresponding input data and the input data already cached for which output data are also cached.

The cache can be either a SimpleCache recording the last execution or a cache storing all executions, e.g. MemoryFullCache and HDF5Cache. Caching data can be either in-memory, e.g. SimpleCache and MemoryFullCache, or on the disk, e.g. HDF5Cache.

The attribute CacheFactory.caches provides the available caches types.

Parameters

cache_type (str) –
The type of cache.

By default it is set to SimpleCache.
cache_tolerance (float) –
The maximum relative norm of the difference between two input arrays to consider that two input arrays are equal.

By default it is set to 0.0.
cache_hdf_file (str | Path | None) –
The path to the HDF file to store the data; this argument is mandatory when the MDODiscipline.HDF5_CACHE policy is used.

By default it is set to None.
cache_hdf_node_name (str | None) –
The name of the HDF file node to store the discipline data. If None, MDODiscipline.name is used.

By default it is set to None.
is_memory_shared (bool) –
Whether to store the data with a shared memory dictionary, which makes the cache compatible with multiprocessing.

By default it is set to True.

Return type

None

>>> from gemseo.api import create_discipline
>>> discipline = create_discipline('AnalyticDiscipline', name='my_discipline', expressions={'y':'2*x'})
>>> print(discipline.cache)
my_discipline
| Type: SimpleCache
| Input names: None
| Output names: None
| Length: 0
| Tolerance: 0.0
>>> discipline.set_cache_policy(discipline.MEMORY_FULL_CACHE)
>>> print(discipline.cache)
my_discipline
| Type: MemoryFullCache
| Input names: None
| Output names: None
| Length: 0
| Tolerance: 0.0

The different cache policies¶

Simple cache: storing the last execution¶

The simplest cache strategy in GEMSEO only stores the last execution data (inputs, outputs, and eventually the Jacobian matrix) in memory.

This cache strategy is implemented by means of the SimpleCache class:

class gemseo.caches.simple_cache.SimpleCache(tolerance=0.0, name=None)[source]

Dictionary-based cache storing a unique entry.

When caching an input data different from this entry, this entry is replaced by a new one initialized with this input data.

Parameters

tolerance (float) –
The tolerance below which two input arrays are considered equal: norm(new_array-cached_array)/(1+norm(cached_array)) <= tolerance. If this is the case for all the input names, then the cached output data shall be returned rather than re-evaluating the discipline. This tolerance could be useful to optimize CPU time. It could be something like 2 * numpy.finfo(float).eps.

By default it is set to 0.0.
name (str | None) –
A name for the cache. If None, use the class name.

By default it is set to None.

Return type

None

cache_jacobian(input_data, jacobian_data)[source]

Cache the input and Jacobian data.

Parameters

input_data (Mapping[str, Any]) – The data containing the input data to cache.
jacobian_data (Mapping[str, Mapping[str, numpy.ndarray]]) – The Jacobian data to cache.

Return type

None

cache_outputs(input_data, output_data)[source]

Cache input and output data.

Parameters

input_data (Mapping[str, Any]) – The data containing the input data to cache.
output_data (Mapping[str, Any]) – The data containing the output data to cache.

Return type

None

clear()[source]

Clear the cache.

Return type: None

export_to_dataset(name=None, by_group=True, categorize=True, input_names=None, output_names=None)

Build a Dataset from the cache.

Parameters

name (str | None) –
A name for the dataset. If None, use the name of the cache.

By default it is set to None.
by_group (bool) –
Whether to store the data by group in Dataset.data, in the sense of one unique NumPy array per group. If categorize is False, there is a unique group: Dataset.PARAMETER_GROUP`. If categorize is True, the groups are stored in Dataset.INPUT_GROUP and Dataset.OUTPUT_GROUP. If by_group is False, store the data by variable names.

By default it is set to True.
categorize (bool) –
Whether to distinguish between the different groups of variables. Otherwise, group all the variables in Dataset.PARAMETER_GROUP`.

By default it is set to True.
input_names (Iterable[str] | None) –
The names of the inputs to be exported. If None, use all the inputs.

By default it is set to None.
output_names (Iterable[str] | None) –
The names of the outputs to be exported. If None, use all the outputs.

By default it is set to None.

Returns

A dataset version of the cache.

Return type

Dataset

get(k[, d]) → D[k] if k in D, else d. d defaults to None.

items() → a set-like object providing a view on D's items

keys() → a set-like object providing a view on D's keys

values() → an object providing a view on D's values

property input_names: list[str]: The names of the inputs of the last entry.

property last_entry: gemseo.core.cache.CacheEntry: The last cache entry.

property names_to_sizes: dict[str, int]: The sizes of the variables of the last entry.

property output_names: list[str]: The names of the outputs of the last entry.

property penultimate_entry: gemseo.core.cache.CacheEntry: The penultimate cache entry.

Memory cache: recording all executions in memory¶

The MemoryFullCache is the in-memory version of the HDF5Cache. It allows to store several executions of a discipline in terms of both inputs, outputs and jacobian values into a dictionary.

This cache strategy is implemented by means of the MemoryFullCache class:

class gemseo.caches.memory_full_cache.MemoryFullCache(tolerance=0.0, name=None, is_memory_shared=True)[source]

Cache using memory to cache all the data.

Parameters

is_memory_shared (bool) –
If True, a shared memory dictionary is used to store the data, which makes the cache compatible with multiprocessing.

By default it is set to True.
tolerance (float) –

By default it is set to 0.0.
name (str | None) –

By default it is set to None.

Return type

None

Warning

If is_memory_shared is False and multiple disciplines point to the same cache or the process is multi-processed, there may be duplicate computations because the cache will not be shared among the processes. This class relies on some multiprocessing features, it is therefore necessary to protect its execution with an if __name__ == '__main__': statement when working on Windows.

cache_jacobian(input_data, jacobian_data)

Cache the input and Jacobian data.

Parameters

input_data (Mapping[str, Any]) – The data containing the input data to cache.
jacobian_data (Mapping[str, Mapping[str, numpy.ndarray]]) – The Jacobian data to cache.

Return type

None

cache_outputs(input_data, output_data)

Cache input and output data.

Parameters

input_data (Mapping[str, Any]) – The data containing the input data to cache.
output_data (Mapping[str, Any]) – The data containing the output data to cache.

Return type

None

clear()[source]

Clear the cache.

Return type: None

export_to_dataset(name=None, by_group=True, categorize=True, input_names=None, output_names=None)

Build a Dataset from the cache.

Parameters

name (str | None) –
A name for the dataset. If None, use the name of the cache.

By default it is set to None.
by_group (bool) –
Whether to store the data by group in Dataset.data, in the sense of one unique NumPy array per group. If categorize is False, there is a unique group: Dataset.PARAMETER_GROUP`. If categorize is True, the groups are stored in Dataset.INPUT_GROUP and Dataset.OUTPUT_GROUP. If by_group is False, store the data by variable names.

By default it is set to True.
categorize (bool) –
Whether to distinguish between the different groups of variables. Otherwise, group all the variables in Dataset.PARAMETER_GROUP`.

By default it is set to True.
input_names (Iterable[str] | None) –
The names of the inputs to be exported. If None, use all the inputs.

By default it is set to None.
output_names (Iterable[str] | None) –
The names of the outputs to be exported. If None, use all the outputs.

By default it is set to None.

Returns

A dataset version of the cache.

Return type

Dataset

export_to_ggobi(file_path, input_names=None, output_names=None)

Export the cache to an XML file for ggobi tool.

Parameters

file_path (str) – The path of the file to export the cache.
input_names (Iterable[str] | None) –
The names of the inputs to export. If None, export all of them.

By default it is set to None.
output_names (Iterable[str] | None) –
The names of the outputs to export. If None, export all of them.

By default it is set to None.

Return type

None

get(k[, d]) → D[k] if k in D, else d. d defaults to None.

items() → a set-like object providing a view on D's items

keys() → a set-like object providing a view on D's keys

update(other_cache)

Update from another cache.

Parameters: other_cache (gemseo.core.cache.AbstractFullCache) – The cache to update the current one.
Return type: None

values() → an object providing a view on D's values

property copy: gemseo.caches.memory_full_cache.MemoryFullCache

Copy the current cache.

Returns: A copy of the current cache.

property input_names: list[str]: The names of the inputs of the last entry.

property last_entry: gemseo.core.cache.CacheEntry: The last cache entry.

property names_to_sizes: dict[str, int]: The sizes of the variables of the last entry.

property output_names: list[str]: The names of the outputs of the last entry.

HDF5 cache: recording all executions on the disk¶

When all the execution data of the discipline shall be stored on the disk, the HDF5 cache policy can be used. HDF5 is a standard file format for storing simulation data. The following description is proposed by the HDF5 website:

“HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data. HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5. The HDF5 Technology suite includes tools and applications for managing, manipulating, viewing, and analyzing data in the HDF5 format.”

HDF5 manipulation libraries exist at least in C++, C, Java, Fortran and Python languages.

The HDFView application can be used to explore the data of the cache. To manipulate the data, one may use the HDF5Cache class, which can import the file and read all the data, or the data of a specific execution.

../_images/HDFView_cache.png — HDFView of the cache generated by a MDF DOE scenario execution on the SSBJ test case¶

This cache strategy is implemented by means of the HDF5Cache class:

class gemseo.caches.hdf5_cache.HDF5Cache(hdf_file_path='cache.hdf5', hdf_node_path='node', tolerance=0.0, name=None)[source]

Cache using disk HDF5 file to store the data.

Parameters

hdf_file_path (str | Path) –
The path of the HDF file. Initialize a singleton to access the HDF file. This singleton is used for multithreaded/multiprocessing access with a lock.

By default it is set to cache.hdf5.
hdf_node_path (str) –
The node of the HDF file.

By default it is set to node.
name (str | None) –
A name for the cache. If None, use hdf_note_path.

By default it is set to None.
tolerance (float) –

By default it is set to 0.0.

Return type

None

Warning

This class relies on some multiprocessing features, it is therefore necessary to protect its execution with an if __name__ == '__main__': statement when working on Windows.

cache_jacobian(input_data, jacobian_data)

Cache the input and Jacobian data.

Parameters

input_data (Mapping[str, Any]) – The data containing the input data to cache.
jacobian_data (Mapping[str, Mapping[str, numpy.ndarray]]) – The Jacobian data to cache.

Return type

None

cache_outputs(input_data, output_data)

Cache input and output data.

Parameters

input_data (Mapping[str, Any]) – The data containing the input data to cache.
output_data (Mapping[str, Any]) – The data containing the output data to cache.

Return type

None

clear()[source]

Clear the cache.

Return type: None

export_to_dataset(name=None, by_group=True, categorize=True, input_names=None, output_names=None)

Build a Dataset from the cache.

Parameters

name (str | None) –
A name for the dataset. If None, use the name of the cache.

By default it is set to None.
by_group (bool) –
Whether to store the data by group in Dataset.data, in the sense of one unique NumPy array per group. If categorize is False, there is a unique group: Dataset.PARAMETER_GROUP`. If categorize is True, the groups are stored in Dataset.INPUT_GROUP and Dataset.OUTPUT_GROUP. If by_group is False, store the data by variable names.

By default it is set to True.
categorize (bool) –
Whether to distinguish between the different groups of variables. Otherwise, group all the variables in Dataset.PARAMETER_GROUP`.

By default it is set to True.
input_names (Iterable[str] | None) –
The names of the inputs to be exported. If None, use all the inputs.

By default it is set to None.
output_names (Iterable[str] | None) –
The names of the outputs to be exported. If None, use all the outputs.

By default it is set to None.

Returns

A dataset version of the cache.

Return type

Dataset

export_to_ggobi(file_path, input_names=None, output_names=None)

Export the cache to an XML file for ggobi tool.

Parameters

file_path (str) – The path of the file to export the cache.
input_names (Iterable[str] | None) –
The names of the inputs to export. If None, export all of them.

By default it is set to None.
output_names (Iterable[str] | None) –
The names of the outputs to export. If None, export all of them.

By default it is set to None.

Return type

None

get(k[, d]) → D[k] if k in D, else d. d defaults to None.

items() → a set-like object providing a view on D's items

keys() → a set-like object providing a view on D's keys

update(other_cache)

Update from another cache.

Parameters: other_cache (gemseo.core.cache.AbstractFullCache) – The cache to update the current one.
Return type: None

static update_file_format(hdf_file_path)[source]

Update the format of a HDF5 file.

[DEV] The abstract caches¶

MemoryFullCache and HDF5Cache inherit from AbstractFullCache.
AbstractFullCache and SimpleCache inherit from AbstractCache.
Both AbstractCache and AbstractFullCache are abstract classes.
Any class inheriting from AbstractCache can be instantiated from the CacheFactory.