gemseo / core

dataset module

Dataset

The dataset module implements the concept of dataset which is a key element for machine learning, post-processing, data analysis, …

A Dataset is an object defined by data stored as a dictionary of 2D numpy arrays, whose rows are samples, a.k.a. realizations, and columns are features, a.k.a. parameters or variables. The indices of this dictionary are either names of groups of variables or names of variables. A Dataset is also defined by a list of variables names, a dictionary of variables sizes and a dictionary of variables groups.

A Dataset can be set either from a numpy array or a file. An AbstractFullCache or an OptimizationProblem can also be exported to a Dataset using AbstractFullCache.export_to_dataset() and OptimizationProblem.export_to_dataset() respectively.

From a Dataset, we can easily access its length and get the data, either as 2D array or as dictionaries indexed by the variables names. We can get either the whole data, data associated to a group or data associated to a list of variables. It is also possible to export the Dataset to an AbstractFullCache or a pandas DataFrame.

class gemseo.core.dataset.Dataset(name=None, by_group=True)[source]

Bases: object

A Dataset is an object defined by data stored as a 2D numpy array, whose rows are samples, a.k.a. realizations, and columns are properties, a.k.a. parameters, variables or features. A dataset is also defined by a list of variables names, a dictionary of variables sizes and a dictionary of variables types. We can easily access its length and get the data, either as a 2D array or as a list of dictionaries indexed by the variables names. It is also possible to export the dataset to a AbstractFullCache or a pandas DataFrame.

Constructor.

Parameters
  • name (str) – dataset name.

  • group (bool) – if True, store the data by group. Otherwise, store them by variables. Default: True

DEFAULT_GROUP = 'parameters'
DEFAULT_NAMES = {'design_parameters': 'dp', 'functions': 'func', 'inputs': 'in', 'outputs': 'out', 'parameters': 'x'}
DESIGN_GROUP = 'design_parameters'
FUNCTION_GROUP = 'functions'
HDF5_CACHE = 'HDF5Cache'
INPUT_GROUP = 'inputs'
MEMORY_FULL_CACHE = 'MemoryFullCache'
OUTPUT_GROUP = 'outputs'
PARAMETER_GROUP = 'parameters'
add_group(group, data, variables=None, sizes=None, varname=None, cache_as_input=True)[source]

Add variable.

Parameters
  • group (str) – group name.

  • data (ndarray) – data.

  • variables (list(str)) – list of variables names.

  • sizes (dict) – dictionary of variables sizes.

  • varname (str) – variable name used if variables is None. If None, use the default variable name for group if it exists; otherwise, use the group name. Default: None.

  • cache_as_input (bool) – cache as input when export to cache. Otherwise, as output. Default: True

add_variable(name, data, group='parameters', cache_as_input=True)[source]

Add variable.

Parameters
  • name (str) – variable name.

  • data (ndarray) – data.

  • group (str) – group name. Default: DEFAULT_GROUP.

  • cache_as_input (bool) – cache as input when export to cache. Otherwise, as output. Default: True

export_to_cache(inputs=None, outputs=None, cache_type='MemoryFullCache', cache_hdf_file=None, cache_hdf_node_name=None, **options)[source]

Export dataset to cache.

Parameters
  • inputs (list(str)) – names of the inputs to cache. If None, use all inputs. Default: None.

  • outputs (list(str)) – names of the outputs to cache. If None, use all outputs. Default: None.

  • cache_type (str) – type of cache to use.

  • cache_hdf_file (str) – the file to store the data, mandatory when HDF caching is used

  • cache_hdf_node_name (str) – name of the HDF dataset to store the discipline data. If None, self.name is used

export_to_dataframe(copy=True)[source]

Export dataset to Dataframe.

Parameters

copy (bool) – If True, copy data. Otherwise, use reference. Default: True.

get_all_data(by_group=True, as_dict=False)[source]

Returns all data.

Parameters
  • group (str) – variable group.

  • as_dict (bool) – if True, return outputs values as dictionary. Default: False.

get_data_by_group(group, as_dict=False)[source]

Returns data associated with a group.

Parameters
  • group (str) – variable group.

  • as_dict (bool) – if True, return outputs values as dictionary. Default: False.

get_data_by_names(names, as_dict=True)[source]

Get data by variables names.

Parameters
  • list(str) – names.

  • as_dict (bool) – if True, return values as dictionary.

get_group(variable_name)[source]

Returns group for a given variable name.

Parameters

variable_name (str) – variable_name

get_names(group_name)[source]

Returns names for a given group.

Parameters

group_name (str) – group_name

property groups

Names of the groups of variables.

is_empty()[source]

Returns True if the dataset is empty.

is_group(group_name)[source]

Returns True is group_name is a group.

Parameters

group_name (str) – group_name

is_variable(variable_name)[source]

Returns True is variable_name is a group.

Parameters

variable_name (str) – variable_name

property n_samples

Return the number of samples.

property n_variables

Return the number of variables.

n_variables_by_group(group)[source]

Return the number of variables for a group.

Parameters

group (str) – group name.

plot(name, show=True, save=False, **options)[source]

Finds the appropriate library and executes the post processing on the problem

Parameters
  • name (str) – the post processing name

  • show – show the figure (default: True)

  • save – save the figure (default: False)

  • options – options for the post method, see its package

set_from_array(data, variables=None, sizes=None, groups=None, default_name=None)[source]

Set Dataset from a numpy array or a dictionary of arrays

Parameters
  • data (array) – dataset.

  • variables (list(str)) – list of variables names.

  • sizes (dict(int)) – list of variables sizes.

  • groups (dict(str)) – list of variables groups.

  • default_name (str) – default variable name.

set_from_file(filename, variables=None, sizes=None, groups=None, delimiter=',', header=True)[source]

Set Dataset from a file.

Parameters
  • filename (str) – file name.

  • variables (list(str)) – list of variables names.

  • sizes (dict(int)) – list of variables sizes.

  • groups (dict(str)) – list of variables groups.

  • delimiter (str) – field delimiter.

  • header (bool) – if True, read the variables names on the first line of the file. Default: True.

set_metadata(name, value)[source]

Set metadata attribute. :param string name: Metadata attribute name. :param value: Metadata attribute value.

property variables

Names of variables.