dataset module¶
Dataset¶
The dataset
module implements the concept of dataset
which is a key element for machine learning, post-processing,
data analysis, …
A Dataset
is an object
defined by data stored as a dictionary of 2D numpy arrays,
whose rows are samples, a.k.a. realizations, and columns are features,
a.k.a. parameters or variables. The indices of this dictionary are either
names of groups of variables or names of variables.
A Dataset
is also defined by
a list of variables names, a dictionary of variables sizes
and a dictionary of variables groups.
A Dataset
can be set either from a numpy array or a file.
An AbstractFullCache
or an OptimizationProblem
can also be exported to a Dataset
using AbstractFullCache.export_to_dataset()
and OptimizationProblem.export_to_dataset()
respectively.
From a Dataset
, we can easily access its length and get the data,
either as 2D array or as dictionaries indexed by the variables names.
We can get either the whole data,
data associated to a group or data associated to a list of variables.
It is also possible to export the Dataset
to an AbstractFullCache
or a pandas DataFrame.
-
class
gemseo.core.dataset.
Dataset
(name=None, by_group=True)[source]¶ Bases:
object
A Dataset is an object defined by data stored as a 2D numpy array, whose rows are samples, a.k.a. realizations, and columns are properties, a.k.a. parameters, variables or features. A dataset is also defined by a list of variables names, a dictionary of variables sizes and a dictionary of variables types. We can easily access its length and get the data, either as a 2D array or as a list of dictionaries indexed by the variables names. It is also possible to export the dataset to a
AbstractFullCache
or a pandas DataFrame.Constructor.
- Parameters
name (str) – dataset name.
group (bool) – if True, store the data by group. Otherwise, store them by variables. Default: True
-
DEFAULT_GROUP
= 'parameters'¶
-
DEFAULT_NAMES
= {'design_parameters': 'dp', 'functions': 'func', 'inputs': 'in', 'outputs': 'out', 'parameters': 'x'}¶
-
DESIGN_GROUP
= 'design_parameters'¶
-
FUNCTION_GROUP
= 'functions'¶
-
HDF5_CACHE
= 'HDF5Cache'¶
-
INPUT_GROUP
= 'inputs'¶
-
MEMORY_FULL_CACHE
= 'MemoryFullCache'¶
-
OUTPUT_GROUP
= 'outputs'¶
-
PARAMETER_GROUP
= 'parameters'¶
-
add_group
(group, data, variables=None, sizes=None, varname=None, cache_as_input=True)[source]¶ Add variable.
- Parameters
group (str) – group name.
data (ndarray) – data.
variables (list(str)) – list of variables names.
sizes (dict) – dictionary of variables sizes.
varname (str) – variable name used if variables is None. If None, use the default variable name for group if it exists; otherwise, use the group name. Default: None.
cache_as_input (bool) – cache as input when export to cache. Otherwise, as output. Default: True
-
add_variable
(name, data, group='parameters', cache_as_input=True)[source]¶ Add variable.
- Parameters
name (str) – variable name.
data (ndarray) – data.
group (str) – group name. Default: DEFAULT_GROUP.
cache_as_input (bool) – cache as input when export to cache. Otherwise, as output. Default: True
-
export_to_cache
(inputs=None, outputs=None, cache_type='MemoryFullCache', cache_hdf_file=None, cache_hdf_node_name=None, **options)[source]¶ Export dataset to cache.
- Parameters
inputs (list(str)) – names of the inputs to cache. If None, use all inputs. Default: None.
outputs (list(str)) – names of the outputs to cache. If None, use all outputs. Default: None.
cache_type (str) – type of cache to use.
cache_hdf_file (str) – the file to store the data, mandatory when HDF caching is used
cache_hdf_node_name (str) – name of the HDF dataset to store the discipline data. If None, self.name is used
-
export_to_dataframe
(copy=True)[source]¶ Export dataset to Dataframe.
- Parameters
copy (bool) – If True, copy data. Otherwise, use reference. Default: True.
-
get_all_data
(by_group=True, as_dict=False)[source]¶ Returns all data.
- Parameters
group (str) – variable group.
as_dict (bool) – if True, return outputs values as dictionary. Default: False.
-
get_data_by_group
(group, as_dict=False)[source]¶ Returns data associated with a group.
- Parameters
group (str) – variable group.
as_dict (bool) – if True, return outputs values as dictionary. Default: False.
-
get_data_by_names
(names, as_dict=True)[source]¶ Get data by variables names.
- Parameters
list(str) – names.
as_dict (bool) – if True, return values as dictionary.
-
get_group
(variable_name)[source]¶ Returns group for a given variable name.
- Parameters
variable_name (str) – variable_name
-
get_names
(group_name)[source]¶ Returns names for a given group.
- Parameters
group_name (str) – group_name
-
property
groups
¶ Names of the groups of variables.
-
is_group
(group_name)[source]¶ Returns True is group_name is a group.
- Parameters
group_name (str) – group_name
-
is_variable
(variable_name)[source]¶ Returns True is variable_name is a group.
- Parameters
variable_name (str) – variable_name
-
property
n_samples
¶ Return the number of samples.
-
property
n_variables
¶ Return the number of variables.
-
n_variables_by_group
(group)[source]¶ Return the number of variables for a group.
- Parameters
group (str) – group name.
-
plot
(name, show=True, save=False, **options)[source]¶ Finds the appropriate library and executes the post processing on the problem
- Parameters
name (str) – the post processing name
show – show the figure (default: True)
save – save the figure (default: False)
options – options for the post method, see its package
-
set_from_array
(data, variables=None, sizes=None, groups=None, default_name=None)[source]¶ Set Dataset from a numpy array or a dictionary of arrays
- Parameters
data (array) – dataset.
variables (list(str)) – list of variables names.
sizes (dict(int)) – list of variables sizes.
groups (dict(str)) – list of variables groups.
default_name (str) – default variable name.
-
set_from_file
(filename, variables=None, sizes=None, groups=None, delimiter=',', header=True)[source]¶ Set Dataset from a file.
- Parameters
filename (str) – file name.
variables (list(str)) – list of variables names.
sizes (dict(int)) – list of variables sizes.
groups (dict(str)) – list of variables groups.
delimiter (str) – field delimiter.
header (bool) – if True, read the variables names on the first line of the file. Default: True.
-
set_metadata
(name, value)[source]¶ Set metadata attribute. :param string name: Metadata attribute name. :param value: Metadata attribute value.
-
property
variables
¶ Names of variables.