gemseo.datasets.dataset module#
A generic data structure with entries and variables.
The concept of dataset is a key element for machine learning, post-processing, data analysis, ...
A Dataset
is a pandas
MultiIndex DataFrame
storing series of data
representing the values of multidimensional features
belonging to different groups of features.
A Dataset
can be set
either from a file (from_csv()
and from_txt()
)
or from a NumPy array (from_array()
),
and can be enriched from a group of variables (add_group()
)
or from a single variable (add_variable()
).
An BaseFullCache
or an OptimizationProblem
can also be exported to a Dataset
using the methods BaseFullCache.to_dataset()
and OptimizationProblem.to_dataset()
.
- class Dataset(data=None, index=None, columns=None, dtype=None, copy=None, *, dataset_name='')[source]#
Bases:
DataFrame
A generic data structure with entries and variables.
A variable is defined by a name and a number of components. For instance, the variable
"x"
can have2
components:0
and1
. Or the variabley
can have4
components:"a"
,"b"
,"c"
and"d"
.A variable belongs to a group of variables (default:
DEFAULT_GROUP
). Two variables can have the same name; only the tuple(group_name, variable_name)
is unique and is therefore called a variable identifier.Based on a set of variable identifiers,
Dataset
is a collection of entries corresponding to a set of variable identifiers. An entry corresponds to an index of theDataset
.A
Dataset
is a special pandas DataFrame with the multi-index(group_name, variable_name, component)
. It must be built from the methodsadd_variable()
,add_group()
,from_array()
,from_txt()
,from_csv()
andfrom_dataframe()
.Miscellaneous information that is not specific to an entry of the dataset can be stored in the dictionary
misc
, asdataset.misc["year"] = 2023
.Warning
A
Dataset
behaves like any multi-index DataFrame but its instantiation using the constructordataset = Dataset(data, ...)
can lead to some inconsistencies (multi-index levels, index values, dtypes, ...). Hence, the construction from the dedicated methods is recommended, e.g.dataset = Dataset(); dataset.add_variable("x", data)
.Notes
The columns of a data structure (NumPy array,
DataFrame
,Dataset
, ...) are called features. The features of aDataset
include all the components of all the variables of all the groups.Initialize self. See help(type(self)) for accurate signature.
- Parameters:
- add_group(group_name, data, variable_names=(), variable_names_to_n_components=None)[source]#
Add the data related to a new group.
- Parameters:
group_name (str) -- The name of the group.
data (DataType) -- The data.
variable_names (StrColumnType) --
The names of the variables. If empty, use
DEFAULT_VARIABLE_NAME
.By default it is set to ().
variable_names_to_n_components (dict[str, int] | None) -- The number of components of the variables. If
variable_names
is empty, this argument is not considered. IfNone
, assume that all the variables have a single component.
- Raises:
ValueError -- If the group already exists.
- Return type:
None
- add_variable(variable_name, data, group_name='parameters', components=())[source]#
Add the data related to a variable.
If the variable does not exist, it is added. If the variable already exists, non-existing components can be added, when specified. It is impossible to add components that have already been added.
- Parameters:
- Return type:
None
Warning
The shape of
data
must be consistent with the number of entries of the dataset. If the dataset is empty, the number of entries will be deducted fromdata
.Notes
The data can be passed as:
an array shaped as
(n_entries, n_features)
,an array shaped as
(1, n_features)
that will be reshaped as(n_entries, n_features)
by replicating the original arrayn_entries
times,an array shaped as
(n_entries,)
that will be reshaped as(n_entries, 1)
,a scalar that will be converted into an array shaped as
(n_entries, 1)
ifcomponents
is empty or(n_entries, n_features)
wheren_features
will be deducted fromcomponents
.
- Raises:
ValueError -- If the group already has the added components of the variable named
variable_name
.- Parameters:
- Return type:
None
- classmethod from_array(data, variable_names=(), variable_names_to_n_components=mappingproxy({}), variable_names_to_group_names=mappingproxy({}))[source]#
Create a dataset from a NumPy array.
- Parameters:
data (ndarray | Iterable[Any] | Any) -- The data to be stored, with the shape (n_entries, n_components).
variable_names (str | Iterable[str]) --
The names of the variables. If empty, use default names.
By default it is set to ().
variable_names_to_n_components (dict[str, int]) --
The number of components of the variables. If empty, assume that all the variables have a single component. Ignored if
variable_names
is empty.By default it is set to {}.
variable_names_to_group_names (dict[str, str]) --
The groups of the variables. If empty, use
Dataset.DEFAULT_GROUP
for all the variables. Ignored ifvariable_names
is empty.By default it is set to {}.
- Returns:
A dataset built from the NumPy array.
- Return type:
- classmethod from_csv(file_path, delimiter=',', first_column_as_index=True)[source]#
Set the dataset from a CSV file.
The first three rows contain the values of the multi-index
(column_name, variable_name, component)
.See also
If the file does not contain multi-index information and not just an array, the method
from_txt()
is better suited.- Parameters:
- Returns:
A dataset built from the CSV file.
- Return type:
- classmethod from_dataframe(dataframe)[source]#
Create a
Dataset
from a pandasDataFrame
.- Parameters:
dataframe (DataFrame) -- The pandas
DataFrame
. whosecolumns
attribute is either a 3-depth~pandas.MultiIndex
or a sequence of 3-length tuples and strings. The items of the 3-length tuples and 3-depth~pandas.MultiIndex
correspond toCOLUMN_LEVEL_NAMES
in this order. In the case of a string, it corresponds to the variable name and theDEFAULT_GROUP
is used.- Returns:
The dataset built from the pandas
DataFrame
.- Raises:
ValueError -- If the
columns
attribute is neither a 3-depth~pandas.MultiIndex
, nor a sequence of 3-length tuples and strings.- Return type:
- classmethod from_txt(file_path, variable_names=(), variable_names_to_n_components=mappingproxy({}), variable_names_to_group_names=mappingproxy({}), delimiter=',', header=True)[source]#
Create a dataset from a text file.
See also
If the file contains multi-index information and not just an array, the
from_csv()
method is better suited.- Parameters:
file_path (Path | str) -- The path to the file containing the data.
variable_names (Iterable[str]) --
The names of the variables. If empty and
header
isTrue
, read the names from the first line of the file.If empty and
header
isFalse
, use default names based on the patterns theDEFAULT_NAMES
associated with the different groups.By default it is set to ().
variable_names_to_n_components (dict[str, int]) --
The number of components of the variables. If empty, assume that all the variables have a single component.
By default it is set to {}.
variable_names_to_group_names (dict[str, str]) --
The groups of the variables. If empty, use
DEFAULT_GROUP
for all the variables.By default it is set to {}.
delimiter (str) --
The field delimiter.
By default it is set to ",".
header (bool) --
Whether to read the names of the variables on the first line of the file.
By default it is set to True.
- Return type:
Note
When the
variable_names
are not provided, and theheader
isTrue
, the file is accessed twice: During the first access, only the first line is read to retrieve the variable names. In the second access, reading starts from the second line to the end of the file.- Returns:
A dataset built from the text file.
- Parameters:
file_path (Path | str)
variable_names (Iterable[str]) --
By default it is set to ().
variable_names_to_n_components (dict[str, int]) --
By default it is set to {}.
variable_names_to_group_names (dict[str, str]) --
By default it is set to {}.
delimiter (str) --
By default it is set to ",".
header (bool) --
By default it is set to True.
- Return type:
- get_columns(variable_names: Iterable[str] = (), as_tuple: Literal[False] = False) list[str] [source]#
- get_columns(variable_names: Iterable[str] = (), as_tuple: Literal[True] = True) list[tuple[str, str, int]]
Return the columns based on variable names.
- Parameters:
variable_names -- The names of the variables. If empty, use all the variables.
as_tuple -- Whether the variable identifiers are returned as tuples.
- Returns:
The columns, either as a variable identifier
(group_name, variable_name, component)
or as a variable component name"variable_name[component]"
(or"variable_name"
if the dimension of the variable is 1).
- get_group_names(variable_name)[source]#
Return the names of the groups that contain a variable.
- Parameters:
variable_name (str) -- The name of the variable.
- Returns:
The names of the groups that contain the variable.
- Return type:
Warning
The names are sorted with the Python function
sorted
.
- get_normalized(excluded_variable_names=(), excluded_group_names=(), use_min_max=True, center=False, scale=False)[source]#
Return a normalized copy of the dataset.
- Parameters:
excluded_variable_names (str | Iterable[str]) --
The names of the variables not to be normalized. If empty, normalize all the variables.
By default it is set to ().
excluded_group_names (str | Iterable[str]) --
The names of the groups not to be normalized. If empty, normalize all the groups.
By default it is set to ().
use_min_max (bool) --
Whether to use the geometric normalization \((x-\min(x))/(\max(x)-\min(x))\).
By default it is set to True.
center (bool) --
Whether to center the variables so that they have a zero mean.
By default it is set to False.
scale (bool) --
Whether to scale the variables so that they have a unit variance.
By default it is set to False.
- Returns:
A normalized dataset.
- Return type:
- get_variable_components(group_name, variable_name)[source]#
Return the components of a given variable.
- Parameters:
- Return type:
Notes
Assure compatibility pandas 1 and 2 by returning an empty list if KeyError is raised.
- get_variable_names(group_name)[source]#
Return the names of the variables contained in a group.
Notes
Assure compatibility pandas 1 and 2 by returning an empty list if KeyError is raised.
- Returns:
The names of the variables contained in the group.
- Parameters:
group_name (str)
- Return type:
Warning
The names are sorted with the Python function
sorted
.
- get_view(group_names=(), variable_names=(), components=(), indices=())[source]#
Return a specific group of rows and columns of the dataset.
- Parameters:
group_names (str | Iterable[str]) --
The name(s) of the group(s). If empty, consider all the groups.
By default it is set to ().
variable_names (str | Iterable[str]) --
The name(s) of the variables(s). If empty, consider all the variables of the considered groups.
By default it is set to ().
components (int | Iterable[int]) --
The component(s). If empty, consider all the components of the considered variables.
By default it is set to ().
indices (str | int | Iterable[str | int]) --
The index (indices) of the dataset into which these data is to be inserted. If empty, consider all the indices.
By default it is set to ().
- Return type:
Notes
The order asked by the user is preserved for the returned Dataset. See also
loc()
.- Returns:
The specific group of rows and columns of the dataset.
- Parameters:
- Return type:
- to_dict_of_arrays(by_group=True, by_entry=False)[source]#
Convert the dataset into a dictionary of NumPy arrays.
- Parameters:
by_group (bool) --
Whether the data are returned as
{group_name: {variable_name: variable_values}}
. Otherwise, the data are returned either as{variable_name: variable_values}
if only one group contains the variablevariable_name
or as{f"{group_name}:{variable_name}": variable_values}
if at least two groups contain the variablevariable_name
.By default it is set to True.
by_entry (bool) --
Whether the data are returned as
[{group_name: {variable_name: variable_value_1}}, ...]
,[{variable_name: variable_value_1}, ...]
or[{f"{group_name}:{variable_name}": variable_value_1}, ...]
according toby_group
. Otherwise, the data are returned as{group_name: {variable_name: variable_value_1}}
,{variable_name: variable_value_1}
or{f"{group_name}:{variable_name}": variable_value_1}
.By default it is set to False.
- Returns:
The dataset expressed as a dictionary of NumPy arrays.
- Return type:
dict[str, ndarray | dict[str, ndarray]] | list[dict[str, ndarray | dict[str, ndarray]]]
- transform_data(transformation, group_names=(), variable_names=(), components=(), indices=())[source]#
Transform some data of the dataset.
- Parameters:
transformation (Callable[[ndarray], ndarray]) -- The function transforming the variable, e.g.
"lambda x: 2*x"
.group_names (str | Iterable[str]) --
The name(s) of the group(s) corresponding to these data. If empty, consider all the groups.
By default it is set to ().
variable_names (str | Iterable[str]) --
The name(s) of the variables(s) corresponding to these data. If empty, consider all the variables of the considered groups.
By default it is set to ().
components (int | Iterable[int]) --
The component(s) corresponding to these data. If empty, consider all the components of the considered variables.
By default it is set to ().
indices (str | int | Iterable[str | int]) --
The index (indices) of the dataset corresponding to these data. If empty, consider all the indices.
By default it is set to ().
- Return type:
None
- update_data(data, group_names=(), variable_names=(), components=(), indices=())[source]#
Replace some existing indices and columns by a deepcopy of
data
.- Parameters:
data (ndarray | Iterable[Any] | Any) -- The new data to be inserted in the dataset.
group_names (str | Iterable[str]) --
The name(s) of the group(s) corresponding to these data. If empty, consider all the groups.
By default it is set to ().
variable_names (str | Iterable[str]) --
The name(s) of the variables(s) corresponding to these data. If empty, consider all the variables of the considered groups.
By default it is set to ().
components (int | Iterable[int]) --
The component(s) corresponding to these data. If empty, consider all the components of the considered variables.
By default it is set to ().
indices (str | int | Iterable[str | int]) --
The index (indices) of the dataset into which these data is to be inserted. If empty, consider all the indices.
By default it is set to ().
- Return type:
None
Notes
Changing the type of data can turn a view into a copy.
- COLUMN_LEVEL_NAMES: Final[tuple[str, str, str]] = ('GROUP', 'VARIABLE', 'COMPONENT')#
The names of the column levels of the multi-index DataFrame.
- DEFAULT_VARIABLE_NAME: ClassVar[str] = 'x'#
The default name for the variable set with
add_group()
.
- property group_names: list[str]#
The names of the groups of variables in alphabetical order.
Warning
The names are sorted with the Python function
sorted
.
- property group_names_to_n_components: dict[str, int]#
The names of the groups bound to their number of components.
- property variable_identifiers: list[tuple[str, str]]#
The variable identifiers.
A variable identifier is the tuple
(group_name, variable_name)
.Notes
A variable name can belong to more than one group while a variable identifier is unique as a group name is unique.
Warning
The names are sorted with the Python function
sorted
.