Dataset

In this example, we will see how to build and manipulate a Dataset.

From a conceptual point of view, a Dataset is a tabular data structure whose rows are the entries, a.k.a. observations or indices, and whose columns are the features, a.k.a. quantities of interest. These features can be grouped by variable identifier which is a tuple (group_name, variable_name) and has a dimension equal to the number of components of the variable, a.k.a. dimension. A feature is a tuple (group_name, variable_name, component).

From a software point of view, a Dataset is a particular pandas DataFrame.

from __future__ import annotations

from numpy import array
from pandas import DataFrame

from gemseo.datasets.dataset import Dataset

Instantiation

At instantiation,

dataset = Dataset()

a dataset has the same name as its class:

dataset.name
'Dataset'

We can use a more appropriate name at instantiation:

dataset_with_custom_name = Dataset(dataset_name="Measurements")
dataset_with_custom_name.name
'Measurements'

or change it after instantiation:

dataset_with_custom_name.name = "simulations"
dataset_with_custom_name.name
'simulations'

Let us check that the class Dataset derives from pandas.DataFrame:

isinstance(dataset, DataFrame)
True

Add a variable

Then, we can add data by variable name:

dataset.add_variable("a", array([[1, 2], [3, 4]]))
dataset
GROUP parameters
VARIABLE a
COMPONENT 0 1
0 1 2
1 3 4


Note that the columns of the dataset use the multi-level index (GROUP, VARIABLE, COMPONENT).

By default, the variable is placed in the group

dataset.DEFAULT_GROUP
'parameters'

The attribute group_name allows to use another group:

dataset.add_variable("b", array([[-1, -2, -3], [-4, -5, -6]]), "inputs")
dataset
GROUP parameters inputs
VARIABLE a b
COMPONENT 0 1 0 1 2
0 1 2 -1 -2 -3
1 3 4 -4 -5 -6


In the same way, for a variable of dimension 2, the components are 0 and 1. We can use other values with the attribute components:

dataset.add_variable("c", array([[1.5], [3.5]]), components=[3])
dataset
GROUP parameters inputs parameters
VARIABLE a b c
COMPONENT 0 1 0 1 2 3
0 1 2 -1 -2 -3 1.5
1 3 4 -4 -5 -6 3.5


Add a group of variables

Note that the data can also be added by group:

dataset.add_group(
    "G1", array([[-1.1, -2.1, -3.1], [-4.1, -5.1, -6.1]]), ["p", "q"], {"p": 2, "q": 1}
)
dataset
GROUP parameters inputs parameters G1
VARIABLE a b c p q
COMPONENT 0 1 0 1 2 3 0 1 0
0 1 2 -1 -2 -3 1.5 -1.1 -2.1 -3.1
1 3 4 -4 -5 -6 3.5 -4.1 -5.1 -6.1


The dimensions of the variables {"p": 2, "q": 1} are not mandatory when the number of variable names is equal to the number of columns of the data array:

dataset.add_group("G2", array([[1.1, 2.1, 3.1], [4.1, 5.1, 6.1]]), ["x", "y", "z"])
dataset
GROUP parameters inputs parameters G1 G2
VARIABLE a b c p q x y z
COMPONENT 0 1 0 1 2 3 0 1 0 0 0 0
0 1 2 -1 -2 -3 1.5 -1.1 -2.1 -3.1 1.1 2.1 3.1
1 3 4 -4 -5 -6 3.5 -4.1 -5.1 -6.1 4.1 5.1 6.1


In the same way, the name of the variable is not mandatory; when missing, "x" will be considered with a dimension equal to the number of columns of the data array:

dataset.add_group("G3", array([[1.2, 2.2], [3.2, 4.2]]))
dataset
GROUP parameters inputs parameters G1 G2 G3
VARIABLE a b c p q x y z x
COMPONENT 0 1 0 1 2 3 0 1 0 0 0 0 0 1
0 1 2 -1 -2 -3 1.5 -1.1 -2.1 -3.1 1.1 2.1 3.1 1.2 2.2
1 3 4 -4 -5 -6 3.5 -4.1 -5.1 -6.1 4.1 5.1 6.1 3.2 4.2


Convert to a dictionary of arrays

Sometimes, it can be useful to have a dictionary view of the dataset with NumPy arrays as values:

dataset.to_dict_of_arrays()
{'G1': {'p': array([[-1.1, -2.1],
       [-4.1, -5.1]]), 'q': array([[-3.1],
       [-6.1]])}, 'G2': {'x': array([[1.1],
       [4.1]]), 'y': array([[2.1],
       [5.1]]), 'z': array([[3.1],
       [6.1]])}, 'G3': {'x': array([[1.2, 2.2],
       [3.2, 4.2]])}, 'inputs': {'b': array([[-1, -2, -3],
       [-4, -5, -6]])}, 'parameters': {'a': array([[1, 2],
       [3, 4]]), 'c': array([[1.5],
       [3.5]])}}

We can also flatten this dictionary:

dataset.to_dict_of_arrays(False)
{'p': array([[-1.1, -2.1],
       [-4.1, -5.1]]), 'q': array([[-3.1],
       [-6.1]]), 'G2:x': array([[1.1],
       [4.1]]), 'y': array([[2.1],
       [5.1]]), 'z': array([[3.1],
       [6.1]]), 'G3:x': array([[1.2, 2.2],
       [3.2, 4.2]]), 'b': array([[-1, -2, -3],
       [-4, -5, -6]]), 'a': array([[1, 2],
       [3, 4]]), 'c': array([[1.5],
       [3.5]])}

Get information

Some properties

At any time, we can access to the names of the groups of variables:

dataset.group_names
['G1', 'G2', 'G3', 'inputs', 'parameters']

and to the total number of components per group:

dataset.group_names_to_n_components
{'G1': 3, 'G2': 3, 'G3': 2, 'inputs': 3, 'parameters': 3}

Concerning the variables, note that we can use the same variable name in two different groups. The (unique) variable names can be accessed with

dataset.variable_names
['a', 'b', 'c', 'p', 'q', 'x', 'y', 'z']

while the total number of components per variable name can be accessed with

dataset.variable_names_to_n_components
{'a': 2, 'b': 3, 'c': 1, 'p': 2, 'q': 1, 'x': 3, 'y': 1, 'z': 1}

Lastly, the variable identifiers (group_name, variable_name) can be accessed with

dataset.variable_identifiers
[('G1', 'p'), ('G1', 'q'), ('G2', 'x'), ('G2', 'y'), ('G2', 'z'), ('G3', 'x'), ('inputs', 'b'), ('parameters', 'a'), ('parameters', 'c')]

Some getters

We can also easily access to the group of a variable:

dataset.get_group_names("x")
['G2', 'G3']

and to the names of the variables included in a group:

dataset.get_variable_names("G1")
['p', 'q']

The components of a variable located in a group can be accessed with

dataset.get_variable_components("G2", "y")
[0]

Lastly, the columns of the dataset have string representations:

dataset.get_columns()
['a[0]', 'a[1]', 'b[0]', 'b[1]', 'b[2]', 'c', 'p[0]', 'p[1]', 'q', 'x', 'y', 'z', 'x[0]', 'x[1]']

that can be split into tuples:

dataset.get_columns(as_tuple=True)
[('parameters', 'a', 0), ('parameters', 'a', 1), ('inputs', 'b', 0), ('inputs', 'b', 1), ('inputs', 'b', 2), ('parameters', 'c', 3), ('G1', 'p', 0), ('G1', 'p', 1), ('G1', 'q', 0), ('G2', 'x', 0), ('G2', 'y', 0), ('G2', 'z', 0), ('G3', 'x', 0), ('G3', 'x', 1)]

We can also consider a subset of the columns:

dataset.get_columns(["c", "y"])
['c', 'y']

Renaming

It is quite easy to rename a group:

dataset.rename_group("G1", "foo")
dataset.group_names
['G2', 'G3', 'foo', 'inputs', 'parameters']

or a variable:

dataset.rename_variable("x", "bar", "G2")
dataset.rename_variable("y", "baz")
dataset.variable_names
['a', 'b', 'bar', 'baz', 'c', 'p', 'q', 'x', 'z']

Note that the group name "G2" allows to rename "x" only in "G2"; without this information, the method would have renamed "x" in both "G2" and "G3".

Transformation to a variable

One can use a function applying to a NumPy array to transform the data associated with a variable, for instance a twofold increase:

dataset.transform_data(lambda x: 2 * x, variable_names="bar")

Get a view of the dataset

The method get_view() returns a view of the dataset by using masks built from variable names, group names, components and row indices. For instance, we can get a view of the variables "b" and "x":

dataset.get_view(variable_names=["b", "x"])
GROUP inputs G3
VARIABLE b x
COMPONENT 0 1 2 0 1
0 -1 -2 -3 1.2 2.2
1 -4 -5 -6 3.2 4.2


or a view of the group "inputs":

dataset.get_view("inputs")
GROUP inputs
VARIABLE b
COMPONENT 0 1 2
0 -1 -2 -3
1 -4 -5 -6


We can also combine the keys:

dataset.get_view(variable_names=["b", "x"], components=[0])
GROUP inputs G3
VARIABLE b x
COMPONENT 0 0
0 -1 1.2
1 -4 3.2


Update some data

To complete this example, we can update the data by using masks built from variable names, group names, components and row indices:

dataset.update_data([[10, 10, 10]], "inputs", indices=[1])
dataset
GROUP parameters inputs parameters foo G2 G3
VARIABLE a b c p q bar baz z x
COMPONENT 0 1 0 1 2 3 0 1 0 0 0 0 0 1
0 1 2 -1 -2 -3 1.5 -1.1 -2.1 -3.1 2.2 2.1 3.1 1.2 2.2
1 3 4 10 10 10 3.5 -4.1 -5.1 -6.1 8.2 5.1 6.1 3.2 4.2


Total running time of the script: (0 minutes 0.099 seconds)

Gallery generated by Sphinx-Gallery