Dataset from a numpy array

In this example, we will see how to build a Dataset from an numpy array. For that, we need to import this Dataset class:

from gemseo.api import configure_logger
from gemseo.core.dataset import Dataset
from numpy import concatenate
from numpy.random import rand

configure_logger()

Out:

<RootLogger root (INFO)>

Synthetic data

Let us consider three parameters:

  • x_1 with dimension 1,

  • x_2 with dimension 2,

  • y_1 with dimension 3.

dim_x1 = 1
dim_x2 = 2
dim_y1 = 3
sizes = {"x_1": dim_x1, "x_2": dim_x2, "y_1": dim_y1}
groups = {"x_1": "inputs", "x_2": "inputs", "y_1": "outputs"}

We generate 5 random samples of the inputs where:

  • x_1 is stored in the first column,

  • x_2 is stored in the 2nd and 3rd columns

and 5 random samples of the outputs.

n_samples = 5
inputs = rand(n_samples, dim_x1 + dim_x2)
inputs_names = ["x_1", "x_2"]
outputs = rand(n_samples, dim_y1)
outputs_names = ["y_1"]
data = concatenate((inputs, outputs), 1)
data_names = inputs_names + outputs_names

Create a dataset

using default names

We build a Dataset and initialize from the whole data:

dataset = Dataset(name="random_dataset")
dataset.set_from_array(data)
print(dataset)

Out:

random_dataset
   Number of samples: 5
   Number of variables: 6
   Variables names and sizes by group:
      parameters: x_0 (1), x_1 (1), x_2 (1), x_3 (1), x_4 (1), x_5 (1)
   Number of dimensions (total = 6) by group:
      parameters: 6

using particular names

We can also use the names of the variables, rather than the default ones fixed by the class:

dataset = Dataset(name="random_dataset")
dataset.set_from_array(data, data_names, sizes)
print(dataset)
print(dataset.data)

Out:

random_dataset
   Number of samples: 5
   Number of variables: 3
   Variables names and sizes by group:
      parameters: x_1 (1), x_2 (2), y_1 (3)
   Number of dimensions (total = 6) by group:
      parameters: 6
{'parameters': array([[4.41451961e-01, 6.30061490e-01, 8.43647520e-01, 4.24697706e-01,
        6.69638526e-02, 4.57962045e-01],
       [9.33783172e-02, 1.19158168e-01, 8.17784653e-01, 4.11008784e-01,
        5.49457252e-01, 6.39829154e-01],
       [4.41969474e-01, 8.02869388e-01, 1.50381081e-04, 5.89044704e-02,
        6.17529433e-01, 1.32273670e-01],
       [1.15028080e-01, 3.68343264e-01, 3.97657201e-01, 4.31813184e-01,
        8.92330173e-01, 1.33836528e-01],
       [8.33757881e-01, 5.09782498e-01, 9.17964534e-01, 2.08759507e-01,
        4.23018763e-01, 6.03938712e-02]])}

Warning

The number of variables names must be equal to the number of columns of the data array. Otherwise, the user has to specify the sizes of the different variables by means of a dictionary and be careful that the total size is equal to this number of columns.

using particular groups

We can also use the notions of groups of variables:

dataset = Dataset(name="random_dataset")
dataset.set_from_array(data, data_names, sizes, groups)
print(dataset)
print(dataset.data)

Out:

random_dataset
   Number of samples: 5
   Number of variables: 3
   Variables names and sizes by group:
      inputs: x_1 (1), x_2 (2)
      outputs: y_1 (3)
   Number of dimensions (total = 6) by group:
      inputs: 3
      outputs: 3
{'inputs': array([[4.41451961e-01, 6.30061490e-01, 8.43647520e-01],
       [9.33783172e-02, 1.19158168e-01, 8.17784653e-01],
       [4.41969474e-01, 8.02869388e-01, 1.50381081e-04],
       [1.15028080e-01, 3.68343264e-01, 3.97657201e-01],
       [8.33757881e-01, 5.09782498e-01, 9.17964534e-01]]), 'outputs': array([[0.42469771, 0.06696385, 0.45796204],
       [0.41100878, 0.54945725, 0.63982915],
       [0.05890447, 0.61752943, 0.13227367],
       [0.43181318, 0.89233017, 0.13383653],
       [0.20875951, 0.42301876, 0.06039387]])}

Note

The groups are specified by means of a dictionary where indices are the variables names and values are the groups. If a variable is missing, the default group ‘parameters’ is considered.

storing by names

We can also store the data by variables names rather than by groups.

dataset = Dataset(name="random_dataset", by_group=False)
dataset.set_from_array(data, data_names, sizes, groups)
print(dataset)
print(dataset.data)

Out:

random_dataset
   Number of samples: 5
   Number of variables: 3
   Variables names and sizes by group:
      inputs: x_1 (1), x_2 (2)
      outputs: y_1 (3)
   Number of dimensions (total = 6) by group:
      inputs: 3
      outputs: 3
{'x_1': array([[0.44145196],
       [0.09337832],
       [0.44196947],
       [0.11502808],
       [0.83375788]]), 'x_2': array([[6.30061490e-01, 8.43647520e-01],
       [1.19158168e-01, 8.17784653e-01],
       [8.02869388e-01, 1.50381081e-04],
       [3.68343264e-01, 3.97657201e-01],
       [5.09782498e-01, 9.17964534e-01]]), 'y_1': array([[0.42469771, 0.06696385, 0.45796204],
       [0.41100878, 0.54945725, 0.63982915],
       [0.05890447, 0.61752943, 0.13227367],
       [0.43181318, 0.89233017, 0.13383653],
       [0.20875951, 0.42301876, 0.06039387]])}

Note

The choice to be made between a storage by group and a storage by variables names aims to limit the number of memory copies of numpy arrays. It mainly depends on how the dataset is used and for what purposes. For example, if we want to build a machine learning algorithm from both input and output data, we only have to access the data by group and in this case, storing the data by group is recommended. Conversely, if we want to use the dataset for post-processing purposes, by accessing the variables of the dataset from their names, the storage by variables names is preferable.

Access properties

Variables names

We can access the variables names:

print(dataset.variables)

Out:

['x_1', 'x_2', 'y_1']

Variables sizes

We can access the variables sizes:

print(dataset.sizes)

Out:

{'x_1': 1, 'x_2': 2, 'y_1': 3}

Variables groups

We can access the variables groups:

print(dataset.groups)

Out:

['inputs', 'outputs']

Access data

Access by group

We can get the data by group, either as an array (default option):

print(dataset.get_data_by_group("inputs"))

Out:

[[4.41451961e-01 6.30061490e-01 8.43647520e-01]
 [9.33783172e-02 1.19158168e-01 8.17784653e-01]
 [4.41969474e-01 8.02869388e-01 1.50381081e-04]
 [1.15028080e-01 3.68343264e-01 3.97657201e-01]
 [8.33757881e-01 5.09782498e-01 9.17964534e-01]]

or as a dictionary indexed by the variables names:

print(dataset.get_data_by_group("inputs", True))

Out:

{'x_1': array([[0.44145196],
       [0.09337832],
       [0.44196947],
       [0.11502808],
       [0.83375788]]), 'x_2': array([[6.30061490e-01, 8.43647520e-01],
       [1.19158168e-01, 8.17784653e-01],
       [8.02869388e-01, 1.50381081e-04],
       [3.68343264e-01, 3.97657201e-01],
       [5.09782498e-01, 9.17964534e-01]])}

Access by variable name

We can get the data by variables names, either as a dictionary indexed by the variables names (default option):

print(dataset.get_data_by_names(["x_1", "y_1"]))

Out:

{'x_1': array([[0.44145196],
       [0.09337832],
       [0.44196947],
       [0.11502808],
       [0.83375788]]), 'y_1': array([[0.42469771, 0.06696385, 0.45796204],
       [0.41100878, 0.54945725, 0.63982915],
       [0.05890447, 0.61752943, 0.13227367],
       [0.43181318, 0.89233017, 0.13383653],
       [0.20875951, 0.42301876, 0.06039387]])}

or as an array:

print(dataset.get_data_by_names(["x_1", "y_1"], False))

Out:

[[0.44145196 0.42469771 0.06696385 0.45796204]
 [0.09337832 0.41100878 0.54945725 0.63982915]
 [0.44196947 0.05890447 0.61752943 0.13227367]
 [0.11502808 0.43181318 0.89233017 0.13383653]
 [0.83375788 0.20875951 0.42301876 0.06039387]]

Access all data

We can get all the data, either as a large array:

print(dataset.get_all_data())

Out:

({'inputs': array([[4.41451961e-01, 6.30061490e-01, 8.43647520e-01],
       [9.33783172e-02, 1.19158168e-01, 8.17784653e-01],
       [4.41969474e-01, 8.02869388e-01, 1.50381081e-04],
       [1.15028080e-01, 3.68343264e-01, 3.97657201e-01],
       [8.33757881e-01, 5.09782498e-01, 9.17964534e-01]]), 'outputs': array([[0.42469771, 0.06696385, 0.45796204],
       [0.41100878, 0.54945725, 0.63982915],
       [0.05890447, 0.61752943, 0.13227367],
       [0.43181318, 0.89233017, 0.13383653],
       [0.20875951, 0.42301876, 0.06039387]])}, {'inputs': ['x_1', 'x_2'], 'outputs': ['y_1']}, {'x_1': 1, 'x_2': 2, 'y_1': 3})

or as a dictionary indexed by variables names:

print(dataset.get_all_data(as_dict=True))

Out:

{'inputs': {'x_1': array([[0.44145196],
       [0.09337832],
       [0.44196947],
       [0.11502808],
       [0.83375788]]), 'x_2': array([[6.30061490e-01, 8.43647520e-01],
       [1.19158168e-01, 8.17784653e-01],
       [8.02869388e-01, 1.50381081e-04],
       [3.68343264e-01, 3.97657201e-01],
       [5.09782498e-01, 9.17964534e-01]])}, 'outputs': {'y_1': array([[0.42469771, 0.06696385, 0.45796204],
       [0.41100878, 0.54945725, 0.63982915],
       [0.05890447, 0.61752943, 0.13227367],
       [0.43181318, 0.89233017, 0.13383653],
       [0.20875951, 0.42301876, 0.06039387]])}}

We can get these data sorted by category, either with a large array for each category:

print(dataset.get_all_data(by_group=False))

Out:

(array([[4.41451961e-01, 6.30061490e-01, 8.43647520e-01, 4.24697706e-01,
        6.69638526e-02, 4.57962045e-01],
       [9.33783172e-02, 1.19158168e-01, 8.17784653e-01, 4.11008784e-01,
        5.49457252e-01, 6.39829154e-01],
       [4.41969474e-01, 8.02869388e-01, 1.50381081e-04, 5.89044704e-02,
        6.17529433e-01, 1.32273670e-01],
       [1.15028080e-01, 3.68343264e-01, 3.97657201e-01, 4.31813184e-01,
        8.92330173e-01, 1.33836528e-01],
       [8.33757881e-01, 5.09782498e-01, 9.17964534e-01, 2.08759507e-01,
        4.23018763e-01, 6.03938712e-02]]), ['x_1', 'x_2', 'y_1'], {'x_1': 1, 'x_2': 2, 'y_1': 3})

or with a dictionary of variables names:

print(dataset.get_all_data(by_group=False, as_dict=True))

Out:

{'x_1': array([[0.44145196],
       [0.09337832],
       [0.44196947],
       [0.11502808],
       [0.83375788]]), 'x_2': array([[6.30061490e-01, 8.43647520e-01],
       [1.19158168e-01, 8.17784653e-01],
       [8.02869388e-01, 1.50381081e-04],
       [3.68343264e-01, 3.97657201e-01],
       [5.09782498e-01, 9.17964534e-01]]), 'y_1': array([[0.42469771, 0.06696385, 0.45796204],
       [0.41100878, 0.54945725, 0.63982915],
       [0.05890447, 0.61752943, 0.13227367],
       [0.43181318, 0.89233017, 0.13383653],
       [0.20875951, 0.42301876, 0.06039387]])}

Total running time of the script: ( 0 minutes 0.017 seconds)

Gallery generated by Sphinx-Gallery