Dataset¶
Basics¶
A generic dataset to store data in memory.
The concept of dataset is a key element for machine learning, post-processing, data analysis, …
A Dataset
uses its attribute Dataset.data
to store \(N\) series of data
representing the values of \(p\) multidimensional features
belonging to different groups of features.
This attribute Dataset.data
is a dictionary of 2D numpy arrays,
whose rows are the samples, a.k.a. series, realizations or entries,
and columns are the variables, a.k.a. parameters or features.
The keys of this dictionary are
either the names of the groups of variables
or the names of the variables.
Thus, a Dataset
is not only defined by the raw data stored
but also by the names, the sizes and the groups of the different variables.
A Dataset
can be set
either from a file (Dataset.set_from_file()
)
or from a numpy arrays (Dataset.set_from_array()
),
and can be enriched from a group of variables (Dataset.add_group()
)
or from a single variable (Dataset.add_variable()
).
An AbstractFullCache
or an OptimizationProblem
can also be exported to a Dataset
using AbstractFullCache.export_to_dataset()
and OptimizationProblem.export_to_dataset()
respectively.
From a Dataset
,
we can easily access its length and data,
either as 2D array or as dictionaries indexed by the variables names.
We can get either the whole data,
or the data associated to a group or the data associated to a list of variables.
It is also possible to export the Dataset
to an AbstractFullCache
or a pandas DataFrame.
Visualization¶
Datasets post-processing.
A DatasetPlot
is a generic graphical representation of a Dataset
that can be displayed on screen or saved to a file.
The different visualization tools
(OptPostProcessor
, SensitivityAnalysis.plot()
, …)
proposed by GEMSEO rely more and more on it
to factorize the code,
separate data and graph generation
and standardize the visualizations.
As an example,
RadarChart
implements a
radar chart
and is used by both
RadarChart
to visualize the constraints stored in a Database
and SensitivityAnalysis.plot_radar()
to visualize the sensitivity indices generated by a SensitivityAnalysis
.
Use cases¶
Examples of dataset.
GEMSEO proposes several datasets containing academic data to illustrate its capabilities:
IrisDataset
is a collection of iris plants, mainly used to benchmark clustering and classification algorithms,RosenbrockDataset
is a set of evaluations of the Rosenbrock function over a regular grid, initially introduced to illustrate visualization tools dedicated to surfaces such asZvsXY
,BurgersDataset
is a set of solutions of the Burgers’ equation at given times, initially introduced to illustrate dimension reduction methods, e.g.PCA
orKLSVD
.