# Plotting a dataset¶

## Dataset plot factory¶

The module factory contains the DatasetPlotFactory class which is a factory to instantiate a DatasetPlot from its class name. The class can be internal to GEMSEO or located in an external module whose path is provided to the constructor. It also provides a list of available cache types and allows you to test if a cache type is available.

## Abstract dataset plot¶

The dataset_plot module implements the abstract DatasetPlot class whose purpose is to build a graphical representation of a Dataset and to display it on screen or save it to a file. This abstract class has to be overloaded by concrete ones implementing at least method DatasetPlot._run().

## Andrews curves¶

The AndrewsCurves class implements the Andrew plot, a.k.a. Andrews curves, which is a way to visualize $$n$$ samples of a high-dimensional vector

$x=(x_1,x_2,\ldots,x_d)\in\mathbb{R}^d$

in a 2D referential by projecting each sample

$x^{(i)}=(x_1^{(i)},x_2^{(i)},\ldots,x_d^{(i)})$

onto the vector

$\left(\frac{1}{\sqrt{2}},\sin(t),\cos(t),\sin(2t),\cos(2t), \ldots\right)$

which is composed of the $$d$$ first elements of the Fourier series:

$f_i(t)=\left(\frac{x_1^{(i)}}{\sqrt{2}},x_2^{(i)}\sin(t),x_3^{(i)}\cos(t), x_4^{(i)}\sin(2t),x_5^{(i)}\cos(2t),\ldots\right)$

Each curve $$t\mapsto f_i(t)$$ is plotted over the interval $$[-\pi,\pi]$$ and structure in the data may be visible in these $$n$$ Andrews curves.

A variable name can be passed to the DatasetPlot.execute() method by means of the classifier keyword in order to color the curves according to the value of the variable name. This is useful when the data is labeled.

## Curve plot¶

A Curves plot represents samples of a functional variable $$y(x)$$ discretized over a 1D mesh. Both evaluations of $$y$$ and mesh are stored in a Dataset, $$y$$ as a parameter and the mesh as a metadata.

## Parallel coordinates plot¶

The ParallelCoordinates class implements the parallel coordinates plot, a.k.a. cowebplot, which is a way to visualize $$n$$ samples of a high-dimensional vector

$x=(x_1,x_2,\ldots,x_d)\in\mathbb{R}^d$

in a 2D referential by representing each sample

$x^{(i)}=(x_1^{(i)},x_2^{(i)},\ldots,x_d^{(i)})$

as a piece-wise line where the x-values of the nodes from left to right are the values of $$x_1$$, $$x_2$$, … and $$x_d^{(i)}$$.

A variable name is required by the DatasetPlot.execute() method by means of the classifier keyword in order to color the curves according to the value of the variable name. This is useful when the data is labeled or when we are looking for the samples for which the classifier value is comprised in some interval specified by the lower and upper arguments (default values are set to -inf and inf respectively). In the latter case, the color scale is composed of only two values: one for the samples positively classified and one for the others.

The Radar class implements the Radviz plot, which is a way to visualize $$n$$ samples of a multi-dimensional vector

$x=(x_1,x_2,\ldots,x_d)\in\mathbb{R}^d$

in a 2D referential and to highlight the separability of the data.

For that, each sample

$x^{(i)}=(x_1^{(i)},x_2^{(i)},\ldots,x_d^{(i)})$

is rendered inside the unit disc with the influences of the different parameters evenly distributed on its circumference. Each parameter influence varies from 0 to 1 and can be interpreted compared to the others.

A variable name is required by the DatasetPlot.execute() method by means of the classifier keyword in order to color the curves according to the value of the variable name. This is useful when the data is labeled or when we are looking for the samples for which the classifier value is comprised in some interval specified by the lower and upper arguments (default values are set to -inf and inf respectively). In the latter case, the color scale is composed of only two values: one for the samples positively classified and one for the others.

## Scatter matrix¶

The ScatterMatrix class implements the scatter plot matrix, which is a way to visualize $$n$$ samples of a multi-dimensional vector

$x=(x_1,x_2,\ldots,x_d)\in\mathbb{R}^d$

in several 2D subplots where the (i,j) subplot represents the cloud of points

$\left(x_i^{(k)},x_j^{(k)}\right)_{1\leq k \leq n}$

while the (i,i) subplot represents the empirical distribution of the samples

$x_i^{(1)},\ldots,x_i^{(n)}$

by means of an histogram or a kernel density estimator.

A variable name can be passed to the DatasetPlot.execute() method by means of the classifier keyword in order to color the curves according to the value of the variable name. This is useful when the data is labeled.

## Scatter plot¶

A Scatter plot represents a set of points $$\{x_i,y_i\}_{1\leq i \leq n}$$ as markers on a classical plot where the color of points can be heterogeneous.

## Surface plot¶

A Surfaces plot represents samples of a functional variable $$z(x,y)$$ discretized over a 2D mesh. Both evaluations of $$z$$ and mesh are stored in a Dataset, $$z$$ as a parameter and the mesh as a metadata.

## YvsX plot¶

A YvsX plot represents samples of a couple $$(x,y)$$ as a set of points whose values are stored in a Dataset. The user can select the style of line or markers, as well as the color.

## ZvsXY plot¶

A ZvsXY plot represents the variable $$z$$ with respect to $$x$$ and $$y$$ as a surface plot, based on a set of points :points $$\{x_i,y_i,z_i\}_{1\leq i \leq n}$$. This interpolation is relies on the Delaunay triangulation of $$\{x_i,y_i\}_{1\leq i \leq n}$$