In [None]:
%matplotlib inline


# Iris dataset

## Presentation

This is one of the best known dataset
to be found in the machine learning literature.

It was introduced by the statistician Ronald Fisher
in his 1936 paper "The use of multiple measurements in taxonomic problems",
Annals of Eugenics. 7 (2): 179–188.

It contains 150 instances of iris plants:

- 50 Iris Setosa,
- 50 Iris Versicolour,
- 50 Iris Virginica.

Each instance is characterized by:

- its sepal length in cm,
- its sepal width in cm,
- its petal length in cm,
- its petal width in cm.

This dataset can be used for either clustering purposes
or classification ones.


In [None]:
from __future__ import annotations

from gemseo.api import configure_logger
from gemseo.api import load_dataset
from gemseo.post.dataset.andrews_curves import AndrewsCurves
from gemseo.post.dataset.parallel_coordinates import ParallelCoordinates
from gemseo.post.dataset.radviz import Radar
from gemseo.post.dataset.scatter_plot_matrix import ScatterMatrix
from matplotlib import pyplot as plt
from numpy.random import choice

configure_logger()

## Load Iris dataset
We can easily load this dataset by means of the
:meth:`~gemseo.api.load_dataset` function of the API:



In [None]:
iris = load_dataset("IrisDataset")

and get some information about it



In [None]:
print(iris)

## Manipulate the dataset
We randomly select 10 samples to display.



In [None]:
shown_samples = choice(iris.length, size=10, replace=False)

If the pandas library is installed, we can export the iris dataset to a
dataframe and print(it.



In [None]:
dataframe = iris.export_to_dataframe()
print(dataframe)

We can also easily access the 10 samples previously selected,
either globally



In [None]:
data = iris.get_all_data(False)
print(data[0][shown_samples, :])

or only the parameters:



In [None]:
parameters = iris.get_data_by_group("parameters")
print(parameters[shown_samples, :])

or only the labels:



In [None]:
labels = iris.get_data_by_group("labels")
print(labels[shown_samples, :])

## Plot the dataset
Lastly, we can plot the dataset in various ways. We will note that the
samples are colored according to their labels.



### Plot scatter matrix
We can use the :class:`.ScatterMatrix` plot where each non-diagonal block
represents the samples according to the x- and y- coordinates names
while the diagonal ones approximate the probability distributions of the
variables, using either an histogram or a kernel-density estimator.



In [None]:
ScatterMatrix(iris, classifier="specy", kde=True).execute(save=False, show=False)

### Plot parallel coordinates
We can use the
:class:`~gemseo.post.dataset.parallel_coordinates.ParallelCoordinates` plot,
a.k.a. cowebplot, where each samples
is represented by a continuous straight line in pieces whose nodes are
indexed by the variables names and measure the variables values.



In [None]:
ParallelCoordinates(iris, "specy").execute(save=False, show=False)

### Plot Andrews curves
We can use the :class:`.AndrewsCurves` plot
which can be viewed as a smooth
version of the parallel coordinates. Each sample is represented by a curve
and if there is structure in data, it may be visible in the plot.



In [None]:
AndrewsCurves(iris, "specy").execute(save=False, show=False)

### Plot Radar
We can use the :class:`.Radar` plot



In [None]:
Radar(iris, "specy").execute(save=False, show=False)
# Workaround for HTML rendering, instead of ``show=True``
plt.show()