gemseo / mlearning / core

# supervised module¶

## Supervised machine learning algorithm¶

Supervised machine learning is a task of learning relationships between input and output variables based on an input-output dataset. One usually distinguishes between to types of supervised machine learning algorithms, based on the nature of the outputs. For a continuous output variable, a regression is performed, while for a discrete output variable, a classification is performed.

Given a set of input variables $$x \in \mathbb{R}^{n_{\text{samples}}\times n_{\text{inputs}}}$$ and a set of output variables $$y\in \mathbb{K}^{n_{\text{samples}}\times n_{\text{outputs}}}$$, where $$n_{\text{inputs}}$$ is the dimension of the input variable, $$n_{\text{outputs}}$$ is the dimension of the output variable, $$n_{\text{samples}}$$ is the number of training samples and $$\mathbb{K}$$ is either $$\mathbb{R}$$ or $$\mathbb{N}$$ for regression and classification tasks respectively, a supervised learning algorithm seeks to find a function $$f: \mathbb{R}^{n_{\text{inputs}}} \to \mathbb{K}^{n_{\text{outputs}}}$$ such that $$y=f(x)$$.

In addition, we often want to impose some additional constraints on the function $$f$$, mainly to ensure that it has a generalization capacity beyond the training data, i.e. it is able to correctly predict output values of new input values. This is called regularization. Assuming $$f$$ is parametrized by a set of parameters $$\theta$$, and denoting $$f_\theta$$ the parametrized function, one typically seeks to minimize a function of the form

$\mu(y, f_\theta(x)) + \Omega(\theta),$

where $$\mu$$ is a distance-like measure, typically a mean squared error or a cross entropy in the case of a regression, or a probability to be maximized in the case of a classification, and $$\Omega$$ is a regularization term that limits the parameters from overfitting, typically some norm of its argument.

The supervised module implements this concept through the MLSupervisedAlgo class based on a Dataset.

class gemseo.mlearning.core.supervised.MLSupervisedAlgo(data, transformer=None, input_names=None, output_names=None, **parameters)[source]

Supervised machine learning algorithm.

Inheriting classes should overload the MLSupervisedAlgo._fit() and MLSupervisedAlgo._predict() methods.

Constructor.

Parameters
• data (Dataset) – learning dataset.

• transformer (dict(str)) – transformation strategy for data groups. If None, do not scale data. Default: None.

• input_names (list(str)) – names of the input variables.

• output_names (list(str)) – names of the output variables.

• parameters – algorithm parameters.

ABBR = 'MLSupervisedAlgo'
class DataFormatters[source]

Decorators for supervised algorithms.

classmethod format_dict(predict)[source]

If input_data is passed as a dictionary, then convert it to ndarray, and convert output_data to dictionary. Else, do nothing.

Parameters

predict – Method whose input_data and output_data are to be formatted.

classmethod format_input_output(predict)[source]

Format dict, samples and transform successively.

Parameters

predict – Method whose input_data and output_data are to be formatted.

classmethod format_samples(predict)[source]

If input_data has shape (n_inputs,), reshape input_data to (1, n_inputs), and then reshape output data from (1, n_outputs) to (n_outputs,). If input_data has shape (n_samples, n_inputs), then do nothing.

Parameters

predict – Method whose input_data and output_data are to be formatted.

classmethod format_transform(transform_inputs=True, transform_outputs=True)[source]

Apply transform to inputs, and inverse transform to outputs.

Parameters
• format_inputs (bool) – Indicates whether to transform inputs.

• format_outputs (bool) – Indicates whether to transform outputs.

property input_shape

Dimension of input variables before applying transformers.

learn(samples=None)[source]

Train machine learning algorithm on learning set, possibly filtered using the given parameters.

Parameters

samples (list(int)) – indices of training samples.

property output_shape

Dimension of output variables before applying transformers.

predict(input_data, *args, **kwargs)[source]