gemseo / mlearning / core

supervised module

Supervised machine learning algorithm

Supervised machine learning is a task of learning relationships between input and output variables based on an input-output dataset. One usually distinguishes between to types of supervised machine learning algorithms, based on the nature of the outputs. For a continuous output variable, a regression is performed, while for a discrete output variable, a classification is performed.

Given a set of input variables \(x \in \mathbb{R}^{n_{\text{samples}}\times n_{\text{inputs}}}\) and a set of output variables \(y\in \mathbb{K}^{n_{\text{samples}}\times n_{\text{outputs}}}\), where \(n_{\text{inputs}}\) is the dimension of the input variable, \(n_{\text{outputs}}\) is the dimension of the output variable, \(n_{\text{samples}}\) is the number of training samples and \(\mathbb{K}\) is either \(\mathbb{R}\) or \(\mathbb{N}\) for regression and classification tasks respectively, a supervised learning algorithm seeks to find a function \(f: \mathbb{R}^{n_{\text{inputs}}} \to \mathbb{K}^{n_{\text{outputs}}}\) such that \(y=f(x)\).

In addition, we often want to impose some additional constraints on the function \(f\), mainly to ensure that it has a generalization capacity beyond the training data, i.e. it is able to correctly predict output values of new input values. This is called regularization. Assuming \(f\) is parametrized by a set of parameters \(\theta\), and denoting \(f_\theta\) the parametrized function, one typically seeks to minimize a function of the form

\[\mu(y, f_\theta(x)) + \Omega(\theta),\]

where \(\mu\) is a distance-like measure, typically a mean squared error or a cross entropy in the case of a regression, or a probability to be maximized in the case of a classification, and \(\Omega\) is a regularization term that limits the parameters from overfitting, typically some norm of its argument.

The supervised module implements this concept through the MLSupervisedAlgo class based on a Dataset.

class gemseo.mlearning.core.supervised.MLSupervisedAlgo(data, transformer=None, input_names=None, output_names=None, **parameters)[source]

Bases: gemseo.mlearning.core.ml_algo.MLAlgo

Supervised machine learning algorithm.

Inheriting classes should overload the MLSupervisedAlgo._fit() and MLSupervisedAlgo._predict() methods.

Constructor.

Parameters
  • data (Dataset) – learning dataset.

  • transformer (dict(str)) – transformation strategy for data groups. If None, do not scale data. Default: None.

  • input_names (list(str)) – names of the input variables.

  • output_names (list(str)) – names of the output variables.

  • parameters – algorithm parameters.

ABBR = 'MLSupervisedAlgo'
class DataFormatters[source]

Bases: gemseo.mlearning.core.ml_algo.MLAlgo.DataFormatters

Decorators for supervised algorithms.

classmethod format_dict(predict)[source]

If input_data is passed as a dictionary, then convert it to ndarray, and convert output_data to dictionary. Else, do nothing.

Parameters

predict – Method whose input_data and output_data are to be formatted.

classmethod format_input_output(predict)[source]

Format dict, samples and transform successively.

Parameters

predict – Method whose input_data and output_data are to be formatted.

classmethod format_samples(predict)[source]

If input_data has shape (n_inputs,), reshape input_data to (1, n_inputs), and then reshape output data from (1, n_outputs) to (n_outputs,). If input_data has shape (n_samples, n_inputs), then do nothing.

Parameters

predict – Method whose input_data and output_data are to be formatted.

classmethod format_transform(transform_inputs=True, transform_outputs=True)[source]

Apply transform to inputs, and inverse transform to outputs.

Parameters
  • format_inputs (bool) – Indicates whether to transform inputs.

  • format_outputs (bool) – Indicates whether to transform outputs.

property input_shape

Dimension of input variables before applying transformers.

learn(samples=None)[source]

Train machine learning algorithm on learning set, possibly filtered using the given parameters.

Parameters

samples (list(int)) – indices of training samples.

property output_shape

Dimension of output variables before applying transformers.

predict(input_data, *args, **kwargs)[source]