gemseo.mlearning.regression.algos.gpr module#

Gaussian process regression model.

Overview#

The Gaussian process regression (GPR) model expresses the model output as a weighted sum of kernel functions centered on the learning input data:

\[y = \mu + w_1\kappa(\|x-x_1\|;\epsilon) + w_2\kappa(\|x-x_2\|;\epsilon) + ... + w_N\kappa(\|x-x_N\|;\epsilon)\]

Details#

The GPR model relies on the assumption that the original model \(f\) to replace is an instance of a Gaussian process (GP) with mean \(\mu\) and covariance \(\sigma^2\kappa(\|x-x'\|;\epsilon)\).

Then, the GP conditioned by the learning set \((x_i,y_i)_{1\leq i \leq N}\) is entirely defined by its expectation:

\[\hat{f}(x) = \hat{\mu} + \hat{w}^T k(x)\]

and its covariance:

\[\hat{c}(x,x') = \hat{\sigma}^2 - k(x)^T K^{-1} k(x')\]

where \([\hat{\mu};\hat{w}]=([1_N~K]^T[1_N~K])^{-1}[1_N~K]^TY\) with \(K_{ij}=\kappa(\|x_i-x_j\|;\hat{\epsilon})\), \(k_i(x)=\kappa(\|x-x_i\|;\hat{\epsilon})\) and \(Y_i=y_i\).

The correlation length vector \(\epsilon\) is estimated by numerical non-linear optimization.

Surrogate model#

The expectation \(\hat{f}\) is the surrogate model of \(f\).

Error measure#

The standard deviation \(\hat{s}\) is a local error measure of \(\hat{f}\):

\[\hat{s}(x):=\sqrt{\hat{c}(x,x)}\]

Interpolation or regression#

The GPR model can be regressive or interpolative according to the value of the nugget effect \(\alpha\geq 0\) which is a regularization term applied to the correlation matrix \(K\). When \(\alpha = 0\), the surrogate model interpolates the learning data.

Dependence#

The GPR model relies on the GaussianProcessRegressor class of the scikit-learn library.

class GaussianProcessRegressor(data, settings_model=None, **settings)[source]#

Bases: BaseRandomProcessRegressor

Gaussian process regression model.

Parameters:
  • data (Dataset) -- The learning dataset.

  • settings_model (BaseMLAlgoSettings | None) -- The machine learning algorithm settings as a Pydantic model. If None, use **settings.

  • **settings (Any) -- The machine learning algorithm settings. These arguments are ignored when settings_model is not None.

Raises:

ValueError -- When both the variable and the group it belongs to have a transformer.

Settings#

alias of GaussianProcessRegressor_Settings

compute_samples(input_data, n_samples, seed=None)[source]#

Sample a random vector from the conditioned Gaussian process.

Parameters:
  • input_data (RealArray) -- The \(N\) input points of dimension \(d\) at which to observe the conditioned Gaussian process; shaped as (N, d).

  • n_samples (int) -- The number of samples M.

  • seed (int | None) -- The seed for reproducible results.

Returns:

The output samples shaped as (M, N, p) where p is the output dimension.

Return type:

RealArray

predict_std(input_data)[source]#

Predict the standard deviation from input data.

The user can specify these input data either as a NumPy array, e.g. array([1., 2., 3.]) or as a dictionary of NumPy arrays, e.g. {'a': array([1.]), 'b': array([2., 3.])}.

If the NumPy arrays are of dimension 2, their i-th rows represent the input data of the i-th sample; while if the NumPy arrays are of dimension 1, there is a single sample.

Parameters:

input_data (DataType) -- The input data.

Returns:

The standard deviation at the query points.

Warning

This statistic is expressed in relation to the transformed output space. You can sample the predict() method to estimate it in relation to the original output space if it is different from the transformed output space.

Return type:

RealArray

LIBRARY: ClassVar[str] = 'scikit-learn'#

The name of the library of the wrapped machine learning algorithm.

SHORT_ALGO_NAME: ClassVar[str] = 'GPR'#

The short name of the machine learning algorithm, often an acronym.

Typically used for composite names, e.g. f"{algo.SHORT_ALGO_NAME}_{dataset.name}" or f"{algo.SHORT_ALGO_NAME}_{discipline.name}".

property kernel#

The kernel used for prediction.