Scaling¶

from gemseo.algos.design_space import DesignSpace
from gemseo.algos.doe.lib_openturns import OpenTURNS
from gemseo.algos.opt_problem import OptimizationProblem
from gemseo.core.mdofunctions.mdo_function import MDOFunction
from gemseo.mlearning.quality_measures.r2_measure import R2Measure
from gemseo.mlearning.regression.gpr import GaussianProcessRegressor
from gemseo.problems.optimization.rosenbrock import Rosenbrock

Scaling data around zero is important to avoid numerical issues when fitting a machine learning model. This is all the more true as the variables have different ranges or the fitting relies on numerical optimization techniques. This example illustrates the latter point.

First, we consider the Rosenbrock function \(f(x)=(1-x_1)^2+100(x_2-x_1^2)^2\) over the domain \([-2,2]^2\):

problem = Rosenbrock()

In order to approximate this function with a regression model, we sample it 30 times with an optimized Latin hypercube sampling (LHS) technique

openturns = OpenTURNS()
openturns.execute(problem, openturns.OT_LHSO, n_samples=30)

Optimization result:

Design variables: [0.86803148 0.78478463]
Objective function: 0.11542217328145475
Feasible solution: True

and save the samples in an IODataset:

dataset_train = problem.to_dataset(opt_naming=False)

We do the same with a full-factorial design of experiments (DOE) of size 900:

openturns.execute(problem, openturns.OT_FULLFACT, n_samples=30 * 30)
dataset_test = problem.to_dataset(opt_naming=False)

Then, we create a first Gaussian process regressor from the training dataset:

gpr = GaussianProcessRegressor(dataset_train)
gpr.learn()

and compute its R2 quality from the test dataset:

r2 = R2Measure(gpr)
r2.compute_test_measure(dataset_test)

array([0.85996207])

Then, we create a second Gaussian process regressor from the training dataset with the default input and output transformers that are MinMaxScaler:

gpr = GaussianProcessRegressor(
    dataset_train, transformer=GaussianProcessRegressor.DEFAULT_TRANSFORMER
)
gpr.learn()

We can see that the scaling improves the R2 quality (recall: the higher, the better):

r2 = R2Measure(gpr)
r2.compute_test_measure(dataset_test)

array([0.99503456])

We note that in this case, the input scaling does not contribute to this improvement:

gpr = GaussianProcessRegressor(dataset_train, transformer={"outputs": "MinMaxScaler"})
gpr.learn()
r2 = R2Measure(gpr)
r2.compute_test_measure(dataset_test)

array([0.99503457])

We can also see that using a StandardScaler is less relevant in this case:

gpr = GaussianProcessRegressor(dataset_train, transformer={"outputs": "StandardScaler"})
gpr.learn()
r2 = R2Measure(gpr)
r2.compute_test_measure(dataset_test)

array([0.97049746])

Finally, we rewrite the Rosenbrock function as \(f(x)=(1-x_1)^2+100(0.01x_2-x_1^2)^2\) and its domain as \([-2,2]\times[-200,200]\):

design_space = DesignSpace()
design_space.add_variable("x1", l_b=-2, u_b=2)
design_space.add_variable("x2", l_b=-200, u_b=200)

in order to have inputs with different orders of magnitude. We create the learning and test datasets in the same way:

problem = OptimizationProblem(design_space)
problem.objective = MDOFunction(
    lambda x: (1 - x[0]) ** 2 + 100 * (0.01 * x[1] - x[0] ** 2) ** 2, "f"
)
openturns.execute(problem, openturns.OT_LHSO, n_samples=30)
dataset_train = problem.to_dataset(opt_naming=False)
openturns.execute(problem, openturns.OT_FULLFACT, n_samples=30 * 30)
dataset_test = problem.to_dataset(opt_naming=False)

and build a first Gaussian process regressor with a min-max scaler for the outputs:

gpr = GaussianProcessRegressor(dataset_train, transformer={"outputs": "MinMaxScaler"})
gpr.learn()
r2 = R2Measure(gpr)
r2.compute_test_measure(dataset_test)

/home/docs/checkouts/readthedocs.org/user_builds/gemseo/envs/develop/lib/python3.9/site-packages/sklearn/gaussian_process/kernels.py:455: ConvergenceWarning: The optimal value found for dimension 1 of parameter length_scale is close to the specified upper bound 100.0. Increasing the bound and calling fit again may find a better value.
  warnings.warn(

array([0.78692926])

The R2 quality is degraded because estimating the model’s correlation lengths is complicated. This can be facilitated by setting a MinMaxScaler for the inputs:

gpr = GaussianProcessRegressor(
    dataset_train, transformer={"inputs": "MinMaxScaler", "outputs": "MinMaxScaler"}
)
gpr.learn()
r2 = R2Measure(gpr)
r2.compute_test_measure(dataset_test)

array([0.98758502])

Total running time of the script: (0 minutes 1.165 seconds)

Gallery generated by Sphinx-Gallery