.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "examples/mlearning/transformer/plot_scaling_data.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_examples_mlearning_transformer_plot_scaling_data.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_examples_mlearning_transformer_plot_scaling_data.py:


Scaling
=======

.. GENERATED FROM PYTHON SOURCE LINES 20-31

.. code-block:: Python


    from __future__ import annotations

    from gemseo.algos.design_space import DesignSpace
    from gemseo.algos.doe.openturns.openturns import OpenTURNS
    from gemseo.algos.optimization_problem import OptimizationProblem
    from gemseo.core.mdo_functions.mdo_function import MDOFunction
    from gemseo.mlearning.regression.algos.gpr import GaussianProcessRegressor
    from gemseo.mlearning.regression.quality.r2_measure import R2Measure
    from gemseo.problems.optimization.rosenbrock import Rosenbrock


.. GENERATED FROM PYTHON SOURCE LINES 32-42

Scaling data around zero is important to avoid numerical issues
when fitting a machine learning model.
This is all the more true as
the variables have different ranges
or the fitting relies on numerical optimization techniques.
This example illustrates the latter point.

First,
we consider the Rosenbrock function :math:`f(x)=(1-x_1)^2+100(x_2-x_1^2)^2`
over the domain :math:`[-2,2]^2`:

.. GENERATED FROM PYTHON SOURCE LINES 42-44

.. code-block:: Python

    problem = Rosenbrock()


.. GENERATED FROM PYTHON SOURCE LINES 45-47

In order to approximate this function with a regression model,
we sample it 30 times with an optimized Latin hypercube sampling (LHS) technique

.. GENERATED FROM PYTHON SOURCE LINES 47-50

.. code-block:: Python

    opt_lhs = OpenTURNS("OT_OPT_LHS")
    opt_lhs.execute(problem, n_samples=30)


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div style='margin: 1em;'>Optimization result:<br/><ul><li>Design variables: [0.86803148 0.78478463]</li><li>Objective function: 0.11542217328145475</li><li>Feasible solution: True</li></ul></div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 51-52

and save the samples in an :class:`.IODataset`:

.. GENERATED FROM PYTHON SOURCE LINES 52-54

.. code-block:: Python

    dataset_train = problem.to_dataset(opt_naming=False)


.. GENERATED FROM PYTHON SOURCE LINES 55-56

We do the same with a full-factorial design of experiments (DOE) of size 900:

.. GENERATED FROM PYTHON SOURCE LINES 56-60

.. code-block:: Python

    full_fact = OpenTURNS("OT_FULLFACT")
    full_fact.execute(problem, n_samples=30 * 30)
    dataset_test = problem.to_dataset(opt_naming=False)


.. GENERATED FROM PYTHON SOURCE LINES 61-63

Then,
we create a first Gaussian process regressor from the training dataset:

.. GENERATED FROM PYTHON SOURCE LINES 63-66

.. code-block:: Python

    gpr = GaussianProcessRegressor(dataset_train)
    gpr.learn()


.. GENERATED FROM PYTHON SOURCE LINES 67-68

and compute its R2 quality from the test dataset:

.. GENERATED FROM PYTHON SOURCE LINES 68-71

.. code-block:: Python

    r2 = R2Measure(gpr)
    r2.compute_test_measure(dataset_test)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    array([0.97049746])


.. GENERATED FROM PYTHON SOURCE LINES 72-75

Then,
we create a second Gaussian process regressor from the training dataset
with the default input and output transformers that are :class:`.MinMaxScaler`:

.. GENERATED FROM PYTHON SOURCE LINES 75-80

.. code-block:: Python

    gpr = GaussianProcessRegressor(
        dataset_train, transformer=GaussianProcessRegressor.DEFAULT_TRANSFORMER
    )
    gpr.learn()


.. GENERATED FROM PYTHON SOURCE LINES 81-82

We can see that the scaling improves the R2 quality (recall: the higher, the better):

.. GENERATED FROM PYTHON SOURCE LINES 82-85

.. code-block:: Python

    r2 = R2Measure(gpr)
    r2.compute_test_measure(dataset_test)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    array([0.97049746])


.. GENERATED FROM PYTHON SOURCE LINES 86-87

We note that in this case, the input scaling does not contribute to this improvement:

.. GENERATED FROM PYTHON SOURCE LINES 87-92

.. code-block:: Python

    gpr = GaussianProcessRegressor(dataset_train, transformer={"outputs": "MinMaxScaler"})
    gpr.learn()
    r2 = R2Measure(gpr)
    r2.compute_test_measure(dataset_test)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    array([0.97049746])


.. GENERATED FROM PYTHON SOURCE LINES 93-94

We can also see that using a :class:`.StandardScaler` is less relevant in this case:

.. GENERATED FROM PYTHON SOURCE LINES 94-99

.. code-block:: Python

    gpr = GaussianProcessRegressor(dataset_train, transformer={"outputs": "StandardScaler"})
    gpr.learn()
    r2 = R2Measure(gpr)
    r2.compute_test_measure(dataset_test)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    array([0.97049746])


.. GENERATED FROM PYTHON SOURCE LINES 100-103

Finally,
we rewrite the Rosenbrock function as :math:`f(x)=(1-x_1)^2+100(0.01x_2-x_1^2)^2`
and its domain as :math:`[-2,2]\times[-200,200]`:

.. GENERATED FROM PYTHON SOURCE LINES 103-107

.. code-block:: Python

    design_space = DesignSpace()
    design_space.add_variable("x1", lower_bound=-2, upper_bound=2)
    design_space.add_variable("x2", lower_bound=-200, upper_bound=200)


.. GENERATED FROM PYTHON SOURCE LINES 108-110

in order to have inputs with different orders of magnitude.
We create the learning and test datasets in the same way:

.. GENERATED FROM PYTHON SOURCE LINES 110-119

.. code-block:: Python

    problem = OptimizationProblem(design_space)
    problem.objective = MDOFunction(
        lambda x: (1 - x[0]) ** 2 + 100 * (0.01 * x[1] - x[0] ** 2) ** 2, "f"
    )
    opt_lhs.execute(problem, n_samples=30)
    dataset_train = problem.to_dataset(opt_naming=False)
    full_fact.execute(problem, n_samples=30 * 30)
    dataset_test = problem.to_dataset(opt_naming=False)


.. GENERATED FROM PYTHON SOURCE LINES 120-121

and build a first Gaussian process regressor with a min-max scaler for the outputs:

.. GENERATED FROM PYTHON SOURCE LINES 121-126

.. code-block:: Python

    gpr = GaussianProcessRegressor(dataset_train, transformer={"outputs": "MinMaxScaler"})
    gpr.learn()
    r2 = R2Measure(gpr)
    r2.compute_test_measure(dataset_test)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    array([0.90669624])


.. GENERATED FROM PYTHON SOURCE LINES 127-130

The R2 quality is degraded
because estimating the model's correlation lengths is complicated.
This can be facilitated by setting a :class:`.MinMaxScaler` for the inputs:

.. GENERATED FROM PYTHON SOURCE LINES 130-136

.. code-block:: Python

    gpr = GaussianProcessRegressor(
        dataset_train, transformer={"inputs": "MinMaxScaler", "outputs": "MinMaxScaler"}
    )
    gpr.learn()
    r2 = R2Measure(gpr)
    r2.compute_test_measure(dataset_test)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    array([0.97432803])


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 0.914 seconds)


.. _sphx_glr_download_examples_mlearning_transformer_plot_scaling_data.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_scaling_data.ipynb <plot_scaling_data.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_scaling_data.py <plot_scaling_data.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: plot_scaling_data.zip <plot_scaling_data.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_