.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "examples/mlearning/transformer/plot_scaling_data.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_examples_mlearning_transformer_plot_scaling_data.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_examples_mlearning_transformer_plot_scaling_data.py:


Scaling
=======

.. GENERATED FROM PYTHON SOURCE LINES 20-29

.. code-block:: Python


    from gemseo.algos.design_space import DesignSpace
    from gemseo.algos.doe.lib_openturns import OpenTURNS
    from gemseo.algos.opt_problem import OptimizationProblem
    from gemseo.core.mdofunctions.mdo_function import MDOFunction
    from gemseo.mlearning.quality_measures.r2_measure import R2Measure
    from gemseo.mlearning.regression.gpr import GaussianProcessRegressor
    from gemseo.problems.optimization.rosenbrock import Rosenbrock


.. GENERATED FROM PYTHON SOURCE LINES 30-40

Scaling data around zero is important to avoid numerical issues
when fitting a machine learning model.
This is all the more true as
the variables have different ranges
or the fitting relies on numerical optimization techniques.
This example illustrates the latter point.

First,
we consider the Rosenbrock function :math:`f(x)=(1-x_1)^2+100(x_2-x_1^2)^2`
over the domain :math:`[-2,2]^2`:

.. GENERATED FROM PYTHON SOURCE LINES 40-42

.. code-block:: Python

    problem = Rosenbrock()


.. GENERATED FROM PYTHON SOURCE LINES 43-45

In order to approximate this function with a regression model,
we sample it 30 times with an optimized Latin hypercube sampling (LHS) technique

.. GENERATED FROM PYTHON SOURCE LINES 45-48

.. code-block:: Python

    openturns = OpenTURNS()
    openturns.execute(problem, openturns.OT_LHSO, n_samples=30)


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div style='background-color: #fafae2; padding: 10px;'>Optimization result:<br/><ul><li>Design variables: [0.86803148 0.78478463]</li><li>Objective function: 0.11542217328145475</li><li>Feasible solution: True</li></ul></div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 49-50

and save the samples in an :class:`IODataset`:

.. GENERATED FROM PYTHON SOURCE LINES 50-52

.. code-block:: Python

    dataset_train = problem.to_dataset(opt_naming=False)


.. GENERATED FROM PYTHON SOURCE LINES 53-54

We do the same with a full-factorial design of experiments (DOE) of size 900:

.. GENERATED FROM PYTHON SOURCE LINES 54-57

.. code-block:: Python

    openturns.execute(problem, openturns.OT_FULLFACT, n_samples=30 * 30)
    dataset_test = problem.to_dataset(opt_naming=False)


.. GENERATED FROM PYTHON SOURCE LINES 58-60

Then,
we create a first Gaussian process regressor from the training dataset:

.. GENERATED FROM PYTHON SOURCE LINES 60-63

.. code-block:: Python

    gpr = GaussianProcessRegressor(dataset_train)
    gpr.learn()


.. GENERATED FROM PYTHON SOURCE LINES 64-65

and compute its R2 quality from the test dataset:

.. GENERATED FROM PYTHON SOURCE LINES 65-68

.. code-block:: Python

    r2 = R2Measure(gpr)
    r2.compute_test_measure(dataset_test)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    array([0.85996207])


.. GENERATED FROM PYTHON SOURCE LINES 69-72

Then,
we create a second Gaussian process regressor from the training dataset
with the default input and output transformers that are :class:`.MinMaxScaler`:

.. GENERATED FROM PYTHON SOURCE LINES 72-77

.. code-block:: Python

    gpr = GaussianProcessRegressor(
        dataset_train, transformer=GaussianProcessRegressor.DEFAULT_TRANSFORMER
    )
    gpr.learn()


.. GENERATED FROM PYTHON SOURCE LINES 78-79

We can see that the scaling improves the R2 quality (recall: the higher, the better):

.. GENERATED FROM PYTHON SOURCE LINES 79-82

.. code-block:: Python

    r2 = R2Measure(gpr)
    r2.compute_test_measure(dataset_test)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    array([0.99503456])


.. GENERATED FROM PYTHON SOURCE LINES 83-84

We note that in this case, the input scaling does not contribute to this improvement:

.. GENERATED FROM PYTHON SOURCE LINES 84-89

.. code-block:: Python

    gpr = GaussianProcessRegressor(dataset_train, transformer={"outputs": "MinMaxScaler"})
    gpr.learn()
    r2 = R2Measure(gpr)
    r2.compute_test_measure(dataset_test)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    array([0.99503457])


.. GENERATED FROM PYTHON SOURCE LINES 90-91

We can also see that using a :class:`.StandardScaler` is less relevant in this case:

.. GENERATED FROM PYTHON SOURCE LINES 91-96

.. code-block:: Python

    gpr = GaussianProcessRegressor(dataset_train, transformer={"outputs": "StandardScaler"})
    gpr.learn()
    r2 = R2Measure(gpr)
    r2.compute_test_measure(dataset_test)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    array([0.97049746])


.. GENERATED FROM PYTHON SOURCE LINES 97-100

Finally,
we rewrite the Rosenbrock function as :math:`f(x)=(1-x_1)^2+100(0.01x_2-x_1^2)^2`
and its domain as :math:`[-2,2]\times[-200,200]`:

.. GENERATED FROM PYTHON SOURCE LINES 100-104

.. code-block:: Python

    design_space = DesignSpace()
    design_space.add_variable("x1", l_b=-2, u_b=2)
    design_space.add_variable("x2", l_b=-200, u_b=200)


.. GENERATED FROM PYTHON SOURCE LINES 105-107

in order to have inputs with different orders of magnitude.
We create the learning and test datasets in the same way:

.. GENERATED FROM PYTHON SOURCE LINES 107-116

.. code-block:: Python

    problem = OptimizationProblem(design_space)
    problem.objective = MDOFunction(
        lambda x: (1 - x[0]) ** 2 + 100 * (0.01 * x[1] - x[0] ** 2) ** 2, "f"
    )
    openturns.execute(problem, openturns.OT_LHSO, n_samples=30)
    dataset_train = problem.to_dataset(opt_naming=False)
    openturns.execute(problem, openturns.OT_FULLFACT, n_samples=30 * 30)
    dataset_test = problem.to_dataset(opt_naming=False)


.. GENERATED FROM PYTHON SOURCE LINES 117-118

and build a first Gaussian process regressor with a min-max scaler for the outputs:

.. GENERATED FROM PYTHON SOURCE LINES 118-123

.. code-block:: Python

    gpr = GaussianProcessRegressor(dataset_train, transformer={"outputs": "MinMaxScaler"})
    gpr.learn()
    r2 = R2Measure(gpr)
    r2.compute_test_measure(dataset_test)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    /home/docs/checkouts/readthedocs.org/user_builds/gemseo/envs/develop/lib/python3.9/site-packages/sklearn/gaussian_process/kernels.py:455: ConvergenceWarning: The optimal value found for dimension 1 of parameter length_scale is close to the specified upper bound 100.0. Increasing the bound and calling fit again may find a better value.
      warnings.warn(

    array([0.78692926])


.. GENERATED FROM PYTHON SOURCE LINES 124-127

The R2 quality is degraded
because estimating the model's correlation lengths is complicated.
This can be facilitated by setting a :class:`.MinMaxScaler` for the inputs:

.. GENERATED FROM PYTHON SOURCE LINES 127-133

.. code-block:: Python

    gpr = GaussianProcessRegressor(
        dataset_train, transformer={"inputs": "MinMaxScaler", "outputs": "MinMaxScaler"}
    )
    gpr.learn()
    r2 = R2Measure(gpr)
    r2.compute_test_measure(dataset_test)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    array([0.98758502])


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 1.165 seconds)


.. _sphx_glr_download_examples_mlearning_transformer_plot_scaling_data.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_scaling_data.ipynb <plot_scaling_data.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_scaling_data.py <plot_scaling_data.py>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_