.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "examples/mlearning/transformer/plot_scaling_data.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_examples_mlearning_transformer_plot_scaling_data.py: Scaling ======= .. GENERATED FROM PYTHON SOURCE LINES 20-29 .. code-block:: Python from gemseo.algos.design_space import DesignSpace from gemseo.algos.doe.lib_openturns import OpenTURNS from gemseo.algos.opt_problem import OptimizationProblem from gemseo.core.mdofunctions.mdo_function import MDOFunction from gemseo.mlearning.quality_measures.r2_measure import R2Measure from gemseo.mlearning.regression.gpr import GaussianProcessRegressor from gemseo.problems.optimization.rosenbrock import Rosenbrock .. GENERATED FROM PYTHON SOURCE LINES 30-40 Scaling data around zero is important to avoid numerical issues when fitting a machine learning model. This is all the more true as the variables have different ranges or the fitting relies on numerical optimization techniques. This example illustrates the latter point. First, we consider the Rosenbrock function :math:`f(x)=(1-x_1)^2+100(x_2-x_1^2)^2` over the domain :math:`[-2,2]^2`: .. GENERATED FROM PYTHON SOURCE LINES 40-42 .. code-block:: Python problem = Rosenbrock() .. GENERATED FROM PYTHON SOURCE LINES 43-45 In order to approximate this function with a regression model, we sample it 30 times with an optimized Latin hypercube sampling (LHS) technique .. GENERATED FROM PYTHON SOURCE LINES 45-48 .. code-block:: Python openturns = OpenTURNS() openturns.execute(problem, openturns.OT_LHSO, n_samples=30) .. raw:: html
Optimization result:
  • Design variables: [0.86803148 0.78478463]
  • Objective function: 0.11542217328145475
  • Feasible solution: True


.. GENERATED FROM PYTHON SOURCE LINES 49-50 and save the samples in an :class:`IODataset`: .. GENERATED FROM PYTHON SOURCE LINES 50-52 .. code-block:: Python dataset_train = problem.to_dataset(opt_naming=False) .. GENERATED FROM PYTHON SOURCE LINES 53-54 We do the same with a full-factorial design of experiments (DOE) of size 900: .. GENERATED FROM PYTHON SOURCE LINES 54-57 .. code-block:: Python openturns.execute(problem, openturns.OT_FULLFACT, n_samples=30 * 30) dataset_test = problem.to_dataset(opt_naming=False) .. GENERATED FROM PYTHON SOURCE LINES 58-60 Then, we create a first Gaussian process regressor from the training dataset: .. GENERATED FROM PYTHON SOURCE LINES 60-63 .. code-block:: Python gpr = GaussianProcessRegressor(dataset_train) gpr.learn() .. GENERATED FROM PYTHON SOURCE LINES 64-65 and compute its R2 quality from the test dataset: .. GENERATED FROM PYTHON SOURCE LINES 65-68 .. code-block:: Python r2 = R2Measure(gpr) r2.compute_test_measure(dataset_test) .. rst-class:: sphx-glr-script-out .. code-block:: none array([0.85996207]) .. GENERATED FROM PYTHON SOURCE LINES 69-72 Then, we create a second Gaussian process regressor from the training dataset with the default input and output transformers that are :class:`.MinMaxScaler`: .. GENERATED FROM PYTHON SOURCE LINES 72-77 .. code-block:: Python gpr = GaussianProcessRegressor( dataset_train, transformer=GaussianProcessRegressor.DEFAULT_TRANSFORMER ) gpr.learn() .. GENERATED FROM PYTHON SOURCE LINES 78-79 We can see that the scaling improves the R2 quality (recall: the higher, the better): .. GENERATED FROM PYTHON SOURCE LINES 79-82 .. code-block:: Python r2 = R2Measure(gpr) r2.compute_test_measure(dataset_test) .. rst-class:: sphx-glr-script-out .. code-block:: none array([0.99503456]) .. GENERATED FROM PYTHON SOURCE LINES 83-84 We note that in this case, the input scaling does not contribute to this improvement: .. GENERATED FROM PYTHON SOURCE LINES 84-89 .. code-block:: Python gpr = GaussianProcessRegressor(dataset_train, transformer={"outputs": "MinMaxScaler"}) gpr.learn() r2 = R2Measure(gpr) r2.compute_test_measure(dataset_test) .. rst-class:: sphx-glr-script-out .. code-block:: none array([0.99503457]) .. GENERATED FROM PYTHON SOURCE LINES 90-91 We can also see that using a :class:`.StandardScaler` is less relevant in this case: .. GENERATED FROM PYTHON SOURCE LINES 91-96 .. code-block:: Python gpr = GaussianProcessRegressor(dataset_train, transformer={"outputs": "StandardScaler"}) gpr.learn() r2 = R2Measure(gpr) r2.compute_test_measure(dataset_test) .. rst-class:: sphx-glr-script-out .. code-block:: none array([0.97049746]) .. GENERATED FROM PYTHON SOURCE LINES 97-100 Finally, we rewrite the Rosenbrock function as :math:`f(x)=(1-x_1)^2+100(0.01x_2-x_1^2)^2` and its domain as :math:`[-2,2]\times[-200,200]`: .. GENERATED FROM PYTHON SOURCE LINES 100-104 .. code-block:: Python design_space = DesignSpace() design_space.add_variable("x1", l_b=-2, u_b=2) design_space.add_variable("x2", l_b=-200, u_b=200) .. GENERATED FROM PYTHON SOURCE LINES 105-107 in order to have inputs with different orders of magnitude. We create the learning and test datasets in the same way: .. GENERATED FROM PYTHON SOURCE LINES 107-116 .. code-block:: Python problem = OptimizationProblem(design_space) problem.objective = MDOFunction( lambda x: (1 - x[0]) ** 2 + 100 * (0.01 * x[1] - x[0] ** 2) ** 2, "f" ) openturns.execute(problem, openturns.OT_LHSO, n_samples=30) dataset_train = problem.to_dataset(opt_naming=False) openturns.execute(problem, openturns.OT_FULLFACT, n_samples=30 * 30) dataset_test = problem.to_dataset(opt_naming=False) .. GENERATED FROM PYTHON SOURCE LINES 117-118 and build a first Gaussian process regressor with a min-max scaler for the outputs: .. GENERATED FROM PYTHON SOURCE LINES 118-123 .. code-block:: Python gpr = GaussianProcessRegressor(dataset_train, transformer={"outputs": "MinMaxScaler"}) gpr.learn() r2 = R2Measure(gpr) r2.compute_test_measure(dataset_test) .. rst-class:: sphx-glr-script-out .. code-block:: none /home/docs/checkouts/readthedocs.org/user_builds/gemseo/envs/develop/lib/python3.9/site-packages/sklearn/gaussian_process/kernels.py:455: ConvergenceWarning: The optimal value found for dimension 1 of parameter length_scale is close to the specified upper bound 100.0. Increasing the bound and calling fit again may find a better value. warnings.warn( array([0.78692926]) .. GENERATED FROM PYTHON SOURCE LINES 124-127 The R2 quality is degraded because estimating the model's correlation lengths is complicated. This can be facilitated by setting a :class:`.MinMaxScaler` for the inputs: .. GENERATED FROM PYTHON SOURCE LINES 127-133 .. code-block:: Python gpr = GaussianProcessRegressor( dataset_train, transformer={"inputs": "MinMaxScaler", "outputs": "MinMaxScaler"} ) gpr.learn() r2 = R2Measure(gpr) r2.compute_test_measure(dataset_test) .. rst-class:: sphx-glr-script-out .. code-block:: none array([0.98758502]) .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 1.165 seconds) .. _sphx_glr_download_examples_mlearning_transformer_plot_scaling_data.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_scaling_data.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_scaling_data.py ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_