.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "examples/mlearning/transformer/plot_scaling_data.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_examples_mlearning_transformer_plot_scaling_data.py: Scaling ======= .. GENERATED FROM PYTHON SOURCE LINES 20-31 .. code-block:: Python from __future__ import annotations from gemseo.algos.design_space import DesignSpace from gemseo.algos.doe.openturns.openturns import OpenTURNS from gemseo.algos.optimization_problem import OptimizationProblem from gemseo.core.mdo_functions.mdo_function import MDOFunction from gemseo.mlearning.regression.algos.gpr import GaussianProcessRegressor from gemseo.mlearning.regression.quality.r2_measure import R2Measure from gemseo.problems.optimization.rosenbrock import Rosenbrock .. GENERATED FROM PYTHON SOURCE LINES 32-42 Scaling data around zero is important to avoid numerical issues when fitting a machine learning model. This is all the more true as the variables have different ranges or the fitting relies on numerical optimization techniques. This example illustrates the latter point. First, we consider the Rosenbrock function :math:`f(x)=(1-x_1)^2+100(x_2-x_1^2)^2` over the domain :math:`[-2,2]^2`: .. GENERATED FROM PYTHON SOURCE LINES 42-44 .. code-block:: Python problem = Rosenbrock() .. GENERATED FROM PYTHON SOURCE LINES 45-47 In order to approximate this function with a regression model, we sample it 30 times with an optimized Latin hypercube sampling (LHS) technique .. GENERATED FROM PYTHON SOURCE LINES 47-50 .. code-block:: Python opt_lhs = OpenTURNS("OT_OPT_LHS") opt_lhs.execute(problem, n_samples=30) .. raw:: html
Optimization result:
  • Design variables: [0.86803148 0.78478463]
  • Objective function: 0.11542217328145475
  • Feasible solution: True


.. GENERATED FROM PYTHON SOURCE LINES 51-52 and save the samples in an :class:`.IODataset`: .. GENERATED FROM PYTHON SOURCE LINES 52-54 .. code-block:: Python dataset_train = problem.to_dataset(opt_naming=False) .. GENERATED FROM PYTHON SOURCE LINES 55-56 We do the same with a full-factorial design of experiments (DOE) of size 900: .. GENERATED FROM PYTHON SOURCE LINES 56-60 .. code-block:: Python full_fact = OpenTURNS("OT_FULLFACT") full_fact.execute(problem, n_samples=30 * 30) dataset_test = problem.to_dataset(opt_naming=False) .. GENERATED FROM PYTHON SOURCE LINES 61-63 Then, we create a first Gaussian process regressor from the training dataset: .. GENERATED FROM PYTHON SOURCE LINES 63-66 .. code-block:: Python gpr = GaussianProcessRegressor(dataset_train) gpr.learn() .. GENERATED FROM PYTHON SOURCE LINES 67-68 and compute its R2 quality from the test dataset: .. GENERATED FROM PYTHON SOURCE LINES 68-71 .. code-block:: Python r2 = R2Measure(gpr) r2.compute_test_measure(dataset_test) .. rst-class:: sphx-glr-script-out .. code-block:: none array([0.97049746]) .. GENERATED FROM PYTHON SOURCE LINES 72-75 Then, we create a second Gaussian process regressor from the training dataset with the default input and output transformers that are :class:`.MinMaxScaler`: .. GENERATED FROM PYTHON SOURCE LINES 75-80 .. code-block:: Python gpr = GaussianProcessRegressor( dataset_train, transformer=GaussianProcessRegressor.DEFAULT_TRANSFORMER ) gpr.learn() .. GENERATED FROM PYTHON SOURCE LINES 81-82 We can see that the scaling improves the R2 quality (recall: the higher, the better): .. GENERATED FROM PYTHON SOURCE LINES 82-85 .. code-block:: Python r2 = R2Measure(gpr) r2.compute_test_measure(dataset_test) .. rst-class:: sphx-glr-script-out .. code-block:: none array([0.97049746]) .. GENERATED FROM PYTHON SOURCE LINES 86-87 We note that in this case, the input scaling does not contribute to this improvement: .. GENERATED FROM PYTHON SOURCE LINES 87-92 .. code-block:: Python gpr = GaussianProcessRegressor(dataset_train, transformer={"outputs": "MinMaxScaler"}) gpr.learn() r2 = R2Measure(gpr) r2.compute_test_measure(dataset_test) .. rst-class:: sphx-glr-script-out .. code-block:: none array([0.97049746]) .. GENERATED FROM PYTHON SOURCE LINES 93-94 We can also see that using a :class:`.StandardScaler` is less relevant in this case: .. GENERATED FROM PYTHON SOURCE LINES 94-99 .. code-block:: Python gpr = GaussianProcessRegressor(dataset_train, transformer={"outputs": "StandardScaler"}) gpr.learn() r2 = R2Measure(gpr) r2.compute_test_measure(dataset_test) .. rst-class:: sphx-glr-script-out .. code-block:: none array([0.97049746]) .. GENERATED FROM PYTHON SOURCE LINES 100-103 Finally, we rewrite the Rosenbrock function as :math:`f(x)=(1-x_1)^2+100(0.01x_2-x_1^2)^2` and its domain as :math:`[-2,2]\times[-200,200]`: .. GENERATED FROM PYTHON SOURCE LINES 103-107 .. code-block:: Python design_space = DesignSpace() design_space.add_variable("x1", lower_bound=-2, upper_bound=2) design_space.add_variable("x2", lower_bound=-200, upper_bound=200) .. GENERATED FROM PYTHON SOURCE LINES 108-110 in order to have inputs with different orders of magnitude. We create the learning and test datasets in the same way: .. GENERATED FROM PYTHON SOURCE LINES 110-119 .. code-block:: Python problem = OptimizationProblem(design_space) problem.objective = MDOFunction( lambda x: (1 - x[0]) ** 2 + 100 * (0.01 * x[1] - x[0] ** 2) ** 2, "f" ) opt_lhs.execute(problem, n_samples=30) dataset_train = problem.to_dataset(opt_naming=False) full_fact.execute(problem, n_samples=30 * 30) dataset_test = problem.to_dataset(opt_naming=False) .. GENERATED FROM PYTHON SOURCE LINES 120-121 and build a first Gaussian process regressor with a min-max scaler for the outputs: .. GENERATED FROM PYTHON SOURCE LINES 121-126 .. code-block:: Python gpr = GaussianProcessRegressor(dataset_train, transformer={"outputs": "MinMaxScaler"}) gpr.learn() r2 = R2Measure(gpr) r2.compute_test_measure(dataset_test) .. rst-class:: sphx-glr-script-out .. code-block:: none array([0.90669624]) .. GENERATED FROM PYTHON SOURCE LINES 127-130 The R2 quality is degraded because estimating the model's correlation lengths is complicated. This can be facilitated by setting a :class:`.MinMaxScaler` for the inputs: .. GENERATED FROM PYTHON SOURCE LINES 130-136 .. code-block:: Python gpr = GaussianProcessRegressor( dataset_train, transformer={"inputs": "MinMaxScaler", "outputs": "MinMaxScaler"} ) gpr.learn() r2 = R2Measure(gpr) r2.compute_test_measure(dataset_test) .. rst-class:: sphx-glr-script-out .. code-block:: none array([0.97432803]) .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.914 seconds) .. _sphx_glr_download_examples_mlearning_transformer_plot_scaling_data.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_scaling_data.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_scaling_data.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_scaling_data.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_