.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "examples/mlearning/quality_measure/plot_cross_validation.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_examples_mlearning_quality_measure_plot_cross_validation.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_examples_mlearning_quality_measure_plot_cross_validation.py:


Cross-validation
================

.. GENERATED FROM PYTHON SOURCE LINES 20-31

.. code-block:: Python


    from matplotlib import pyplot as plt
    from numpy import array
    from numpy import linspace
    from numpy import newaxis
    from numpy import sin

    from gemseo.datasets.io_dataset import IODataset
    from gemseo.mlearning.quality_measures.rmse_measure import RMSEMeasure
    from gemseo.mlearning.regression.polyreg import PolynomialRegressor


.. GENERATED FROM PYTHON SOURCE LINES 32-50

Every quality measure can be computed from a learning dataset or a test dataset.
The use of a test dataset aims to
approximate the quality of the machine learning model over the whole variable space
in order to be less dependent on the learning dataset
and so to avoid over-fitting (accurate near learning points and poor elsewhere).

In the presence of expensive data,
this test dataset may just be a dream,
and we have to estimate this quality with techniques resampling the learning dataset,
such as cross-validation.
The idea is simple:
we divide the learning dataset into :math:`K` folds (typically 5),
iterate :math:`K` times the two-step task
"1) learn from :math:`K-1` folds, 2) predict from the remainder"
and finally approximate the measure from the :math:`K` batches of predictions.

To illustrate this point,
let us consider the function :math:`f(x)=(6x-2)^2\sin(12x-4)` :cite:`forrester2008`:

.. GENERATED FROM PYTHON SOURCE LINES 50-56

.. code-block:: Python


    def f(x):
        return (6 * x - 2) ** 2 * sin(12 * x - 4)


.. GENERATED FROM PYTHON SOURCE LINES 57-61

and try to approximate it with a polynomial of order 3.

For this,
we can take these 7 learning input points

.. GENERATED FROM PYTHON SOURCE LINES 61-63

.. code-block:: Python

    x_train = array([0.1, 0.3, 0.5, 0.6, 0.8, 0.9, 0.95])


.. GENERATED FROM PYTHON SOURCE LINES 64-65

and evaluate the model ``f`` over this design of experiments (DOE):

.. GENERATED FROM PYTHON SOURCE LINES 65-67

.. code-block:: Python

    y_train = f(x_train)


.. GENERATED FROM PYTHON SOURCE LINES 68-70

Then,
we create an :class:`.IODataset` from these 7 learning samples:

.. GENERATED FROM PYTHON SOURCE LINES 70-74

.. code-block:: Python

    dataset_train = IODataset()
    dataset_train.add_input_group(x_train[:, newaxis], ["x"])
    dataset_train.add_output_group(y_train[:, newaxis], ["y"])


.. GENERATED FROM PYTHON SOURCE LINES 75-76

and build a :class:`.PolynomialRegressor` with ``degree=3`` from it:

.. GENERATED FROM PYTHON SOURCE LINES 76-79

.. code-block:: Python

    polynomial = PolynomialRegressor(dataset_train, 3)
    polynomial.learn()


.. GENERATED FROM PYTHON SOURCE LINES 80-82

Now,
we compute the quality of this model with the RMSE metric:

.. GENERATED FROM PYTHON SOURCE LINES 82-85

.. code-block:: Python

    rmse = RMSEMeasure(polynomial)
    rmse.compute_learning_measure()


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    array([2.37578236])


.. GENERATED FROM PYTHON SOURCE LINES 86-89

As the cost of this academic function is zero,
we can approximate the generalization quality with a large test dataset
whereas the usual test size is about 20% of the training size.

.. GENERATED FROM PYTHON SOURCE LINES 89-96

.. code-block:: Python

    x_test = linspace(0.0, 1.0, 100)
    y_test = f(x_test)
    dataset_test = IODataset()
    dataset_test.add_input_group(x_test[:, newaxis], ["x"])
    dataset_test.add_output_group(y_test[:, newaxis], ["y"])
    rmse.compute_test_measure(dataset_test)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    array([3.31730517])


.. GENERATED FROM PYTHON SOURCE LINES 97-99

And do the same by cross-validation with :math:`K=5` folds
(this number can be changed with the ``n_folds`` arguments):

.. GENERATED FROM PYTHON SOURCE LINES 99-101

.. code-block:: Python

    rmse.compute_cross_validation_measure()


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    array([14.87187721])


.. GENERATED FROM PYTHON SOURCE LINES 102-105

We note that the cross-validation error is pessimistic.
As the cross-validation method is based on randomization,
we can try again:

.. GENERATED FROM PYTHON SOURCE LINES 105-107

.. code-block:: Python

    rmse.compute_cross_validation_measure()


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    array([18.9676278])


.. GENERATED FROM PYTHON SOURCE LINES 108-110

The result is even more pessimistic.
We can take a closer look by storing the sub-models:

.. GENERATED FROM PYTHON SOURCE LINES 110-112

.. code-block:: Python

    rmse.compute_cross_validation_measure(store_resampling_result=True)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    array([19.25060777])


.. GENERATED FROM PYTHON SOURCE LINES 113-114

and plotting their outputs:

.. GENERATED FROM PYTHON SOURCE LINES 114-123

.. code-block:: Python

    plot = plt.plot(x_test, y_test, label="Reference")
    plt.plot(x_train, y_train, "o", color=plot[0].get_color(), label="Training dataset")
    plt.plot(x_test, polynomial.predict(x_test[:, newaxis]), label="Model")
    for i, algo in enumerate(polynomial.resampling_results["CrossValidation"][1], 1):
        plt.plot(x_test, algo.predict(x_test[:, newaxis]), label=f"Sub-model {i}")
    plt.legend()
    plt.grid()
    plt.show()


.. image-sg:: /examples/mlearning/quality_measure/images/sphx_glr_plot_cross_validation_001.png
   :alt: plot cross validation
   :srcset: /examples/mlearning/quality_measure/images/sphx_glr_plot_cross_validation_001.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 124-132

We can see that
this pessimistic error is mainly due to the fifth sub-model
which did not learn the first training point
and therefore has a very high extrapolation error.

Finally,
note that we can make the result deterministic
by using a custom seed

.. GENERATED FROM PYTHON SOURCE LINES 132-135

.. code-block:: Python

    result = rmse.compute_cross_validation_measure(seed=1)
    assert rmse.compute_cross_validation_measure(seed=1) == result


.. GENERATED FROM PYTHON SOURCE LINES 136-139

or splitting the samples into :math:`K` folds
without randomizing them
(i.e. first samples in the first fold, next ones in the second, etc.):

.. GENERATED FROM PYTHON SOURCE LINES 139-141

.. code-block:: Python

    result = rmse.compute_cross_validation_measure(randomize=False)
    assert rmse.compute_cross_validation_measure(randomize=False) == result


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 0.326 seconds)


.. _sphx_glr_download_examples_mlearning_quality_measure_plot_cross_validation.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_cross_validation.ipynb <plot_cross_validation.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_cross_validation.py <plot_cross_validation.py>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_