# R2 for regression models¶

from matplotlib import pyplot as plt
from numpy import array
from numpy import linspace
from numpy import newaxis
from numpy import sin

from gemseo.datasets.io_dataset import IODataset
from gemseo.mlearning.quality_measures.r2_measure import R2Measure
from gemseo.mlearning.regression.polyreg import PolynomialRegressor
from gemseo.mlearning.regression.rbf import RBFRegressor


Given a dataset $$(x_i,y_i,\hat{y}_i)_{1\leq i \leq N}$$ where $$x_i$$ is an input point, $$y_i$$ is an output observation and $$\hat{y}_i=\hat{f}(x_i)$$ is an output prediction computed by a regression model $$\hat{f}$$, the $$R^2$$ metric (also known as $$Q^2$$) is written

$R^2 = 1 - \frac{\sum_{i=1}^N(y_i-\hat{y}_i)^2}{\sum_{i=1}^N(y_i-\bar{y})^2} \leq 1$

where $$\bar{y}=\frac{1}{N}\sum_{i=1}^Ny_i$$. The higher, the better. From 0.9 it starts to look (very) good. A negative value is very bad; a constant model would do better.

To illustrate this quality measure, let us consider the function $$f(x)=(6x-2)^2\sin(12x-4)$$ [FSK08]:

def f(x):
return (6 * x - 2) ** 2 * sin(12 * x - 4)


and try to approximate it with a polynomial of order 3.

For this, we can take these 7 learning input points

x_train = array([0.1, 0.3, 0.5, 0.6, 0.8, 0.9, 0.95])


and evaluate the model f over this design of experiments (DOE):

y_train = f(x_train)


Then, we create an IODataset from these 7 learning samples:

dataset_train = IODataset()


and build a PolynomialRegressor with degree=3 from it:

polynomial = PolynomialRegressor(dataset_train, 3)
polynomial.learn()


Before using it, we are going to measure its quality with the $$R^2$$ metric:

r2 = R2Measure(polynomial)
r2.compute_learning_measure()

array([0.78649338])


This result is medium, and we can be expected to a poor generalization quality. As the cost of this academic function is zero, we can approximate this generalization quality with a large test dataset whereas the usual test size is about 20% of the training size.

x_test = linspace(0.0, 1.0, 100)
y_test = f(x_test)
dataset_test = IODataset()
r2.compute_test_measure(dataset_test)

array([0.47280012])


The quality is lower than 0.5, which is pretty mediocre. This can be explained by a broader generalization domain than that of learning, which highlights the difficulties of extrapolation:

plt.plot(x_test, y_test, "-b", label="Reference")
plt.plot(x_train, y_train, "ob")
plt.plot(x_test, polynomial.predict(x_test[:, newaxis]), "-r", label="Prediction")
plt.plot(x_train, polynomial.predict(x_train[:, newaxis]), "or")
plt.legend()
plt.grid()
plt.show()


Using the learning domain would slightly improve the quality:

x_test = linspace(x_train.min(), x_train.max(), 100)
y_test = f(x_test)
dataset_test_in_learning_domain = IODataset()
r2.compute_test_measure(dataset_test_in_learning_domain)

array([0.50185268])


Lastly, to get better results without new learning points, we would have to change the regression model:

rbf = RBFRegressor(dataset_train)
rbf.learn()


The quality of this RBFRegressor is quite good, both on the learning side:

r2_rbf = R2Measure(rbf)
r2_rbf.compute_learning_measure()

array([1.])


and on the validation side:

r2_rbf.compute_test_measure(dataset_test_in_learning_domain)

array([0.99807284])


including the larger domain:

r2_rbf.compute_test_measure(dataset_test)

array([0.98593573])


A final plot to convince us:

plt.plot(x_test, y_test, "-b", label="Reference")
plt.plot(x_train, y_train, "ob")
plt.plot(x_test, rbf.predict(x_test[:, newaxis]), "-r", label="Prediction")
plt.plot(x_train, rbf.predict(x_train[:, newaxis]), "or")
plt.legend()
plt.grid()
plt.show()


Total running time of the script: (0 minutes 0.294 seconds)

Gallery generated by Sphinx-Gallery