Fitting a distribution from data based on OpenTURNS#

from __future__ import annotations

from numpy.random import default_rng

from gemseo import configure_logger
from gemseo.uncertainty.distributions.openturns.fitting import OTDistributionFitter

configure_logger()
<RootLogger root (INFO)>

In this example, we will see how to fit a distribution from data. For a purely pedagogical reason, we consider a synthetic dataset made of 100 realizations of 'X', a random variable distributed according to the standard normal distribution. These samples are generated from the NumPy library.

rng = default_rng(1)
data = rng.normal(size=100)
variable_name = "X"

Create a distribution fitter#

Then, we create an OTDistributionFitter from these data and this variable name:

fitter = OTDistributionFitter(variable_name, data)

Fit a distribution#

From this distribution fitter, we can easily fit any distribution available in the OpenTURNS library:

fitter.available_distributions
['Arcsine', 'Beta', 'Burr', 'Chi', 'ChiSquare', 'Dirichlet', 'Exponential', 'FisherSnedecor', 'Frechet', 'Gamma', 'GeneralizedPareto', 'Gumbel', 'Histogram', 'InverseNormal', 'Laplace', 'LogNormal', 'LogUniform', 'Logistic', 'MeixnerDistribution', 'Normal', 'Pareto', 'Rayleigh', 'Rice', 'Student', 'Trapezoidal', 'Triangular', 'TruncatedNormal', 'Uniform', 'VonMises', 'WeibullMax', 'WeibullMin']

For example, we can fit a normal distribution:

norm_dist = fitter.fit("Normal")
norm_dist
Normal([-0.0736121,0.855847])

or an exponential one:

exp_dist = fitter.fit("Exponential")
exp_dist
Exponential([0.375357,-2.73774])

The returned object is an OTDistribution that we can represent graphically in terms of probability and cumulative density functions:

norm_dist.plot()
Normal([-0.0736121,0.855847])
<Figure size 640x320 with 2 Axes>

Measure the goodness-of-fit#

We can also measure the goodness-of-fit of a distribution by means of a fitting criterion. Some fitting criteria are based on significance tests made of a test statistics, a p-value and a significance level. We can access the names of all the available fitting criteria:

fitter.available_criteria
['BIC', 'ChiSquared', 'Kolmogorov']

or only the significance tests

fitter.available_significance_tests
[<SignificanceTest.ChiSquared: 'ChiSquared'>, <SignificanceTest.Kolmogorov: 'Kolmogorov'>]

For example, we can measure the goodness-of-fit of the previous distributions by considering the Bayesian information criterion (BIC):

quality_measure = fitter.compute_measure(norm_dist, "BIC")
"Normal", quality_measure

quality_measure = fitter.compute_measure(exp_dist, "BIC")
"Exponential", quality_measure
('Exponential', 3.9597553873428653)

Here, the fitted normal distribution is better than the fitted exponential one in terms of BIC. We can also the Kolmogorov fitting criterion which is based on the Kolmogorov significance test:

acceptable, details = fitter.compute_measure(norm_dist, "Kolmogorov")
"Normal", acceptable, details
acceptable, details = fitter.compute_measure(exp_dist, "Kolmogorov")
"Exponential", acceptable, details
('Exponential', False, {'p-value': 4.864624399187062e-11, 'statistics': 0.3434922163146683, 'level': 0.05})

In this case, the OTDistributionFitter.compute_measure() method returns a tuple with two values:

  1. a boolean indicating if the measured distribution is acceptable to model the data,

  2. a dictionary containing the test statistics, the p-value and the significance level.

Note

We can also change the significance level for significance tests whose default value is 0.05. For that, use the level argument.

Select an optimal distribution#

Lastly, we can also select an optimal OTDistribution based on a collection of distributions names, a fitting criterion, a significance level and a selection criterion:

  • 'best': select the distribution minimizing (or maximizing, depending on the criterion) the criterion,

  • 'first': select the first distribution for which the criterion is greater (or lower, depending on the criterion) than the level.

By default, the OTDistributionFitter.select() method uses a significance level equal to 0.5 and 'best' selection criterion.

selected_distribution = fitter.select(["Exponential", "Normal"], "Kolmogorov")
selected_distribution
Normal([-0.0736121,0.855847])

Total running time of the script: (0 minutes 0.161 seconds)

Gallery generated by Sphinx-Gallery