Note
Go to the end to download the full example code.
Fitting a distribution from data based on OpenTURNS#
from __future__ import annotations
from numpy.random import default_rng
from gemseo import configure_logger
from gemseo.uncertainty.distributions.openturns.fitting import OTDistributionFitter
configure_logger()
<RootLogger root (INFO)>
In this example, we will see how to fit a distribution from data. For a purely pedagogical reason, we consider a synthetic dataset made of 100 realizations of 'X', a random variable distributed according to the standard normal distribution. These samples are generated from the NumPy library.
rng = default_rng(1)
data = rng.normal(size=100)
variable_name = "X"
Create a distribution fitter#
Then,
we create an OTDistributionFitter
from these data and this variable name:
fitter = OTDistributionFitter(variable_name, data)
Fit a distribution#
From this distribution fitter, we can easily fit any distribution available in the OpenTURNS library:
fitter.available_distributions
['Arcsine', 'Beta', 'Burr', 'Chi', 'ChiSquare', 'Dirichlet', 'Exponential', 'FisherSnedecor', 'Frechet', 'Gamma', 'GeneralizedPareto', 'Gumbel', 'Histogram', 'InverseNormal', 'Laplace', 'LogNormal', 'LogUniform', 'Logistic', 'MeixnerDistribution', 'Normal', 'Pareto', 'Rayleigh', 'Rice', 'Student', 'Trapezoidal', 'Triangular', 'TruncatedNormal', 'Uniform', 'VonMises', 'WeibullMax', 'WeibullMin']
For example, we can fit a normal distribution:
norm_dist = fitter.fit("Normal")
norm_dist
Normal([-0.0736121,0.855847])
or an exponential one:
exp_dist = fitter.fit("Exponential")
exp_dist
Exponential([0.375357,-2.73774])
The returned object is an OTDistribution
that we can represent graphically
in terms of probability and cumulative density functions:
norm_dist.plot()
<Figure size 640x320 with 2 Axes>
Measure the goodness-of-fit#
We can also measure the goodness-of-fit of a distribution by means of a fitting criterion. Some fitting criteria are based on significance tests made of a test statistics, a p-value and a significance level. We can access the names of all the available fitting criteria:
fitter.available_criteria
['BIC', 'ChiSquared', 'Kolmogorov']
or only the significance tests
fitter.available_significance_tests
[<SignificanceTest.ChiSquared: 'ChiSquared'>, <SignificanceTest.Kolmogorov: 'Kolmogorov'>]
For example, we can measure the goodness-of-fit of the previous distributions by considering the Bayesian information criterion (BIC):
quality_measure = fitter.compute_measure(norm_dist, "BIC")
"Normal", quality_measure
quality_measure = fitter.compute_measure(exp_dist, "BIC")
"Exponential", quality_measure
('Exponential', 3.9597553873428653)
Here, the fitted normal distribution is better than the fitted exponential one in terms of BIC. We can also the Kolmogorov fitting criterion which is based on the Kolmogorov significance test:
acceptable, details = fitter.compute_measure(norm_dist, "Kolmogorov")
"Normal", acceptable, details
acceptable, details = fitter.compute_measure(exp_dist, "Kolmogorov")
"Exponential", acceptable, details
('Exponential', False, {'p-value': 4.864624399187062e-11, 'statistics': 0.3434922163146683, 'level': 0.05})
In this case,
the OTDistributionFitter.compute_measure()
method
returns a tuple with two values:
a boolean indicating if the measured distribution is acceptable to model the data,
a dictionary containing the test statistics, the p-value and the significance level.
Note
We can also change the significance level for significance tests
whose default value is 0.05.
For that, use the level
argument.
Select an optimal distribution#
Lastly,
we can also select an optimal OTDistribution
based on a collection of distributions names,
a fitting criterion,
a significance level
and a selection criterion:
'best': select the distribution minimizing (or maximizing, depending on the criterion) the criterion,
'first': select the first distribution for which the criterion is greater (or lower, depending on the criterion) than the level.
By default,
the OTDistributionFitter.select()
method uses a significance level equal to 0.5
and 'best' selection criterion.
selected_distribution = fitter.select(["Exponential", "Normal"], "Kolmogorov")
selected_distribution
Normal([-0.0736121,0.855847])
Total running time of the script: (0 minutes 0.161 seconds)