Parametric estimation of statistics#

In this example, we want to estimate statistics from synthetic data. These data are 500 realizations of x_0, x_1, x_2 and x_3 distributed in the following way:

  • x_0: standard uniform distribution,

  • x_1: standard normal distribution,

  • x_2: standard Weibull distribution,

  • x_3: standard exponential distribution.

These samples are generated from the NumPy library.

from __future__ import annotations

from numpy import vstack
from numpy.random import default_rng

from gemseo import configure_logger
from gemseo import create_dataset
from gemseo.uncertainty import create_statistics

configure_logger()
<RootLogger root (INFO)>

Create synthetic data#

rng = default_rng(0)

n_samples = 500

uniform_rand = rng.uniform(size=n_samples)
normal_rand = rng.normal(size=n_samples)
weibull_rand = rng.weibull(1.5, size=n_samples)
exponential_rand = rng.exponential(size=n_samples)

data = vstack((uniform_rand, normal_rand, weibull_rand, exponential_rand)).T

variables = ["x_0", "x_1", "x_2", "x_3"]

data
array([[ 6.36961687e-01,  1.35543803e+00,  1.24477385e-01,
         1.91363961e-01],
       [ 2.69786714e-01,  2.21160257e-03,  8.14109465e-01,
         2.33137384e+00],
       [ 4.09735239e-02, -7.90544810e-01,  4.64297251e-01,
         5.53852517e-01],
       ...,
       [ 9.85769635e-01, -1.16187331e+00,  9.62671893e-01,
         8.66423178e-01],
       [ 4.28024519e-01,  2.72032137e-01,  5.03249648e-01,
         2.17492296e-01],
       [ 8.43014715e-01, -7.66939588e-01,  8.18740909e-01,
         8.75593057e-01]])

Create a ParametricStatistics object#

We create a ParametricStatistics object from this data encapsulated in a Dataset:

dataset = create_dataset("Dataset", data, variables)

and a list of names of candidate probability distributions: exponential, normal and uniform distributions (see ParametricStatistics.DistributionName). We do not use the default fitting criterion ('BIC') but 'Kolmogorov' (see ParametricStatistics.FittingCriterion and ParametricStatistics.SignificanceTest).

tested_distributions = ["Exponential", "Normal", "Uniform"]
analysis = create_statistics(
    dataset, tested_distributions=tested_distributions, fitting_criterion="Kolmogorov"
)
analysis
   INFO - 15:35:42: | Set goodness-of-fit criterion: Kolmogorov.
   INFO - 15:35:42: | Set significance level of hypothesis test: 0.05.
   INFO - 15:35:42: Fit different distributions (Exponential, Normal, Uniform) per variable and compute the goodness-of-fit criterion.
   INFO - 15:35:42: | Fit different distributions for x_0.
   INFO - 15:35:42: | Fit different distributions for x_1.
   INFO - 15:35:42: | Fit different distributions for x_2.
   INFO - 15:35:42: | Fit different distributions for x_3.
   INFO - 15:35:42: Select the best distribution for each variable.
WARNING - 15:35:42: All criteria values are lower than the significance level 0.05.
   INFO - 15:35:42: | The best distribution for x_0[0] is Uniform([-0.00168518,0.999196]).
   INFO - 15:35:42: | The best distribution for x_1[0] is Normal([-0.068321,0.937168]).
WARNING - 15:35:42: All criteria values are lower than the significance level 0.05.
   INFO - 15:35:42: | The best distribution for x_2[0] is Normal([0.905464,0.645926]).
   INFO - 15:35:42: | The best distribution for x_3[0] is Exponential([1.02161,0.00146538]).
ParametricStatistics(Dataset)
  • n_samples: 500
  • n_variables: 4
  • variables: x_0, x_1, x_2, x_3


Get statistics#

From this ParametricStatistics instance, we can easily get statistics for the different variables based on the selected distributions.

Get minimum#

Here is the minimum value for the different variables:

analysis.compute_minimum()
{'x_0': array([-0.00168518]), 'x_1': array([-inf]), 'x_2': array([-inf]), 'x_3': array([0.00146538])}

Get maximum#

Here is the minimum value for the different variables:

analysis.compute_maximum()
{'x_0': array([0.99919581]), 'x_1': array([inf]), 'x_2': array([inf]), 'x_3': array([inf])}

Get range#

Here is the range, i.e. the difference between the minimum and maximum values, for the different variables:

analysis.compute_range()
{'x_0': array([1.000881]), 'x_1': array([inf]), 'x_2': array([inf]), 'x_3': array([inf])}

Get mean#

Here is the mean value for the different variables:

analysis.compute_mean()
{'x_0': array([0.49875531]), 'x_1': array([-0.06832097]), 'x_2': array([0.90546442]), 'x_3': array([0.98031191])}

Get standard deviation#

Here is the standard deviation for the different variables:

analysis.compute_standard_deviation()
{'x_0': array([0.28892946]), 'x_1': array([0.93716844]), 'x_2': array([0.64592637]), 'x_3': array([0.97884653])}

Get variance#

Here is the variance for the different variables:

analysis.compute_variance()
{'x_0': array([0.08348023]), 'x_1': array([0.87828468]), 'x_2': array([0.41722087]), 'x_3': array([0.95814053])}

Get quantile#

Here is the quantile with level 80% for the different variables:

analysis.compute_quantile(0.8)
{'x_0': array([0.79901961]), 'x_1': array([0.72041989]), 'x_2': array([1.44908977]), 'x_3': array([1.5768581])}

Get quartile#

Here is the second quartile for the different variables:

analysis.compute_quartile(2)
{'x_0': array([0.49875531]), 'x_1': array([-0.06832097]), 'x_2': array([0.90546442]), 'x_3': array([0.67995009])}

Get percentile#

Here is the 50th percentile for the different variables:

analysis.compute_percentile(50)
{'x_0': array([0.49875531]), 'x_1': array([-0.06832097]), 'x_2': array([0.90546442]), 'x_3': array([0.67995009])}

Get median#

Here is the median for the different variables:

analysis.compute_median()
{'x_0': array([0.49875531]), 'x_1': array([-0.06832097]), 'x_2': array([0.90546442]), 'x_3': array([0.67995009])}

Get tolerance interval#

Here is the two-sided tolerance interval with a coverage level equal to 50% with a confidence level of 95% for the different variables:

analysis.compute_tolerance_interval(0.5)
{'x_0': [Bounds(lower=array([0.24854773]), upper=array([0.75453424]))], 'x_1': [Bounds(lower=array([-0.73596335]), upper=array([0.59932142]))], 'x_2': [Bounds(lower=array([0.44530401]), upper=array([1.36562484]))], 'x_3': [Bounds(lower=array([0.24115604]), upper=array([1.32178328]))]}

Get B-value#

Here is the B-value for the different variables, which is a left-sided tolerance interval with a coverage level equal to 90% with a confidence level of 95%:

analysis.compute_b_value()
{'x_0': array([[0.09841318]]), 'x_1': array([[-1.33841706]]), 'x_2': array([[0.0300737]]), 'x_3': array([[0.09423241]])}

Plot the distribution#

We can draw the empirical cumulative distribution function and the empirical probability density function:

analysis.plot()
  • Uniform([-0.00168518,0.999196])
  • Normal([-0.068321,0.937168])
  • Normal([0.905464,0.645926])
  • Exponential([1.02161,0.00146538])
{'x_0': <Figure size 640x320 with 2 Axes>, 'x_1': <Figure size 640x320 with 2 Axes>, 'x_2': <Figure size 640x320 with 2 Axes>, 'x_3': <Figure size 640x320 with 2 Axes>}

Total running time of the script: (0 minutes 0.728 seconds)

Gallery generated by Sphinx-Gallery