Note

Click here to download the full example code

Parametric estimation of statistics¶

In this example, we want to estimate statistics from synthetic data. These data are 500 realizations of x_0, x_1, x_2 and x_3 distributed in the following way:

x_0: standard uniform distribution,
x_1: standard normal distribution,
x_2: standard Weibull distribution,
x_3: standard exponential distribution.

These samples are generated from the NumPy library.

from gemseo.api import configure_logger
from gemseo.api import create_dataset
from gemseo.uncertainty.api import create_statistics
from numpy import vstack
from numpy.random import exponential
from numpy.random import normal
from numpy.random import rand
from numpy.random import seed
from numpy.random import weibull

configure_logger()

Out:

<RootLogger root (INFO)>

Create synthetic data¶

seed(0)

n_samples = 500

uniform_rand = rand(n_samples)
normal_rand = normal(size=n_samples)
weibull_rand = weibull(1.5, size=n_samples)
exponential_rand = exponential(size=n_samples)

data = vstack((uniform_rand, normal_rand, weibull_rand, exponential_rand)).T

variables = ["x_0", "x_1", "x_2", "x_3"]

print(data)

Out:

[[ 0.5488135  -0.98551074  1.37408242  1.11379656]
 [ 0.71518937 -1.47183501  2.13236167  0.63548465]
 [ 0.60276338  1.64813493  0.52518717  3.2112956 ]
 ...
 [ 0.40171354 -0.21252304  0.30225024  4.00986833]
 [ 0.24841347 -0.76211451  0.364483    0.55896365]
 [ 0.50586638 -0.88778014  0.82654114  2.12919171]]

Create a `ParametricStatistics` object¶

We create a ParametricStatistics object from this data encapsulated in a Dataset:

dataset = create_dataset("Dataset", data, variables)

and a list of names of candidate probability distributions: exponential, normal and uniform distributions (see ParametricStatistics.get_available_distributions()). We do not use the default fitting criterion (‘BIC’) but ‘Kolmogorov’ (see ParametricStatistics.get_available_criteria() and ParametricStatistics.get_significance_tests()).

tested_distributions = ["Exponential", "Normal", "Uniform"]
analysis = create_statistics(
    dataset, tested_distributions=tested_distributions, fitting_criterion="Kolmogorov"
)
print(analysis)

Out:

    INFO - 07:19:21: Create ParametricStatistics_Dataset, a ParametricStatistics library.
    INFO - 07:19:21: | Set goodness-of-fit criterion: Kolmogorov.
    INFO - 07:19:21: | Set significance level of hypothesis test: 0.05.
    INFO - 07:19:21: Fit different distributions (Exponential, Normal, Uniform) per variable and compute the goodness-of-fit criterion.
    INFO - 07:19:21: | Fit different distributions for x_0.
    INFO - 07:19:21: | Fit different distributions for x_1.
    INFO - 07:19:21: | Fit different distributions for x_2.
    INFO - 07:19:21: | Fit different distributions for x_3.
    INFO - 07:19:21: Select the best distribution for each variable.
    INFO - 07:19:21: | The best distribution for x_0 is Uniform(class=Point name=Unnamed dimension=2 values=[0.00271509,1.00083]).
    INFO - 07:19:21: | The best distribution for x_1 is Normal(class=Point name=Unnamed dimension=2 values=[-0.100117,0.985312]).
 WARNING - 07:19:21: All criteria values are lower than the significance level 0.05.
    INFO - 07:19:21: | The best distribution for x_2 is Normal(class=Point name=Unnamed dimension=2 values=[0.9783,0.665983]).
    INFO - 07:19:21: | The best distribution for x_3 is Exponential(class=Point name=Unnamed dimension=2 values=[1.02231,7.35553e-05]).
ParametricStatistics_Dataset
   n_samples: 500
   n_variables: 4
   variables: x_0, x_1, x_2, x_3

Print the fitting matrix¶

At this step, an optimal distribution has been selected for each variable based on the tested distributions and on the Kolmogorov fitting criterion. We can print the fitting matrix to see the goodness-of-fit measures for each pair < variable, distribution > as well as the selected distribution for each variable. Note that in the case of significance tests, the goodness-of-fit measures are the p-values.

print(analysis.get_fitting_matrix())

Out:

+----------+------------------------+------------------------+------------------------+-------------+
| Variable |      Exponential       |         Normal         |        Uniform         |  Selection  |
+----------+------------------------+------------------------+------------------------+-------------+
|   x_0    | 1.602160180879313e-10  |  0.005823020521403932  |   0.7338504331264553   |   Uniform   |
|   x_1    |  2.82659088382179e-53  |   0.8587721484840084   | 5.660300987516015e-18  |    Normal   |
|   x_2    | 1.5387797946575896e-09 | 0.0016128012413438864  | 7.748433868335025e-67  |    Normal   |
|   x_3    |   0.864074427829853    | 2.0987474708559965e-10 | 7.782983660200643e-152 | Exponential |
+----------+------------------------+------------------------+------------------------+-------------+

Get statistics¶

From this ParametricStatistics instance, we can easily get statistics for the different variables based on the selected distributions.

Get minimum¶

Here is the minimum value for the different variables:

print(analysis.compute_minimum())

Out:

{'x_0': array([0.00271509]), 'x_1': array([-inf]), 'x_2': array([-inf]), 'x_3': array([7.3555332e-05])}

Get maximum¶

Here is the minimum value for the different variables:

print(analysis.compute_maximum())

Out:

{'x_0': array([1.00082739]), 'x_1': array([inf]), 'x_2': array([inf]), 'x_3': array([inf])}

Get range¶

Here is the range, i.e. the difference between the minimum and maximum values, for the different variables:

print(analysis.compute_range())

Out:

{'x_0': array([0.99811229]), 'x_1': array([inf]), 'x_2': array([inf]), 'x_3': array([inf])}

Get mean¶

Here is the mean value for the different variables:

print(analysis.compute_mean())

Out:

{'x_0': array([0.50177124]), 'x_1': array([-0.1001173]), 'x_2': array([0.97829969]), 'x_3': array([0.97825244])}

Get standard deviation¶

Here is the standard deviation for the different variables:

print(analysis.compute_standard_deviation())

Out:

{'x_0': array([0.2881302]), 'x_1': array([0.98531188]), 'x_2': array([0.66598346]), 'x_3': array([0.97817888])}

Get variance¶

Here is the variance for the different variables:

print(analysis.compute_variance())

Out:

{'x_0': array([0.08301901]), 'x_1': array([0.9708395]), 'x_2': array([0.44353397]), 'x_3': array([0.95683393])}

Get quantile¶

Here is the quantile with level 80% for the different variables:

print(analysis.compute_quantile(0.8))

Out:

{'x_0': array([0.80120493]), 'x_1': array([0.72914209]), 'x_2': array([1.53880551]), 'x_3': array([1.57439174])}

Get quartile¶

Here is the second quartile for the different variables:

print(analysis.compute_quartile(2))

Out:

{'x_0': array([0.50177124]), 'x_1': array([-0.1001173]), 'x_2': array([0.97829969]), 'x_3': array([0.67809549])}

Get percentile¶

Here is the 50th percentile for the different variables:

print(analysis.compute_percentile(50))

Out:

{'x_0': array([0.50177124]), 'x_1': array([-0.1001173]), 'x_2': array([0.97829969]), 'x_3': array([0.67809549])}

Get median¶

Here is the median for the different variables:

print(analysis.compute_median())

Out:

{'x_0': array([0.50177124]), 'x_1': array([-0.1001173]), 'x_2': array([0.97829969]), 'x_3': array([0.67809549])}

Get tolerance interval¶

Here is the two-sided tolerance interval with a coverage level equal to 50% with a confidence level of 95% for the different variables:

print(analysis.compute_tolerance_interval(0.5, 0.95))

Out:

{'x_0': (array([0.2522558]), array([0.75684261])), 'x_1': (array([-0.80205726]), array([0.60182265])), 'x_2': (array([0.50385052]), array([1.45274885])), 'x_3': (array([0.23960073]), array([1.3194909]))}

Get B-value¶

Here is the B-value for the different variables, which is a left-sided tolerance interval with a coverage level equal to 90% with a confidence level of 95%:

print(analysis.compute_b_value())

Out:

{'x_0': array([0.10253656]), 'x_1': array([-1.43545972]), 'x_2': array([0.07572662]), 'x_3': array([0.09277731])}

Total running time of the script: ( 0 minutes 0.048 seconds)

Gallery generated by Sphinx-Gallery

Parametric estimation of statistics¶

Create synthetic data¶

Create a ParametricStatistics object¶

Print the fitting matrix¶

Get statistics¶

Get minimum¶

Get maximum¶

Get range¶

Get mean¶

Get standard deviation¶

Get variance¶

Get quantile¶

Get quartile¶

Get percentile¶

Get median¶

Get tolerance interval¶

Get B-value¶

Create a `ParametricStatistics` object¶