gemseo / uncertainty / statistics

# parametric module¶

Class for the parametric estimation of statistics from a dataset.

## Overview¶

The ParametricStatistics class inherits from the abstract Statistics class and aims to estimate statistics from a Dataset, based on candidate parametric distributions calibrated from this Dataset.

For each variable,

1. the parameters of these distributions are calibrated from the Dataset,

2. the fitted parametric Distribution which is optimal in the sense of a goodness-of-fit criterion and a selection criterion is selected to estimate the statistics related to this variable.

The ParametricStatistics relies on the OpenTURNS library through the OTDistribution and OTDistributionFitter classes.

## Construction¶

The ParametricStatistics is built from two mandatory arguments:

• a dataset,

• a list of distributions names,

and can consider optional arguments:

• a subset of variables names (by default, statistics are computed for all variables),

• a fitting criterion name (by default, BIC is used; see AVAILABLE_CRITERIA and AVAILABLE_SIGNIFICANCE_TESTS for more information),

• a level associated with the fitting criterion,

• a selection criterion:

• ‘best’: select the distribution minimizing (or maximizing, depending on the criterion) the criterion,

• ‘first’: select the first distribution for which the criterion is greater (or lower, depending on the criterion) than the level,

• a name for the ParametricStatistics object (by default, the name is the concatenation of ‘ParametricStatistics’ and the name of the Dataset).

## Capabilities¶

By inheritance, a ParametricStatistics object has the same capabilities as Statistics. Additional ones are:

class gemseo.uncertainty.statistics.parametric.ParametricStatistics(dataset, distributions, variables_names=None, fitting_criterion='BIC', level=0.05, selection_criterion='best', name=None)[source]

Bases: Statistics

A toolbox to compute statistics based on probability distribution-fitting.

Unless otherwise stated, the statistics are computed variable-wise and component-wise, i.e. variable-by-variable and component-by-component. So, for the sake of readability, the methods named as compute_statistic() return dict[str, ndarray] objects whose values are the names of the variables and the values are the statistic estimated for the different component.

Examples

>>> from gemseo.api import (
...     create_discipline,
...     create_parameter_space,
...     create_scenario
... )
>>> from gemseo.uncertainty.statistics.parametric import ParametricStatistics
>>>
>>> expressions = {"y1": "x1+2*x2", "y2": "x1-3*x2"}
>>> discipline = create_discipline(
...     "AnalyticDiscipline", expressions=expressions
... )
>>>
>>> parameter_space = create_parameter_space()
...     "x1", "OTUniformDistribution", minimum=-1, maximum=1
... )
...     "x2", "OTNormalDistribution", mu=0.5, sigma=2
... )
>>>
>>> scenario = create_scenario(
...     [discipline],
...     "DisciplinaryOpt",
...     "y1", parameter_space, scenario_type="DOE"
... )
>>> scenario.execute({'algo': 'OT_MONTE_CARLO', 'n_samples': 100})
>>>
>>> dataset = scenario.export_to_dataset(opt_naming=False)
>>>
>>> statistics = ParametricStatistics(
...     dataset, ['Normal', 'Uniform', 'Triangular']
... )
>>> fitting_matrix = statistics.get_fitting_matrix()
>>> mean = statistics.compute_mean()

Parameters:
• dataset (Dataset) – A dataset.

• distributions (Sequence[str]) – The names of the distributions.

• variables_names (Iterable[str] | None) – The variables of interest. Default: consider all the variables available in the dataset.

• fitting_criterion (str) –

The name of the goodness-of-fit criterion, measuring how the distribution fits the data. Use ParametricStatistics.get_criteria() to get the available criteria.

By default it is set to “BIC”.

• level (float) –

A test level, i.e. the risk of committing a Type 1 error, that is an incorrect rejection of a true null hypothesis, for criteria based on test hypothesis.

By default it is set to 0.05.

• selection_criterion (str) –

The name of the selection criterion to select a distribution from a list of candidates. Either ‘first’ or ‘best’.

By default it is set to “best”.

• name (str | None) – A name for the object. Default: use the concatenation of the class and dataset names.

compute_a_value()

Compute the A-value $$\text{Aval}[X]$$.

The A-value is the lower bound of the left-sided tolerance interval associated with a coverage level equal to 99% and a confidence level equal to 95%.

Returns:

The component-wise A-value of the different variables.

Return type:
compute_b_value()

Compute the B-value $$\text{Bval}[X]$$.

The B-value is the lower bound of the left-sided tolerance interval associated with a coverage level equal to 90% and a confidence level equal to 95%.

Returns:

The component-wise B-value of the different variables.

Return type:
classmethod compute_expression(variable_name, statistic_name, show_name=False, **options)

Return the expression of a statistical function applied to a variable.

E.g. “P[X >= 1.0]” for the probability that X exceeds 1.0.

Parameters:
• variable_name (str) – The name of the variable, e.g. "X".

• statistic_name (str) – The name of the statistic, e.g. "probability".

• show_name (bool) –

If True, show option names. Otherwise, only show option values.

By default it is set to False.

• **options (bool | float | int) – The options passed to the statistical function, e.g. {"greater": True, "thresh": 1.0}.

Returns:

The expression of the statistical function applied to the variable.

Return type:

str

compute_joint_probability(thresh, greater=True)[source]

Compute the joint probability related to a threshold.

Either $$\mathbb{P}[X \geq x]$$ or $$\mathbb{P}[X \leq x]$$.

Parameters:
• thresh (Mapping[str, float | ndarray]) – A threshold $$x$$ per variable.

• greater (bool) –

The type of probability. If True, compute the probability of exceeding the threshold. Otherwise, compute the opposite.

By default it is set to True.

Returns:

The joint probability of the different variables (by definition of the joint probability, this statistics is not computed component-wise).

Return type:

dict[str, float]

compute_margin(std_factor)

Compute a margin $$\text{Margin}[X]=\mathbb{E}[X]+\kappa\mathbb{S}[X]$$.

Parameters:

std_factor (float) – The weight $$\kappa$$ of the standard deviation.

Returns:

The component-wise margin for the different variables.

Return type:
compute_maximum()[source]

Compute the maximum $$\text{Max}[X]$$.

Returns:

The component-wise maximum of the different variables.

Return type:
compute_mean()[source]

Compute the mean $$\mathbb{E}[X]$$.

Returns:

The component-wise mean of the different variables.

Return type:
compute_mean_std(std_factor)

Compute a margin $$\text{Margin}[X]=\mathbb{E}[X]+\kappa\mathbb{S}[X]$$.

Parameters:

std_factor (float) – The weight $$\kappa$$ of the standard deviation.

Returns:

The component-wise margin for the different variables.

Return type:
compute_median()

Compute the median $$\text{Med}[X]$$.

Returns:

The component-wise median of the different variables.

Return type:
compute_minimum()[source]

Compute the $$\text{Min}[X]$$.

Returns:

The component-wise minimum of the different variables.

Return type:
compute_moment(order)[source]

Compute the n-th moment $$M[X; n]$$.

Parameters:

order (int) – The order $$n$$ of the moment.

Returns:

The component-wise moment of the different variables.

Return type:
compute_percentile(order)

Compute the n-th percentile $$\text{p}[X; n]$$.

Parameters:

order (int) – The order $$n\in\{0,1,2,...100\}$$ of the percentile.

Returns:

The component-wise percentile of the different variables.

Raises:

ValueError – When $$n\notin\{0,1,2,...100\}$$.

Return type:
compute_probability(thresh, greater=True)[source]

Compute the probability related to a threshold.

Either $$\mathbb{P}[X \geq x]$$ or $$\mathbb{P}[X \leq x]$$.

Parameters:
• thresh (Mapping[str, float | ndarray]) – A threshold $$x$$ per variable.

• greater (bool) –

The type of probability. If True, compute the probability of exceeding the threshold. Otherwise, compute the opposite.

By default it is set to True.

Returns:

The component-wise probability of the different variables.

Return type:

dict[str, ndarray]

compute_quantile(prob)[source]

Compute the quantile $$\mathbb{Q}[X; \alpha]$$ related to a probability.

Parameters:

prob (float) – A probability $$\alpha$$ between 0 and 1.

Returns:

The component-wise quantile of the different variables.

Return type:
compute_quartile(order)

Compute the n-th quartile $$q[X; n]$$.

Parameters:

order (int) – The order $$n\in\{1,2,3\}$$ of the quartile.

Returns:

The component-wise quartile of the different variables.

Raises:

ValueError – When $$n\notin\{1,2,3\}$$.

Return type:
compute_range()[source]

Compute the range $$R[X]$$.

Returns:

The component-wise range of the different variables.

Return type:
compute_standard_deviation()[source]

Compute the standard deviation $$\mathbb{S}[X]$$.

Returns:

The component-wise standard deviation of the different variables.

Return type:
compute_tolerance_interval(coverage, confidence=0.95, side=ToleranceIntervalSide.BOTH)[source]

Compute a $$(p,1-\alpha)$$ tolerance interval $$\text{TI}[X]$$.

The tolerance interval $$\text{TI}[X]$$ is defined to contain at least a proportion $$p$$ of the values of $$X$$ with a level of confidence $$1-\alpha$$. $$p$$ is also called the coverage level of the TI.

Typically, $$\alpha=0.05$$ or equivalently $$1-\alpha=0.95$$.

The tolerance interval can be either

• lower-sided (side="LOWER": $$[L, +\infty[$$),

• upper-sided (side="UPPER": $$]-\infty, U]$$) or

• both-sided (side="BOTH": $$[L, U]$$).

Parameters:
• coverage (float) – A minimum proportion $$p\in[0,1]$$ of belonging to the TI.

• confidence (float) –

A level of confidence $$1-\alpha\in[0,1]$$.

By default it is set to 0.95.

• side (ToleranceIntervalSide) –

The type of the tolerance interval.

By default it is set to BOTH.

Returns:

The component-wise tolerance intervals of the different variables, expressed as {variable_name: [(lower_bound, upper_bound), ...], ... } where [(lower_bound, upper_bound), ...] are the lower and upper bounds of the tolerance interval of the different components of variable_name.

Return type:
compute_variance()[source]

Compute the variance $$\mathbb{V}[X]$$.

Returns:

The component-wise variance of the different variables.

Return type:
compute_variation_coefficient()

Compute the coefficient of variation $$CoV[X]$$.

This is the standard deviation normalized by the expectation: $$CoV[X]=\mathbb{E}[S]/\mathbb{E}[X]$$.

Returns:

The component-wise coefficient of variation of the different variables.

Return type:
get_criteria(variable, index=0)[source]

Get the value of the fitting criterion for the different distributions.

Parameters:
• variable (str) – The name of the variable.

• index (int) –

The component of the variable.

By default it is set to 0.

Returns:

The value of the fitting criterion for the given variable name and component and the different distributions, as well as whether this fitting criterion is a statistical test and so this value a p-value.

Return type:

tuple[dict[str, float], bool]

get_fitting_matrix()[source]

Get the fitting matrix.

This matrix contains goodness-of-fit measures for each pair < variable, distribution >.

Returns:

The printable fitting matrix.

Return type:

str

plot_criteria(variable, title=None, save=False, show=True, n_legend_cols=4, directory='.', index=0, fig_size=(6.4, 3.2))[source]

Plot criteria for a given variable name.

Parameters:
• variable (str) – The name of the variable.

• title (str | None) – A plot title.

• save (bool) –

If True, save the plot on the disk.

By default it is set to False.

• show (bool) –

If True, show the plot.

By default it is set to True.

• n_legend_cols (int) –

The number of text columns in the upper legend.

By default it is set to 4.

• directory (str | Path) –

The directory path, either absolute or relative.

By default it is set to “.”.

• index (int) –

The index of the component of the variable.

By default it is set to 0.

• fig_size (FigSizeType) –

The width and height of the figure in inches, e.g. (w, h).

By default it is set to (6.4, 3.2).

Raises:

ValueError – If the variable is missing from the dataset.

Return type:

None

AVAILABLE_CRITERIA: ClassVar[list[str]] = ['BIC', 'ChiSquared', 'Kolmogorov']

The names of the available fitting criteria.

AVAILABLE_DISTRIBUTIONS: ClassVar[list[str]] = ['Arcsine', 'Beta', 'Burr', 'Chi', 'ChiSquare', 'Dirichlet', 'Exponential', 'FisherSnedecor', 'Frechet', 'Gamma', 'GeneralizedPareto', 'Gumbel', 'Histogram', 'InverseNormal', 'Laplace', 'LogNormal', 'LogUniform', 'Logistic', 'MeixnerDistribution', 'Normal', 'Pareto', 'Rayleigh', 'Rice', 'Student', 'Trapezoidal', 'Triangular', 'TruncatedNormal', 'Uniform', 'VonMises', 'WeibullMax', 'WeibullMin']

The names of the available probability distributions.

AVAILABLE_SIGNIFICANCE_TESTS: ClassVar[list[str]] = ['ChiSquared', 'Kolmogorov']

The names of the available significance tests.

SYMBOLS = {'a_value': 'Aval', 'b_value': 'Bval', 'margin': 'Margin', 'maximum': 'Max', 'mean': 'E', 'mean_std': 'E_StD', 'median': 'Med', 'minimum': 'Min', 'moment': 'M', 'percentile': 'p', 'probability': 'P', 'quantile': 'Q', 'quartile': 'q', 'range': 'R', 'standard_deviation': 'StD', 'tolerance_interval': 'TI', 'variance': 'V', 'variation_coefficient': 'CoV'}
dataset: Dataset

The dataset.

distributions: dict[str, DistributionType | list[DistributionType]]

The probability distributions of the random variables.

When a random variable is a random vector, its probability distribution is expressed as a list of marginal distributions. Otherwise, its probability distribution is expressed as the unique marginal distribution.

fitting_criterion: str

The name of the goodness-of-fit criterion, measuring how the distribution fits the data.

level: float

The test level, i.e. risk of committing a Type 1 error, that is an incorrect rejection of a true null hypothesis, for criteria based on test hypothesis.

n_samples: int

The number of samples.

n_variables: int

The number of variables.

name: str

The name of the object.

selection_criterion: str

The name of the selection criterion to select a distribution from a list of candidates.

## Examples using ParametricStatistics¶

Parametric estimation of statistics

Parametric estimation of statistics