gemseo / uncertainty / statistics

parametric module

Class for the parametric estimation of statistics from a dataset.

Overview

The ParametricStatistics class inherits from the abstract Statistics class and aims to estimate statistics from a Dataset, based on candidate parametric distributions calibrated from this Dataset.

For each variable,

  1. the parameters of these distributions are calibrated from the Dataset,

  2. the fitted parametric Distribution which is optimal in the sense of a goodness-of-fit criterion and a selection criterion is selected to estimate the statistics related to this variable.

The ParametricStatistics relies on the OpenTURNS library through the OTDistribution and OTDistributionFitter classes.

Construction

The ParametricStatistics is built from two mandatory arguments:

  • a dataset,

  • a list of distributions names,

and can consider optional arguments:

  • a subset of variables names (by default, statistics are computed for all variables),

  • a fitting criterion name (by default, BIC is used; see AVAILABLE_CRITERIA and AVAILABLE_SIGNIFICANCE_TESTS for more information),

  • a level associated with the fitting criterion,

  • a selection criterion:

    • ‘best’: select the distribution minimizing (or maximizing, depending on the criterion) the criterion,

    • ‘first’: select the first distribution for which the criterion is greater (or lower, depending on the criterion) than the level,

  • a name for the ParametricStatistics object (by default, the name is the concatenation of ‘ParametricStatistics’ and the name of the Dataset).

Capabilities

By inheritance, a ParametricStatistics object has the same capabilities as Statistics. Additional ones are:

  • get_fitting_matrix(): this method displays the values of the fitting criterion for the different variables and candidate probability distributions as well as the select probability distribution,

  • plot_criteria(): this method plots the criterion values for a given variable.

class gemseo.uncertainty.statistics.parametric.ParametricStatistics(dataset, distributions, variables_names=None, fitting_criterion='BIC', level=0.05, selection_criterion='best', name=None)[source]

Bases: gemseo.uncertainty.statistics.statistics.Statistics

Parametric estimation of statistics.

Examples

>>> from gemseo.api import (
...     create_discipline,
...     create_parameter_space,
...     create_scenario
... )
>>> from gemseo.uncertainty.statistics.parametric import ParametricStatistics
>>>
>>> expressions = {"y1": "x1+2*x2", "y2": "x1-3*x2"}
>>> discipline = create_discipline(
...     "AnalyticDiscipline", expressions=expressions
... )
>>>
>>> parameter_space = create_parameter_space()
>>> parameter_space.add_random_variable(
...     "x1", "OTUniformDistribution", minimum=-1, maximum=1
... )
>>> parameter_space.add_random_variable(
...     "x2", "OTNormalDistribution", mu=0.5, sigma=2
... )
>>>
>>> scenario = create_scenario(
...     [discipline],
...     "DisciplinaryOpt",
...     "y1", parameter_space, scenario_type="DOE"
... )
>>> scenario.execute({'algo': 'OT_MONTE_CARLO', 'n_samples': 100})
>>>
>>> dataset = scenario.export_to_dataset(opt_naming=False)
>>>
>>> statistics = ParametricStatistics(
...     dataset, ['Normal', 'Uniform', 'Triangular']
... )
>>> fitting_matrix = statistics.get_fitting_matrix()
>>> mean = statistics.mean()
Parameters
  • dataset (Dataset) – A dataset.

  • distributions (Sequence[str]) – The names of the distributions.

  • variables_names (Iterable[str] | None) –

    The variables of interest. Default: consider all the variables available in the dataset.

    By default it is set to None.

  • fitting_criterion (str) –

    The name of the goodness-of-fit criterion, measuring how the distribution fits the data. Use ParametricStatistics.get_criteria() to get the available criteria.

    By default it is set to BIC.

  • level (float) –

    A test level, i.e. the risk of committing a Type 1 error, that is an incorrect rejection of a true null hypothesis, for criteria based on test hypothesis.

    By default it is set to 0.05.

  • selection_criterion (str) –

    The name of the selection criterion to select a distribution from a list of candidates. Either ‘first’ or ‘best’.

    By default it is set to best.

  • name (str | None) –

    A name for the object. Default: use the concatenation of the class and dataset names.

    By default it is set to None.

Return type

None

compute_a_value()

Compute the A-value \(\text{Aval}[X]\).

Returns

The A-value of the different variables.

Return type

dict[str, numpy.ndarray]

compute_b_value()

Compute the B-value \(\text{Bval}[X]\).

Returns

The B-value of the different variables.

Return type

dict[str, numpy.ndarray]

classmethod compute_expression(variable_name, statistic_name, show_name=False, **options)

Return the expression of a statistical function applied to a variable.

E.g. “P[X >= 1.0]” for the probability that X exceeds 1.0.

Parameters
  • variable_name (str) – The name of the variable, e.g. "X".

  • statistic_name (str) – The name of the statistic, e.g. "probability".

  • show_name (bool) –

    If True, show option names. Otherwise, only show option values.

    By default it is set to False.

  • **options (bool | float | int) – The options passed to the statistical function, e.g. {"greater": True, "thresh": 1.0}.

Returns

The expression of the statistical function applied to the variable.

Return type

str

compute_margin(std_factor)

Compute a margin \(\text{Margin}[X]=\mathbb{E}[X]+\kappa\mathbb{S}[X]\).

Parameters

std_factor (float) – The weight \(\kappa\) of the standard deviation.

Returns

The margin for the different variables.

Return type

dict[str, numpy.ndarray]

compute_maximum()[source]

Compute the maximum \(\text{Max}[X]\).

Returns

The maximum of the different variables.

Return type

dict[str, numpy.ndarray]

compute_mean()[source]

Compute the mean \(\mathbb{E}[X]\).

Returns

The mean of the different variables.

Return type

dict[str, numpy.ndarray]

compute_mean_std(std_factor)

Compute a margin \(\text{Margin}[X]=\mathbb{E}[X]+\kappa\mathbb{S}[X]\).

Parameters

std_factor (float) – The weight \(\kappa\) of the standard deviation.

Returns

The margin for the different variables.

Return type

dict[str, numpy.ndarray]

compute_median()

Compute the median \(\text{Med}[X]\).

Returns

The median of the different variables.

Return type

dict[str, numpy.ndarray]

compute_minimum()[source]

Compute the \(\text{Min}[X]\).

Returns

The minimum of the different variables.

Return type

dict[str, numpy.ndarray]

compute_moment(order)[source]

Compute the n-th moment \(M[X; n]\).

Parameters

order (int) – The order \(n\) of the moment.

Returns

The moment of the different variables.

Return type

dict[str, numpy.ndarray]

compute_percentile(order)

Compute the n-th percentile \(\text{p}[X; n]\).

Parameters

order (int) – The order \(n\) of the percentile. Either 0, 1, 2, … or 100.

Returns

The percentile of the different variables.

Return type

dict[str, numpy.ndarray]

compute_probability(thresh, greater=True)[source]

Compute the probability related to a threshold.

Either \(\mathbb{P}[X \geq x]\) or \(\mathbb{P}[X \leq x]\).

Parameters
  • thresh (float) – A threshold \(x\).

  • greater (bool) –

    The type of probability. If True, compute the probability of exceeding the threshold. Otherwise, compute the opposite.

    By default it is set to True.

Returns

The probability of the different variables

Return type

dict[str, numpy.ndarray]

compute_quantile(prob)[source]

Compute the quantile \(\mathbb{Q}[X; \alpha]\) related to a probability.

Parameters

prob (float) – A probability \(\alpha\) between 0 and 1.

Returns

The quantile of the different variables.

Return type

dict[str, numpy.ndarray]

compute_quartile(order)

Compute the n-th quartile \(q[X; n]\).

Parameters

order (int) – The order \(n\) of the quartile. Either 1, 2 or 3.

Returns

The quartile of the different variables.

Return type

dict[str, numpy.ndarray]

compute_range()[source]

Compute the range \(R[X]\).

Returns

The range of the different variables.

Return type

dict[str, numpy.ndarray]

compute_standard_deviation()[source]

Compute the standard deviation \(\mathbb{S}[X]\).

Returns

The standard deviation of the different variables.

Return type

dict[str, numpy.ndarray]

compute_tolerance_interval(coverage, confidence=0.95, side=ToleranceIntervalSide.BOTH)[source]

Compute a tolerance interval \(\text{TI}[X]\).

This coverage level is the minimum percentage of belonging to the TI. The tolerance interval is computed with a confidence level and can be either lower-sided, upper-sided or both-sided.

Parameters
  • coverage (float) – A minimum percentage of belonging to the TI.

  • confidence (float) –

    A level of confidence in [0,1].

    By default it is set to 0.95.

  • side (gemseo.uncertainty.statistics.tolerance_interval.distribution.ToleranceIntervalSide) –

    The type of the tolerance interval characterized by its sides of interest, either a lower-sided tolerance interval \([a, +\infty[\), an upper-sided tolerance interval \(]-\infty, b]\), or a two-sided tolerance interval \([c, d]\).

    By default it is set to BOTH.

Returns

The tolerance limits of the different variables.

Return type

dict[str, tuple[numpy.ndarray, numpy.ndarray]]

compute_variance()[source]

Compute the variance \(\mathbb{V}[X]\).

Returns

The variance of the different variables.

Return type

dict[str, numpy.ndarray]

compute_variation_coefficient()

Compute the coefficient of variation \(CoV[X]\).

This is the standard deviation normalized by the expectation: \(CoV[X]=\mathbb{E}[S]/\mathbb{E}[X]\).

Returns

The coefficient of variation of the different variables.

Return type

dict[str, numpy.ndarray]

get_criteria(variable)[source]

Get criteria for a given variable name and the different distributions.

Parameters

variable (str) – The name of the variable.

Returns

The criterion for the different distributions. and an indicator equal to True is the criterion is a p-value.

Return type

tuple[dict[str, float], bool]

get_fitting_matrix()[source]

Get the fitting matrix.

This matrix contains goodness-of-fit measures for each pair < variable, distribution >.

Returns

The printable fitting matrix.

Return type

str

plot_criteria(variable, title=None, save=False, show=True, n_legend_cols=4, directory='.')[source]

Plot criteria for a given variable name.

Parameters
  • variable (str) – The name of the variable.

  • title (str | None) –

    A plot title.

    By default it is set to None.

  • save (bool) –

    If True, save the plot on the disk.

    By default it is set to False.

  • show (bool) –

    If True, show the plot.

    By default it is set to True.

  • n_legend_cols (int) –

    The number of text columns in the upper legend.

    By default it is set to 4.

  • directory (str) –

    The directory path, either absolute or relative.

    By default it is set to ..

Raises

ValueError – If the variable is missing from the dataset.

Return type

None

AVAILABLE_CRITERIA = ['BIC', 'ChiSquared', 'Kolmogorov']
AVAILABLE_DISTRIBUTIONS = ['Arcsine', 'Beta', 'Burr', 'Chi', 'ChiSquare', 'Dirichlet', 'Exponential', 'FisherSnedecor', 'Frechet', 'Gamma', 'GeneralizedPareto', 'Gumbel', 'Histogram', 'InverseNormal', 'Laplace', 'LogNormal', 'LogUniform', 'Logistic', 'MeixnerDistribution', 'Normal', 'Pareto', 'Rayleigh', 'Rice', 'Student', 'Trapezoidal', 'Triangular', 'TruncatedNormal', 'Uniform', 'VonMises', 'WeibullMax', 'WeibullMin']
AVAILABLE_SIGNIFICANCE_TESTS = ['ChiSquared', 'Kolmogorov']
SYMBOLS = {'a_value': 'Aval', 'b_value': 'Bval', 'margin': 'Margin', 'maximum': 'Max', 'mean': 'E', 'mean_std': 'E_StD', 'median': 'Med', 'minimum': 'Min', 'moment': 'M', 'percentile': 'p', 'probability': 'P', 'quantile': 'Q', 'quartile': 'q', 'range': 'R', 'standard_deviation': 'StD', 'tolerance_interval': 'TI', 'variance': 'V', 'variation_coefficient': 'CoV'}
dataset: gemseo.core.dataset.Dataset

The dataset.

distributions: dict[str, dict[str, gemseo.uncertainty.distributions.openturns.distribution.OTDistribution]]

The probability distributions of the random variables.

fitting_criterion: str

The name of the goodness-of-fit criterion, measuring how the distribution fits the data.

level: float

The test level, i.e. risk of committing a Type 1 error, that is an incorrect rejection of a true null hypothesis, for criteria based on test hypothesis.

n_samples: int

The number of samples.

n_variables: int

The number of variables.

name: str

The name of the object.

selection_criterion: str

The name of the selection criterion to select a distribution from a list of candidates.