parametric module¶

Class for the parametric estimation of statistics from a dataset.

Overview¶

The ParametricStatistics class inherits from the abstract Statistics class and aims to estimate statistics from a Dataset, based on candidate parametric distributions calibrated from this Dataset.

For each variable,

the parameters of these distributions are calibrated from the Dataset,
the fitted parametric Distribution which is optimal in the sense of a goodness-of-fit criterion and a selection criterion is selected to estimate the statistics related to this variable.

The ParametricStatistics relies on the OpenTURNS library through the OTDistribution and OTDistributionFitter classes.

Construction¶

The ParametricStatistics is built from two mandatory arguments:

a dataset,
a list of distributions names,

and can consider optional arguments:

a subset of variables names (by default, statistics are computed for all variables),
a fitting criterion name (by default, BIC is used; see AVAILABLE_CRITERIA and AVAILABLE_SIGNIFICANCE_TESTS for more information),
a level associated with the fitting criterion,
a selection criterion:
- ‘best’: select the distribution minimizing (or maximizing, depending on the criterion) the criterion,
- ‘first’: select the first distribution for which the criterion is greater (or lower, depending on the criterion) than the level,
a name for the ParametricStatistics object (by default, the name is the concatenation of ‘ParametricStatistics’ and the name of the Dataset).

Capabilities¶

By inheritance, a ParametricStatistics object has the same capabilities as Statistics. Additional ones are:

get_fitting_matrix(): this method displays the values of the fitting criterion for the different variables and candidate probability distributions as well as the select probability distribution,
plot_criteria(): this method plots the criterion values for a given variable.

Classes:

ParametricStatistics(dataset, distributions)

Parametric estimation of statistics.

class gemseo.uncertainty.statistics.parametric.ParametricStatistics(dataset, distributions, variables_names=None, fitting_criterion='BIC', level=0.05, selection_criterion='best', name=None)[source]¶

Bases: gemseo.uncertainty.statistics.statistics.Statistics

Parametric estimation of statistics.

Attributes

dataset (Dataset) – The dataset.
n_samples (int) – The number of samples.
n_variables (int) – The number of variables.
name (str) – The name of the object.
fitting_criterion (str) – The name of the goodness-of-fit criterion, measuring how the distribution fits the data.
level (float) – The test level, i.e. risk of committing a Type 1 error, that is an incorrect rejection of a true null hypothesis, for criteria based on test hypothesis.
selection_criterion (str) – The name of the selection criterion to select a distribution from a list of candidates.
distributions (dict(str, dict(str, OTDistribution))) – The probability distributions of the random variables.

Parameters

dataset (Dataset) –
distributions (Sequence[str]) –
variables_names (Optional[Iterable[str]]) –
fitting_criterion (str) –
level (float) –
selection_criterion (str) –
name (Optional[str]) –

Return type

None

Examples

>>> from gemseo.api import (
...     create_discipline,
...     create_parameter_space,
...     create_scenario
... )
>>> from gemseo.uncertainty.statistics.parametric import ParametricStatistics
>>>
>>> expressions = {"y1": "x1+2*x2", "y2": "x1-3*x2"}
>>> discipline = create_discipline(
...     "AnalyticDiscipline", expressions_dict=expressions
... )
>>> discipline.set_cache_policy(discipline.MEMORY_FULL_CACHE)
>>>
>>> parameter_space = create_parameter_space()
>>> parameter_space.add_random_variable(
...     "x1", "OTUniformDistribution", minimum=-1, maximum=1
... )
>>> parameter_space.add_random_variable(
...     "x2", "OTNormalDistribution", mu=0.5, sigma=2
... )
>>>
>>> scenario = create_scenario(
...     [discipline],
...     "DisciplinaryOpt",
...     "y1", parameter_space, scenario_type="DOE"
... )
>>> scenario.execute({'algo': 'OT_MONTE_CARLO', 'n_samples': 100})
>>>
>>> dataset = discipline.cache.export_to_dataset()
>>>
>>> statistics = ParametricStatistics(
...     dataset, ['Normal', 'Uniform', 'Triangular']
... )
>>> fitting_matrix = statistics.get_fitting_matrix()
>>> mean = statistics.mean()

Initialize self. See help(type(self)) for accurate signature.

Parameters

dataset (Dataset) – A dataset.
variables_names (Optional[Iterable[str]]) – The variables of interest. Default: consider all the variables available in the dataset.
name (Optional[str]) – A name for the object. Default: use the concatenation of the class and dataset names.
distributions (Sequence[str]) – The names of the distributions.
fitting_criterion (str) – The name of the goodness-of-fit criterion, measuring how the distribution fits the data. Use ParametricStatistics.get_criteria() to get the available criteria.
level (float) – A test level, i.e. the risk of committing a Type 1 error, that is an incorrect rejection of a true null hypothesis, for criteria based on test hypothesis.
selection_criterion (str) – The name of the selection criterion to select a distribution from a list of candidates. Either ‘first’ or ‘best’.

Return type

None

Attributes:

`AVAILABLE_CRITERIA`
`AVAILABLE_DISTRIBUTIONS`
`AVAILABLE_SIGNIFICANCE_TESTS`
`SYMBOLS`

Methods:

`compute_a_value`()	Compute the A-value.
`compute_b_value`()	Compute the B-value.
`compute_expression`(variable, function[, …])	Return the expression of a statistical function applied to a variable.
`compute_maximum`()	Compute the maximum.
`compute_mean`()	Compute the mean.
`compute_mean_std`(std_factor)	Compute mean + std_factor * std.
`compute_median`()	Compute the median.
`compute_minimum`()	Compute the minimum.
`compute_moment`(order)	Compute the n-th moment.
`compute_percentile`(order)	Compute the n-th percentile.
`compute_probability`(thresh[, greater])	Compute the probability related to a threshold.
`compute_quantile`(prob)	Compute the quantile related to a probability.
`compute_quartile`(order)	Compute the n-th quartile.
`compute_range`()	Compute the range.
`compute_standard_deviation`()	Compute the standard deviation.
`compute_tolerance_interval`(coverage[, …])	Compute a tolerance interval (TI) for a given coverage level.
`compute_variance`()	Compute the variance.
`get_criteria`(variable)	Get criteria for a given variable name and the different distributions.
`get_fitting_matrix`()	Get the fitting matrix.
`plot_criteria`(variable[, title, save, show, …])	Plot criteria for a given variable name.

AVAILABLE_CRITERIA = ['BIC', 'ChiSquared', 'Kolmogorov']¶

AVAILABLE_DISTRIBUTIONS = ['Arcsine', 'Beta', 'Burr', 'Chi', 'ChiSquare', 'Dirichlet', 'Exponential', 'FisherSnedecor', 'Frechet', 'Gamma', 'GeneralizedPareto', 'Gumbel', 'Histogram', 'InverseNormal', 'Laplace', 'LogNormal', 'LogUniform', 'Logistic', 'MeixnerDistribution', 'Normal', 'Pareto', 'Rayleigh', 'Rice', 'Student', 'Trapezoidal', 'Triangular', 'TruncatedNormal', 'Uniform', 'WeibullMax', 'WeibullMin']¶

AVAILABLE_SIGNIFICANCE_TESTS = ['ChiSquared', 'Kolmogorov']¶

SYMBOLS = {'a_value': 'Aval', 'b_value': 'Bval', 'maximum': 'Max', 'mean': 'E', 'mean_std': 'E_StD', 'median': 'Med', 'minimum': 'Min', 'moment': 'M', 'percentile': 'p', 'probability': 'P', 'quantile': 'Q', 'quartile': 'q', 'range': 'R', 'standard_deviation': 'StD', 'tolerance_interval': 'TI', 'variance': 'V'}¶

compute_a_value()¶

Compute the A-value.

Returns: The A-value of the different variables.
Return type: Dict[str, numpy.ndarray]

compute_b_value()¶

Compute the B-value.

Returns: The B-value of the different variables.
Return type: Dict[str, numpy.ndarray]

classmethod compute_expression(variable, function, show_name=False, **options)¶

Return the expression of a statistical function applied to a variable.

Parameters

variable (str) – The name of the variable.
function (str) – The name of the function.
show_name (bool) – If True, show name. Otherwise, only show value.
**options – The options passed to the statistical function.

Returns

The expression of the statistical function applied to the variable.

Return type

str

compute_maximum()[source]¶

Compute the maximum.

Returns: The maximum of the different variables.
Return type: Dict[str, numpy.ndarray]

compute_mean()[source]¶

Compute the mean.

Returns: The mean of the different variables.
Return type: Dict[str, numpy.ndarray]

compute_mean_std(std_factor)¶

Compute mean + std_factor * std.

Returns: mean + std_factor * std for the different variables.
Parameters: std_factor (float) –
Return type: Dict[str, numpy.ndarray]

compute_median()¶

Compute the median.

Returns: The median of the different variables.
Return type: Dict[str, numpy.ndarray]

compute_minimum()[source]¶

Compute the minimum.

Returns: The minimum of the different variables.
Return type: Dict[str, numpy.ndarray]

compute_moment(order)[source]¶

Compute the n-th moment.

Parameters: order (int) – The order of a moment.
Returns: The moment of the different variables.
Return type: Dict[str, numpy.ndarray]

compute_percentile(order)¶

Compute the n-th percentile.

Parameters: order (int) – The order of the percentile. Either 0, 1, 2, … or 100.
Returns: The percentile of the different variables.
Return type: Dict[str, numpy.ndarray]

compute_probability(thresh, greater=True)[source]¶

Compute the probability related to a threshold.

Parameters

thresh (float) – A threshold.
greater (bool) – The type of probability. If True, compute the probability of exceeding the threshold. Otherwise, compute the opposite.

Returns

The probability of the different variables

Return type

Dict[str, numpy.ndarray]

compute_quantile(prob)[source]¶

Compute the quantile related to a probability.

Parameters: prob (float) – A probability between 0 and 1.
Returns: The quantile of the different variables.
Return type: Dict[str, numpy.ndarray]

compute_quartile(order)¶

Compute the n-th quartile.

Parameters: order (int) – The order of the quartile. Either 1, 2 or 3.
Returns: The quartile of the different variables.
Return type: Dict[str, numpy.ndarray]

compute_range()[source]¶

Compute the range.

Returns: The range of the different variables.
Return type: Dict[str, numpy.ndarray]

compute_standard_deviation()[source]¶

Compute the standard deviation.

Returns: The standard deviation of the different variables.
Return type: Dict[str, numpy.ndarray]

compute_tolerance_interval(coverage, confidence=0.95, side=<ToleranceIntervalSide.BOTH: 3>)[source]¶

Compute a tolerance interval (TI) for a given coverage level.

This coverage level is the minimum percentage of belonging to the TI. The tolerance interval is computed with a confidence level and can be either lower-sided, upper-sided or both-sided.

Parameters

coverage (float) – A minimum percentage of belonging to the TI.
confidence (float) – A level of confidence in [0,1].
side (gemseo.uncertainty.statistics.tolerance_interval.distribution.ToleranceIntervalSide) – The type of the tolerance interval characterized by its sides of interest, either a lower-sided tolerance interval \([a, +\infty[\), an upper-sided tolerance interval \(]-\infty, b]\), or a two-sided tolerance interval \([c, d]\).

Returns

The tolerance limits of the different variables.

Return type

Dict[str, Tuple[numpy.ndarray, numpy.ndarray]]

compute_variance()[source]¶

Compute the variance.

Returns: The variance of the different variables.
Return type: Dict[str, numpy.ndarray]

get_criteria(variable)[source]¶

Get criteria for a given variable name and the different distributions.

Parameters: variable (str) – The name of the variable.
Returns: The criterion for the different distributions. and an indicator equal to True is the criterion is a p-value.
Return type: Tuple[Dict[str, float], bool]

get_fitting_matrix()[source]¶

Get the fitting matrix.

This matrix contains goodness-of-fit measures for each pair < variable, distribution >.

Returns: The printable fitting matrix.
Return type: str

plot_criteria(variable, title=None, save=False, show=True, n_legend_cols=4, directory='.')[source]¶

Plot criteria for a given variable name.

Parameters

variable (str) – The name of the variable.
title (Optional[str]) – A plot title.
save (bool) – If True, save the plot on the disk.
show (bool) – If True, show the plot.
n_legend_cols (int) – The number of text columns in the upper legend.
directory (str) – The directory path, either absolute or relative.

Raises

ValueError – If the variable is missing from the dataset.

Return type

None