statistics module¶

Abstract class for the estimation of statistics from a dataset.

Overview¶

The abstract Statistics class implements the concept of statistics library. It is enriched by the EmpiricalStatistics and ParametricStatistics.

Construction¶

A Statistics object is built from a Dataset and optionally variables names. In this case, statistics are only computed for these variables. Otherwise, statistics are computed for all the variable available in the dataset. Lastly, the user can give a name to its Statistics object. By default, this name is the concatenation of the name of the class overloading Statistics and the name of the Dataset.

Capabilities¶

A Statistics returns standard descriptive and statistical measures for the different variables:

Statistics.compute_minimum(): the minimum value,
Statistics.compute_maximum(): the maximum value,
Statistics.compute_range(): the difference between minimum and maximum values,
Statistics.compute_mean(): the expectation (a.k.a. mean value),
Statistics.compute_moment(): a central moment, which is a the expected value of a specified integer power of the deviation from the mean,
Statistics.compute_variance(): the variance, which is the mean squared variation around the mean value,
Statistics.compute_standard_deviation(): the standard deviation, which is the square root of the variance,
Statistics.compute_variation_coefficient(): the coefficient of variation, which is the standard deviation normalized by the mean,
Statistics.compute_quantile(): the quantile associated with a probability, which is the cut point diving the range into a first continuous interval with this given probability and a second continuous interval with the complementary probability; common q-quantiles dividing the range into q continuous interval with equal probabilities are also implemented:
- Statistics.compute_median() which implements the 2-quantile (50%).
- Statistics.compute_quartile() whose order (1, 2 or 3) implements the 4-quantiles (25%, 50% and 75%),
- Statistics.compute_percentile() whose order (1, 2, …, 99) implements the 100-quantiles (1%, 2%, …, 99%),
Statistics.compute_probability(): the probability that the random variable is larger or smaller than a certain threshold,
Statistics.compute_tolerance_interval(): the left-sided, right-sided or both-sided tolerance interval associated with a given coverage level and a given confidence level, which is a statistical interval within which, with some confidence level, a specified proportion of the random variable realizations falls (this proportion is the coverage level)
- Statistics.compute_a_value(): the A-value, which is the lower bound of the left-sided tolerance interval associated with a coverage level equal to 99% and a confidence level equal to 95%,
- Statistics.compute_b_value(): the B-value, which is the lower bound of the left-sided tolerance interval associated with a coverage level equal to 90% and a confidence level equal to 95%,

class gemseo.uncertainty.statistics.statistics.Statistics(dataset, variables_names=None, name=None)[source]¶

Bases: object

Abstract class to interface a statistics library.

Parameters

dataset (Dataset) – A dataset.
variables_names (Iterable[str] | None) –
The variables of interest. Default: consider all the variables available in the dataset.

By default it is set to None.
name (str | None) –
A name for the object. Default: use the concatenation of the class and dataset names.

By default it is set to None.

Return type

None

compute_a_value()[source]¶

Compute the A-value \(\text{Aval}[X]\).

Returns: The A-value of the different variables.
Return type: dict[str, numpy.ndarray]

compute_b_value()[source]¶

Compute the B-value \(\text{Bval}[X]\).

Returns: The B-value of the different variables.
Return type: dict[str, numpy.ndarray]

classmethod compute_expression(variable_name, statistic_name, show_name=False, **options)[source]¶

Return the expression of a statistical function applied to a variable.

E.g. “P[X >= 1.0]” for the probability that X exceeds 1.0.

Parameters

variable_name (str) – The name of the variable, e.g. "X".
statistic_name (str) – The name of the statistic, e.g. "probability".
show_name (bool) –
If True, show option names. Otherwise, only show option values.

By default it is set to False.
**options (bool | float | int) – The options passed to the statistical function, e.g. {"greater": True, "thresh": 1.0}.

Returns

The expression of the statistical function applied to the variable.

Return type

str

compute_margin(std_factor)¶

Compute a margin \(\text{Margin}[X]=\mathbb{E}[X]+\kappa\mathbb{S}[X]\).

Parameters: std_factor (float) – The weight \(\kappa\) of the standard deviation.
Returns: The margin for the different variables.
Return type: dict[str, numpy.ndarray]

compute_maximum()[source]¶

Compute the maximum \(\text{Max}[X]\).

Returns: The maximum of the different variables.
Return type: dict[str, numpy.ndarray]

compute_mean()[source]¶

Compute the mean \(\mathbb{E}[X]\).

Returns: The mean of the different variables.
Return type: dict[str, numpy.ndarray]

compute_mean_std(std_factor)[source]¶

Compute a margin \(\text{Margin}[X]=\mathbb{E}[X]+\kappa\mathbb{S}[X]\).

Parameters: std_factor (float) – The weight \(\kappa\) of the standard deviation.
Returns: The margin for the different variables.
Return type: dict[str, numpy.ndarray]

compute_median()[source]¶

Compute the median \(\text{Med}[X]\).

Returns: The median of the different variables.
Return type: dict[str, numpy.ndarray]

compute_minimum()[source]¶

Compute the \(\text{Min}[X]\).

Returns: The minimum of the different variables.
Return type: dict[str, numpy.ndarray]

compute_moment(order)[source]¶

Compute the n-th moment \(M[X; n]\).

Parameters: order (int) – The order \(n\) of the moment.
Returns: The moment of the different variables.
Return type: dict[str, numpy.ndarray]

compute_percentile(order)[source]¶

Compute the n-th percentile \(\text{p}[X; n]\).

Parameters: order (int) – The order \(n\) of the percentile. Either 0, 1, 2, … or 100.
Returns: The percentile of the different variables.
Return type: dict[str, numpy.ndarray]

compute_probability(thresh, greater=True)[source]¶

Compute the probability related to a threshold.

Either \(\mathbb{P}[X \geq x]\) or \(\mathbb{P}[X \leq x]\).

Parameters

thresh (float) – A threshold \(x\).
greater (bool) –
The type of probability. If True, compute the probability of exceeding the threshold. Otherwise, compute the opposite.

By default it is set to True.

Returns

The probability of the different variables

Return type

dict[str, numpy.ndarray]

compute_quantile(prob)[source]¶

Compute the quantile \(\mathbb{Q}[X; \alpha]\) related to a probability.

Parameters: prob (float) – A probability \(\alpha\) between 0 and 1.
Returns: The quantile of the different variables.
Return type: dict[str, numpy.ndarray]

compute_quartile(order)[source]¶

Compute the n-th quartile \(q[X; n]\).

Parameters: order (int) – The order \(n\) of the quartile. Either 1, 2 or 3.
Returns: The quartile of the different variables.
Return type: dict[str, numpy.ndarray]

compute_range()[source]¶

Compute the range \(R[X]\).

Returns: The range of the different variables.
Return type: dict[str, numpy.ndarray]

compute_standard_deviation()[source]¶

Compute the standard deviation \(\mathbb{S}[X]\).

Returns: The standard deviation of the different variables.
Return type: dict[str, numpy.ndarray]

compute_tolerance_interval(coverage, confidence=0.95, side=ToleranceIntervalSide.BOTH)[source]¶

Compute a tolerance interval \(\text{TI}[X]\).

This coverage level is the minimum percentage of belonging to the TI. The tolerance interval is computed with a confidence level and can be either lower-sided, upper-sided or both-sided.

Parameters

coverage (float) – A minimum percentage of belonging to the TI.
confidence (float) –
A level of confidence in [0,1].

By default it is set to 0.95.
side (gemseo.uncertainty.statistics.tolerance_interval.distribution.ToleranceIntervalSide) –
The type of the tolerance interval characterized by its sides of interest, either a lower-sided tolerance interval \([a, +\infty[\), an upper-sided tolerance interval \(]-\infty, b]\), or a two-sided tolerance interval \([c, d]\).

By default it is set to BOTH.

Returns

The tolerance limits of the different variables.

Return type

dict[str, tuple[numpy.ndarray, numpy.ndarray]]

compute_variance()[source]¶

Compute the variance \(\mathbb{V}[X]\).

Returns: The variance of the different variables.
Return type: dict[str, numpy.ndarray]

compute_variation_coefficient()[source]¶

Compute the coefficient of variation \(CoV[X]\).

This is the standard deviation normalized by the expectation: \(CoV[X]=\mathbb{E}[S]/\mathbb{E}[X]\).

Returns: The coefficient of variation of the different variables.
Return type: dict[str, numpy.ndarray]

SYMBOLS = {'a_value': 'Aval', 'b_value': 'Bval', 'margin': 'Margin', 'maximum': 'Max', 'mean': 'E', 'mean_std': 'E_StD', 'median': 'Med', 'minimum': 'Min', 'moment': 'M', 'percentile': 'p', 'probability': 'P', 'quantile': 'Q', 'quartile': 'q', 'range': 'R', 'standard_deviation': 'StD', 'tolerance_interval': 'TI', 'variance': 'V', 'variation_coefficient': 'CoV'}¶

dataset: gemseo.core.dataset.Dataset¶: The dataset.

n_samples: int¶: The number of samples.

n_variables: int¶: The number of variables.

name: str¶: The name of the object.