empirical module¶
Class for the empirical estimation of statistics from a dataset.
Overview¶
The EmpiricalStatistics
class inherits
from the abstract Statistics
class
and aims to estimate statistics from a Dataset
,
based on empirical estimators.
Construction¶
A EmpiricalStatistics
is built from a Dataset
and optionally variables names.
In this case,
statistics are only computed for these variables.
Otherwise,
statistics are computed for all the variable available in the dataset.
Lastly,
the user can give a name to its EmpiricalStatistics
object.
By default,
this name is the concatenation of ‘EmpiricalStatistics’
and the name of the Dataset
.
- class gemseo.uncertainty.statistics.empirical.EmpiricalStatistics(dataset, variable_names=(), name='')[source]¶
Bases:
Statistics
A toolbox to compute statistics empirically.
Unless otherwise stated, the statistics are computed variable-wise and component-wise, i.e. variable-by-variable and component-by-component. So, for the sake of readability, the methods named as
compute_statistic()
returndict[str, ndarray]
objects whose values are the names of the variables and the values are the statistic estimated for the different component.Examples
>>> from gemseo import ( ... create_discipline, ... create_parameter_space, ... create_scenario, ... ) >>> from gemseo.uncertainty.statistics.empirical import EmpiricalStatistics >>> >>> expressions = {"y1": "x1+2*x2", "y2": "x1-3*x2"} >>> discipline = create_discipline( ... "AnalyticDiscipline", expressions=expressions ... ) >>> >>> parameter_space = create_parameter_space() >>> parameter_space.add_random_variable( ... "x1", "OTUniformDistribution", minimum=-1, maximum=1 ... ) >>> parameter_space.add_random_variable( ... "x2", "OTUniformDistribution", minimum=-1, maximum=1 ... ) >>> >>> scenario = create_scenario( ... [discipline], ... "DisciplinaryOpt", ... "y1", ... parameter_space, ... scenario_type="DOE", ... ) >>> scenario.execute({"algo": "OT_MONTE_CARLO", "n_samples": 100}) >>> >>> dataset = scenario.to_dataset(opt_naming=False) >>> >>> statistics = EmpiricalStatistics(dataset) >>> mean = statistics.compute_mean()
- Parameters:
dataset (Dataset) – A dataset.
variable_names (Iterable[str]) –
The names of the variables for which to compute statistics. If empty, consider all the variables of the dataset.
By default it is set to ().
name (str) –
A name for the toolbox computing statistics. If empty, concatenate the names of the dataset and the name of the class.
By default it is set to “”.
- compute_a_value()¶
Compute the A-value \(\text{Aval}[X]\).
The A-value is the lower bound of the left-sided tolerance interval associated with a coverage level equal to 99% and a confidence level equal to 95%.
- compute_b_value()¶
Compute the B-value \(\text{Bval}[X]\).
The B-value is the lower bound of the left-sided tolerance interval associated with a coverage level equal to 90% and a confidence level equal to 95%.
- classmethod compute_expression(variable_name, statistic_name, show_name=False, **options)¶
Return the expression of a statistical function applied to a variable.
E.g. “P[X >= 1.0]” for the probability that X exceeds 1.0.
- Parameters:
variable_name (str) – The name of the variable, e.g.
"X"
.statistic_name (str) – The name of the statistic, e.g.
"probability"
.show_name (bool) –
If
True
, show option names. Otherwise, only show option values.By default it is set to False.
**options (bool | float | int) – The options passed to the statistical function, e.g.
{"greater": True, "thresh": 1.0}
.
- Returns:
The expression of the statistical function applied to the variable.
- Return type:
- compute_joint_probability(thresh, greater=True)[source]¶
Compute the joint probability related to a threshold.
Either \(\mathbb{P}[X \geq x]\) or \(\mathbb{P}[X \leq x]\).
- Parameters:
- Returns:
The joint probability of the different variables (by definition of the joint probability, this statistics is not computed component-wise).
- Return type:
- compute_margin(std_factor)¶
Compute a margin \(\text{Margin}[X]=\mathbb{E}[X]+\kappa\mathbb{S}[X]\).
- compute_mean_std(std_factor)¶
Compute a margin \(\text{Margin}[X]=\mathbb{E}[X]+\kappa\mathbb{S}[X]\).
- compute_median()¶
Compute the median \(\text{Med}[X]\).
- compute_percentile(order)¶
Compute the n-th percentile \(\text{p}[X; n]\).
- Parameters:
order (int) – The order \(n\in\{0,1,2,...100\}\) of the percentile.
- Returns:
The component-wise percentile of the different variables.
- Raises:
ValueError – When \(n\notin\{0,1,2,...100\}\).
- Return type:
- compute_probability(thresh, greater=True)[source]¶
Compute the probability related to a threshold.
Either \(\mathbb{P}[X \geq x]\) or \(\mathbb{P}[X \leq x]\).
- Parameters:
- Returns:
The component-wise probability of the different variables.
- Return type:
- compute_quantile(prob)[source]¶
Compute the quantile \(\mathbb{Q}[X; \alpha]\) related to a probability.
- compute_quartile(order)¶
Compute the n-th quartile \(q[X; n]\).
- Parameters:
order (int) – The order \(n\in\{1,2,3\}\) of the quartile.
- Returns:
The component-wise quartile of the different variables.
- Raises:
ValueError – When \(n\notin\{1,2,3\}\).
- Return type:
- compute_tolerance_interval(coverage, confidence=0.95, side=ToleranceIntervalSide.BOTH)¶
Compute a \((p,1-\alpha)\) tolerance interval \(\text{TI}[X]\).
The tolerance interval \(\text{TI}[X]\) is defined to contain at least a proportion \(p\) of the values of \(X\) with a level of confidence \(1-\alpha\). \(p\) is also called the coverage level of the TI.
Typically, \(\alpha=0.05\) or equivalently \(1-\alpha=0.95\).
The tolerance interval can be either
lower-sided (
side="LOWER"
: \([L, +\infty[\)),upper-sided (
side="UPPER"
: \(]-\infty, U]\)) orboth-sided (
side="BOTH"
: \([L, U]\)).
- Parameters:
coverage (float) – A minimum proportion \(p\in[0,1]\) of belonging to the TI.
confidence (float) –
A level of confidence \(1-\alpha\in[0,1]\).
By default it is set to 0.95.
side (ToleranceIntervalSide) –
The type of the tolerance interval.
By default it is set to “both”.
- Returns:
The component-wise tolerance intervals of the different variables, expressed as
{variable_name: [(lower_bound, upper_bound), ...], ... }
where[(lower_bound, upper_bound), ...]
are the lower and upper bounds of the tolerance interval of the different components ofvariable_name
.- Return type:
See also
- compute_variation_coefficient()¶
Compute the coefficient of variation \(CoV[X]\).
This is the standard deviation normalized by the expectation: \(CoV[X]=\mathbb{E}[S]/\mathbb{E}[X]\).
- plot_boxplot(save=False, show=True, directory_path='', file_format='png', **options)[source]¶
Visualize the data with a boxplot.
- Parameters:
save (bool) –
Whether to save the figures.
By default it is set to False.
show (bool) –
Whether to show the figures.
By default it is set to True.
directory_path (str | Path) –
The path to save the figures.
By default it is set to “”.
file_format (str) –
The file extension.
By default it is set to “png”.
**options (Any) – The options of the
Boxplot
graphs.
- Returns:
The boxplot of each variable.
- Return type:
- plot_cdf(save=False, show=True, directory_path='', file_format='png', **options)[source]¶
Visualize the empirical cumulative probability function.
- Parameters:
save (bool) –
Whether to save the figures.
By default it is set to False.
show (bool) –
Whether to show the figures.
By default it is set to True.
directory_path (str | Path) –
The path to save the figures.
By default it is set to “”.
file_format (str) –
The file extension.
By default it is set to “png”.
**options (Any) – The options of the
Lines
graphs.
- Returns:
The graph of the cumulative probability function for each variable.
- Return type:
- plot_pdf(save=False, show=True, directory_path='', file_format='png', **options)[source]¶
Visualize the empirical probability density function.
- Parameters:
save (bool) –
Whether to save the figures.
By default it is set to False.
show (bool) –
Whether to show the figures.
By default it is set to True.
directory_path (str | Path) –
The path to save the figures.
By default it is set to “”.
file_format (str) –
The file extension.
By default it is set to “png”.
**options (Any) – The options of the
Lines
graphs.
- Returns:
The graph of the probability density function for each variable.
- Return type:
- SYMBOLS: ClassVar[dict[str, str]] = {'a_value': 'Aval', 'b_value': 'Bval', 'margin': 'Margin', 'maximum': 'Max', 'mean': 'E', 'mean_std': 'E_StD', 'median': 'Med', 'minimum': 'Min', 'moment': 'M', 'percentile': 'p', 'probability': 'P', 'quantile': 'Q', 'quartile': 'q', 'range': 'R', 'standard_deviation': 'StD', 'tolerance_interval': 'TI', 'variance': 'V', 'variation_coefficient': 'CoV'}¶
Examples using EmpiricalStatistics¶
Empirical estimation of statistics