Statistics#
The base class BaseStatistics#
Abstract class for the estimation of statistics from a dataset.
Overview#
The abstract BaseStatistics
class implements t
he concept of statistics library.
It is enriched by the EmpiricalStatistics
and ParametricStatistics
.
Construction#
A BaseStatistics
object is built from a Dataset
and optionally variables names.
In this case,
statistics are only computed for these variables.
Otherwise,
statistics are computed for all the variable available in the dataset.
Lastly,
the user can give a name to its BaseStatistics
object.
By default,
this name is the concatenation of the name
of the class overloading BaseStatistics
and the name of the Dataset
.
Capabilities#
A BaseStatistics
returns standard descriptive and statistical measures
for the different variables:
BaseStatistics.compute_minimum()
: the minimum value,BaseStatistics.compute_maximum()
: the maximum value,BaseStatistics.compute_range()
: the difference between minimum and maximum values,BaseStatistics.compute_mean()
: the expectation (a.k.a. mean value),BaseStatistics.compute_moment()
: a central moment, which is the expected value of a specified integer power of the deviation from the mean,BaseStatistics.compute_variance()
: the variance, which is the mean squared variation around the mean value,BaseStatistics.compute_standard_deviation()
: the standard deviation, which is the square root of the variance,BaseStatistics.compute_variation_coefficient()
: the coefficient of variation, which is the standard deviation normalized by the mean,BaseStatistics.compute_quantile()
: the quantile associated with a probability, which is the cut point diving the range into a first continuous interval with this given probability and a second continuous interval with the complementary probability; common q-quantiles dividing the range into q continuous interval with equal probabilities are also implemented:BaseStatistics.compute_median()
which implements the 2-quantile (50%).BaseStatistics.compute_quartile()
whose order (1, 2 or 3) implements the 4-quantiles (25%, 50% and 75%),BaseStatistics.compute_percentile()
whose order (1, 2, ..., 99) implements the 100-quantiles (1%, 2%, ..., 99%),
BaseStatistics.compute_probability()
: the probability that the random variable is larger or smaller than a certain threshold,BaseStatistics.compute_tolerance_interval()
: the left-sided, right-sided or both-sided tolerance interval associated with a given coverage level and a given confidence level, which is a statistical interval within which, with some confidence level, a specified proportion of the random variable realizations falls (this proportion is the coverage level)BaseStatistics.compute_a_value()
: the A-value, which is the lower bound of the left-sided tolerance interval associated with a coverage level equal to 99% and a confidence level equal to 95%,BaseStatistics.compute_b_value()
: the B-value, which is the lower bound of the left-sided tolerance interval associated with a coverage level equal to 90% and a confidence level equal to 95%,
- class BaseStatistics(dataset, variable_names=(), name='')[source]
A toolbox to compute statistics.
Unless otherwise stated, the statistics are computed variable-wise and component- wise, i.e. variable-by-variable and component-by-component. So, for the sake of readability, the methods named as
compute_statistic()
returndict[str, RealArray]
objects whose values are the names of the variables and the values are the statistic estimated for the different component.- Parameters:
dataset (Dataset) -- A dataset.
variable_names (Iterable[str]) --
The names of the variables for which to compute statistics. If empty, consider all the variables of the dataset.
By default it is set to ().
name (str) --
A name for the toolbox computing statistics. If empty, concatenate the names of the dataset and the name of the class.
By default it is set to "".
- compute_a_value()[source]
Compute the A-value \(\text{Aval}[X]\).
The A-value is the lower bound of the left-sided tolerance interval associated with a coverage level equal to 99% and a confidence level equal to 95%.
- compute_b_value()[source]
Compute the B-value \(\text{Bval}[X]\).
The B-value is the lower bound of the left-sided tolerance interval associated with a coverage level equal to 90% and a confidence level equal to 95%.
- classmethod compute_expression(variable_name, statistic_name, show_name=False, **options)[source]
Return the expression of a statistical function applied to a variable.
E.g. "P[X >= 1.0]" for the probability that X exceeds 1.0.
- Parameters:
variable_name (str) -- The name of the variable, e.g.
"X"
.statistic_name (str) -- The name of the statistic, e.g.
"probability"
.show_name (bool) --
If
True
, show option names. Otherwise, only show option values.By default it is set to False.
**options (bool | float) -- The options passed to the statistical function, e.g.
{"greater": True, "thresh": 1.0}
.
- Returns:
The expression of the statistical function applied to the variable.
- Return type:
- abstract compute_joint_probability(thresh, greater=True)[source]
Compute the joint probability related to a threshold.
Either \(\mathbb{P}[X \geq x]\) or \(\mathbb{P}[X \leq x]\).
- Parameters:
- Returns:
The joint probability of the different variables (by definition of the joint probability, this statistics is not computed component-wise).
- Return type:
- compute_margin(std_factor)[source]
Compute a margin \(\text{Margin}[X]=\mathbb{E}[X]+\kappa\mathbb{S}[X]\).
- abstract compute_maximum()[source]
Compute the maximum \(\text{Max}[X]\).
- abstract compute_mean()[source]
Compute the mean \(\mathbb{E}[X]\).
- compute_mean_std(std_factor)
Compute a margin \(\text{Margin}[X]=\mathbb{E}[X]+\kappa\mathbb{S}[X]\).
- compute_median()[source]
Compute the median \(\text{Med}[X]\).
- abstract compute_minimum()[source]
Compute the \(\text{Min}[X]\).
- abstract compute_moment(order)[source]
Compute the n-th moment \(M[X; n]\).
- compute_percentile(order)[source]
Compute the n-th percentile \(\text{p}[X; n]\).
- Parameters:
order (int) -- The order \(n\in\{0,1,2,...100\}\) of the percentile.
- Returns:
The component-wise percentile of the different variables.
- Raises:
ValueError -- When \(n\notin\{0,1,2,...100\}\).
- Return type:
- abstract compute_probability(thresh, greater=True)[source]
Compute the probability related to a threshold.
Either \(\mathbb{P}[X \geq x]\) or \(\mathbb{P}[X \leq x]\).
- Parameters:
- Returns:
The component-wise probability of the different variables.
- Return type:
- abstract compute_quantile(prob)[source]
Compute the quantile \(\mathbb{Q}[X; \alpha]\) related to a probability.
- compute_quartile(order)[source]
Compute the n-th quartile \(q[X; n]\).
- Parameters:
order (int) -- The order \(n\in\{1,2,3\}\) of the quartile.
- Returns:
The component-wise quartile of the different variables.
- Raises:
ValueError -- When \(n\notin\{1,2,3\}\).
- Return type:
- abstract compute_range()[source]
Compute the range \(R[X]\).
- abstract compute_standard_deviation()[source]
Compute the standard deviation \(\mathbb{S}[X]\).
- compute_tolerance_interval(coverage, confidence=0.95, side=ToleranceIntervalSide.BOTH)[source]
Compute a \((p,1-\alpha)\) tolerance interval \(\text{TI}[X]\).
The tolerance interval \(\text{TI}[X]\) is defined to contain at least a proportion \(p\) of the values of \(X\) with a level of confidence \(1-\alpha\). \(p\) is also called the coverage level of the TI.
Typically, \(\alpha=0.05\) or equivalently \(1-\alpha=0.95\).
The tolerance interval can be either
lower-sided (
side="LOWER"
: \([L, +\infty[\)),upper-sided (
side="UPPER"
: \(]-\infty, U]\)) orboth-sided (
side="BOTH"
: \([L, U]\)).
- Parameters:
coverage (float) -- A minimum proportion \(p\in[0,1]\) of belonging to the TI.
confidence (float) --
A level of confidence \(1-\alpha\in[0,1]\).
By default it is set to 0.95.
side (ToleranceIntervalSide) --
The type of the tolerance interval.
By default it is set to "both".
- Returns:
The component-wise tolerance intervals of the different variables, expressed as
{variable_name: [(lower_bound, upper_bound), ...], ... }
where[(lower_bound, upper_bound), ...]
are the lower and upper bounds of the tolerance interval of the different components ofvariable_name
.- Return type:
See also
- abstract compute_variance()[source]
Compute the variance \(\mathbb{V}[X]\).
- compute_variation_coefficient()[source]
Compute the coefficient of variation \(CoV[X]\).
This is the standard deviation normalized by the expectation: \(CoV[X]=\mathbb{E}[S]/\mathbb{E}[X]\).
- dataset: Dataset
The dataset.
- n_samples: int
The number of samples.
- n_variables: int
The number of variables.
- name: str
The name of the object.