gemseo.mlearning.clustering.algos.gaussian_mixture module#

The Gaussian mixture algorithm for clustering.

The Gaussian mixture algorithm groups the data into clusters. The number of clusters is fixed. Each cluster \(i=1, \cdots, k\) is defined by a mean \(\mu_i\) and a covariance matrix \(\Sigma_i\).

The prediction of the cluster value of a point is simply the cluster where the probability density of the Gaussian distribution defined by the given mean and covariance matrix is the highest:

\[\operatorname{cluster}(x) = \underset{i=1,\cdots,k}{\operatorname{argmax}} \mathcal{N}(x; \mu_i, \Sigma_i)\]

where \(\mathcal{N}(x; \mu_i, \Sigma_i)\) is the value of the probability density function of a Gaussian random variable \(X \sim \mathcal{N}(\mu_i, \Sigma_i)\) at the point \(x\) and \(\|x-\mu_i\|_{\Sigma_i^{-1}} = \sqrt{(x-\mu_i)^T \Sigma_i^{-1} (x-\mu_i)}\) is the Mahalanobis distance between \(x\) and \(\mu_i\) weighted by \(\Sigma_i\). Likewise, the probability of belonging to a cluster \(i=1, \cdots, k\) may be determined through

\[\mathbb{P}(x \in C_i) = \frac{\mathcal{N}(x; \mu_i, \Sigma_i)} {\sum_{j=1}^k \mathcal{N}(x; \mu_j, \Sigma_j)},\]

where \(C_i = \{x\, | \, \operatorname{cluster}(x) = i \}\).

When fitting the algorithm, the cluster centers \(\mu_i\) and the covariance matrices \(\Sigma_i\) are computed using the expectation-maximization algorithm.

This concept is implemented through the GaussianMixture class which inherits from the BaseClusterer class.

Dependence#

This clustering algorithm relies on the GaussianMixture class of the scikit-learn library.

class GaussianMixture(data, settings_model=None, **settings)[source]#

Bases: BasePredictiveClusterer

The Gaussian mixture clustering algorithm.

Parameters:
  • data (Dataset) -- The training dataset.

  • settings_model (BaseMLAlgoSettings | None) -- The machine learning algorithm settings as a Pydantic model. If None, use **settings.

  • **settings (Any) -- The machine learning algorithm settings. These arguments are ignored when settings_model is not None.

Raises:

ValueError -- When both the variable and the group it belongs to have a transformer.

Settings#

alias of GaussianMixture_Settings

LIBRARY: ClassVar[str] = 'scikit-learn'#

The name of the library of the wrapped machine learning algorithm.

SHORT_ALGO_NAME: ClassVar[str] = 'GMM'#

The short name of the machine learning algorithm, often an acronym.

Typically used for composite names, e.g. f"{algo.SHORT_ALGO_NAME}_{dataset.name}" or f"{algo.SHORT_ALGO_NAME}_{discipline.name}".