gemseo / mlearning / cluster

gaussian_mixture module

Gaussian mixture clustering algorithm

The Gaussian mixture algorithm groups the data into clusters. The number of clusters is fixed. Each cluster \(i=1, \cdots, k\) is defined by a mean \(\mu_i\) and a covariance matrix \(\Sigma_i\).

The prediction of the cluster value of a point is simply the cluster where the probability density from the Gaussian distribution defined by the given mean and covariance matrix is the highest:

\[\operatorname{cluster}(x) = \underset{i=1,\cdots,k}{\operatorname{argmax}} \mathcal{N}(x; \mu_i, \Sigma_i) = \underset{i=1,\cdots,k}{\operatorname{argmin}} \|x-\mu_i\|_{\Sigma_i^{-1}},\]

where \(\mathcal{N}(x; \mu_i, \Sigma_i)\) is the value of the probability density function of a Gaussian random variable \(X \sim \mathcal{N}(\mu_i, \Sigma_i)\) at the point \(x\) and \(\|x-\mu_i\|_{\Sigma_i^{-1}} = \sqrt{(x-\mu_i)^T \Sigma_i^{-1} (x-\mu_i)}\) is the Mahalanobis distance between \(x\) and \(\mu_i\) weighted by \(\Sigma_i\). Likewise, the probability of belonging to a cluster \(i=1, \cdots, k\) may be determined through

\[\mathbb{P}(x \in C_i) = \frac{\mathcal{N}(x; \mu_i, \Sigma_i)} {\sum_{j=1}^k \mathcal{N}(x; \mu_j, \Sigma_j)},\]

where \(C_i = \{x\, | \, \operatorname{cluster}(x) = i \}\).

When fitting the algorithm, the cluster centers \(\mu_i\) and the covariance matrices \(\Sigma_i\) are computed using the expectation-maximization algorithm.

This concept is implemented through the GaussianMixture class which inherits from the MLClusteringAlgo class.

Dependence

This clustering algorithm relies on the GaussianMixture class of the scikit-learn library.

class gemseo.mlearning.cluster.gaussian_mixture.GaussianMixture(data, transformer=None, var_names=None, n_components=5, **parameters)[source]

Bases: gemseo.mlearning.cluster.cluster.MLClusteringAlgo

Gaussian mixture clustering algorithm.

Constructor.

Parameters
  • data (Dataset) – learning dataset.

  • transformer (dict(str)) – transformation strategy for data groups. If None, do not transform data. Default: None.

  • var_names (list(str)) – names of the variables to consider.

  • n_components (int) – number of Gaussian mixture components. Default: 5.

  • parameters – Scikit-learn algorithm parameters.

ABBR = 'GaussMix'