gemseo / mlearning / cluster

# gaussian_mixture module¶

## Gaussian mixture clustering algorithm¶

The Gaussian mixture algorithm groups the data into clusters. The number of clusters is fixed. Each cluster $$i=1, \cdots, k$$ is defined by a mean $$\mu_i$$ and a covariance matrix $$\Sigma_i$$.

The prediction of the cluster value of a point is simply the cluster where the probability density from the Gaussian distribution defined by the given mean and covariance matrix is the highest:

$\operatorname{cluster}(x) = \underset{i=1,\cdots,k}{\operatorname{argmax}} \mathcal{N}(x; \mu_i, \Sigma_i) = \underset{i=1,\cdots,k}{\operatorname{argmin}} \|x-\mu_i\|_{\Sigma_i^{-1}},$

where $$\mathcal{N}(x; \mu_i, \Sigma_i)$$ is the value of the probability density function of a Gaussian random variable $$X \sim \mathcal{N}(\mu_i, \Sigma_i)$$ at the point $$x$$ and $$\|x-\mu_i\|_{\Sigma_i^{-1}} = \sqrt{(x-\mu_i)^T \Sigma_i^{-1} (x-\mu_i)}$$ is the Mahalanobis distance between $$x$$ and $$\mu_i$$ weighted by $$\Sigma_i$$. Likewise, the probability of belonging to a cluster $$i=1, \cdots, k$$ may be determined through

$\mathbb{P}(x \in C_i) = \frac{\mathcal{N}(x; \mu_i, \Sigma_i)} {\sum_{j=1}^k \mathcal{N}(x; \mu_j, \Sigma_j)},$

where $$C_i = \{x\, | \, \operatorname{cluster}(x) = i \}$$.

When fitting the algorithm, the cluster centers $$\mu_i$$ and the covariance matrices $$\Sigma_i$$ are computed using the expectation-maximization algorithm.

This concept is implemented through the GaussianMixture class which inherits from the MLClusteringAlgo class.

### Dependence¶

This clustering algorithm relies on the GaussianMixture class of the scikit-learn library.

class gemseo.mlearning.cluster.gaussian_mixture.GaussianMixture(data, transformer=None, var_names=None, n_components=5, **parameters)[source]

Gaussian mixture clustering algorithm.

Constructor.

Parameters
• data (Dataset) – learning dataset.

• transformer (dict(str)) – transformation strategy for data groups. If None, do not transform data. Default: None.

• var_names (list(str)) – names of the variables to consider.

• n_components (int) – number of Gaussian mixture components. Default: 5.

• parameters – Scikit-learn algorithm parameters.

ABBR = 'GaussMix'