clustering module¶
This module contains the base classes for clustering algorithms.
The clustering
module implements the concept of
clustering models, a kind of unsupervised machine learning algorithm where the goal is to
group data into clusters. Wherever possible, these methods should be able to predict the
class of the new data, as well as the probability of belonging to each class.
This concept is implemented through the MLClusteringAlgo
class, which inherits
from the MLUnsupervisedAlgo
class, and through the
MLPredictiveClusteringAlgo
class which inherits from
MLClusteringAlgo
.
- class gemseo.mlearning.clustering.clustering.MLClusteringAlgo(data, transformer=mappingproxy({}), var_names=None, **parameters)[source]
Bases:
MLUnsupervisedAlgo
Clustering algorithm.
The inheriting classes shall overload the
MLUnsupervisedAlgo._fit()
method.- Parameters:
data (IODataset) – The learning dataset.
transformer (TransformerType) –
The strategies to transform the variables. The values are instances of
Transformer
while the keys are the names of either the variables or the groups of variables, e.g."inputs"
or"outputs"
in the case of the regression algorithms. If a group is specified, theTransformer
will be applied to all the variables of this group. IfIDENTITY
, do not transform the variables.By default it is set to {}.
var_names (Iterable[str] | None) – The names of the variables. If
None
, consider all variables mentioned in the learning dataset.**parameters (MLAlgoParameterType) – The parameters of the machine learning algorithm.
- Raises:
ValueError – When both the variable and the group it belongs to have a transformer.
- algo: Any
The interfaced machine learning algorithm.
- learning_set: IODataset
The learning dataset.
- n_clusters: int
The number of clusters.
- transformer: dict[str, Transformer]
The strategies to transform the variables, if any.
The values are instances of
Transformer
while the keys are the names of either the variables or the groups of variables, e.g. “inputs” or “outputs” in the case of the regression algorithms. If a group is specified, theTransformer
will be applied to all the variables of this group.
- class gemseo.mlearning.clustering.clustering.MLPredictiveClusteringAlgo(data, transformer=mappingproxy({}), var_names=None, **parameters)[source]
Bases:
MLClusteringAlgo
Predictive clustering algorithm.
The inheriting classes shall overload the
MLUnsupervisedAlgo._fit()
method, and theMLClusteringAlgo._predict()
andMLClusteringAlgo._predict_proba()
methods if possible.- Parameters:
data (IODataset) – The learning dataset.
transformer (TransformerType) –
The strategies to transform the variables. The values are instances of
Transformer
while the keys are the names of either the variables or the groups of variables, e.g."inputs"
or"outputs"
in the case of the regression algorithms. If a group is specified, theTransformer
will be applied to all the variables of this group. IfIDENTITY
, do not transform the variables.By default it is set to {}.
var_names (Iterable[str] | None) – The names of the variables. If
None
, consider all variables mentioned in the learning dataset.**parameters (MLAlgoParameterType) – The parameters of the machine learning algorithm.
- Raises:
ValueError – When both the variable and the group it belongs to have a transformer.
- predict(data)[source]
Predict the clusters from the input data.
The user can specify these input data either as a NumPy array, e.g.
array([1., 2., 3.])
or as a dictionary, e.g.{'a': array([1.]), 'b': array([2., 3.])}
.If the numpy arrays are of dimension 2, their i-th rows represent the input data of the i-th sample; while if the numpy arrays are of dimension 1, there is a single sample.
The type of the output data and the dimension of the output arrays will be consistent with the type of the input data and the dimension of the input arrays.
- Parameters:
data (DataType) – The input data.
- Returns:
The predicted cluster for each input data sample.
- Return type:
int | ndarray
- predict_proba(data, hard=True)[source]
Predict the probability of belonging to each cluster from input data.
The user can specify these input data either as a numpy array, e.g.
array([1., 2., 3.])
or as a dictionary, e.g.{'a': array([1.]), 'b': array([2., 3.])}
.If the numpy arrays are of dimension 2, their i-th rows represent the input data of the i-th sample; while if the numpy arrays are of dimension 1, there is a single sample.
The dimension of the output array will be consistent with the dimension of the input arrays.
- algo: Any
The interfaced machine learning algorithm.
- learning_set: IODataset
The learning dataset.
- n_clusters: int
The number of clusters.
- transformer: dict[str, Transformer]
The strategies to transform the variables, if any.
The values are instances of
Transformer
while the keys are the names of either the variables or the groups of variables, e.g. “inputs” or “outputs” in the case of the regression algorithms. If a group is specified, theTransformer
will be applied to all the variables of this group.