gemseo.mlearning.clustering.algos.base_predictive_clusterer module#

The base class for clustering algorithms with a prediction method.

class BasePredictiveClusterer(data, settings_model=None, **settings)[source]#

Bases: BaseClusterer

The base class for clustering algorithms with a prediction method.

Parameters:
  • data (Dataset) -- The learning dataset.

  • settings_model (BaseMLAlgoSettings | None) -- The machine learning algorithm settings as a Pydantic model. If None, use **settings.

  • **settings (Any) -- The machine learning algorithm settings. These arguments are ignored when settings_model is not None.

Raises:

ValueError -- When both the variable and the group it belongs to have a transformer.

predict(data)[source]#

Predict the clusters from the input data.

The user can specify these input data either as a NumPy array, e.g. array([1., 2., 3.]) or as a dictionary, e.g. {'a': array([1.]), 'b': array([2., 3.])}.

If the numpy arrays are of dimension 2, their i-th rows represent the input data of the i-th sample; while if the numpy arrays are of dimension 1, there is a single sample.

The type of the output data and the dimension of the output arrays will be consistent with the type of the input data and the dimension of the input arrays.

Parameters:

data (DataType) -- The input data.

Returns:

The predicted cluster for each input data sample.

Return type:

int | ndarray

predict_proba(data, hard=True)[source]#

Predict the probability of belonging to each cluster from input data.

The user can specify these input data either as a numpy array, e.g. array([1., 2., 3.]) or as a dictionary, e.g. {'a': array([1.]), 'b': array([2., 3.])}.

If the numpy arrays are of dimension 2, their i-th rows represent the input data of the i-th sample; while if the numpy arrays are of dimension 1, there is a single sample.

The dimension of the output array will be consistent with the dimension of the input arrays.

Parameters:
  • data (DataType) -- The input data.

  • hard (bool) --

    Whether clustering should be hard (True) or soft (False).

    By default it is set to True.

Returns:

The probability of belonging to each cluster, with shape (n_samples, n_clusters) or (n_clusters,).

Return type:

RealArray