gemseo.mlearning.clustering.quality.silhouette_measure module#
The silhouette score to assess the quality of a clusterer.
The silhouette coefficient \(s_i\) is a measure of how similar a point \(x_i\) is to its own cluster \(C_{k_i}\) (cohesion) compared to other clusters (separation):
with \(a_i=\frac{1}{|C_{k_i}|-1} \sum_{j\in C_{k_i}\setminus\{i\} } \|x_i-x_j\|\) and \(b_i = \underset{\ell=1,\cdots,K\atop{\ell\neq k_i}}{\min} \frac{1}{|C_\ell|} \sum_{j\in C_\ell} \|x_i-x_j\|\)
where
\(K\) is the number of clusters,
\(C_k\) are the indices of the points belonging to the cluster \(k\),
\(|C_k|\) is the size of \(C_k\).
- class SilhouetteMeasure(algo, fit_transformers=True)[source]#
Bases:
BasePredictiveClustererQuality
The silhouette score to assess the quality of a clusterer.
- Parameters:
algo (BasePredictiveClusterer) -- A clustering algorithm.
fit_transformers (bool) --
Whether to re-fit the transformers when using resampling techniques. If
False
, use the transformers of the algorithm fitted from the whole training dataset.By default it is set to True.
- compute_bootstrap_measure(n_replicates=100, samples=(), multioutput=True, seed=None)[source]#
Evaluate the quality of the ML model using the bootstrap technique.
- Parameters:
n_replicates (int) --
The number of bootstrap replicates.
By default it is set to 100.
samples (Sequence[int]) --
The indices of the learning samples. If empty, use the whole training dataset.
By default it is set to ().
multioutput (bool) --
Whether the quality measure is returned for each component of the outputs. Otherwise, the average quality measure.
By default it is set to True.
seed (int | None) -- The seed of the pseudo-random number generator. If
None
, an unpredictable generator will be used.
- Returns:
The quality of the ML model.
- Return type:
MeasureType
- compute_cross_validation_measure(n_folds=5, samples=(), multioutput=True, randomize=True, seed=None)[source]#
Evaluate the quality of the ML model using the k-folds technique.
- Parameters:
n_folds (int) --
The number of folds.
By default it is set to 5.
samples (Sequence[int]) --
The indices of the learning samples. If empty, use the whole training dataset.
By default it is set to ().
multioutput (bool) --
Whether the quality measure is returned for each component of the outputs. Otherwise, the average quality measure.
By default it is set to True.
randomize (bool) --
Whether to shuffle the samples before dividing them in folds.
By default it is set to True.
seed (int | None) -- The seed of the pseudo-random number generator. If
None
, an unpredictable generator is used.
- Returns:
The quality of the ML model.
- Return type:
MeasureType
- compute_test_measure(test_data, samples=(), multioutput=True)[source]#
Evaluate the quality of the ML model from a test dataset.
- Parameters:
test_data (Dataset) -- The test dataset.
samples (Sequence[int]) --
The indices of the learning samples. If empty, use the whole training dataset.
By default it is set to ().
multioutput (bool) --
Whether the quality measure is returned for each component of the outputs. Otherwise, the average quality measure.
By default it is set to True.
- Returns:
The quality of the ML model.
- Return type:
MeasureType
- algo: BasePredictiveClusterer#
The machine learning algorithm whose quality we want to measure.