Clustering
schemes
-
Distance-based: assumes
metric space on attributes
-
Easier on numerical data since basic notions
are natural
-
Distance metric easier to formulate (e.g.
weighted Euclidean)
-
“Distance” from a reference point
-
Quantities such as averages, correlations,
centroids, and so forth are well-defined.
-
Clustering on categorical data is generally
harder since metric notions have to be defined.
-
Model-based:
-
estimate a density (e.g. KDE, mixture
of gaussians, …)
-
go bump-hunting
-
compute P(Xi|Cluster j)
-
Partition-based:
-
enumerate partitions and score each: usually
using some adhoc scoring scheme (e.g. conceptual clustering in AI)
