KDD - what's new

Clustering schemes

Distance-based: assumes metric space on attributes

Easier on numerical data since basic notions are natural

Distance metric easier to formulate (e.g. weighted Euclidean)
“Distance” from a reference point
Quantities such as averages, correlations, centroids, and so forth are well-defined.

Clustering on categorical data is generally harder since metric notions have to be defined.

Model-based:

estimate a density (e.g. KDE, mixture of gaussians, …)
go bump-hunting
compute P(X_i|Cluster j)

Partition-based:

enumerate partitions and score each: usually using some adhoc scoring scheme (e.g. conceptual clustering in AI)