So
why “Data Mining” ?
(vs.
Pattern Recognition or Statistics?)
-
Scalability to large data sets:
-
what if data is disk-resident and can
only afford to scan it a few times?
-
what if “random” sampling is not sufficient
or even more efficient than scan? distributed system?
-
EDA with large data sets and high dimensionality
-
New algorithms are beginning to appear
(some without proper statistical foundation)
-
What is possible/impossible with massive
data sets?
-
Automation:
effective use by non-statisticians
