Abstract: | Update | Many important problems involve clustering large datasets. Although naive implementations of clustering are computationally expensive, there are established efficient techniques for clustering when the dataset has either (1) a limited number of clusters, (2) a low feature dimensionality, or (3) a small number of data points. However, there has been much less work on methods of efficiently clustering datasets that are large in all three ways at once--for example, having millions of data points that exist in many thousands of dimensions representing many thousands of clusters. We present a new technique for clustering these large, highdimensional datasets. The key idea involves using a cheap, approximate distance measure to efficiently... |
BibTeX entry: | Update |
@misc{ whizbang-efficient,
author = "Andrew Mccallum Whizbang",
title = "Efficient Clustering of High-Dimensional Data Sets with Application to
Reference Matching" }
|
|
|