S. Guha, R. Rastogi and K. Shim. ROCK: a robust clustering algorithm for categorical attributes . In Proceedings of International Conference on Data Engineering, 1999.

Problem: Given N data points with boolean/categorical attributes, divide the data points into clusters(groups) such that points in a cluster are similar but dissimilar to those in other clusters.

Traditional clustering algorithms can be classified into partitional clustering and hierarchical clustering. Partitional clustering algorithms divide the point space into k clusters that optimize a certain criterion function. Hierarchical clustering algorithms treat all points as separate clusters initially and pairs of clusters whose centroids/means are the closest are then sucessively merged until the desired number of clusters remain. However, distances between centroids of clusters or points are poor estimate of the somilarity between them.

Consider the market basket database, we want to cluster the customers according to the items they bought. Each customer is treated as a data point and the attributes are the items which values are true/false. Customers with similar buying patterns and belonging to a single cluster may buy a small subset of items from a much larger set that defines the cluster. Thus, using distance to measure the similarity may not be accurate. Especially if the clusters are of different sizes, the distance between a data point and its centroid may be so large that the cluster may be forced to split into smaller clusters. (e.g. a cluster involving common items such as diapers, baby food and toys / oil, salt, rice) Also, using distance may cause 2 cluster with no common items to merge into a single cluster. Besides, when the attribute values of the mean/centroid of a cluster decreases, it's difficult to know if 2 points differ on few attributes or on every attribute with small amounts. In additional, the problem of 0, 0.5 and 1 exits (i think).

One solution is to use Jaccard coefficient to measure similarity. Centroid-based hierarchical clustering schemes cannot be used since the similarity measure is non-meatric and defined for only points in the cluster and not for its centroid. Thus, use either MST or hierarchical clustering with group average. The MST algorithm merges, at each step, the pair of clusters containing the most similar pair of points while the group average algorithm merges the ones for which the average similarity between pairs of points in the clusters is the highest. The MST algorithm is very sensitive to outlier in clusters while the group average algorithm has a tendency to split large clusters.

A pair of points are neighbors if their similarity exceeds a certain threshold (distance or Jaccard coefficient). The number of links between a pair of points is the number of common neighbors. Points belonging to a single cluster will in general have a larger number of common neighbors.

Categorical data: treat each value of each attribute as an item and ignore missing value.

Criterion Function: summation of the product of the number of data points in a cluster and the sum of the ratio of actual links between any 2 points and the expected links between pairs of points in the cluster.

Goodness of merging 2 cluster is ratio of actual links between the 2 clusters and the estimated links.

ROCK algorithm: 1. draw sample data points, 2. cluster with links, 3. label data in disk
cluster with links: treat each data points as cluster and keep a local heap for each order which contains goodness of other clusters in decreasing order. Also keep a global heap which contains the best goodness of each cluster in decreasing order. Extract the max one in the global heap, merge the pair of clusters and update the global and local heap until no more links between any 2 clusters or the number of clusters reaches a user-specifed number.
label data in disk: select a fraction of points in each cluster; calculate the normalized number of neighbors of the point in that cluster and put the point in the cluster with the max normalized number of neighbors.