Tian Zhang, Gaghu Ramakrishnan, and Miron Linvy, BIRCH: An Efficient Data Clustering Method for Very Large Databases, In Proc. of the ACM SIGMOD Conference on Management of Data, Montreal, Cananda, June, 1996.

Problem definition: Given N data points, find a partition to divide the data points into clusters such that data points are close to each other in each cluster. All data points have the same number of attributes where attributes are positive number in the paper. Vectors are used to represent the data points in the paper. Closeness in the paper is distance-based e.g. Euclidean distance. Clusters can be represented by 3 pieces of information namely, number of data points in the cluster, linear sum of vectors of all its data points and squared sum. Using these 3 pieces of information, we can compute all the usual metrics used to measure the tightness of a cluster and the relationship between cluster, e.g. centroid, radius and average inter-cluster distance etc.

Algorithm: have 4 phase
-Phase 1: build a clustering feature tree to have a rough partition

-Definition of a clustering feature(CF) tree: it has 2 parameters branching factor B and threshold T. Every non-leaf node can has at most B entries. Each entry has 2 parts. 1 is a pointer to its child node. The other is the CF of the cluster of all the subclusters in the child node. Every leaf node entry has only the CF of the subcluster it represents and the radius of the subcluster cannot exceed T. All leaf nodes are chained together using pointers horizontally. Each node is required to fit in a page.

-Insertion into the CF tree: start from the part, follow the pointer of the closest cluster in each node until reaching a leaf node entry. If the entry can accomodate the new data point without violating T, update nodes along the path from the root to the leaf node and done. If violate T, create a new entry in the leaf node and put the new data point in it. If the resulting leaf node can fit in a page, update the ancestors and done. If not, split the leaf node using the farthest entries as the seeds. A new entry must be created in the parent node. If again violating the page limit, repeat the previous procedure. At the node where the split ends, merge the 2 closest entries. If the children of the 2 entried can fit in a page, we free a child node and an entry. If not, redistribute the entries in the 2 child nodes.

-If the size of the CF tree exceeds the main memory M, estimate another higher threshold and rebuild the CF tree for all leaf entries already in the CF tree. We allocate R bytes in disk for ouliers. If the leaf entry contains too few data points, we treat them as outliers and write them to disk. If run out of disk space, insert the outliers into the CF tree again. Or outliers are inserted again at last. Also, we can delay inserting data points which cause splits.

-Problems of Phase 1: splits due to page limit can split natural clusters. also, such insertion only considers the existing subclusters in the CF tree, so a data point may be put in a place it should not be and the data order has a big effect.


-Phase 2(optional): to build a smaller CF tree but scanning the CF tree of phase 1 to ensure the number of subclusters input to the phase 3 within a range phase 3 can handle.


-Phase 3: use traditional clustering algorithm (e.g. hierarchical clustering) to do clustering on the subclustering.


-Phase 4: use the centroids of the subcluster as seeds and scan the data points again to redistribute them in the correct subclusters.


Comment: Since computing the best partition needs to look at all combinations of clustering, it requires many scans of data points which is costly. The algorithm of this paper makes use of the property that data points in a cluster are close to each other and only considers the existing clusters. The algorithm may not find the best partition but it only needs 2 scans of data points with the use of the CF tree. The threshold T is determined dynamically to cope with the main memory M.