Parallel Hierarchical Clustering in Linearithmic Time for Ultra-Large-Scale Sequence Analysis

The rapid development of sequencing technology has led to an explosive accumulation of genomics data. Clustering is often the first step to perform in sequence analysis, and hierarchical clustering is one of the most commonly used approaches for this purpose. However, the standard hierarchical clustering method scales poorly due to its quadratic time and space complexities stemming mainly from the computation and storage of a pairwise distance matrix. In order to recover the hierarchy of massive sequence data, it is necessary to minimize the number of pairwise distances computed without degrading clustering performance. On the other hand, as high-performance computing clusters are becoming widely accessible, it is highly desirable that a clustering method can be easily adapted to parallel computing environments for further speedup, which is not a trivial task for hierarchical clustering. ] We proposed a new hierarchical clustering method that achieves good clustering performance and high scalability on large sequence datasets. It consists of two stages. In the first stage, a new landmark-based active hierarchical divisive clustering method was proposed that partitions a large-scale sequence dataset into groups, and in the second stage, a fast hierarchical agglomerative clustering method is applied to each group. By assembling hierarchies from both stages, the hierarchy of the data can be easily recovered. Theoretical results showed that our method can recover the true hierarchy with a high probability under some mild conditions and has a linearithmic time complexity with respect to the number of input sequences. The proposed method also facilitates an efficient parallel implementation. Empirical results on various datasets showed that our method achieved clustering accuracy comparable to ESPRIT-Tree and ran faster than CD-HIT and UCLUST on large-scale data.


Q. Mao, W. Zheng, L. Wang, Y. Cai, V. Mai, Y. Sun, Parallel Hierarchical Clustering in Linearithmic Time for Large-scale Sequence Analysis (regular paper), in Proc. IEEE International Conference on Data Mining (ICDM'15), pp. 310-319, 2015. (acceptance rate: 19%. Regular paper: 8%)