Bioinformatics Laboratory

SLAD: A parallel computational framework for ultra-large-scale sequence clustering

The rapid development of sequencing technology has led to an explosive accumulation of genomic data. Clustering is often the first step to be performed in sequence analysis. However, existing methods for sequence clustering scale poorly with respect to the unprecedented growth of input data size. As high-performance computing systems are becoming widely accessible, it is highly desired that a clustering method can easily scale to handle large-scale sequence datasets by leveraging the power of parallel computing. In this paper, we introduce SLAD (Separation via Landmark-based Active Divisive clustering), a generic computational framework that can be used to parallelize various de novo operational taxonomic unit (OTU) picking methods and comes with theoretical guarantees on both accuracy and efficiency. The proposed framework was implemented on Apache Spark, which allows us to easily and efficiently utilize parallel computing resources. Experiments on various datasets showed that SLAD can significantly speed up a number of popular de novo OTU picking methods and meanwhile maintains the same level of accuracy. In particular, the experiment on the Earth Microbiome Project dataset (∼2.2B reads, 437GB) demonstrated the excellent scalability of the proposed method.

Bioinformatics Laboratory

Software

SLAD: A parallel computational framework for ultra-large-scale sequence clustering

Source code and documentation

Manuscript