Bioinformatics Laboratory

ESPRIT: Estimating Species Richness Using Large Collections of 16S rRNA Pyrosequences

Recent metagenomics studies of environmental samples suggest that microbial communities are much more diverse than previously reported, and deep sequencing will significantly increase the estimate of total species diversity. Massively parallel pyrosequencing technology enables ultra-deep sequencing of complex microbial populations rapidly and inexpensively. However, classifying large collections of 16S ribosomal sequences poses a serious computational challenge for existing algorithms. We proposed a new algorithm, referred to as ESPRIT, which addresses several computational limitations of prior methods. We developed two versions of ESPRIT, one for personal computers and one for computer clusters. The personal-computer version is used for small and medium-scale datasets and can process several tens of thousands sequences within a few minutes, while the computer-cluster version is for large-scale problems and is able to analyze several hundreds of thousands of sequences within one day.

Publication

Y. Sun*, Y. Cai*, L. Liu, F. Yu, M. L. Farrell, W. McKendree, and W. Farmerie, (*equal contribution) ESPRIT: Estimating Species Richness Using Large Collections of 16S rRNA Pyrosequences, Nucleic Acids Research, vol. 37, no. 10, e76, 2009.

Documents

Code

Manuscript and supplement accepted by Nucleic Acids Research
User guide

Related Publications

ESPRIT is a standard implementation of the complete-linkage based hierarchical clustering method. It can comfortably process several tens of thousands sequences using a desktop computer. We have used the algorithm to process 1.1M human gut sequences using a small computer cluster consisting of 100 nodes. Here is the paper.
- Y. Sun, Y. Cai, V. Mai, W. Farmerie, F. Yu, J. Li, and S. Goodison, Advanced Computational Algorithms for Microbial Community Analysis Using Massive 16S rRNA Sequence Data, Nucleic Acids Research, vol. 38, no. 22, e205, 2010.

Many existing algorithms, though widely used by the biology community, have not yet been fully or properly benchmarked. They vary widely in their outputs, which makes it difficult to interpret and compare research findings from different research groups. We conducted a large-scale benchmark study to evaluate the performance of each algorithm. One of the reviewers commented that every graduate student and PI using high-throughput sequencing technology for microbial community analysis should read this paper. We hope you find the paper useful.
- Y. Sun, Y. Cai, S. Huse, R. Knight, W. Farmerie, X. Wang and V. Mai, A Large-scale Benchmark Study of Existing Algorithms for Taxonomy-Independent Microbial Community Analysis, Briefings in Bioinformatics, vol. 13, no. 1, pp. 107-121, 2012.

ESPRIT has been implemented into the Novo-G system, the world's most powerful reconfigurable computer for research. And here is the paper.
- C. Pascoe, A. Lawande, H. Lam, A. George, Y. Sun, W. Farmerie, and H. Martin, Reconfigurable Supercomputing with Scalable Systolic Arrays and In-Stream Control for Wavefront Genomics Processing. in Proc 2010 Symposium on Application Accelerators in High-Performance Computing (SAAHPC10), pp. 1-6, July 2010.

ESPRIT is an O(N²) algorithm with quadratic computational and space complexity. We are developing a more powerful algorithm capable of handling several tens of millions of 16S rRNA pyrosequences. A preliminary study showed that the new algorithm has close-to-linear computational and space complexities, and runs about 500 times faster than ESPRIT. This approach is useful for other types of biological sequence clustering (e.g., identification of orthologs).
- Y. Cai and Y. Sun, ESPRIT-Tree: Hierarchical Clustering Analysis of Millions of 16S rRNA Pyrosequences in Quasilinear Time, Nucleic Acids Research, 39 (14): e95, 2011.

ESPRIT and ESPRIT-Tree use pairwise sequence alignment, instead of multiple sequence alignment, to compute pairwise distances. It was suggested that using pairwise sequence alignment ignores the secondary structure information of 16S rRNA gene. We performed a simulation study that showed that including secondary structure information actually does not improve OTU picking performance, but significantly increases computational complexity.
- X. Wang, Y. Cai, Y. Sun, R. Knight, and V. Mai, Secondary Structure Information Does not Improve OTU Picking for 16S rRNA Sequences, The ISME Journal, vol. 6, no. 7, pp. 1277-1280, 2012.

While parallel computing is generally not a viable solution to scale up an O(N²) algorithms, the quasilinear space and computational complexities of the ESPRIT-Tree algorithm make it computationally tractable to process tens of millions of sequences by using a small computer cluster.
- Y. Cai and Y. Sun, ESPRIT-Forest: Taxonomy Independent Analysis of Tens of Millions of 16S rRNA Pyrosequences Using Parallel Computing, under review.

Bioinformatics Laboratory

Software

ESPRIT: Estimating Species Richness Using Large Collections of 16S rRNA Pyrosequences

Publication

Documents

Related Publications