Multiple sequence alignment of datasets containing many thousands of sequences is a challenging problem with applications in phylogeny estimation, protein structure and function prediction, taxon identification of metagenomic data, etc. However, few methods can analyze large datasets, and none have been shown to have good accuracy on datasets with more than about 10,000 sequences, especially if the sequence datasets have evolved with high rates of evolution.
In this talk, I will present a new method to obtain highly accurate estimations of large-scale multiple sequence alignments and phylogenies. The basic idea is to use an ensemble of Hidden Markov Models (HMMs) to represent a "seed alignment", and then align all the remaining sequences to the seed alignment. Our method, UPP, returns very accurate alignments, and trees on these alignments are also very accurate - even on datasets with as many as 1,000,000 sequences, or datasets that contain many fragmentary sequences. Furthermore, UPP is both fast and very scalable, so that the
analysis of the 1-million taxon dataset took only 24 hours using 12 cores and small amounts of memory. Finally, this Ensemble of HMMs technique improves the accuracy of methods for other bioinformatics problems, including phylogenetic placement and taxon identication of metagenomic data. This is joint work with Siavash Mirarab and Tandy Warnow.