Batzoglou's research has focused on the development of algorithms and systems for genomics. Some of the topics he is working on include: sequence alignment algorithms, hidden Markov models, whole-genome comparison, annotation of biological features in genomes, microarray analysis, gene regulation, and DNA sequencing.

Academic Appointments

Honors & Awards

  • Sloan Research Fellowship, Alfred P. Sloan Foundation
  • Career Award in Computer Science, Alfred P. Sloan Foundation
  • Top 100 Young Technology Innovators, National Science Foundation
  • Best Paper Award, MIT Technology Review Magazine (2003)

Professional Education

  • PhD, MIT (2000)

Research & Scholarship

Current Research and Scholarly Interests

Computational Genomics


2015-16 Courses

Stanford Advisees

Graduate and Fellowship Programs


All Publications

  • Cell-lineage heterogeneity and driver mutation recurrence in pre-invasive breast neoplasia GENOME MEDICINE Weng, Z., Spies, N., Zhu, S. X., Newburger, D. E., Kashef-Haghighi, D., Batzoglou, S., Sidow, A., West, R. B. 2015; 7
  • Mutations in early follicular lymphoma progenitors are associated with suppressed antigen presentation. Proceedings of the National Academy of Sciences of the United States of America Green, M. R., Kihira, S., Liu, C. L., Nair, R. V., Salari, R., Gentles, A. J., Irish, J., Stehr, H., Vicente-Dueñas, C., Romero-Camarero, I., Sanchez-Garcia, I., Plevritis, S. K., Arber, D. A., Batzoglou, S., Levy, R., Alizadeh, A. A. 2015; 112 (10): E1116-25


    Follicular lymphoma (FL) is incurable with conventional therapies and has a clinical course typified by multiple relapses after therapy. These tumors are genetically characterized by B-cell leukemia/lymphoma 2 (BCL2) translocation and mutation of genes involved in chromatin modification. By analyzing purified tumor cells, we identified additional novel recurrently mutated genes and confirmed mutations of one or more chromatin modifier genes within 96% of FL tumors and two or more in 76% of tumors. We defined the hierarchy of somatic mutations arising during tumor evolution by analyzing the phylogenetic relationship of somatic mutations across the coding genomes of 59 sequentially acquired biopsies from 22 patients. Among all somatically mutated genes, CREBBP mutations were most significantly enriched within the earliest inferable progenitor. These mutations were associated with a signature of decreased antigen presentation characterized by reduced transcript and protein abundance of MHC class II on tumor B cells, in line with the role of CREBBP in promoting class II transactivator (CIITA)-dependent transcriptional activation of these genes. CREBBP mutant B cells stimulated less proliferation of T cells in vitro compared with wild-type B cells from the same tumor. Transcriptional signatures of tumor-infiltrating T cells were indicative of reduced proliferation, and this corresponded to decreased frequencies of tumor-infiltrating CD4 helper T cells and CD8 memory cytotoxic T cells. These observations therefore implicate CREBBP mutation as an early event in FL evolution that contributes to immune evasion via decreased antigen presentation.

    View details for DOI 10.1073/pnas.1501199112

    View details for PubMedID 25713363

  • Mutations in early follicular lymphoma progenitors are associated with suppressed antigen presentation PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Green, M. R., Kihira, S., Liu, C. L., Nair, R. V., Salari, R., Gentles, A. J., Irish, J., Stehr, H., Vicente-Duenas, C., Romero-Camarero, I., Sanchez-Garcia, I., Plevritis, S. K., Arber, D. A., Batzoglou, S., Levy, R., Alizadeh, A. A. 2015; 112 (10): E1116-E1125
  • Parente2: a fast and accurate method for detecting identity by descent GENOME RESEARCH Rodriguez, J. M., Bercovici, S., Huang, L., Frostig, R., Batzoglou, S. 2015; 25 (2): 280-289


    Identity-by-descent (IBD) inference is the problem of establishing a genetic connection between two individuals through a genomic segment that is inherited by both individuals from a recent common ancestor. IBD inference is an important preceding step in a variety of population genomic studies, ranging from demographic studies to linking genomic variation with phenotype and disease. The problem of accurate IBD detection has become increasingly challenging with the availability of large collections of human genotypes and genomes: Given a cohort's size, a quadratic number of pairwise genome comparisons must be performed. Therefore, computation time and the false discovery rate can also scale quadratically. To enable accurate and efficient large-scale IBD detection, we present Parente2, a novel method for detecting IBD segments. Parente2 is based on an embedded log-likelihood ratio and uses a model that accounts for linkage disequilibrium by explicitly modeling haplotype frequencies. Parente2 operates directly on genotype data without the need to phase data prior to IBD inference. We evaluate Parente2's performance through extensive simulations using real data, and we show that it provides substantially higher accuracy compared to previous state-of-the-art methods while maintaining high computational efficiency.

    View details for DOI 10.1101/gr.173641.114

    View details for Web of Science ID 000348974500012

  • Constraint and divergence of global gene expression in the mammalian embryo. eLife Spies, N., Smith, C. L., Rodriguez, J. M., Baker, J. C., Batzoglou, S., Sidow, A. 2015; 4


    The effects of genetic variation on gene regulation in the developing mammalian embryo remain largely unexplored. To globally quantify these effects, we crossed two divergent mouse strains and asked how genotype of the mother or of the embryo drives gene expression phenotype genomewide. Embryonic expression of 331 genes depends on the genotype of the mother. Embryonic genotype controls allele-specific expression of 1594 genes and a highly overlapping set of cis-expression quantitative trait loci (eQTL). A marked paucity of trans-eQTL suggests that the widespread expression differences do not propagate through the embryonic gene regulatory network. The cis-eQTL genes exhibit lower-than-average evolutionary conservation and are depleted for developmental regulators, consistent with purifying selection acting on expression phenotype of pattern formation genes. The widespread effect of maternal and embryonic genotype in conjunction with the purifying selection we uncovered suggests that embryogenesis is an important and understudied reservoir of phenotypic variation.

    View details for DOI 10.7554/eLife.05538

    View details for PubMedID 25871848

  • Cell-lineage heterogeneity and driver mutation recurrence in pre-invasive breast neoplasia. Genome medicine Weng, Z., Spies, N., Zhu, S. X., Newburger, D. E., Kashef-Haghighi, D., Batzoglou, S., Sidow, A., West, R. B. 2015; 7 (1): 28-?


    All cells in an individual are related to one another by a bifurcating lineage tree, in which each node is an ancestral cell that divided into two, each branch connects two nodes, and the root is the zygote. When a somatic mutation occurs in an ancestral cell, all its descendants carry the mutation, which can then serve as a lineage marker for the phylogenetic reconstruction of tumor progression. Using this concept, we investigate cell lineage relationships and genetic heterogeneity of pre-invasive neoplasias compared to invasive carcinomas.We deeply sequenced over a thousand phylogenetically informative somatic variants in 66 morphologically independent samples from six patients that represent a spectrum of normal, early neoplasia, carcinoma in situ, and invasive carcinoma. For each patient, we obtained a highly resolved lineage tree that establishes the phylogenetic relationships among the pre-invasive lesions and with the invasive carcinoma.The trees reveal lineage heterogeneity of pre-invasive lesions, both within the same lesion, and between histologically similar ones. On the basis of the lineage trees, we identified a large number of independent recurrences of PIK3CA H1047 mutations in separate lesions in four of the six patients, often separate from the diagnostic carcinoma.Our analyses demonstrate that multi-sample phylogenetic inference provides insights on the origin of driver mutations, lineage heterogeneity of neoplastic proliferations, and the relationship of genomically aberrant neoplasias with the primary tumors. PIK3CA driver mutations may be comparatively benign inducers of cellular proliferation.

    View details for DOI 10.1186/s13073-015-0146-2

    View details for PubMedID 25918554

  • Fast and scalable inference of multi-sample cancer lineages. Genome biology Popic, V., Salari, R., Hajirasouliha, I., Kashef-Haghighi, D., West, R. B., Batzoglou, S. 2015; 16 (1): 91-?


    Somatic variants can be used as lineage markers for the phylogenetic reconstruction of cancer evolution. Since somatic phylogenetics is complicated by sample heterogeneity, novel specialized tree-building methods are required for cancer phylogeny reconstruction. We present LICHeE (Lineage Inference for Cancer Heterogeneity and Evolution), a novel method that automates the phylogenetic inference of cancer progression from multiple somatic samples. LICHeE uses variant allele frequencies of somatic single nucleotide variants obtained by deep sequencing to reconstruct multi-sample cell lineage trees and infer the subclonal composition of the samples. LICHeE is open source and available at .

    View details for DOI 10.1186/s13059-015-0647-8

    View details for PubMedID 25944252

  • An Effective Filter for IBD Detection in Large Data Sets PLOS ONE Huang, L., Bercovici, S., Rodriguez, J. M., Batzoglou, S. 2014; 9 (3)
  • An effective filter for IBD detection in large data sets. PloS one Huang, L., Bercovici, S., Rodriguez, J. M., Batzoglou, S. 2014; 9 (3)


    Identity by descent (IBD) inference is the task of computationally detecting genomic segments that are shared between individuals by means of common familial descent. Accurate IBD detection plays an important role in various genomic studies, ranging from mapping disease genes to exploring ancient population histories. The majority of recent work in the field has focused on improving the accuracy of inference, targeting shorter genomic segments that originate from a more ancient common ancestor. The accuracy of these methods, however, is achieved at the expense of high computational cost, resulting in a prohibitively long running time when applied to large cohorts. To enable the study of large cohorts, we introduce SpeeDB, a method that facilitates fast IBD detection in large unphased genotype data sets. Given a target individual and a database of individuals that potentially share IBD segments with the target, SpeeDB applies an efficient opposite-homozygous filter, which excludes chromosomal segments from the database that are highly unlikely to be IBD with the corresponding segments from the target individual. The remaining segments can then be evaluated by any IBD detection method of choice. When examining simulated individuals sharing 4 cM IBD regions, SpeeDB filtered out 99.5% of genomic regions from consideration while retaining 99% of the true IBD segments. Applying the SpeeDB filter prior to detecting IBD in simulated fourth cousins resulted in an overall running time that was 10,000x faster than inferring IBD without the filter and retained 99% of the true IBD segments in the output.

    View details for DOI 10.1371/journal.pone.0092713

    View details for PubMedID 24667521

  • Extensive Variation in Chromatin States Across Humans SCIENCE Kasowski, M., Kyriazopoulou-Panagiotopoulou, S., Grubert, F., Zaugg, J. B., Kundaje, A., Liu, Y., Boyle, A. P., Zhang, Q. C., Zakharia, F., Spacek, D. V., Li, J., Xie, D., Olarerin-George, A., Steinmetz, L. M., Hogenesch, J. B., Kellis, M., Batzoglou, S., Snyder, M. 2013; 342 (6159): 750-752


    The majority of disease-associated variants lie outside protein-coding regions, suggesting a link between variation in regulatory regions and disease predisposition. We studied differences in chromatin states using five histone modifications, cohesin, and CTCF in lymphoblastoid lines from 19 individuals of diverse ancestry. We found extensive signal variation in regulatory regions, which often switch between active and repressed states across individuals. Enhancer activity is particularly diverse among individuals, whereas gene expression remains relatively stable. Chromatin variability shows genetic inheritance in trios, correlates with genetic variation and population divergence, and is associated with disruptions of transcription factor binding motifs. Overall, our results provide insights into chromatin variation among humans.

    View details for DOI 10.1126/science.1242510

    View details for Web of Science ID 000326647600047

  • Inference of Tumor Phylogenies with Improved Somatic Mutation Discovery JOURNAL OF COMPUTATIONAL BIOLOGY Salari, R., Saleh, S. S., Kashef-Haghighi, D., Khavari, D., Newburger, D. E., West, R. B., Sidow, A., Batzoglou, S. 2013; 20 (11): 933-944


    Next-generation sequencing technologies provide a powerful tool for studying genome evolution during progression of advanced diseases such as cancer. Although many recent studies have employed new sequencing technologies to detect mutations across multiple, genetically related tumors, current methods do not exploit available phylogenetic information to improve the accuracy of their variant calls. Here, we present a novel algorithm that uses somatic single-nucleotide variations (SNVs) in multiple, related tissue samples as lineage markers for phylogenetic tree reconstruction. Our method then leverages the inferred phylogeny to improve the accuracy of SNV discovery. Experimental analyses demonstrate that our method achieves up to 32% improvement for somatic SNV calling of multiple, related samples over the accuracy of GATK's Unified Genotyper, the state-of-the-art multisample SNV caller.

    View details for DOI 10.1089/cmb.2013.0106

    View details for Web of Science ID 000326577600008

    View details for PubMedID 24195709

  • Inference of tumor phylogenies with improved somatic mutation discovery. Journal of computational biology Salari, R., Saleh, S. S., Kashef-Haghighi, D., Khavari, D., Newburger, D. E., West, R. B., Sidow, A., Batzoglou, S. 2013; 20 (11): 933-944


    Next-generation sequencing technologies provide a powerful tool for studying genome evolution during progression of advanced diseases such as cancer. Although many recent studies have employed new sequencing technologies to detect mutations across multiple, genetically related tumors, current methods do not exploit available phylogenetic information to improve the accuracy of their variant calls. Here, we present a novel algorithm that uses somatic single-nucleotide variations (SNVs) in multiple, related tissue samples as lineage markers for phylogenetic tree reconstruction. Our method then leverages the inferred phylogeny to improve the accuracy of SNV discovery. Experimental analyses demonstrate that our method achieves up to 32% improvement for somatic SNV calling of multiple, related samples over the accuracy of GATK's Unified Genotyper, the state-of-the-art multisample SNV caller.

    View details for DOI 10.1089/cmb.2013.0106

    View details for PubMedID 24195709

  • Short read alignment with populations of genomes. Bioinformatics Huang, L., Popic, V., Batzoglou, S. 2013; 29 (13): i361-i370


    The increasing availability of high-throughput sequencing technologies has led to thousands of human genomes having been sequenced in the past years. Efforts such as the 1000 Genomes Project further add to the availability of human genome variation data. However, to date, there is no method that can map reads of a newly sequenced human genome to a large collection of genomes. Instead, methods rely on aligning reads to a single reference genome. This leads to inherent biases and lower accuracy. To tackle this problem, a new alignment tool BWBBLE is introduced in this article. We (i) introduce a new compressed representation of a collection of genomes, which explicitly tackles the genomic variation observed at every position, and (ii) design a new alignment algorithm based on the Burrows-Wheeler transform that maps short reads from a newly sequenced genome to an arbitrary collection of two or more (up to millions of) genomes with high accuracy and no inherent bias to one specific genome.

    View details for DOI 10.1093/bioinformatics/btt215

    View details for PubMedID 23813006

  • Short read alignment with populations of genomes BIOINFORMATICS Huang, L., Popic, V., Batzoglou, S. 2013; 29 (13): 361-370
  • Automated cellular annotation for high-resolution images of adult Caenorhabditis elegans BIOINFORMATICS Aerni, S. J., Liu, X., Do, C. B., Gross, S. S., Nguyen, A., Guo, S. D., Long, F., Peng, H., Kim, S. S., Batzoglou, S. 2013; 29 (13): 18-26
  • Genome evolution during progression to breast cancer GENOME RESEARCH Newburger, D. E., Kashef-Haghighi, D., Weng, Z., Salari, R., Sweeney, R. T., Brunner, A. L., Zhu, S. X., Guo, X., Varma, S., Troxell, M. L., West, R. B., Batzoglou, S., Sidow, A. 2013; 23 (7): 1097-1108


    Cancer evolution involves cycles of genomic damage, epigenetic deregulation, and increased cellular proliferation that eventually culminate in the carcinoma phenotype. Early neoplasias, which are often found concurrently with carcinomas and are histologically distinguishable from normal breast tissue, are less advanced in phenotype than carcinomas and are thought to represent precursor stages. To elucidate their role in cancer evolution we performed comparative whole-genome sequencing of early neoplasias, matched normal tissue, and carcinomas from six patients, for a total of 31 samples. By using somatic mutations as lineage markers we built trees that relate the tissue samples within each patient. On the basis of these lineage trees we inferred the order, timing, and rates of genomic events. In four out of six cases, an early neoplasia and the carcinoma share a mutated common ancestor with recurring aneuploidies, and in all six cases evolution accelerated in the carcinoma lineage. Transition spectra of somatic mutations are stable and consistent across cases, suggesting that accumulation of somatic mutations is a result of increased ancestral cell division rather than specific mutational mechanisms. In contrast to highly advanced tumors that are the focus of much of the current cancer genome sequencing, neither the early neoplasia genomes nor the carcinomas are enriched with potentially functional somatic point mutations. Aneuploidies that occur in common ancestors of neoplastic and tumor cells are the earliest events that affect a large number of genes and may predispose breast tissue to eventual development of invasive carcinoma.

    View details for DOI 10.1101/gr.151670.112

    View details for Web of Science ID 000321119900007

  • Automated cellular annotation for high-resolution images of adult Caenorhabditis elegans. Bioinformatics Aerni, S. J., Liu, X., Do, C. B., Gross, S. S., Nguyen, A., Guo, S. D., Long, F., Peng, H., Kim, S. S., Batzoglou, S. 2013; 29 (13): i18-26


    Advances in high-resolution microscopy have recently made possible the analysis of gene expression at the level of individual cells. The fixed lineage of cells in the adult worm Caenorhabditis elegans makes this organism an ideal model for studying complex biological processes like development and aging. However, annotating individual cells in images of adult C.elegans typically requires expertise and significant manual effort. Automation of this task is therefore critical to enabling high-resolution studies of a large number of genes.In this article, we describe an automated method for annotating a subset of 154 cells (including various muscle, intestinal and hypodermal cells) in high-resolution images of adult C.elegans. We formulate the task of labeling cells within an image as a combinatorial optimization problem, where the goal is to minimize a scoring function that compares cells in a test input image with cells from a training atlas of manually annotated worms according to various spatial and morphological characteristics. We propose an approach for solving this problem based on reduction to minimum-cost maximum-flow and apply a cross-entropy-based learning algorithm to tune the weights of our scoring function. We achieve 84% median accuracy across a set of 154 cell labels in this highly variable system. These results demonstrate the feasibility of the automatic annotation of microscopy-based images in adult

    View details for DOI 10.1093/bioinformatics/btt223

    View details for PubMedID 23812982

  • Ancestry Inference in Complex Admixtures via Variable-length Markov Chain Linkage Models JOURNAL OF COMPUTATIONAL BIOLOGY Rodriguez, J. M., Bercovici, S., Elmore, M., Batzoglou, S. 2013; 20 (3): 199-211


    Inferring the ancestral origin of chromosomal segments in admixed individuals is key for genetic applications, ranging from analyzing population demographics and history, to mapping disease genes. Previous methods addressed ancestry inference by using either weak models of linkage disequilibrium, or large models that make explicit use of ancestral haplotypes. In this paper we introduce ALLOY, an efficient method that incorporates generalized, but highly expressive, linkage disequilibrium models. ALLOY applies a factorial hidden Markov model to capture the parallel process producing the maternal and paternal admixed haplotypes, and models the background linkage disequilibrium in the ancestral populations via an inhomogeneous variable-length Markov chain. We test ALLOY in a broad range of scenarios ranging from recent to ancient admixtures with up to four ancestral populations. We show that ALLOY outperforms the previous state of the art, and is robust to uncertainties in model parameters.

    View details for DOI 10.1089/cmb.2012.0088

    View details for Web of Science ID 000315888500003

    View details for PubMedID 23421795

  • An accurate method for inferring relatedness in large datasets of unphased genotypes via an embedded likelihood-ratio test. Rodriguez, J., Batzoglou, S., Bercovici, S. 2013
  • Automated cellular annotation for high-resolution images of adult Caenorhabditiselegans. Bioinformatics Aerni, S. J., Liu, X., Do, C. B., Gross, S. S., Nguyen, A., Guo, S. D., Batzoglou, S. 2013
  • An integrated encyclopedia of DNA elements in the human genome NATURE Dunham, I., Kundaje, A., Aldred, S. F., Collins, P. J., Davis, C., Doyle, F., Epstein, C. B., Frietze, S., Harrow, J., Kaul, R., Khatun, J., Lajoie, B. R., Landt, S. G., Lee, B., Pauli, F., Rosenbloom, K. R., Sabo, P., Safi, A., Sanyal, A., Shoresh, N., Simon, J. M., Song, L., Trinklein, N. D., Altshuler, R. C., Birney, E., Brown, J. B., Cheng, C., Djebali, S., Dong, X., Dunham, I., Ernst, J., Furey, T. S., Gerstein, M., Giardine, B., Greven, M., Hardison, R. C., Harris, R. S., Herrero, J., Hoffman, M. M., Iyer, S., Kellis, M., Khatun, J., Kheradpour, P., Kundaje, A., Lassmann, T., Li, Q., Lin, X., Marinov, G. K., Merkel, A., Mortazavi, A., Parker, S. C., Reddy, T. E., Rozowsky, J., Schlesinger, F., Thurman, R. E., Wang, J., Ward, L. D., Whitfield, T. W., Wilder, S. P., Wu, W., Xi, H. S., Yip, K. Y., Zhuang, J., Bernstein, B. E., Birney, E., Dunham, I., Green, E. D., Gunter, C., Snyder, M., Pazin, M. J., Lowdon, R. F., Dillon, L. A., Adams, L. B., Kelly, C. J., Zhang, J., Wexler, J. R., Green, E. D., Good, P. J., Feingold, E. A., Bernstein, B. E., Birney, E., Crawford, G. E., Dekker, J., Elnitski, L., Farnham, P. J., Gerstein, M., Giddings, M. C., Gingeras, T. R., Green, E. D., Guigo, R., Hardison, R. C., Hubbard, T. J., Kellis, M., Kent, W. J., Lieb, J. D., Margulies, E. H., Myers, R. M., Snyder, M., Stamatoyannopoulos, J. A., Tenenbaum, S. A., Weng, Z., White, K. P., Wold, B., Khatun, J., Yu, Y., Wrobel, J., Risk, B. A., Gunawardena, H. P., Kuiper, H. C., Maier, C. W., Xie, L., Chen, X., Giddings, M. C., Bernstein, B. E., Epstein, C. B., Shoresh, N., Ernst, J., Kheradpour, P., Mikkelsen, T. S., Gillespie, S., Goren, A., Ram, O., Zhang, X., Wang, L., Issner, R., Coyne, M. J., Durham, T., Ku, M., Truong, T., Ward, L. D., Altshuler, R. C., Eaton, M. L., Kellis, M., Djebali, S., Davis, C. A., Merkel, A., Dobin, A., Lassmann, T., Mortazavi, A., Tanzer, A., Lagarde, J., Lin, W., Schlesinger, F., Xue, C., Marinov, G. K., Khatun, J., Williams, B. A., Zaleski, C., Rozowsky, J., Roeder, M., Kokocinski, F., Abdelhamid, R. F., Alioto, T., Antoshechkin, I., Baer, M. T., Batut, P., Bell, I., Bell, K., Chakrabortty, S., Chen, X., Chrast, J., Curado, J., Derrien, T., Drenkow, J., Dumais, E., Dumais, J., Duttagupta, R., Fastuca, M., Fejes-Toth, K., Ferreira, P., Foissac, S., Fullwood, M. J., Gao, H., Gonzalez, D., Gordon, A., Gunawardena, H. P., Howald, C., Jha, S., Johnson, R., Kapranov, P., King, B., Kingswood, C., Li, G., Luo, O. J., Park, E., Preall, J. B., Presaud, K., Ribeca, P., Risk, B. A., Robyr, D., Ruan, X., Sammeth, M., Sandhu, K. S., Schaeffer, L., See, L., Shahab, A., Skancke, J., Suzuki, A. M., Takahashi, H., Tilgner, H., Trout, D., Walters, N., Wang, H., Wrobel, J., Yu, Y., Hayashizaki, Y., Harrow, J., Gerstein, M., Hubbard, T. J., Reymond, A., Antonarakis, S. E., Hannon, G. J., Giddings, M. C., Ruan, Y., Wold, B., Carninci, P., Guigo, R., Gingeras, T. R., Rosenbloom, K. R., Sloan, C. A., Learned, K., Malladi, V. S., Wong, M. C., Barber, G., Cline, M. S., Dreszer, T. R., Heitner, S. G., Karolchik, D., Kent, W. J., Kirkup, V. M., Meyer, L. R., Long, J. C., Maddren, M., Raney, B. J., Furey, T. S., Song, L., Grasfeder, L. L., Giresi, P. G., Lee, B., Battenhouse, A., Sheffield, N. C., Simon, J. M., Showers, K. A., Safi, A., London, D., Bhinge, A. A., Shestak, C., Schaner, M. R., Kim, S. K., Zhang, Z. Z., Mieczkowski, P. A., Mieczkowska, J. O., Liu, Z., McDaniell, R. M., Ni, Y., Rashid, N. U., Kim, M. J., Adar, S., Zhang, Z., Wang, T., Winter, D., Keefe, D., Birney, E., Iyer, V. R., Lieb, J. D., Crawford, G. E., Li, G., Sandhu, K. S., Zheng, M., Wang, P., Luo, O. J., Shahab, A., Fullwood, M. J., Ruan, X., Ruan, Y., Myers, R. M., Pauli, F., Williams, B. A., Gertz, J., Marinov, G. K., Reddy, T. E., Vielmetter, J., Partridge, E. C., Trout, D., Varley, K. E., Gasper, C., Bansal, A., Pepke, S., Jain, P., Amrhein, H., Bowling, K. M., Anaya, M., Cross, M. K., King, B., Muratet, M. A., Antoshechkin, I., Newberry, K. M., McCue, K., Nesmith, A. S., Fisher-Aylor, K. I., Pusey, B., DeSalvo, G., Parker, S. L., Balasubramanian, S., Davis, N. S., Meadows, S. K., Eggleston, T., Gunter, C., Newberry, J. S., Levy, S. E., Absher, D. M., Mortazavi, A., Wong, W. H., Wold, B., Blow, M. J., Visel, A., Pennachio, L. A., Elnitski, L., Margulies, E. H., Parker, S. C., Petrykowska, H. M., Abyzov, A., Aken, B., Barrell, D., Barson, G., Berry, A., Bignell, A., Boychenko, V., Bussotti, G., Chrast, J., Davidson, C., Derrien, T., Despacio-Reyes, G., Diekhans, M., Ezkurdia, I., Frankish, A., Gilbert, J., Gonzalez, J. M., Griffiths, E., Harte, R., Hendrix, D. A., Howald, C., Hunt, T., Jungreis, I., Kay, M., Khurana, E., Kokocinski, F., Leng, J., Lin, M. F., Loveland, J., Lu, Z., Manthravadi, D., Mariotti, M., Mudge, J., Mukherjee, G., Notredame, C., Pei, B., Rodriguez, J. M., Saunders, G., Sboner, A., Searle, S., Sisu, C., Snow, C., Steward, C., Tanzer, A., Tapanari, E., Tress, M. L., van Baren, M. J., Walters, N., Washietl, S., Wilming, L., Zadissa, A., Zhang, Z., Brent, M., Haussler, D., Kellis, M., Valencia, A., Gerstein, M., Reymond, A., Guigo, R., Harrow, J., Hubbard, T. J., Landt, S. G., Frietze, S., Abyzov, A., Addleman, N., Alexander, R. P., Auerbach, R. K., Balasubramanian, S., Bettinger, K., Bhardwaj, N., Boyle, A. P., Cao, A. R., Cayting, P., Charos, A., Cheng, Y., Cheng, C., Eastman, C., Euskirchen, G., Fleming, J. D., Grubert, F., Habegger, L., Hariharan, M., Harmanci, A., Iyengar, S., Jin, V. X., Karczewski, K. J., Kasowski, M., Lacroute, P., Lam, H., Lamarre-Vincent, N., Leng, J., Lian, J., Lindahl-Allen, M., Min, R., Miotto, B., Monahan, H., Moqtaderi, Z., Mu, X. J., O'Geen, H., Ouyang, Z., Patacsil, D., Pei, B., Raha, D., Ramirez, L., Reed, B., Rozowsky, J., Sboner, A., Shi, M., Sisu, C., Slifer, T., Witt, H., Wu, L., Xu, X., Yan, K., Yang, X., Yip, K. Y., Zhang, Z., Struhl, K., Weissman, S. M., Gerstein, M., Farnham, P. J., Snyder, M., Tenenbaum, S. A., Penalva, L. O., Doyle, F., Karmakar, S., Landt, S. G., Bhanvadia, R. R., Choudhury, A., Domanus, M., Ma, L., Moran, J., Patacsil, D., Slifer, T., Victorsen, A., Yang, X., Snyder, M., White, K. P., Auer, T., Centanin, L., Eichenlaub, M., Gruhl, F., Heermann, S., Hoeckendorf, B., Inoue, D., Kellner, T., Kirchmaier, S., Mueller, C., Reinhardt, R., Schertel, L., Schneider, S., Sinn, R., Wittbrodt, B., Wittbrodt, J., Weng, Z., Whitfield, T. W., Wang, J., Collins, P. J., Aldred, S. F., Trinklein, N. D., Partridge, E. C., Myers, R. M., Dekker, J., Jain, G., Lajoie, B. R., Sanyal, A., Balasundaram, G., Bates, D. L., Byron, R., Canfield, T. K., Diegel, M. J., Dunn, D., Ebersol, A. K., Frum, T., Garg, K., Gist, E., Hansen, R. S., Boatman, L., Haugen, E., Humbert, R., Jain, G., Johnson, A. K., Johnson, E. M., Kutyavin, T. V., Lajoie, B. R., Lee, K., Lotakis, D., Maurano, M. T., Neph, S. J., Neri, F. V., Nguyen, E. D., Qu, H., Reynolds, A. P., Roach, V., Rynes, E., Sabo, P., Sanchez, M. E., Sandstrom, R. S., Sanyal, A., Shafer, A. O., Stergachis, A. B., Thomas, S., Thurman, R. E., Vernot, B., Vierstra, J., Vong, S., Wang, H., Weaver, M. A., Yan, Y., Zhang, M., Akey, J. M., Bender, M., Dorschner, M. O., Groudine, M., MacCoss, M. J., Navas, P., Stamatoyannopoulos, G., Kaul, R., Dekker, J., Stamatoyannopoulos, J. A., Dunham, I., Beal, K., Brazma, A., Flicek, P., Herrero, J., Johnson, N., Keefe, D., Lukk, M., Luscombe, N. M., Sobral, D., Vaquerizas, J. M., Wilder, S. P., Batzoglou, S., Sidow, A., Hussami, N., Kyriazopoulou-Panagiotopoulou, S., Libbrecht, M. W., Schaub, M. A., Kundaje, A., Hardison, R. C., Miller, W., Giardine, B., Harris, R. S., Wu, W., Bickel, P. J., Banfai, B., Boley, N. P., Brown, J. B., Huang, H., Li, Q., Li, J. J., Noble, W. S., Bilmes, J. A., Buske, O. J., Hoffman, M. M., Sahu, A. D., Kharchenko, P. V., Park, P. J., Baker, D., Taylor, J., Weng, Z., Iyer, S., Dong, X., Greven, M., Lin, X., Wang, J., Xi, H. S., Zhuang, J., Gerstein, M., Alexander, R. P., Balasubramanian, S., Cheng, C., Harmanci, A., Lochovsky, L., Min, R., Mu, X. J., Rozowsky, J., Yan, K., Yip, K. Y., Birney, E. 2012; 489 (7414): 57-74


    The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.

    View details for DOI 10.1038/nature11247

    View details for Web of Science ID 000308347000039

    View details for PubMedID 22955616

  • Architecture of the human regulatory network derived from ENCODE data NATURE Gerstein, M. B., Kundaje, A., Hariharan, M., Landt, S. G., Yan, K., Cheng, C., Mu, X. J., Khurana, E., Rozowsky, J., Alexander, R., Min, R., Alves, P., Abyzov, A., Addleman, N., Bhardwaj, N., Boyle, A. P., Cayting, P., Charos, A., Chen, D. Z., Cheng, Y., Clarke, D., Eastman, C., Euskirchen, G., Frietze, S., Fu, Y., Gertz, J., Grubert, F., Harmanci, A., Jain, P., Kasowski, M., Lacroute, P., Leng, J., Lian, J., Monahan, H., O'Geen, H., Ouyang, Z., Partridge, E. C., Patacsil, D., Pauli, F., Raha, D., Ramirez, L., Reddy, T. E., Reed, B., Shi, M., Slifer, T., Wang, J., Wu, L., Yang, X., Yip, K. Y., Zilberman-Schapira, G., Batzoglou, S., Sidow, A., Farnham, P. J., Myers, R. M., Weissman, S. M., Snyder, M. 2012; 489 (7414): 91-100


    Transcription factors bind in a combinatorial fashion to specify the on-and-off states of genes; the ensemble of these binding events forms a regulatory network, constituting the wiring diagram for a cell. To examine the principles of the human transcriptional regulatory network, we determined the genomic binding information of 119 transcription-related factors in over 450 distinct experiments. We found the combinatorial, co-association of transcription factors to be highly context specific: distinct combinations of factors bind at specific genomic locations. In particular, there are significant differences in the binding proximal and distal to genes. We organized all the transcription factor binding into a hierarchy and integrated it with other genomic information (for example, microRNA regulation), forming a dense meta-network. Factors at different levels have different properties; for instance, top-level transcription factors more strongly influence expression and middle-level ones co-regulate targets to mitigate information-flow bottlenecks. Moreover, these co-regulations give rise to many enriched network motifs (for example, noise-buffering feed-forward loops). Finally, more connected network components are under stronger selection and exhibit a greater degree of allele-specific activity (that is, differential binding to the two parental alleles). The regulatory information obtained in this study will be crucial for interpreting personal genome sequences and understanding basic principles of human biology and disease.

    View details for DOI 10.1038/nature11245

    View details for Web of Science ID 000308347000042

    View details for PubMedID 22955619

  • Ubiquitous heterogeneity and asymmetry of the chromatin environment at regulatory elements GENOME RESEARCH Kundaje, A., Kyriazopoulou-Panagiotopoulou, S., Libbrecht, M., Smith, C. L., Raha, D., Winters, E. E., Johnson, S. M., Snyder, M., Batzoglou, S., Sidow, A. 2012; 22 (9): 1735-1747


    Gene regulation at functional elements (e.g., enhancers, promoters, insulators) is governed by an interplay of nucleosome remodeling, histone modifications, and transcription factor binding. To enhance our understanding of gene regulation, the ENCODE Consortium has generated a wealth of ChIP-seq data on DNA-binding proteins and histone modifications. We additionally generated nucleosome positioning data on two cell lines, K562 and GM12878, by MNase digestion and high-depth sequencing. Here we relate 14 chromatin signals (12 histone marks, DNase, and nucleosome positioning) to the binding sites of 119 DNA-binding proteins across a large number of cell lines. We developed a new method for unsupervised pattern discovery, the Clustered AGgregation Tool (CAGT), which accounts for the inherent heterogeneity in signal magnitude, shape, and implicit strand orientation of chromatin marks. We applied CAGT on a total of 5084 data set pairs to obtain an exhaustive catalog of high-resolution patterns of histone modifications and nucleosome positioning signals around bound transcription factors. Our analyses reveal extensive heterogeneity in how histone modifications are deposited, and how nucleosomes are positioned around binding sites. With the exception of the CTCF/cohesin complex, asymmetry of nucleosome positioning is predominant. Asymmetry of histone modifications is also widespread, for all types of chromatin marks examined, including promoter, enhancer, elongation, and repressive marks. The fine-resolution signal shapes discovered by CAGT unveiled novel correlation patterns between chromatin marks, nucleosome positioning, and sequence content. Meta-analyses of the signal profiles revealed a common vocabulary of chromatin signals shared across multiple cell lines and binding proteins.

    View details for DOI 10.1101/gr.136366.111

    View details for Web of Science ID 000308272800015

    View details for PubMedID 22955985

  • Linking disease associations with regulatory information in the human genome GENOME RESEARCH Schaub, M. A., Boyle, A. P., Kundaje, A., Batzoglou, S., Snyder, M. 2012; 22 (9): 1748-1759


    Genome-wide association studies have been successful in identifying single nucleotide polymorphisms (SNPs) associated with a large number of phenotypes. However, an associated SNP is likely part of a larger region of linkage disequilibrium. This makes it difficult to precisely identify the SNPs that have a biological link with the phenotype. We have systematically investigated the association of multiple types of ENCODE data with disease-associated SNPs and show that there is significant enrichment for functional SNPs among the currently identified associations. This enrichment is strongest when integrating multiple sources of functional information and when highest confidence disease-associated SNPs are used. We propose an approach that integrates multiple types of functional data generated by the ENCODE Consortium to help identify "functional SNPs" that may be associated with the disease phenotype. Our approach generates putative functional annotations for up to 80% of all previously reported associations. We show that for most associations, the functional SNP most strongly supported by experimental evidence is a SNP in linkage disequilibrium with the reported association rather than the reported SNP itself. Our results show that the experimental data sets generated by the ENCODE Consortium can be successfully used to suggest functional hypotheses for variants associated with diseases and other phenotypes.

    View details for DOI 10.1101/gr.136127.111

    View details for Web of Science ID 000308272800016

    View details for PubMedID 22955986

  • ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia GENOME RESEARCH Landt, S. G., Marinov, G. K., Kundaje, A., Kheradpour, P., Pauli, F., Batzoglou, S., Bernstein, B. E., Bickel, P., Brown, J. B., Cayting, P., Chen, Y., DeSalvo, G., Epstein, C., Fisher-Aylor, K. I., Euskirchen, G., Gerstein, M., Gertz, J., Hartemink, A. J., Hoffman, M. M., Iyer, V. R., Jung, Y. L., Karmakar, S., Kellis, M., Kharchenko, P. V., Li, Q., Liu, T., Liu, X. S., Ma, L., Milosavljevic, A., Myers, R. M., Park, P. J., Pazin, M. J., Perry, M. D., Raha, D., Reddy, T. E., Rozowsky, J., Shoresh, N., Sidow, A., Slattery, M., Stamatoyannopoulos, J. A., Tolstorukov, M. Y., White, K. P., Xi, S., Farnham, P. J., Lieb, J. D., Wold, B. J., Snyder, M. 2012; 22 (9): 1813-1831


    Chromatin immunoprecipitation (ChIP) followed by high-throughput DNA sequencing (ChIP-seq) has become a valuable and widely used approach for mapping the genomic location of transcription-factor binding and histone modifications in living cells. Despite its widespread use, there are considerable differences in how these experiments are conducted, how the results are scored and evaluated for quality, and how the data and metadata are archived for public use. These practices affect the quality and utility of any global ChIP experiment. Through our experience in performing ChIP-seq experiments, the ENCODE and modENCODE consortia have developed a set of working standards and guidelines for ChIP experiments that are updated routinely. The current guidelines address antibody validation, experimental replication, sequencing depth, data and metadata reporting, and data quality assessment. We discuss how ChIP quality, assessed in these ways, affects different uses of ChIP-seq data. All data sets used in the analysis have been deposited for public viewing and downloading at the ENCODE ( and modENCODE ( portals.

    View details for DOI 10.1101/gr.136184.111

    View details for Web of Science ID 000308272800021

    View details for PubMedID 22955991

  • The Human OligoGenome Resource: a database of oligonucleotide capture probes for resequencing target regions across the human genome. Nucleic acids research Newburger, D. E., Natsoulis, G., Grimes, S., Bell, J. M., Davis, R. W., Batzoglou, S., Ji, H. P. 2012; 40 (Database issue): D1137-43


    Recent exponential growth in the throughput of next-generation DNA sequencing platforms has dramatically spurred the use of accessible and scalable targeted resequencing approaches. This includes candidate region diagnostic resequencing and novel variant validation from whole genome or exome sequencing analysis. We have previously demonstrated that selective genomic circularization is a robust in-solution approach for capturing and resequencing thousands of target human genome loci such as exons and regulatory sequences. To facilitate the design and production of customized capture assays for any given region in the human genome, we developed the Human OligoGenome Resource ( This online database contains over 21 million capture oligonucleotide sequences. It enables one to create customized and highly multiplexed resequencing assays of target regions across the human genome and is not restricted to coding regions. In total, this resource provides 92.1% in silico coverage of the human genome. The online server allows researchers to download a complete repository of oligonucleotide probes and design customized capture assays to target multiple regions throughout the human genome. The website has query tools for selecting and evaluating capture oligonucleotides from specified genomic regions.

    View details for DOI 10.1093/nar/gkr973

    View details for PubMedID 22102592

  • Ancestry inference in complex admixtures via variable-length Markov chain linkage models. Bercovici, S., Rodriguez, J., Elmore, M., Batzoglou, S. 2012
  • The Human OligoGenome Resource: a database of oligonucleotide capture probes for resequencing target regions across the human genome NUCLEIC ACIDS RESEARCH Newburger, D. E., Natsoulis, G., Grimes, S., Bell, J. M., Davis, R. W., Batzoglou, S., Ji, H. P. 2012; 40 (D1): D1137-D1143

    View details for DOI 10.1093/nar/gkr973

    View details for Web of Science ID 000298601300170

  • Reconstruction of genealogical relationships with applications to Phase III of HapMap Kyriazopoulou-Panagiotopoulou, S., Haghighi, D. K., Aerni, S. J., Sundquist, A., Bercovici, S., Batzoglou, S. OXFORD UNIV PRESS. 2011: I333-I341


    Accurate inference of genealogical relationships between pairs of individuals is paramount in association studies, forensics and evolutionary analyses of wildlife populations. Current methods for relationship inference consider only a small set of close relationships and have limited to no power to distinguish between relationships with the same number of meioses separating the individuals under consideration (e.g. aunt-niece versus niece-aunt or first cousins versus great aunt-niece).We present CARROT (ClAssification of Relationships with ROTations), a novel framework for relationship inference that leverages linkage information to differentiate between rotated relationships, that is, between relationships with the same number of common ancestors and the same number of meioses separating the individuals under consideration. We demonstrate that CARROT clearly outperforms existing methods on simulated data. We also applied CARROT on four populations from Phase III of the HapMap Project and detected previously unreported pairs of third- and fourth-degree relatives.Source code for CARROT is freely available at

    View details for DOI 10.1093/bioinformatics/btr243

    View details for Web of Science ID 000291752600041

    View details for PubMedID 21685089

  • A User's Guide to the Encyclopedia of DNA Elements (ENCODE) PLOS BIOLOGY Myers, R. M., Stamatoyannopoulos, J., Snyder, M., Dunham, I., Hardison, R. C., Bernstein, B. E., Gingeras, T. R., Kent, W. J., Birney, E., Wold, B., Crawford, G. E., Bernstein, B. E., Epstein, C. B., Shoresh, N., Ernst, J., Mikkelsen, T. S., Kheradpour, P., Zhang, X., Wang, L., Issner, R., Coyne, M. J., Durham, T., Ku, M., Thanh Truong, T., Ward, L. D., Altshuler, R. C., Lin, M. F., Kellis, M., Gingeras, T. R., Davis, C. A., Kapranov, P., Dobin, A., Zaleski, C., Schlesinger, F., Batut, P., Chakrabortty, S., Jha, S., Lin, W., Drenkow, J., Wang, H., Bell, K., Gao, H., Bell, I., Dumais, E., Dumais, J., Antonarakis, S. E., Ucla, C., Borel, C., Guigo, R., Djebali, S., Lagarde, J., Kingswood, C., Ribeca, P., Sammeth, M., Alioto, T., Merkel, A., Tilgner, H., Carninci, P., Hayashizaki, Y., Lassmann, T., Takahashi, H., Abdelhamid, R. F., Hannon, G., Fejes-Toth, K., Preall, J., Gordon, A., Sotirova, V., Reymond, A., Howald, C., Graison, E. A., Chrast, J., Ruan, Y., Ruan, X., Shahab, A., Poh, W. T., Wei, C., Crawford, G. E., Furey, T. S., Boyle, A. P., Sheffield, N. C., Song, L., Shibata, Y., Vales, T., Winter, D., Zhang, Z., London, D., Wang, T., Birney, E., Keefe, D., Iyer, V. R., Lee, B., McDaniell, R. M., Liu, Z., Battenhouse, A., Bhinge, A. A., Lieb, J. D., Grasfeder, L. L., Showers, K. A., Giresi, P. G., Kim, S. K., Shestak, C., Myers, R. M., Pauli, F., Reddy, T. E., Gertz, J., Partridge, E. C., Jain, P., Sprouse, R. O., Bansal, A., Pusey, B., Muratet, M. A., Varley, K. E., Bowling, K. M., Newberry, K. M., Nesmith, A. S., Dilocker, J. A., Parker, S. L., Waite, L. L., Thibeault, K., Roberts, K., Absher, D. M., Wold, B., Mortazavi, A., Williams, B., Marinov, G., Trout, D., Pepke, S., King, B., McCue, K., Kirilusha, A., DeSalvo, G., Fisher-Aylor, K., Amrhein, H., Vielmetter, J., Sherlock, G., Sidow, A., Batzoglou, S., Rauch, R., Kundaje, A., Libbrecht, M., Margulies, E. H., Parker, S. C., Elnitski, L., Green, E. D., Hubbard, T., Harrow, J., Searle, S., Kokocinski, F., Aken, B., Frankish, A., Hunt, T., Despacio-Reyes, G., Kay, M., Mukherjee, G., Bignell, A., Saunders, G., Boychenko, V., Brent, M., van Baren, M. J., Brown, R. H., Gerstein, M., Khurana, E., Balasubramanian, S., Zhang, Z., Lam, H., Cayting, P., Robilotto, R., Lu, Z., Guigo, R., Derrien, T., Tanzer, A., Knowles, D. G., Mariotti, M., Kent, W. J., Haussler, D., Harte, R., Diekhans, M., Kellis, M., Lin, M., Kheradpour, P., Ernst, J., Reymond, A., Howald, C., Graison, E. A., Chrast, J., Valencia, A., Tress, M., Manuel Rodriguez, J., Snyder, M., Landt, S. G., Raha, D., Shi, M., Euskirchen, G., Grubert, F., Kasowski, M., Lian, J., Cayting, P., Lacroute, P., Xu, Y., Monahan, H., Patacsil, D., Slifer, T., Yang, X., Charos, A., Reed, B., Wu, L., Auerbach, R. K., Habegger, L., Hariharan, M., Rozowsky, J., Abyzov, A., Weissman, S. M., Gerstein, M., Struhl, K., Lamarre-Vincent, N., Lindahl-Allen, M., Miotto, B., Moqtaderi, Z., Fleming, J. D., Newburger, P., Farnham, P. J., Frietze, S., O'Geen, H., Xu, X., Blahnik, K. R., Cao, A. R., Iyengar, S., Stamatoyannopoulos, J. A., Kaul, R., Thurman, R. E., Wang, H., Navas, P. A., Sandstrom, R., Sabo, P. J., Weaver, M., Canfield, T., Lee, K., Neph, S., Roach, V., Reynolds, A., Johnson, A., Rynes, E., Giste, E., Vong, S., Neri, J., Frum, T., Johnson, E. M., Nguyen, E. D., Ebersol, A. K., Sanchez, M. E., Sheffer, H. H., Lotakis, D., Haugen, E., Humbert, R., Kutyavin, T., Shafer, T., Dekker, J., Lajoie, B. R., Sanyal, A., Kent, W. J., Rosenbloom, K. R., Dreszer, T. R., Raney, B. J., Barber, G. P., Meyer, L. R., Sloan, C. A., Malladi, V. S., Cline, M. S., Learned, K., Swing, V. K., Zweig, A. S., Rhead, B., Fujita, P. A., Roskin, K., Karolchik, D., Kuhn, R. M., Haussler, D., Birney, E., Dunham, I., Wilder, S. P., Keefe, D., Sobral, D., Herrero, J., Beal, K., Lukk, M., Brazma, A., Vaquerizas, J. M., Luscombe, N. M., Bickel, P. J., Boley, N., Brown, J. B., Li, Q., Huang, H., Gerstein, M., Habegger, L., Sboner, A., Rozowsky, J., Auerbach, R. K., Yip, K. Y., Cheng, C., Yan, K., Bhardwaj, N., Wang, J., Lochovsky, L., Jee, J., Gibson, T., Leng, J., Du, J., Hardison, R. C., Harris, R. S., Song, G., Miller, W., Haussler, D., Roskin, K., Suh, B., Wang, T., Paten, B., Noble, W. S., Hoffman, M. M., Buske, O. J., Weng, Z., Dong, X., Wang, J., Xi, H., Tenenbaum, S. A., Doyle, F., Penalva, L. O., Chittur, S., Tullius, T. D., Parker, S. C., White, K. P., Karmakar, S., Victorsen, A., Jameel, N., Bild, N., Grossman, R. L., Snyder, M., Landt, S. G., Yang, X., Patacsil, D., Slifer, T., Dekker, J., Lajoie, B. R., Sanyal, A., Weng, Z., Whitfield, T. W., Wang, J., Collins, P. J., Trinklein, N. D., Partridge, E. C., Myers, R. M., Giddings, M. C., Chen, X., Khatun, J., Maier, C., Yu, Y., Gunawardena, H., Risk, B., Feingold, E. A., Lowdon, R. F., Dillon, L. A., Good, P. J. 2011; 9 (4)
  • Reconstruction of genealogical relationships with application to Phase III of HapMap. Bioinformatics Kyriazopoulou-Panagiotopoulou, S., KashefHaghighi, D., Aerni, S. J., Sundquist, A., Bercovici, S., Batzoglou, S. 2011
  • Identifying a High Fraction of the Human Genome to be under Selective Constraint Using GERP plus PLOS COMPUTATIONAL BIOLOGY Davydov, E. V., Goode, D. L., Sirota, M., Cooper, G. M., Sidow, A., Batzoglou, S. 2010; 6 (12)


    Computational efforts to identify functional elements within genomes leverage comparative sequence information by looking for regions that exhibit evidence of selective constraint. One way of detecting constrained elements is to follow a bottom-up approach by computing constraint scores for individual positions of a multiple alignment and then defining constrained elements as segments of contiguous, highly scoring nucleotide positions. Here we present GERP++, a new tool that uses maximum likelihood evolutionary rate estimation for position-specific scoring and, in contrast to previous bottom-up methods, a novel dynamic programming approach to subsequently define constrained elements. GERP++ evaluates a richer set of candidate element breakpoints and ranks them based on statistical significance, eliminating the need for biased heuristic extension techniques. Using GERP++ we identify over 1.3 million constrained elements spanning over 7% of the human genome. We predict a higher fraction than earlier estimates largely due to the annotation of longer constrained elements, which improves one to one correspondence between predicted elements with known functional sequences. GERP++ is an efficient and effective tool to provide both nucleotide- and element-level constraint scores within deep multiple sequence alignments.

    View details for DOI 10.1371/journal.pcbi.1001025

    View details for Web of Science ID 000285574600013

    View details for PubMedID 21152010

  • RECOMB Main Conference 2009 Preface JOURNAL OF COMPUTATIONAL BIOLOGY Batzoglou, S. 2010; 17 (3): 201-201

    View details for DOI 10.1089/cmb.2010.Pr01

    View details for Web of Science ID 000279271600001

    View details for PubMedID 20377440

  • Evolutionary constraint facilitates interpretation of genetic variation in resequenced human genomes GENOME RESEARCH Goode, D. L., Cooper, G. M., Schmutz, J., Dickson, M., Gonzales, E., Tsai, M., Karra, K., Davydov, E., Batzoglou, S., Myers, R. M., Sidow, A. 2010; 20 (3): 301-310


    Here, we demonstrate how comparative sequence analysis facilitates genome-wide base-pair-level interpretation of individual genetic variation and address two questions of importance for human personal genomics: first, whether an individual's functional variation comes mostly from noncoding or coding polymorphisms; and, second, whether population-specific or globally-present polymorphisms contribute more to functional variation in any given individual. Neither has been definitively answered by analyses of existing variation data because of a focus on coding polymorphisms, ascertainment biases in favor of common variation, and a lack of base-pair-level resolution for identifying functional variants. We resequenced 575 amplicons within 432 individuals at genomic sites enriched for evolutionary constraint and also analyzed variation within three published human genomes. We find that single-site measures of evolutionary constraint derived from mammalian multiple sequence alignments are strongly predictive of reductions in modern-day genetic diversity across a range of annotation categories and across the allele frequency spectrum from rare (<1%) to high frequency (>10% minor allele frequency). Furthermore, we show that putatively functional variation in an individual genome is dominated by polymorphisms that do not change protein sequence and that originate from our shared ancestral population and commonly segregate in human populations. These observations show that common, noncoding alleles contribute substantially to human phenotypes and that constraint-based analyses will be of value to identify phenotypically relevant variants in individual genomes.

    View details for DOI 10.1101/gr.102210.109

    View details for Web of Science ID 000275124600002

    View details for PubMedID 20067941

  • Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Computational Biology Davydov, E. V., Goode, D. L., Sirota, M., Cooper, G. M., Sidow, A., Batzoglou, S. 2010; 6: e1001025
  • Current progress in static and dynamic modeling of biological networks. Systems Biology for Signaling Networks Daigle, B. J., Srinivasan, B. S., Flannick, J. A., Novak, A. F., Batzoglou, S. edited by Sangdun, C. Springer. 2010: 13-73
  • RECOMB Conference 2009 Journal of Computational Biology Batzoglou, S., Batzoglou, S. 2010; 3 (17)
  • Autoimmune Disease Classification by Inverse Association with SNP Alleles PLOS GENETICS Sirota, M., Schaub, M. A., Batzoglou, S., Robinson, W. H., Butte, A. J. 2009; 5 (12)


    With multiple genome-wide association studies (GWAS) performed across autoimmune diseases, there is a great opportunity to study the homogeneity of genetic architectures across autoimmune disease. Previous approaches have been limited in the scope of their analysis and have failed to properly incorporate the direction of allele-specific disease associations for SNPs. In this work, we refine the notion of a genetic variation profile for a given disease to capture strength of association with multiple SNPs in an allele-specific fashion. We apply this method to compare genetic variation profiles of six autoimmune diseases: multiple sclerosis (MS), ankylosing spondylitis (AS), autoimmune thyroid disease (ATD), rheumatoid arthritis (RA), Crohn's disease (CD), and type 1 diabetes (T1D), as well as five non-autoimmune diseases. We quantify pair-wise relationships between these diseases and find two broad clusters of autoimmune disease where SNPs that make an individual susceptible to one class of autoimmune disease also protect from diseases in the other autoimmune class. We find that RA and AS form one such class, and MS and ATD another. We identify specific SNPs and genes with opposite risk profiles for these two classes. We furthermore explore individual SNPs that play an important role in defining similarities and differences between disease pairs. We present a novel, systematic, cross-platform approach to identify allele-specific relationships between disease pairs based on genetic variation as well as the individual SNPs which drive the relationships. While recognizing similarities between diseases might lead to identifying novel treatment options, detecting differences between diseases previously thought to be similar may point to key novel disease-specific genes and pathways.

    View details for DOI 10.1371/journal.pgen.1000792

    View details for Web of Science ID 000273469700042

    View details for PubMedID 20041220

  • Analysis of Cell Fate from Single-Cell Gene Expression Profiles in C. elegans CELL Liu, X., Long, F., Peng, H., Aerni, S. J., Jiang, M., Sanchez-Blanco, A., Murray, J. I., Preston, E., Mericle, B., Batzoglou, S., Myers, E. W., Kim, S. K. 2009; 139 (3): 623-633


    The C. elegans cell lineage provides a unique opportunity to look at how cell lineage affects patterns of gene expression. We developed an automatic cell lineage analyzer that converts high-resolution images of worms into a data table showing fluorescence expression with single-cell resolution. We generated expression profiles of 93 genes in 363 specific cells from L1 stage larvae and found that cells with identical fates can be formed by different gene regulatory pathways. Molecular signatures identified repeating cell fate modules within the cell lineage and enabled the generation of a molecular differentiation map that reveals points in the cell lineage when developmental fates of daughter cells begin to diverge. These results demonstrate insights that become possible using computational approaches to analyze quantitative expression from many genes in parallel using a digital gene expression atlas.

    View details for DOI 10.1016/j.cell.2009.08.044

    View details for Web of Science ID 000271259600025

    View details for PubMedID 19879847

  • Automatic Parameter Learning for Multiple Local Network Alignment JOURNAL OF COMPUTATIONAL BIOLOGY Flannick, J., Novak, A., Do, C. B., Srinivasan, B. S., Batzoglou, S. 2009; 16 (8): 1001-1022


    We developed Graemlin 2.0, a new multiple network aligner with (1) a new multi-stage approach to local network alignment; (2) a novel scoring function that can use arbitrary features of a multiple network alignment, such as protein deletions, protein duplications, protein mutations, and interaction losses; (3) a parameter learning algorithm that uses a training set of known network alignments to learn parameters for our scoring function and thereby adapt it to any set of networks; and (4) an algorithm that uses our scoring function to find approximate multiple network alignments in linear time. We tested Graemlin 2.0's accuracy on protein interaction networks from IntAct, DIP, and the Stanford Network Database. We show that, on each of these datasets, Graemlin 2.0 has higher sensitivity and specificity than existing network aligners. Graemlin 2.0 is available under the GNU public license at .

    View details for DOI 10.1089/cmb.2009.0099

    View details for Web of Science ID 000269639100004

    View details for PubMedID 19645599

  • A Classifier-based approach to identify genetic similarities between diseases BIOINFORMATICS Schaub, M. A., Kaplow, I. M., Sirota, M., Do, C. B., Butte, A. J., Batzoglou, S. 2009; 25 (12): I21-I29


    Genome-wide association studies are commonly used to identify possible associations between genetic variations and diseases. These studies mainly focus on identifying individual single nucleotide polymorphisms (SNPs) potentially linked with one disease of interest. In this work, we introduce a novel methodology that identifies similarities between diseases using information from a large number of SNPs. We separate the diseases for which we have individual genotype data into one reference disease and several query diseases. We train a classifier that distinguishes between individuals that have the reference disease and a set of control individuals. This classifier is then used to classify the individuals that have the query diseases. We can then rank query diseases according to the average classification of the individuals in each disease set, and identify which of the query diseases are more similar to the reference disease. We repeat these classification and comparison steps so that each disease is used once as reference disease.We apply this approach using a decision tree classifier to the genotype data of seven common diseases and two shared control sets provided by the Wellcome Trust Case Control Consortium. We show that this approach identifies the known genetic similarity between type 1 diabetes and rheumatoid arthritis, and identifies a new putative similarity between bipolar disease and hypertension.

    View details for DOI 10.1093/bioinformatics/btp226

    View details for Web of Science ID 000266498300004

    View details for PubMedID 19477990

  • A serial founder effect model for human settlement out of Africa PROCEEDINGS OF THE ROYAL SOCIETY B-BIOLOGICAL SCIENCES Deshpande, O., Batzoglou, S., Feldman, M. W., Cavalli-Sforza, L. L. 2009; 276 (1655): 291-300


    The increasing abundance of human genetic data has shown that the geographical patterns of worldwide genetic diversity are best explained by human expansion out of Africa. This expansion is modelled well by prolonged migration from a single origin in Africa with multiple subsequent serial founding events. We discuss a new simulation model for the serial founder effect out of Africa and compare it with results from previous studies. Unlike previous models, we distinguish colonization events from the continued exchange of people between occupied territories as a result of mating. We conduct a search through parameter space to estimate the range of parameter values that best explain key statistics from published data on worldwide variation in microsatellites. The range of parameters we use is chosen to be compatible with an out-of-Africa migration at 50-60Kyr ago and archaeo-ethno-demographic information. In addition to a colonization rate of 0.09-0.18, for an acceptable fit to the published microsatellite data, incorporation into existing models of exchange between neighbouring populations is essential, but at a very low rate. A linear decay of genetic diversity with geographical distance from the origin of expansion could apply to any species, especially if it moved recently into new geographical niches.

    View details for DOI 10.1098/rspb.2008.0750

    View details for Web of Science ID 000262005200013

    View details for PubMedID 18796400

  • Proceedings of the 23th Annual International Conference on Research in Computational Molecular Biology. edited by Batzoglou, S. Springer-Verlag. 2009
  • A serial founder effect model for human settlements out of Africa. Deshpande, O., Batzoglou, S., Feldman, M., Cavalli-Sforza, L. 2009
  • Genetic and Computational Identification of a Conserved Bacterial Metabolic Module PLOS GENETICS Boutte, C. C., Srinivasan, B. S., Flannick, J. A., Novak, A. F., Martens, A. T., Batzoglou, S., Viollier, P. H., Crosson, S. 2008; 4 (12)


    We have experimentally and computationally defined a set of genes that form a conserved metabolic module in the alpha-proteobacterium Caulobacter crescentus and used this module to illustrate a schema for the propagation of pathway-level annotation across bacterial genera. Applying comprehensive forward and reverse genetic methods and genome-wide transcriptional analysis, we (1) confirmed the presence of genes involved in catabolism of the abundant environmental sugar myo-inositol, (2) defined an operon encoding an ABC-family myo-inositol transmembrane transporter, and (3) identified a novel myo-inositol regulator protein and cis-acting regulatory motif that control expression of genes in this metabolic module. Despite being encoded from non-contiguous loci on the C. crescentus chromosome, these myo-inositol catabolic enzymes and transporter proteins form a tightly linked functional group in a computationally inferred network of protein associations. Primary sequence comparison was not sufficient to confidently extend annotation of all components of this novel metabolic module to related bacterial genera. Consequently, we implemented the Graemlin multiple-network alignment algorithm to generate cross-species predictions of genes involved in myo-inositol transport and catabolism in other alpha-proteobacteria. Although the chromosomal organization of genes in this functional module varied between species, the upstream regions of genes in this aligned network were enriched for the same palindromic cis-regulatory motif identified experimentally in C. crescentus. Transposon disruption of the operon encoding the computationally predicted ABC myo-inositol transporter of Sinorhizobium meliloti abolished growth on myo-inositol as the sole carbon source, confirming our cross-genera functional prediction. Thus, we have defined regulatory, transport, and catabolic genes and a cis-acting regulatory sequence that form a conserved module required for myo-inositol metabolism in select alpha-proteobacteria. Moreover, this study describes a forward validation of gene-network alignment, and illustrates a strategy for reliably transferring pathway-level annotation across bacterial species.

    View details for DOI 10.1371/journal.pgen.1000310

    View details for Web of Science ID 000263667900025

    View details for PubMedID 19096521

  • Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data NATURE METHODS Valouev, A., Johnson, D. S., Sundquist, A., Medina, C., Anton, E., Batzoglou, S., Myers, R. M., Sidow, A. 2008; 5 (9): 829-834


    Molecular interactions between protein complexes and DNA mediate essential gene-regulatory functions. Uncovering such interactions by chromatin immunoprecipitation coupled with massively parallel sequencing (ChIP-Seq) has recently become the focus of intense interest. We here introduce quantitative enrichment of sequence tags (QuEST), a powerful statistical framework based on the kernel density estimation approach, which uses ChIP-Seq data to determine positions where protein complexes contact DNA. Using QuEST, we discovered several thousand binding sites for the human transcription factors SRF, GABP and NRSF at an average resolution of about 20 base pairs. MEME motif-discovery tool-based analyses of the QuEST-identified sequences revealed DNA binding by cofactors of SRF, providing evidence that cofactor binding specificity can be obtained from ChIP-Seq data. By combining QuEST analyses with Gene Ontology (GO) annotations and expression data, we illustrate how general functions of transcription factors can be inferred.

    View details for DOI 10.1038/NMETH.1246

    View details for Web of Science ID 000258912700017

    View details for PubMedID 19160518

  • What is the expectation maximization algorithm? NATURE BIOTECHNOLOGY Do, C. B., Batzoglou, S. 2008; 26 (8): 897-899

    View details for DOI 10.1038/nbt1406

    View details for Web of Science ID 000258325500023

    View details for PubMedID 18688245

  • A max-margin model for efficient simultaneous alignment and folding of RNA sequences BIOINFORMATICS Do, C. B., Foo, C., Batzoglou, S. 2008; 24 (13): I68-I76


    The need for accurate and efficient tools for computational RNA structure analysis has become increasingly apparent over the last several years: RNA folding algorithms underlie numerous applications in bioinformatics, ranging from microarray probe selection to de novo non-coding RNA gene prediction. In this work, we present RAF (RNA Alignment and Folding), an efficient algorithm for simultaneous alignment and consensus folding of unaligned RNA sequences. Algorithmically, RAF exploits sparsity in the set of likely pairing and alignment candidates for each nucleotide (as identified by the CONTRAfold or CONTRAlign programs) to achieve an effectively quadratic running time for simultaneous pairwise alignment and folding. RAF's fast sparse dynamic programming, in turn, serves as the inference engine within a discriminative machine learning algorithm for parameter estimation.In cross-validated benchmark tests, RAF achieves accuracies equaling or surpassing the current best approaches for RNA multiple sequence secondary structure prediction. However, RAF requires nearly an order of magnitude less time than other simultaneous folding and alignment methods, thus making it especially appropriate for high-throughput studies.Source code for RAF is available at:

    View details for DOI 10.1093/bioinformatics/btn177

    View details for Web of Science ID 000257169700030

    View details for PubMedID 18586747

  • Effect of genetic divergence in identifying ancestral origin using HAPAA GENOME RESEARCH Sundquist, A., Fratkin, E., Do, C. B., Batzoglou, S. 2008; 18 (4): 676-682


    The genome of an admixed individual with ancestors from isolated populations is a mosaic of chromosomal blocks, each following the statistical properties of variation seen in those populations. By analyzing polymorphisms in the admixed individual against those seen in representatives from the populations, we can infer the ancestral source of the individual's haploblocks. In this paper we describe a novel approach for ancestry inference, HAPAA (HMM-based analysis of polymorphisms in admixed ancestries), that models the allelic and haplotypic variation in the populations and captures the signal of correlation due to linkage disequilibrium, resulting in greatly improved accuracy. We also introduce a methodology for evaluating the effect of genetic divergence between ancestral populations and time-to-admixture on inference accuracy. Using HAPAA, we explore the limits of ancestry inference in closely related populations.

    View details for DOI 10.1101/gr.072850.107

    View details for Web of Science ID 000254562400018

    View details for PubMedID 18353807

  • Automatic parameter learning for multiple network alignment Flannick, J., Novak, A., Do, C. B., Srinivasan, B. S., Batzoglou, S. SPRINGER-VERLAG BERLIN. 2008: 214-231
  • Automatic parameter learning for multiple network alignment. Flannick, J., Novak, A., Srinivasan, B. S., Batzoglou, S. 2008
  • Effect of genetic divergence in identifying ancestral origin using HAPAA. Sundquist, A., Fratkin, E., Do, C. B., Batzoglou, S. 2008
  • Genome-wise analysis of transcription factor binding sites based on ChIP-Seq data. Nature Methods Valouev, A., Johnson, D. S., Sundquist, A., Medina, C., Elisabeth, A., Batzoglou, S. 2008; 9 (5): 829-834
  • Effects of genetic divergence in identifying ancestral origin using HAPAA Sundquist, A., Fratkin, E., Do, C. B., Batzoglou, S. SPRINGER-VERLAG BERLIN. 2008: 423-423
  • Bacterial flora-typing with targeted, chip-based Pyrosequencing BMC MICROBIOLOGY Sundquist, A., Bigdeli, S., Jalili, R., Druzin, M. L., Waller, S., Pullen, K. M., El-Sayed, Y. Y., Taslimi, M. M., Batzoglou, S., Ronaghi, M. 2007; 7


    The metagenomic analysis of microbial communities holds the potential to improve our understanding of the role of microbes in clinical conditions. Recent, dramatic improvements in DNA sequencing throughput and cost will enable such analyses on individuals. However, such advances in throughput generally come at the cost of shorter read-lengths, limiting the discriminatory power of each read. In particular, classifying the microbial content of samples by sequencing the < 1,600 bp 16S rRNA gene will be affected by such limitations.We describe a method for identifying the phylogenetic content of bacterial samples using high-throughput Pyrosequencing targeted at the 16S rRNA gene. Our analysis is adapted to the shorter read-lengths of such technology and uses a database of 16S rDNA to determine the most specific phylogenetic classification for reads, resulting in a weighted phylogenetic tree characterizing the content of the sample. We present results for six samples obtained from the human vagina during pregnancy that corroborates previous studies using conventional techniques.Next, we analyze the power of our method to classify reads at each level of the phylogeny using simulation experiments. We assess the impacts of read-length and database completeness on our method, and predict how we do as technology improves and more bacteria are sequenced. Finally, we study the utility of targeting specific 16S variable regions and show that such an approach considerably improves results for certain types of microbial samples. Using simulation, our method can be used to determine the most informative variable region.This study provides positive validation of the effectiveness of targeting 16S metagenomes using short-read sequencing technology. Our methodology allows us to infer the most specific assignment of the sequence reads within the phylogeny, and to identify the most discriminative variable region to target. The analysis of high-throughput Pyrosequencing on human flora samples will accelerate the study of the relationship between the microbial world and ourselves.

    View details for DOI 10.1186/1471-2180-7-108

    View details for Web of Science ID 000253968300001

    View details for PubMedID 18047683

  • Evolution of genes and genomes on the Drosophila phylogeny NATURE Clark, A. G., Eisen, M. B., Smith, D. R., Bergman, C. M., Oliver, B., Markow, T. A., Kaufman, T. C., Kellis, M., Gelbart, W., Iyer, V. N., Pollard, D. A., Sackton, T. B., Larracuente, A. M., Singh, N. D., Abad, J. P., Abt, D. N., Adryan, B., Aguade, M., Akashi, H., Anderson, W. W., Aquadro, C. F., Ardell, D. H., Arguello, R., Artieri, C. G., Barbash, D. A., Barker, D., Barsanti, P., Batterham, P., Batzoglou, S., Begun, D., Bhutkar, A., Blanco, E., Bosak, S. A., Bradley, R. K., Brand, A. D., Brent, M. R., Brooks, A. N., Brown, R. H., Butlin, R. K., Caggese, C., Calvi, B. R., de Carvalho, A. B., Caspi, A., Castrezana, S., Celniker, S. E., Chang, J. L., Chapple, C., Chatterji, S., Chinwalla, A., Civetta, A., Clifton, S. W., Comeron, J. M., Costello, J. C., Coyne, J. A., Daub, J., David, R. G., Delcher, A. L., Delehaunty, K., Do, C. B., Ebling, H., Edwards, K., Eickbush, T., Evans, J. D., Filipski, A., Findeiss, S., Freyhult, E., Fulton, L., Fulton, R., Garcia, A. C., Gardiner, A., Garfield, D. A., Garvin, B. E., Gibson, G., Gilbert, D., Gnerre, S., Godfrey, J., Good, R., Gotea, V., Gravely, B., Greenberg, A. J., Griffiths-Jones, S., Gross, S., Guigo, R., Gustafson, E. A., Haerty, W., Hahn, M. W., Halligan, D. L., Halpern, A. L., Halter, G. M., Han, M. V., Heger, A., Hillier, L., Hinrichs, A. S., Holmes, I., Hoskins, R. A., Hubisz, M. J., Hultmark, D., Huntley, M. A., Jaffe, D. B., Jagadeeshan, S., Jeck, W. R., Johnson, J., Jones, C. D., Jordan, W. C., Karpen, G. H., Kataoka, E., Keightley, P. D., Kheradpour, P., Kirkness, E. F., Koerich, L. B., Kristiansen, K., Kudrna, D., Kulathinal, R. J., Kumar, S., Kwok, R., Lander, E., Langley, C. H., Lapoint, R., Lazzaro, B. P., Lee, S., Levesque, L., Li, R., Lin, C., Lin, M. F., Lindblad-Toh, K., Llopart, A., Long, M., Low, L., Lozovsky, E., Lu, J., Luo, M., Machado, C. A., Makalowski, W., Marzo, M., Matsuda, M., Matzkin, L., McAllister, B., McBride, C. S., McKernan, B., McKernan, K., Mendez-Lago, M., Minx, P., Mollenhauer, M. U., Montooth, K., Mount, S. M., Mu, X., Myers, E., Negre, B., Newfeld, S., Nielsen, R., Noor, M. A., O'Grady, P., Pachter, L., Papaceit, M., Parisi, M. J., Parisi, M., Parts, L., Pedersen, J. S., Pesole, G., Phillippy, A. M., Ponting, C. P., Pop, M., Porcelli, D., Powell, J. R., Prohaska, S., Pruitt, K., Puig, M., Quesneville, H., Ram, K. R., Rand, D., Rasmussen, M. D., Reed, L. K., Reenan, R., Reily, A., Remington, K. A., Rieger, T. T., Ritchie, M. G., Robin, C., Rogers, Y., Rohde, C., Rozas, J., Rubenfield, M. J., Ruiz, A., Russo, S., Salzberg, S. L., Sanchez-Gracia, A., Saranga, D. J., Sato, H., Schaeffer, S. W., Schatz, M. C., Schlenke, T., Schwartz, R., Segarra, C., Singh, R. S., Sirot, L., Sirota, M., Sisneros, N. B., Smith, C. D., Smith, T. F., Spieth, J., Stage, D. E., Stark, A., Stephan, W., Strausberg, R. L., Strempel, S., Sturgill, D., Sutton, G., Sutton, G. G., Tao, W., Teichmann, S., Tobari, Y. N., Tomimura, Y., Tsolas, J. M., Valente, V. L., Venter, E., Venter, J. C., Vicario, S., Vieira, F. G., Vilella, A. J., Villasante, A., Walenz, B., Wang, J., Wasserman, M., Watts, T., Wilson, D., Wilson, R. K., Wing, R. A., Wolfner, M. F., Wong, A., Wong, G. K., Wu, C., Wu, G., Yamamoto, D., Yang, H., Yang, S., Yorke, J. A., Yoshida, K., Zdobnov, E., Zhang, P., Zhang, Y., Zimin, A. V., Baldwin, J., Abdouelleil, A., Abdulkadir, J., Abebe, A., Abera, B., Abreu, J., Acer, S. C., Aftuck, L., Alexander, A., An, P., Anderson, E., Anderson, S., Arachi, H., Azer, M., Bachantsang, P., Barry, A., Bayul, T., Berlin, A., Bessette, D., Bloom, T., Blye, J., Boguslavskiy, L., Bonnet, C., Boukhgalter, B., Bourzgui, I., Brown, A., Cahill, P., Channer, S., Cheshatsang, Y., Chuda, L., Citroen, M., Collymore, A., Cooke, P., Costello, M., D'Aco, K., Daza, R., De Haan, G., DeGray, S., DeMaso, C., Dhargay, N., Dooley, K., Dooley, E., Doricent, M., Dorje, P., Dorjee, K., Dupes, A., Elong, R., Falk, J., Farina, A., Faro, S., Ferguson, D., Fisher, S., Foley, C. D., Franke, A., Friedrich, D., Gadbois, L., Gearin, G., Gearin, C. R., Giannoukos, G., Goode, T., Graham, J., Grandbois, E., Grewal, S., Gyaltsen, K., Hafez, N., Hagos, B., Hall, J., Henson, C., Hollinger, A., Honan, T., Huard, M. D., Hughes, L., Hurhula, B., Husby, M. E., Kamat, A., Kanga, B., Kashin, S., Khazanovich, D., Kisner, P., Lance, K., Lara, M., Lee, W., Lennon, N., Letendre, F., LeVine, R., Lipovsky, A., Liu, X., Liu, J., Liu, S., Lokyitsang, T., Lokyitsang, Y., Lubonja, R., Lui, A., MacDonald, P., Magnisalis, V., Maru, K., Matthews, C., McCusker, W., McDonough, S., Mehta, T., Meldrim, J., Meneus, L., Mihai, O., Mihalev, A., Mihova, T., Mittelman, R., Mlenga, V., Montmayeur, A., Mulrain, L., Navidi, A., Naylor, J., Negash, T., Nguyen, T., Nguyen, N., Nicol, R., Norbu, C., Norbu, N., Novod, N., O'Neill, B., Osman, S., Markiewicz, E., Oyono, O. L., Patti, C., Phunkhang, P., Pierre, F., Priest, M., Raghuraman, S., Rege, F., Reyes, R., Rise, C., Rogov, P., Ross, K., Ryan, E., Settipalli, S., Shea, T., Sherpa, N., Shi, L., Shih, D., Sparrow, T., Spaulding, J., Stalker, J., Stange-Thomann, N., Stavropoulos, S., Stone, C., Strader, C., Tesfaye, S., Thomson, T., Thoulutsang, Y., Thoulutsang, D., Topham, K., Topping, I., Tsamla, T., Vassiliev, H., Vo, A., Wangchuk, T., Wangdi, T., Weiand, M., Wilkinson, J., Wilson, A., Yadav, S., Young, G., Yu, Q., Zembek, L., Zhong, D., Zimmer, A., Zwirko, Z., Jaffe, D. B., Alvarez, P., Brockman, W., Butler, J., Chin, C., Gnerre, S., Grabherr, M., Kleber, M., Mauceli, E., MacCallum, I. 2007; 450 (7167): 203-218


    Comparative analysis of multiple genomes in a phylogenetic framework dramatically improves the precision and sensitivity of evolutionary inference, producing more robust results than single-genome analyses can provide. The genomes of 12 Drosophila species, ten of which are presented here for the first time (sechellia, simulans, yakuba, erecta, ananassae, persimilis, willistoni, mojavensis, virilis and grimshawi), illustrate how rates and patterns of sequence divergence across taxa can illuminate evolutionary processes on a genomic scale. These genome sequences augment the formidable genetic tools that have made Drosophila melanogaster a pre-eminent model for animal genetics, and will further catalyse fundamental research on mechanisms of development, cell biology, genetics, disease, neurobiology, behaviour, physiology and evolution. Despite remarkable similarities among these Drosophila species, we identified many putatively non-neutral changes in protein-coding genes, non-coding RNA genes, and cis-regulatory regions. These may prove to underlie differences in the ecology and behaviour of these diverse species.

    View details for DOI 10.1038/nature06341

    View details for Web of Science ID 000250746200042

    View details for PubMedID 17994087

  • Current progress in network research: toward reference networks for key model organisms BRIEFINGS IN BIOINFORMATICS Srinivasan, B. S., Shah, N. H., Flannick, J. A., Abeliuk, E., Novak, A. F., Batzoglou, S. 2007; 8 (5): 318-332


    The collection of multiple genome-scale datasets is now routine, and the frontier of research in systems biology has shifted accordingly. Rather than clustering a single dataset to produce a static map of functional modules, the focus today is on data integration, network alignment, interactive visualization and ontological markup. Because of the intrinsic noisiness of high-throughput measurements, statistical methods have been central to this effort. In this review, we briefly survey available datasets in functional genomics, review methods for data integration and network alignment, and describe recent work on using network models to guide experimental validation. We explain how the integration and validation steps spring from a Bayesian description of network uncertainty, and conclude by describing an important near-term milestone for systems biology: the construction of a set of rich reference networks for key model organisms.

    View details for DOI 10.1093/bib/bbm038

    View details for Web of Science ID 000251034700005

    View details for PubMedID 17728341

  • Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project NATURE Birney, E., Stamatoyannopoulos, J. A., Dutta, A., Guigo, R., Gingeras, T. R., Margulies, E. H., Weng, Z., Snyder, M., Dermitzakis, E. T., Stamatoyannopoulos, J. A., Thurman, R. E., Kuehn, M. S., Taylor, C. M., Neph, S., Koch, C. M., Asthana, S., Malhotra, A., Adzhubei, I., Greenbaum, J. A., Andrews, R. M., Flicek, P., Boyle, P. J., Cao, H., Carter, N. P., Clelland, G. K., Davis, S., Day, N., Dhami, P., Dillon, S. C., Dorschner, M. O., Fiegler, H., Giresi, P. G., Goldy, J., Hawrylycz, M., Haydock, A., Humbert, R., James, K. D., Johnson, B. E., Johnson, E. M., Frum, T. T., Rosenzweig, E. R., Karnani, N., Lee, K., Lefebvre, G. C., Navas, P. A., Neri, F., Parker, S. C., Sabo, P. J., Sandstrom, R., Shafer, A., Vetrie, D., Weaver, M., Wilcox, S., Yu, M., Collins, F. S., Dekker, J., Lieb, J. D., Tullius, T. D., Crawford, G. E., Sunyaev, S., Noble, W. S., Dunham, I., Dutta, A., Guigo, R., Denoeud, F., Reymond, A., Kapranov, P., Rozowsky, J., Zheng, D., Castelo, R., Frankish, A., Harrow, J., Ghosh, S., Sandelin, A., Hofacker, I. L., Baertsch, R., Keefe, D., Flicek, P., Dike, S., Cheng, J., Hirsch, H. A., Sekinger, E. A., Lagarde, J., Abril, J. F., Shahab, A., Flamm, C., Fried, C., Hackermueller, J., Hertel, J., Lindemeyer, M., Missal, K., Tanzer, A., Washietl, S., Korbel, J., Emanuelsson, O., Pedersen, J. S., Holroyd, N., Taylor, R., Swarbreck, D., Matthews, N., Dickson, M. C., Thomas, D. J., Weirauch, M. T., Gilbert, J., Drenkow, J., Bell, I., Zhao, X., Srinivasan, K. G., Sung, W., Ooi, H. S., Chiu, K. P., Foissac, S., Alioto, T., Brent, M., Pachter, L., Tress, M. L., Valencia, A., Choo, S. W., Choo, C. Y., Ucla, C., Manzano, C., Wyss, C., Cheung, E., Clark, T. G., Brown, J. B., Ganesh, M., Patel, S., Tammana, H., Chrast, J., Henrichsen, C. N., Kai, C., Kawai, J., Nagalakshmi, U., Wu, J., Lian, Z., Lian, J., Newburger, P., Zhang, X., Bickel, P., Mattick, J. S., Carninci, P., Hayashizaki, Y., Weissman, S., Dermitzakis, E. T., Margulies, E. H., Hubbard, T., Myers, R. M., Rogers, J., Stadler, P. F., Lowe, T. M., Wei, C., Ruan, Y., Snyder, M., Birney, E., Struhl, K., Gerstein, M., Antonarakis, S. E., Gingeras, T. R., Brown, J. B., Flicek, P., Fu, Y., Keefe, D., Birney, E., Denoeud, F., Gerstein, M., Green, E. D., Kapranov, P., Karaoez, U., Myers, R. M., Noble, W. S., Reymond, A., Rozowsky, J., Struhl, K., Siepel, A., Stamatoyannopoulos, J. A., Taylor, C. M., Taylor, J., Thurman, R. E., Tullius, T. D., Washietl, S., Zheng, D., Liefer, L. A., Wetterstrand, K. A., Good, P. J., Feingold, E. A., Guyer, M. S., Collins, F. S., Margulies, E. H., Cooper, G. M., Asimenos, G., Thomas, D. J., Dewey, C. N., Siepel, A., Birney, E., Keefe, D., Hou, M., Taylor, J., Nikolaev, S., Montoya-Burgos, J. I., Loeytynoja, A., Whelan, S., Pardi, F., Massingham, T., Brown, J. B., Huang, H., Zhang, N. R., Bickel, P., Holmes, I., Mullikin, J. C., Ureta-Vidal, A., Paten, B., Seringhaus, M., Church, D., Rosenbloom, K., Kent, W. J., Stone, E. A., Gerstein, M., Antonarakis, S. E., Batzoglou, S., Goldman, N., Hardison, R. C., Haussler, D., Miller, W., Pachter, L., Green, E. D., Sidow, A., Weng, Z., Trinklein, N. D., Fu, Y., Zhang, Z. D., Karaoez, U., Barrera, L., Stuart, R., Zheng, D., Ghosh, S., Flicek, P., King, D. C., Taylor, J., Ameur, A., Enroth, S., Bieda, M. C., Koch, C. M., Hirsch, H. A., Wei, C., Cheng, J., Kim, J., Bhinge, A. A., Giresi, P. G., Jiang, N., Liu, J., Yao, F., Sung, W., Chiu, K. P., Vega, V. B., Lee, C. W., Ng, P., Shahab, A., Sekinger, E. A., Yang, A., Moqtaderi, Z., Zhu, Z., Xu, X., Squazzo, S., Oberley, M. J., Inman, D., Singer, M. A., Richmond, T. A., Munn, K. J., Rada-Iglesias, A., Wallerman, O., Komorowski, J., Clelland, G. K., Wilcox, S., Dillon, S. C., Andrews, R. M., Fowler, J. C., Couttet, P., James, K. D., Lefebvre, G. C., Bruce, A. W., Dovey, O. M., Ellis, P. D., Dhami, P., Langford, C. F., Carter, N. P., Vetrie, D., Kapranov, P., Nix, D. A., Bell, I., Patel, S., Rozowsky, J., Euskirchen, G., Hartman, S., Lian, J., Wu, J., Urban, A. E., Kraus, P., Van Calcar, S., Heintzman, N., Kim, T. H., Wang, K., Qu, C., Hon, G., Luna, R., Glass, C. K., Rosenfeld, M. G., Force Aldred, S., Cooper, S. J., Halees, A., Lin, J. M., Shulha, H. P., Zhang, X., Xu, M., Haidar, J. N., Yu, Y., Birney, E., Weissman, S., Ruan, Y., Lieb, J. D., Iyer, V. R., Green, R. D., Gingeras, T. R., Wadelius, C., Dunham, I., Struhl, K., Hardison, R. C., Gerstein, M., Farnham, P. J., Myers, R. M., Ren, B., Snyder, M., Thomas, D. J., Rosenbloom, K., Harte, R. A., Hinrichs, A. S., Trumbower, H., Clawson, H., Hillman-Jackson, J., Zweig, A. S., Smith, K., Thakkapallayil, A., Barber, G., Kuhn, R. M., Karolchik, D., Haussler, D., Kent, W. J., Dermitzakis, E. T., Armengol, L., Bird, C. P., Clark, T. G., Cooper, G. M., de Bakker, P. I., Kern, A. D., Lopez-Bigas, N., Martin, J. D., Stranger, B. E., Thomas, D. J., Woodroffe, A., Batzoglou, S., Davydov, E., Dimas, A., Eyras, E., Hallgrimsdottir, I. B., Hardison, R. C., Huppert, J., Sidow, A., Taylor, J., Trumbower, H., Zody, M. C., Guigo, R., Mullikin, J. C., Abecasis, G. R., Estivill, X., Birney, E., Bouffard, G. G., Guan, X., Hansen, N. F., Idol, J. R., Maduro, V. V., Maskeri, B., McDowell, J. C., Park, M., Thomas, P. J., Young, A. C., Blakesley, R. W., Muzny, D. M., Sodergren, E., Wheeler, D. A., Worley, K. C., Jiang, H., Weinstock, G. M., Gibbs, R. A., Graves, T., Fulton, R., Mardis, E. R., Wilson, R. K., Clamp, M., Cuff, J., Gnerre, S., Jaffe, D. B., Chang, J. L., Lindblad-Toh, K., Lander, E. S., Koriabine, M., Nefedov, M., Osoegawa, K., Yoshinaga, Y., Zhu, B., de Jong, P. J. 2007; 447 (7146): 799-816


    We report the generation and analysis of functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project. These data have been further integrated and augmented by a number of evolutionary and computational analyses. Together, our results advance the collective knowledge about human genome function in several major areas. First, our studies provide convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts, including non-protein-coding transcripts, and those that extensively overlap one another. Second, systematic examination of transcriptional regulation has yielded new understanding about transcription start sites, including their relationship to specific regulatory sequences and features of chromatin accessibility and histone modification. Third, a more sophisticated view of chromatin structure has emerged, including its inter-relationship with DNA replication and transcriptional regulation. Finally, integration of these new sources of information, in particular with respect to mammalian evolution based on inter- and intra-species sequence comparisons, has yielded new mechanistic and evolutionary insights concerning the functional landscape of the human genome. Together, these studies are defining a path for pursuit of a more comprehensive characterization of human genome function.

    View details for DOI 10.1038/nature05874

    View details for Web of Science ID 000247207500034

    View details for PubMedID 17571346

  • Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome GENOME RESEARCH Margulies, E. H., Cooper, G. M., Asimenos, G., Thomas, D. J., Dewey, C. N., Siepel, A., Birney, E., Keefe, D., Schwartz, A. S., Hou, M., Taylor, J., Nikolaev, S., Montoya-Burgos, J. I., Loytynoja, A., Whelan, S., Pardi, F., Massingham, T., Brown, J. B., Bickel, P., Holmes, I., Mullikin, J. C., Ureta-Vidal, A., Paten, B., Stone, E. A., Rosenbloom, K. R., Kent, W. J., Antonarakis, S. E., Batzoglou, S., Goldman, N., Hardison, R., Haussler, D., Miller, W., Pachter, L., Green, E. D., Sidow, A. 2007; 17 (6): 760-774


    A key component of the ongoing ENCODE project involves rigorous comparative sequence analyses for the initially targeted 1% of the human genome. Here, we present orthologous sequence generation, alignment, and evolutionary constraint analyses of 23 mammalian species for all ENCODE targets. Alignments were generated using four different methods; comparisons of these methods reveal large-scale consistency but substantial differences in terms of small genomic rearrangements, sensitivity (sequence coverage), and specificity (alignment accuracy). We describe the quantitative and qualitative trade-offs concomitant with alignment method choice and the levels of technical error that need to be accounted for in applications that require multisequence alignments. Using the generated alignments, we identified constrained regions using three different methods. While the different constraint-detecting methods are in general agreement, there are important discrepancies relating to both the underlying alignments and the specific algorithms. However, by integrating the results across the alignments and constraint-detecting methods, we produced constraint annotations that were found to be robust based on multiple independent measures. Analyses of these annotations illustrate that most classes of experimentally annotated functional elements are enriched for constrained sequences; however, large portions of each class (with the exception of protein-coding sequences) do not overlap constrained regions. The latter elements might not be under primary sequence constraint, might not be constrained across all mammals, or might have expendable molecular functions. Conversely, 40% of the constrained sequences do not overlap any of the functional elements that have been experimentally identified. Together, these findings demonstrate and quantify how many genomic functional elements await basic molecular characterization.

    View details for DOI 10.1101/gr.6034307

    View details for Web of Science ID 000247226900009

    View details for PubMedID 17567995

  • Whole-Genome Sequencing and Assembly with High-Throughput, Short-Read Technologies PLOS ONE Sundquist, A., Ronaghi, M., Tang, H., Pevzner, P., Batzoglou, S. 2007; 2 (5)


    While recently developed short-read sequencing technologies may dramatically reduce the sequencing cost and eventually achieve the $1000 goal for re-sequencing, their limitations prevent the de novo sequencing of eukaryotic genomes with the standard shotgun sequencing protocol. We present SHRAP (SHort Read Assembly Protocol), a sequencing protocol and assembly methodology that utilizes high-throughput short-read technologies. We describe a variation on hierarchical sequencing with two crucial differences: (1) we select a clone library from the genome randomly rather than as a tiling path and (2) we sample clones from the genome at high coverage and reads from the clones at low coverage. We assume that 200 bp read lengths with a 1% error rate and inexpensive random fragment cloning on whole mammalian genomes is feasible. Our assembly methodology is based on first ordering the clones and subsequently performing read assembly in three stages: (1) local assemblies of regions significantly smaller than a clone size, (2) clone-sized assemblies of the results of stage 1, and (3) chromosome-sized assemblies. By aggressively localizing the assembly problem during the first stage, our method succeeds in assembling short, unpaired reads sampled from repetitive genomes. We tested our assembler using simulated reads from D. melanogaster and human chromosomes 1, 11, and 21, and produced assemblies with large sets of contiguous sequence and a misassembly rate comparable to other draft assemblies. Tested on D. melanogaster and the entire human genome, our clone-ordering method produces accurate maps, thereby localizing fragment assembly and enabling the parallelization of the subsequent steps of our pipeline. Thus, we have demonstrated that truly inexpensive de novo sequencing of mammalian genomes will soon be possible with high-throughput, short-read technologies using our methodology.

    View details for DOI 10.1371/journal.pone.0000484

    View details for Web of Science ID 000207448800014

    View details for PubMedID 17534434

  • CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction GENOME BIOLOGY Gross, S. S., Do, C. B., Sirota, M., Batzoglou, S. 2007; 8 (12)


    We describe CONTRAST, a gene predictor which directly incorporates information from multiple alignments rather than employing phylogenetic models. This is accomplished through the use of discriminative machine learning techniques, including a novel training algorithm. We use a two-stage approach, in which a set of binary classifiers designed to recognize coding region boundaries is combined with a global model of gene structure. CONTRAST predicts exact coding region structures for 65% more human genes than the previous state-of-the-art method, misses 46% fewer exons and displays comparable gains in specificity.

    View details for DOI 10.1186/gb-2007-8-12-r269

    View details for Web of Science ID 000253451800020

    View details for PubMedID 18096039

  • Bacterial flora typing with deep, targeted, chip-based Pyrosequencing. BMC Microbiology Sundquist, A., Bigdeli, S., Jalili, R., El-Sayed, Y. Y., Taslimi, M. M., Druzin, M. L., Batzoglou, S. 2007; 7: 108
  • CONTAST: A discriminative, phylogeny-free approach to multiple informant de novo gene prediction. Genome Biology Gross, S. S., Do, C. B., Sirota, M., Batzoglou, S. 2007; 8: R269
  • Current progress in network research: towards reference networks for key model organisms. Briefings in Bioinformatics Srinivasan, B. S., Shah, N. H., Flannick, J. A., Abeliuk, E., Novak, A. F., Batzoglou, S. 2007; 5 (8): 318-32
  • Whole-genome sequencing and assembly with high-throughput short-read technologies. PLOS One Sundquist, A., Ronaghi, M., Tang, H., Pevzner, P., Batzoglou, S. 2007; 5 (2): e484
  • Drosophila Comparative Genome Sequencing and Analysis Consortium. Evolution of genes and genomes in the context of the Drosophila phylogeny. Nature Batzoglou, S. 2007; 450: 203-218
  • A computational model for RNA multiple structural alignment Davydov, E., Batzoglou, S. ELSEVIER SCIENCE BV. 2006: 205-216
  • A graph-based motif detection algorithm models complex nucleotide dependencies in transcription factor binding sites NUCLEIC ACIDS RESEARCH Naughton, B. T., Fratkin, E., Batzoglou, S., Brutlag, D. L. 2006; 34 (20): 5730-5739


    Given a set of known binding sites for a specific transcription factor, it is possible to build a model of the transcription factor binding site, usually called a motif model, and use this model to search for other sites that bind the same transcription factor. Typically, this search is performed using a position-specific scoring matrix (PSSM), also known as a position weight matrix. In this paper we analyze a set of eukaryotic transcription factor binding sites and show that there is extensive clustering of similar k-mers in eukaryotic motifs, owing to both functional and evolutionary constraints. The apparent limitations of probabilistic models in representing complex nucleotide dependencies lead us to a graph-based representation of motifs. When deciding whether a candidate k-mer is part of a motif or not, we base our decision not on how well the k-mer conforms to a model of the motif as a whole, but how similar it is to specific, known k-mers in the motif. We elucidate the reasons why we expect graph-based methods to perform well on motif data. Our MotifScan algorithm shows greatly improved performance over the prevalent PSSM-based method for the detection of eukaryotic motifs.

    View details for DOI 10.1093/nar/gkl585

    View details for Web of Science ID 000242474800009

    View details for PubMedID 17041233

  • Multiple alignment of protein sequences with repeats and rearrangements NUCLEIC ACIDS RESEARCH Phuong, T. M., Do, C. B., Edgar, R. C., Batzoglou, S. 2006; 34 (20): 5932-5942


    Multiple sequence alignments are the usual starting point for analyses of protein structure and evolution. For proteins with repeated, shuffled and missing domains, however, traditional multiple sequence alignment algorithms fail to provide an accurate view of homology between related proteins, because they either assume that the input sequences are globally alignable or require locally alignable regions to appear in the same order in all sequences. In this paper, we present ProDA, a novel system for automated detection and alignment of homologous regions in collections of proteins with arbitrary domain architectures. Given an input set of unaligned sequences, ProDA identifies all homologous regions appearing in one or more sequences, and returns a collection of local multiple alignments for these regions. On a subset of the BAliBASE benchmarking suite containing curated alignments of proteins with complicated domain architectures, ProDA performs well in detecting conserved domain boundaries and clustering domain segments, achieving the highest accuracy to date for this task. We conclude that ProDA is a practical tool for automated alignment of protein sequences with repeats and rearrangements in their domain architecture.

    View details for DOI 10.1093/nar/gkl511

    View details for Web of Science ID 000242474800027

    View details for PubMedID 17068081

  • Graemlin: General and robust alignment of multiple large interaction networks GENOME RESEARCH Flannick, J., Novak, A., Srinivasan, B. S., McAdams, H. H., Batzoglou, S. 2006; 16 (9): 1169-1181


    The recent proliferation of protein interaction networks has motivated research into network alignment: the cross-species comparison of conserved functional modules. Previous studies have laid the foundations for such comparisons and demonstrated their power on a select set of sparse interaction networks. Recently, however, new computational techniques have produced hundreds of predicted interaction networks with interconnection densities that push existing alignment algorithms to their limits. To find conserved functional modules in these new networks, we have developed Graemlin, the first algorithm capable of scalable multiple network alignment. Graemlin's explicit model of functional evolution allows both the generalization of existing alignment scoring schemes and the location of conserved network topologies other than protein complexes and metabolic pathways. To assess Graemlin's performance, we have developed the first quantitative benchmarks for network alignment, which allow comparisons of algorithms in terms of their ability to recapitulate the KEGG database of conserved functional modules. We find that Graemlin achieves substantial scalability gains over previous methods while improving sensitivity.

    View details for DOI 10.1101/gr.5235706

    View details for Web of Science ID 000240238600011

    View details for PubMedID 16899655

  • CONTRAfold: RNA secondary structure prediction without physics-based models BIOINFORMATICS Do, C. B., Woods, D. A., Batzoglou, S. 2006; 22 (14): E90-E98


    For several decades, free energy minimization methods have been the dominant strategy for single sequence RNA secondary structure prediction. More recently, stochastic context-free grammars (SCFGs) have emerged as an alternative probabilistic methodology for modeling RNA structure. Unlike physics-based methods, which rely on thousands of experimentally-measured thermodynamic parameters, SCFGs use fully-automated statistical learning algorithms to derive model parameters. Despite this advantage, however, probabilistic methods have not replaced free energy minimization methods as the tool of choice for secondary structure prediction, as the accuracies of the best current SCFGs have yet to match those of the best physics-based models.In this paper, we present CONTRAfold, a novel secondary structure prediction method based on conditional log-linear models (CLLMs), a flexible class of probabilistic models which generalize upon SCFGs by using discriminative training and feature-rich scoring. In a series of cross-validation experiments, we show that grammar-based secondary structure prediction methods formulated as CLLMs consistently outperform their SCFG analogs. Furthermore, CONTRAfold, a CLLM incorporating most of the features found in typical thermodynamic models, achieves the highest single sequence prediction accuracies to date, outperforming currently available probabilistic and physics-based techniques. Our result thus closes the gap between probabilistic and thermodynamic models, demonstrating that statistical learning procedures provide an effective alternative to empirical measurement of thermodynamic parameters for RNA secondary structure prediction.Source code for CONTRAfold is available at

    View details for DOI 10.1093/bioinformatics/btl246

    View details for Web of Science ID 000250005000012

    View details for PubMedID 16873527

  • MotifCut: regulatory motifs finding with maximum density subgraphs BIOINFORMATICS Fratkin, E., Naughton, B. T., Brutlag, D. L., Batzoglou, S. 2006; 22 (14): E150-E157


    DNA motif finding is one of the core problems in computational biology, for which several probabilistic and discrete approaches have been developed. Most existing methods formulate motif finding as an intractable optimization problem and rely either on expectation maximization (EM) or on local heuristic searches. Another challenge is the choice of motif model: simpler models such as the position-specific scoring matrix (PSSM) impose biologically unrealistic assumptions such as independence of the motif positions, while more involved models are harder to parametrize and learn.We present MotifCut, a graph-theoretic approach to motif finding leading to a convex optimization problem with a polynomial time solution. We build a graph where the vertices represent all k-mers in the input sequences, and edges represent pairwise k-mer similarity. In this graph, we search for a motif as the maximum density subgraph, which is a set of k-mers that exhibit a large number of pairwise similarities. Our formulation does not make strong assumptions regarding the structure of the motif and in practice both motifs that fit well the PSSM model, and those that exhibit strong dependencies between position pairs are found as dense subgraphs. We benchmark MotifCut on both synthetic and real yeast motifs, and find that it compares favorably to existing popular methods. The ability of MotifCut to detect motifs appears to scale well with increasing input size. Moreover, the motifs we discover are different from those discovered by the other methods.MotifCut server and other materials can be found at

    View details for DOI 10.1093/bioinformatics/btl243

    View details for Web of Science ID 000250005000019

    View details for PubMedID 16873465

  • Multiple sequence alignment CURRENT OPINION IN STRUCTURAL BIOLOGY Edgar, R. C., Batzoglou, S. 2006; 16 (3): 368-373


    Multiple sequence alignments are an essential tool for protein structure and function prediction, phylogeny inference and other common tasks in sequence analysis. Recently developed systems have advanced the state of the art with respect to accuracy, ability to scale to thousands of proteins and flexibility in comparing proteins that do not share the same domain architecture. New multiple alignment benchmark databases include PREFAB, SABMARK, OXBENCH and IRMBASE. Although CLUSTALW is still the most popular alignment tool to date, recent methods offer significantly better alignment quality and, in some cases, reduced computational cost.

    View details for DOI 10.1016/

    View details for Web of Science ID 000239082100014

    View details for PubMedID 16679011

  • Evidence for intelligent (algorithm) design GENOME BIOLOGY Srinivasan, B. S., Do, C. B., Batzoglou, S. 2006; 7 (7)


    : A report on the 10th annual Research in Computational Molecular Biology (RECOMB) Conference, Venice, Italy, 2-5 April 2006.

    View details for DOI 10.1186/gb-2006-7-7-322

    View details for Web of Science ID 000241322700008

    View details for PubMedID 16879725

  • MotifCut: Finding Regulatory Motifs with Maximum Density Subgraphs. Fratkin, E., Naughton, B., Brutlag, D. L., Batzoglou, S. 2006
  • Training conditional random fields for maximum parse accuracy. NIPS Gross, S. S., Russakovsky, O., Do, C. B., Batzoglou, S. 2006
  • Integrated protein interaction networks for 11 microbes Srinivasan, B. S., Novak, A. F., Flannick, J. A., Batzoglou, S., McAdams, H. H. SPRINGER-VERLAG BERLIN. 2006: 1-14
  • CONTRAlign: Discriminative training for protein sequence alignment Do, C. B., Gross, S. S., Batzoglou, S. SPRINGER-VERLAG BERLIN. 2006: 160-174
  • Sequencing of Aspergillus nidulans and comparative analysis with A-fumigatus and A-oryzae NATURE Galagan, J. E., Calvo, S. E., Cuomo, C., Ma, L. J., Wortman, J. R., Batzoglou, S., Lee, S. I., Basturkmen, M., Spevak, C. C., Clutterbuck, J., Kapitonov, V., Jurka, J., Scazzocchio, C., Farman, M., Butler, J., Purcell, S., Harris, S., Braus, G. H., Draht, O., Busch, S., d'Enfert, C., Bouchier, C., Goldman, G. H., Bell-Pedersen, D., Griffiths-Jones, S., Doonan, J. H., Yu, J., Vienken, K., Pain, A., Freitag, M., Selker, E. U., Archer, D. B., Penalva, M. A., Oakley, B. R., Momany, M., Tanaka, T., Kumagai, T., Asai, K., Machida, M., Nierman, W. C., Denning, D. W., Caddick, M., Hynes, M., Paoletti, M., Fischer, R., Miller, B., Dyer, P., Sachs, M. S., Osmani, S. A., Birren, B. W. 2005; 438 (7071): 1105-1115


    The aspergilli comprise a diverse group of filamentous fungi spanning over 200 million years of evolution. Here we report the genome sequence of the model organism Aspergillus nidulans, and a comparative study with Aspergillus fumigatus, a serious human pathogen, and Aspergillus oryzae, used in the production of sake, miso and soy sauce. Our analysis of genome structure provided a quantitative evaluation of forces driving long-term eukaryotic genome evolution. It also led to an experimentally validated model of mating-type locus evolution, suggesting the potential for sexual reproduction in A. fumigatus and A. oryzae. Our analysis of sequence conservation revealed over 5,000 non-coding regions actively conserved across all three species. Within these regions, we identified potential functional elements including a previously uncharacterized TPP riboswitch and motifs suggesting regulation in filamentous fungi by Puf family genes. We further obtained comparative and experimental evidence indicating widespread translational regulation by upstream open reading frames. These results enhance our understanding of these widely studied fungi as well as provide new insight into eukaryotic genome evolution and gene regulation.

    View details for DOI 10.1038/nature04341

    View details for Web of Science ID 000234111500040

    View details for PubMedID 16372000

  • Distribution and intensity of constraint in mammalian genomic sequence GENOME RESEARCH Cooper, G. M., Stone, E. A., Asimenos, G., Green, E. D., Batzoglou, S., Sidow, A. 2005; 15 (7): 901-913


    Comparisons of orthologous genomic DNA sequences can be used to characterize regions that have been subject to purifying selection and are enriched for functional elements. We here present the results of such an analysis on an alignment of sequences from 29 mammalian species. The alignment captures approximately 3.9 neutral substitutions per site and spans approximately 1.9 Mbp of the human genome. We identify constrained elements from 3 bp to over 1 kbp in length, covering approximately 5.5% of the human locus. Our estimate for the total amount of nonexonic constraint experienced by this locus is roughly twice that for exonic constraint. Constrained elements tend to cluster, and we identify large constrained regions that correspond well with known functional elements. While constraint density inversely correlates with mobile element density, we also show the presence of unambiguously constrained elements overlapping mammalian ancestral repeats. In addition, we describe a number of elements in this region that have undergone intense purifying selection throughout mammalian evolution, and we show that these important elements are more numerous than previously thought. These results were obtained with Genomic Evolutionary Rate Profiling (GERP), a statistically rigorous and biologically transparent framework for constrained element identification. GERP identifies regions at high resolution that exhibit nucleotide substitution deficits, and measures these deficits as "rejected substitutions". Rejected substitutions reflect the intensity of past purifying selection and are used to rank and characterize constrained elements. We anticipate that GERP and the types of analyses it facilitates will provide further insights and improved annotation for the human genome as mammalian genome sequence data become richer.

    View details for Web of Science ID 000230424000001

    View details for PubMedID 15965027

  • The many faces of sequence alignment BRIEFINGS IN BIOINFORMATICS Batzoglou, S. 2005; 6 (1): 6-22


    Starting with the sequencing of the mouse genome in 2002, we have entered a period where the main focus of genomics will be to compare multiple genomes in order to learn about human biology and evolution at the DNA level. Alignment methods are the main computational component of this endeavour. This short review aims to summarise the current status of research in alignments, emphasising large-scale genomic comparisons and suggesting possible directions that will be explored in the near future.

    View details for Web of Science ID 000228587400002

    View details for PubMedID 15826353

  • ProbCons: Probabilistic consistency-based multiple sequence alignment GENOME RESEARCH Do, C. B., Mahabhashyam, M. S., Brudno, M., Batzoglou, S. 2005; 15 (2): 330-340


    To study gene evolution across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein families. Obtaining accurate alignments, however, is a difficult computational problem because of not only the high computational cost but also the lack of proper objective functions for measuring alignment quality. In this paper, we introduce probabilistic consistency, a novel scoring function for multiple sequence comparisons. We present ProbCons, a practical tool for progressive protein multiple sequence alignment based on probabilistic consistency, and evaluate its performance on several standard alignment benchmark data sets. On the BAliBASE, SABmark, and PREFAB benchmark alignment databases, ProbCons achieves statistically significant improvement over other leading methods while maintaining practical speed. ProbCons is publicly available as a Web resource.

    View details for DOI 10.1101/gr.2821705

    View details for Web of Science ID 000226762500016

    View details for PubMedID 15687296

  • Using multiple alignments to improve seeded local alignment algorithms NUCLEIC ACIDS RESEARCH Flannick, J., Batzoglou, S. 2005; 33 (14): 4563-4577


    Multiple alignments among genomes are becoming increasingly prevalent. This trend motivates the development of tools for efficient homology search between a query sequence and a database of multiple alignments. In this paper, we present an algorithm that uses the information implicit in a multiple alignment to dynamically build an index that is weighted most heavily towards the promising regions of the multiple alignment. We have implemented Typhon, a local alignment tool that incorporates our indexing algorithm, which our test results show to be more sensitive than algorithms that index only a sequence. This suggests that when applied on a whole-genome scale, Typhon should provide improved homology searches in time comparable to existing algorithms.

    View details for DOI 10.1093/nar/gki767

    View details for Web of Science ID 000231362600024

    View details for PubMedID 16100379

  • Sequencing of Aspergillusnidulans and comparative analysis with A. fumigatus and A. oryzae. Nature Galagan, J. E., Calvo, S. E., Cuomo, C., Ma, L. J., Wortman, J., Batzoglou, S. 2005; 438: 1105–1115
  • Algorithmic Challenges in Mammalian Genome Sequence Assembly. Special Review, Encyclopedia of Genomics, Proteomics, and Bioinformatics. Encyclopedia of genomics, proteomics and bioinformatics. Batzoglou, S. edited by Dunn, M., Jorde, L., Little, P. Hoboken (New Jersey): John Wiley and Sons. 2005: 1
  • TreeRefiner: A tool for refining a multiple alignment on a phylogenetic tree 2005 IEEE COMPUTATIONAL SYSTEMS BIOINFORMATICS CONFERENCE, PROCEEDINGS Manohar, A., Batzoglou, S. 2005: 111-119


    We present TreeRefiner, a tool for refining multiple alignments of biological sequences. Given a multiple alignment, a phylogenetic tree, and scoring parameters as input, TreeRefiner optimizes the sum-of-pairs function in a restricted three-dimensional space around the alignment. At each internal node of the unrooted tree, the multiple alignment is projected to the sub-alignments corresponding to the three neighboring nodes, and three-dimensional dynamic programming is performed within a user-specified radius r around the original alignment. We test TreeRefiner on simulated sequences aligned by several popular tools, and demonstrate substantial improvements in the percentage of correctly aligned positions.

    View details for Web of Science ID 000231800100016

    View details for PubMedID 16447969

  • A suite of web-based programs to search for transcriptional regulatory motifs NUCLEIC ACIDS RESEARCH Liu, Y. Y., Wei, L. P., Batzoglou, S., Brutlag, D. L., Liu, J. S., Liu, X. S. 2004; 32: W204-W207


    The identification of regulatory motifs is important for the study of gene expression. Here we present a suite of programs that we have developed to search for regulatory sequence motifs: (i) BioProspector, a Gibbs-sampling-based program for predicting regulatory motifs from co-regulated genes in prokaryotes or lower eukaryotes; (ii) CompareProspector, an extension to BioProspector which incorporates comparative genomics features to be used for higher eukaryotes; (iii) MDscan, a program for finding protein-DNA interaction sites from ChIP-on-chip targets. All three programs examine a group of sequences that may share common regulatory motifs and output a list of putative motifs as position-specific probability matrices, the individual sites used to construct the motifs and the location of each site on the input sequences. The web servers and executables can be accessed at

    View details for DOI 10.1093/nar/gkh461

    View details for Web of Science ID 000222273100043

    View details for PubMedID 15215381

  • Genome sequence of the Brown Norway rat yields insights into mammalian evolution NATURE Gibbs, R. A., Weinstock, G. M., Metzker, M. L., Muzny, D. M., Sodergren, E. J., Scherer, S., Scott, G., Steffen, D., Worley, K. C., Burch, P. E., Okwuonu, G., Hines, S., Lewis, L., DeRamo, C., Delgado, O., Dugan-Rocha, S., Miner, G., Morgan, M., Hawes, A., Gill, R., Holt, R. A., Adams, M. D., Amanatides, P. G., Baden-Tillson, H., Barnstead, M., Chin, S., Evans, C. A., Ferriera, S., Fosler, C., Glodek, A., Gu, Z. P., Jennings, D., Kraft, C. L., Nguyen, T., Pfannkoch, C. M., Sitter, C., Sutton, G. G., Venter, J. C., Woodage, T., Smith, D., Lee, H. M., Gustafson, E., Cahill, P., Kana, A., Doucette-Stamm, L., Weinstock, K., Fechtel, K., Weiss, R. B., Dunn, D. M., Green, E. D., Blakesley, R. W., Bouffard, G. G., de Jong, J., Osoegawa, K., Zhu, B. L., Marra, M., Schein, J., Bosdet, I., Fjell, C., Jones, S., Krzywinski, M., Mathewson, C., Siddiqui, A., Wye, N., McPherson, J., Zhao, S. Y., Fraser, C. M., Shetty, J., Shatsman, S., Geer, K., Chen, Y. X., Abramzon, S., Nierman, W. C., Gibbs, R. A., Weinstock, G. M., Havlak, P. H., Chen, R., Durbin, K. J., Egan, A., Ren, Y. R., Song, X. Z., Li, B. S., Liu, Y., Qin, X., Cawley, S., Weinstock, G. M., Worley, K. C., Cooney, A. J., Gibbs, R. A., D'Souza, L. M., Martin, K., Wu, J. Q., Gonzalez-Garay, M. L., Jackson, A. R., Kalafus, K. J., McLeod, M. P., Milosavljevic, A., Virk, D., Volkov, A., Wheeler, D. A., Zhang, Z. D., Bailey, J. A., Eichler, E. E., Tuzun, E., Birney, E., Mongin, E., Ureta-Vidal, A., Woodwark, C., Zdobnov, E., Bork, P., Suyama, M., Torrents, D., Alexandersson, M., Trask, B. J., Young, J. M., Smith, D., Huang, H., Fechtel, K., Wang, H. J., Xing, H. M., Weinstock, K., Daniels, S., Gietzen, D., Schmidt, J., Stevens, K., Vitt, U., Wingrove, J., Camara, F., Schmidt, J., Stevens, K., Vitt, U., Wingrove, J., Camara, F., Alba, M. M., Abril, J. F., Guigo, R., Smit, A., Dubchak, I., Rubin, E. M., Couronne, O., Poliakov, A., Hubner, N., Ganten, D., Goesele, C., Hummel, O., Kreitler, T., Lee, Y. A., Monti, J., SCHULZ, H., Zimdahl, H., Himmelbauer, H., Lehrach, H., Jacob, H. J., Bromberg, S., Gullings-Handley, J., Jensen-Seaman, M. I., Kwitek, A. E., Lazar, J., Pasko, D., Tonellato, P. J., Twigger, S., Ponting, P., Duarte, J. M., Rice, S., Goodstadt, L., Beatson, S. A., Emes, R. D., Winter, E. E., Webber, C., Brandt, P., Nyakatura, G., Adetobi, M., Chiaromonte, F., Elnitski, L., Eswara, P., Hardison, R. C., Hou, M. M., Kolbe, D., Makova, K., Miller, W., Nekrutenko, A., Riemer, C., Schwartz, S., Taylor, J., Yang, S., Zhang, Y., Lindpaintner, K., Andrews, T. D., Caccamo, M., Clamp, M., Clarke, L., Curwen, V., Durbin, R., Eyras, E., Searle, S. M., Cooper, G. M., Batzoglou, S., Brudno, M., Sidow, A., Stone, E. A., Venter, J. C., Payseur, B. A., Bourque, G., Lopez-Otin, C., Puente, X. S., Chakrabarti, K., Chatterji, S., Dewey, C., Pachter, L., Bray, N., Yap, V. B., Caspi, A., Tesler, G., Pevzner, P. A., Haussler, D., Roskin, K. M., Baertsch, R., Clawson, H., Furey, T. S., Hinrichs, A. S., Karolchik, D., Kent, W. J., Rosenbloom, K. R., Trumbower, H., Weirauch, M., Cooper, D. N., Stenson, P. D., Ma, B., Brent, M., Arumugam, M., Shteynberg, D., Copley, R. R., Taylor, M. S., Riethman, H., Mudunuri, U., Peterson, J., Guyer, M., Felsenfeld, A., Old, S., Mockrin, S., Collins, F. 2004; 428 (6982): 493-521


    The laboratory rat (Rattus norvegicus) is an indispensable tool in experimental medicine and drug development, having made inestimable contributions to human health. We report here the genome sequence of the Brown Norway (BN) rat strain. The sequence represents a high-quality 'draft' covering over 90% of the genome. The BN rat sequence is the third complete mammalian genome to be deciphered, and three-way comparisons with the human and mouse genomes resolve details of mammalian evolution. This first comprehensive analysis includes genes and proteins and their relation to human disease, repeated sequences, comparative genome-wide studies of mammalian orthologous chromosomal regions and rearrangement breakpoints, reconstruction of ancestral karyotypes and the events leading to existing species, rates of variation, and lineage-specific and lineage-independent evolutionary events such as expansion of gene families, orthology relations and protein evolution.

    View details for DOI 10.1038/nature02426

    View details for Web of Science ID 000220540100032

    View details for PubMedID 15057822

  • Characterization of evolutionary rates and constraints in three mammalian genomes GENOME RESEARCH Cooper, G. M., Brudno, M., Stone, E. A., Dubchak, I., Batzoglou, S., Sidow, A. 2004; 14 (4): 539-548


    We present an analysis of rates and patterns of microevolutionary phenomena that have shaped the human, mouse, and rat genomes since their last common ancestor. We find evidence for a shift in the mutational spectrum between the mouse and rat lineages, with the net effect being a relative increase in GC content in the rat genome. Our estimate for the neutral point substitution rate separating the two rodents is 0.196 substitutions per site, and 0.65 substitutions per site for the tree relating all three mammals. Small insertions and deletions of 1-10 bp in length ("microindels") occur at approximately 5% of the point substitution rate. Inferred regional correlations in evolutionary rates between lineages and between types of sites support the idea that rates of evolution are influenced by local genomic or cell biological context. No substantial correlations between rates of point substitutions and rates of microindels are found, however, implying that the influences that affect these processes are distinct. Finally, we have identified those regions in the human genome that are evolving slowly, which are likely to include functional elements important to human biology. At least 5% of the human genome is under substantial constraint, most of which is noncoding.

    View details for DOI 10.1101/gr.2034704

    View details for Web of Science ID 000220629900005

    View details for PubMedID 15059994

  • Automated whole-genome multiple alignment of rat, mouse, and human GENOME RESEARCH Brudno, M., Poliakov, A., Salamov, A., Cooper, G. M., Sidow, A., Rubin, E. M., Solovyev, V., Batzoglou, S., Dubchak, I. 2004; 14 (4): 685-692


    We have built a whole-genome multiple alignment of the three currently available mammalian genomes using a fully automated pipeline that combines the local/global approach of the Berkeley Genome Pipeline and the LAGAN program. The strategy is based on progressive alignment and consists of two main steps: (1) alignment of the mouse and rat genomes, and (2) alignment of human to either the mouse-rat alignments from step 1, or the remaining unaligned mouse and rat sequences. The resulting alignments demonstrate high sensitivity, with 87% of all human gene-coding areas aligned in both mouse and rat. The specificity is also high: <7% of the rat contigs are aligned to multiple places in human, and 97% of all alignments with human sequence >100 kb agree with a three-way synteny map built independently, using predicted exons in the three genomes. At the nucleotide level <1% of the rat nucleotides are mapped to multiple places in the human sequence in the alignment, and 96.5% of human nucleotides within all alignments agree with the synteny map. The alignments are publicly available online, with visualization through the novel Multi-VISTA browser that we also present.

    View details for DOI 10.1101/gr.2067704

    View details for Web of Science ID 000220629900022

    View details for PubMedID 15060011

  • Phylo-VISTA: interactive visualization of multiple DNA sequence alignments BIOINFORMATICS Shah, N., Couronne, O., Pennacchio, L. A., Brudno, M., Batzoglou, S., Bethel, E. W., Rubin, E. M., Hamann, B., Dubchak, I. 2004; 20 (5): 636-U122


    The power of multi-sequence comparison for biological discovery is well established. The need for new capabilities to visualize and compare cross-species alignment data is intensified by the growing number of genomic sequence datasets being generated for an ever-increasing number of organisms. To be efficient these visualization algorithms must support the ability to accommodate consistently a wide range of evolutionary distances in a comparison framework based upon phylogenetic relationships.We have developed Phylo-VISTA, an interactive tool for analyzing multiple alignments by visualizing a similarity measure for multiple DNA sequences. The complexity of visual presentation is effectively organized using a framework based upon interspecies phylogenetic relationships. The phylogenetic organization supports rapid, user-guided interspecies comparison. To aid in navigation through large sequence datasets, Phylo-VISTA leverages concepts from VISTA that provide a user with the ability to select and view data at varying resolutions. The combination of multiresolution data visualization and analysis, combined with the phylogenetic framework for interspecies comparison, produces a highly flexible and powerful tool for visual data analysis of multiple sequence alignments.Phylo-VISTA is available at It requires an Internet browser with Java Plug-in 1.4.2 and it is integrated into the global alignment program LAGAN at

    View details for DOI 10.1093/bioinformatics/btg459

    View details for Web of Science ID 000220485300006

    View details for PubMedID 15033870

  • Eukaryotic regulatory element conservation analysis and identification using comparative genomics GENOME RESEARCH Liu, Y. Y., Liu, X. S., Wei, L. P., Altman, R. B., Batzoglou, S. 2004; 14 (3): 451-458


    Comparative genomics is a promising approach to the challenging problem of eukaryotic regulatory element identification, because functional noncoding sequences may be conserved across species from evolutionary constraints. We systematically analyzed known human and Saccharomyces cerevisiae regulatory elements and discovered that human regulatory elements are more conserved between human and mouse than are background sequences. Although S. cerevisiae regulatory elements do not appear to be more conserved by comparison of S. cerevisiae to Schizosaccharomyces pombe, they are more conserved when compared with multiple other yeast genomes (Saccharomyces paradoxus, Saccharomyces mikatae, and Saccharomyces bayanus). Based on these analyses, we developed a sequence-motif-finding algorithm called CompareProspector, which extends Gibbs sampling by biasing the search in regions conserved across species. Using human-mouse comparison, CompareProspector identified known motifs for transcription factors Mef2, Myf, Srf, and Sp1 from a set of human-muscle-specific genes. It also discovered the NFAT motif from genes up-regulated by CD28 stimulation in T-cells, which implies the direct involvement of NFAT in mediating the CD28 stimulatory signal. Using Caenorhabditis elegans-Caenorhabditis briggsae comparison, CompareProspector found the PHA-4 motif and the UNC-86 motif. CompareProspector outperformed many other computational motif-finding programs, demonstrating the power of comparative genomics-based biased sampling in eukaryotic regulatory element identification.

    View details for Web of Science ID 000189389100013

    View details for PubMedID 14993210

  • ICA-based clustering of genes from microarray expression data Lee, S. I., Batzoglou, S. M I T PRESS. 2004: 675-682
  • Eukaryotic regulatory element conservation and their identification using comparative genomics. Genome Research Liu, Y., Liu, X. S., Wei, L., Altman, R. B., Batzoglou, S. 2004; 14: 451-458
  • The ENCODE Project Consortium. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science Batzoglou, S. 2004; 306: 636–640
  • Chaining algorithms for alignment of draft sequence Sundararajan, M., Brudno, M., Small, K., Sidow, A., Batzoglou, S. SPRINGER-VERLAG BERLIN. 2004: 326-337
  • PROBCONS: Probabilistic consistency-based multiple alignment of amino acid sequences Do, C. B., Brudno, M., Batzoglou, S. ASSOC ADVANCEMENT ARTIFICIAL INTELLIGENCE. 2004: 703-708
  • A computational model for RNA multiple structural alignment Davydov, E., Batzoglou, S. SPRINGER-VERLAG BERLIN. 2004: 254-269
  • Fast and sensitive multiple alignment of large genomic sequences BMC BIOINFORMATICS Brudno, M., Chapman, M., Gottgens, B., Batzoglou, S., Morgenstern, B. 2003; 4


    Genomic sequence alignment is a powerful method for genome analysis and annotation, as alignments are routinely used to identify functional sites such as genes or regulatory elements. With a growing number of partially or completely sequenced genomes, multiple alignment is playing an increasingly important role in these studies. In recent years, various tools for pair-wise and multiple genomic alignment have been proposed. Some of them are extremely fast, but often efficiency is achieved at the expense of sensitivity. One way of combining speed and sensitivity is to use an anchored-alignment approach. In a first step, a fast search program identifies a chain of strong local sequence similarities. In a second step, regions between these anchor points are aligned using a slower but more accurate method.Herein, we present CHAOS, a novel algorithm for rapid identification of chains of local pair-wise sequence similarities. Local alignments calculated by CHAOS are used as anchor points to improve the running time of DIALIGN, a slow but sensitive multiple-alignment tool. We show that this way, the running time of DIALIGN can be reduced by more than 95% for BAC-sized and longer sequences, without affecting the quality of the resulting alignments. We apply our approach to a set of five genomic sequences around the stem-cell-leukemia (SCL) gene and demonstrate that exons and small regulatory elements can be identified by our multiple-alignment procedure.We conclude that the novel CHAOS local alignment tool is an effective way to significantly speed up global alignment tools such as DIALIGN without reducing the alignment quality. We likewise demonstrate that the DIALIGN/CHAOS combination is able to accurately align short regulatory sequences in distant orthologues.

    View details for Web of Science ID 000189033700001

    View details for PubMedID 14693042

  • AGenDA: homology-based gene prediction BIOINFORMATICS Taher, L., Rinner, O., Garg, S., Sczyrba, A., Brudno, M., Batzoglou, S., MORGENSTERN, B. 2003; 19 (12): 1575-1577


    We present a www server for homology-based gene prediction. The user enters a pair of evolutionary related genomic sequences, for example from human and mouse. Our software system uses CHAOS and DIALIGN to calculate an alignment of the input sequences and then searches for conserved splicing signals and start/stop codons around regions of local sequence similarity. This way, candidate exons are identified that are used, in turn, to calculate optimal gene models. The server returns the constructed gene model by email, together with a graphical representation of the underlying genomic alignment.

    View details for DOI 10.1093/bioinformatics/btg181

    View details for Web of Science ID 000184878700017

    View details for PubMedID 12912840

  • Identification of promoter regions in the human genome by using a retroviral plasmid library-based functional reporter gene assay GENOME RESEARCH Khambata-Ford, S., Liu, Y. Y., Gleason, C., Dickson, M., Altman, R. B., Batzoglou, S., Myers, R. M. 2003; 13 (7): 1765-1774


    Attempts to identify regulatory sequences in the human genome have involved experimental and computational methods such as cross-species sequence comparisons and the detection of transcription factor binding-site motifs in coexpressed genes. Although these strategies provide information on which genomic regions are likely to be involved in gene regulation, they do not give information on their functions. We have developed a functional selection for promoter regions in the human genome that uses a retroviral plasmid library-based system. This approach enriches for and detects promoter function of isolated DNA fragments in an in vitro cell culture assay. By using this method, we have discovered likely promoters of known and predicted genes, as well as many other putative promoter regions based on the presence of features such as CpG islands. Comparison of sequences of 858 plasmid clones selected by this assay with the human genome draft sequence indicates that a significantly higher percentage of sequences align to the 500-bp segment upstream of the transcription start sites of known genes than would be expected from random genomic sequences. We also observed enrichment for putative promoter regions of genes predicted in at least two annotation databases and for clones overlapping with CpG islands. Functional validation of randomly selected clones enriched by this method showed that a large fraction of these putative promoters can drive the expression of a reporter gene in transient transfection experiments. This method promises to be a useful genome-wide function-based approach that can complement existing methods to look for promoters.

    View details for DOI 10.1101/gr.529803

    View details for Web of Science ID 000183970000023

    View details for PubMedID 12805274

  • Glocal alignment: finding rearrangements during alignment BIOINFORMATICS Brudno, M., Malde, S., Poliakov, A., Do, C. B., Couronne, O., Dubchak, I., Batzoglou, S. 2003; 19: i54-i62


    To compare entire genomes from different species, biologists increasingly need alignment methods that are efficient enough to handle long sequences, and accurate enough to correctly align the conserved biological features between distant species. The two main classes of pairwise alignments are global alignment, where one string is transformed into the other, and local alignment, where all locations of similarity between the two strings are returned. Global alignments are less prone to demonstrating false homology as each letter of one sequence is constrained to being aligned to only one letter of the other. Local alignments, on the other hand, can cope with rearrangements between non-syntenic, orthologous sequences by identifying similar regions in sequences; this, however, comes at the expense of a higher false positive rate due to the inability of local aligners to take into account overall conservation maps.In this paper we introduce the notion of glocal alignment, a combination of global and local methods, where one creates a map that transforms one sequence into the other while allowing for rearrangement events. We present Shuffle-LAGAN, a glocal alignment algorithm that is based on the CHAOS local alignment algorithm and the LAGAN global aligner, and is able to align long genomic sequences. To test Shuffle-LAGAN we split the mouse genome into BAC-sized pieces, and aligned these pieces to the human genome. We demonstrate that Shuffle-LAGAN compares favorably in terms of sensitivity and specificity with standard local and global aligners. From the alignments we conclude that about 9% of human/mouse homology may be attributed to small rearrangements, 63% of which are duplications.

    View details for DOI 10.1093/bioinformatics/btg1005

    View details for Web of Science ID 000207434200007

    View details for PubMedID 12855437

  • Quantitative estimates of sequence divergence for comparative analyses of mammalian genomes GENOME RESEARCH Cooper, G. M., Brudno, M., Green, E. D., Batzoglou, S., Sidow, A. 2003; 13 (5): 813-820


    Comparative sequence analyses on a collection of carefully chosen mammalian genomes could facilitate identification of functional elements within the human genome and allow quantification of evolutionary constraint at the single nucleotide level. High-resolution quantification would be informative for determining the distribution of important positions within functional elements and for evaluating the relative importance of nucleotide sites that carry single nucleotide polymorphisms (SNPs). Because the level of resolution in comparative sequence analyses is a direct function of sequence diversity, we propose that the information content of a candidate mammalian genome be defined as the sequence divergence it would add relative to already-sequenced genomes. We show that reliable estimates of genomic sequence divergence can be obtained from small genomic regions. On the basis of a multiple sequence alignment of approximately 1.4 megabases each from eight mammals, we generate such estimates for five unsequenced mammals. Estimates of the neutral divergence in these data suggest that a small number of diverse mammalian genomes in addition to human, mouse, and rat would allow single nucleotide resolution in comparative sequence analyses.

    View details for DOI 10.1101/gr.1064503

    View details for Web of Science ID 000182645500007

    View details for PubMedID 12727901

  • LAGAN and Multi-LAGAN: Efficient tools for large-scale multiple alignment of genomic DNA GENOME RESEARCH Brudno, M., Do, C. B., Cooper, G. M., Kim, M. F., Davydov, E., Green, E. D., Sidow, A., Batzoglou, S. 2003; 13 (4): 721-731


    To compare entire genomes from different species, biologists increasingly need alignment methods that are efficient enough to handle long sequences, and accurate enough to correctly align the conserved biological features between distant species. We present LAGAN, a system for rapid global alignment of two homologous genomic sequences, and Multi-LAGAN, a system for multiple global alignment of genomic sequences. We tested our systems on a data set consisting of greater than 12 Mb of high-quality sequence from 12 vertebrate species. All the sequence was derived from the genomic region orthologous to an approximately 1.5-Mb region on human chromosome 7q31.3. We found that both LAGAN and Multi-LAGAN compare favorably with other leading alignment methods in correctly aligning protein-coding exons, especially between distant homologs such as human and chicken, or human and fugu. Multi-LAGAN produced the most accurate alignments, while requiring just 75 minutes on a personal computer to obtain the multiple alignment of all 12 sequences. Multi-LAGAN is a practical method for generating multiple alignments of long genomic sequences at any evolutionary distance. Our systems are publicly available at

    View details for DOI 10.1101/gr.926603

    View details for Web of Science ID 000182046300018

    View details for PubMedID 12654723

  • Application of independent component analysis to microarrays GENOME BIOLOGY Lee, S. I., Batzoglou, S. 2003; 4 (11)


    We apply linear and nonlinear independent component analysis (ICA) to project microarray data into statistically independent components that correspond to putative biological processes, and to cluster genes according to over- or under-expression in each component. We test the statistical significance of enrichment of gene annotations within clusters. ICA outperforms other leading methods, such as principal component analysis, k-means clustering and the Plaid model, in constructing functionally coherent clusters on microarray datasets from Saccharomyces cerevisiae, Caenorhabditis elegans and human.

    View details for Web of Science ID 000186342700012

    View details for PubMedID 14611662

  • Phylo-VISTA: an interactive visualization tool for multiple DNA sequence alignments. Bioinformatics Shan, N., Couronne, O., Pennacchio, L. A., Brudno, M., Batzoglou, S., Joy, S. 2003; 19: 1575-1577
  • Gene Regulation, Session Introduction. Batzoglou, S., Pachter, L. 2003
  • Glocal alignment: finding rearrangements during alignment. Brudno, M., Malde, S., Poliakov, A., Do, C., Couronne, O., Dubchak, I., Batzoglou, S. 2003
  • ARACHNE: A whole genome shotgun assembler. Genome Research Batzoglou, S., Jaffe, D., Stanley, K., Butler, J., Gnerre, S., Mauceli, E. 2002; 12: 177-189
  • Initial sequencing and analysis of the human genome NATURE Lander, E. S., Int Human Genome Sequencing Consortium, Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., Funke, R., Gage, D., Harris, K., Heaford, A., Howland, J., Kann, L., Lehoczky, J., Levine, R., McEwan, P., McKernan, K., Meldrim, J., Mesirov, J. P., Miranda, C., Morris, W., Naylor, J., Raymond, C., Rosetti, M., Santos, R., Sheridan, A., Sougnez, C., Stange-Thomann, N., Stojanovic, N., Subramanian, A., Wyman, D., Rogers, J., Sulston, J., Ainscough, R., Beck, S., Bentley, D., Burton, J., Clee, C., CARTER, N., Coulson, A., Deadman, R., Deloukas, P., Dunham, A., Dunham, I., Durbin, R., French, L., Grafham, D., Gregory, S., Hubbard, T., Humphray, S., Hunt, A., Jones, M., LLOYD, C., McMurray, A., Matthews, L., Mercer, S., Milne, S., Mullikin, J. C., Mungall, A., Plumb, R., Ross, M., Shownkeen, R., Sims, S., Waterston, R. H., Wilson, R. K., Hillier, L. W., McPherson, J. D., Marra, M. A., Mardis, E. R., Fulton, L. A., Chinwalla, A. T., Pepin, K. H., Gish, W. R., Chissoe, S. L., Wendl, M. C., Delehaunty, K. D., Miner, T. L., Delehaunty, A., Kramer, J. B., Cook, L. L., Fulton, R. S., Johnson, D. L., Minx, P. J., Clifton, S. W., Hawkins, T., Branscomb, E., Predki, P., Richardson, P., Wenning, S., Slezak, T., Doggett, N., Cheng, J. F., Olsen, A., Lucas, S., Elkin, C., Uberbacher, E., Frazier, M., Gibbs, R. A., Muzny, D. M., Scherer, S. E., Bouck, J. B., Sodergren, E. J., Worley, K. C., Rives, C. M., Gorrell, J. H., Metzker, M. L., NAYLOR, S. L., Kucherlapati, R. S., Nelson, D. L., Weinstock, G. M., SAKAKI, Y., Fujiyama, A., Hattori, M., Yada, T., Toyoda, A., Itoh, T., Kawagoe, C., Watanabe, H., Totoki, Y., Taylor, T., Weissenbach, J., Heilig, R., Saurin, W., Artiguenave, F., Brottier, P., Bruls, T., Pelletier, E., Robert, C., Wincker, P., Rosenthal, A., Platzer, M., Nyakatura, G., Taudien, S., Rump, A., Yang, H. M., Yu, J., Wang, J., Huang, G. Y., Gu, J., Hood, L., Rowen, L., Madan, A., Qin, S. Z., Davis, R. W., Federspiel, N. A., Abola, A. P., Proctor, M. J., Myers, R. M., Schmutz, J., Dickson, M., Grimwood, J., Cox, D. R., Olson, M. V., Kaul, R., Raymond, C., Shimizu, N., Kawasaki, K., Minoshima, S., Evans, G. A., Athanasiou, M., Schultz, R., Roe, B. A., Chen, F., Pan, H. Q., Ramser, J., Lehrach, H., Reinhardt, R., McCombie, W. R., de la Bastide, M., Dedhia, N., Blocker, H., Hornischer, K., Nordsiek, G., Agarwala, R., Aravind, L., Bailey, J. A., Bateman, A., Batzoglou, S., Birney, E., Bork, P., Brown, D. G., Burge, C. B., Cerutti, L., Chen, H. C., Church, D., Clamp, M., Copley, R. R., Doerks, T., Eddy, S. R., Eichler, E. E., Furey, T. S., Galagan, J., Gilbert, J. G., Harmon, C., Hayashizaki, Y., Haussler, D., Hermjakob, H., Hokamp, K., Jang, W. H., Johnson, L. S., Jones, T. A., Kasif, S., Kaspryzk, A., Kennedy, S., Kent, W. J., Kitts, P., Koonin, E. V., Korf, I., Kulp, D., Lancet, D., Lowe, T. M., McLysaght, A., Mikkelsen, T., Moran, J. V., Mulder, N., Pollara, V. J., Ponting, C. P., Schuler, G., Schultz, J. R., Slater, G., Smit, A. F., Stupka, E., Szustakowki, J., Thierry-Mieg, D., Thierry-Mieg, J., Wagner, L., Wallis, J., Wheeler, R., Williams, A., Wolf, Y. I., Wolfe, K. H., Yang, S. P., Yeh, R. F., Collins, F., Guyer, M. S., Peterson, J., Felsenfeld, A., Wetterstrand, K. A., Patrinos, A., Morgan, M. J. 2001; 409 (6822): 860-921


    The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.

    View details for DOI 10.1038/35057062

    View details for Web of Science ID 000166938800058

    View details for PubMedID 11237011

  • Distributed Algorithms: Instructors Manual. Lynch, N., Batzoglou, S., Boyko, V. Morgan Kauffman. 2001
  • Prediction of Self-Assembly of Energetic Tiles and Dominos: Experiments Mathematics and Software.Sandia Labs Technical Report Istrail, S., Hurd, A., Lippert, R. A., Walenz, B., Batzoglou, S., Conway, J. H. 2000
  • Human and mouse gene structure: comparative analysis and application to exon prediction. Genome Research Batzoglou, S., Pachter, L., Mesirov, J. P., Berger, B., Lander, E, S. 2000; 10: 950-958
  • Sequencing a genome by walking with clone-ends: A mathematical analysis. Batzoglou, S., Mesirov, J. P., Berger, B., Lander, E, S. 2000
  • Human and mouse gene structure: comparative analysis and application to exon prediction. Batzoglou, S., Pachter, L., Mesirov, J. P., Berger, B., Lander, E, S. 2000
  • Sequencing a genome by walking with clone-ends: A mathematical analysis. Genome Research Batzoglou, S., Mesirov, J. P., Berger, B., Lander, E, S. 1999; 9: 1163-1174
  • Physical mapping with repeated probes: The hypergraph superstring problem. Lecture Notes in Computer Science Batzoglou, S., Istrail, S. 1999; 1645: 66
  • A dictionary based approach to gene annotation. Pachter, L., Batzoglou, S., Spitkovsky, V. I., Beebee, W., Lander, E. S., Berger, B. 1999
  • A dictionary based approach to gene annotation. Journal of Computational Biology Pachter, L., Batzoglou, S., Spitkovsky, V. I., Banks, E., Lander, E. S., Kleitman, D. J. 1999; 6: 419–430
  • Recent developments in computational gene recognition. DocumentaMathematica Extra Volume ICM I Batzoglou, S., Berger, B., Kleitman, D. J., Lander, E. S., Pachter, L. 1998: 649-658
  • Local rules for protein folding on a triangular lattice and generalized hydrophobicity in the HP model. Journal of Computational Biology Agarwala, R., Batzoglou, S., Dancik, V., Decatur, S. E., Hannenhalli, S., Farach, M. 1997; 4: 275–-296
  • Local rules for protein folding on a triangular lattice and generalized hydrophobicity in the HP model. Agarwala, R., Batzoglou, S., Dancik, V., Decatur, S. E., Hannenhalli, S., Farach, M. 1997
  • Protein folding in the hydrophobic-polar model on the 3D triangular lattice. Decatur, S., Batzoglou, S. 1997
  • Local rules for protein folding on a triangular lattice and generalized hydrophobicity in the HP model. Agarwala, R., Batzoglou, S., Dancik, V., Decatur, S. E., Hannenhalli, S., Farach, M. 1997

Stanford Medicine Resources: