Academic Appointments

Research & Scholarship

Current Research and Scholarly Interests

Current interest centers on the application of statistics to problems arsing from biology. We are particularly interested in questions concerning gene regulation and signal transduction.


2016-17 Courses

Stanford Advisees

Graduate and Fellowship Programs

  • Biology (School of Humanities and Sciences) (Phd Program)


All Publications

  • Learning regulatory programs by threshold SVD regression PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Ma, X., Xiao, L., Wong, W. H. 2014; 111 (44): 15675-15680


    We formulate a statistical model for the regulation of global gene expression by multiple regulatory programs and propose a thresholding singular value decomposition (T-SVD) regression method for learning such a model from data. Extensive simulations demonstrate that this method offers improved computational speed and higher sensitivity and specificity over competing approaches. The method is used to analyze microRNA (miRNA) and long noncoding RNA (lncRNA) data from The Cancer Genome Atlas (TCGA) consortium. The analysis yields previously unidentified insights into the combinatorial regulation of gene expression by noncoding RNAs, as well as findings that are supported by evidence from the literature.

    View details for DOI 10.1073/pnas.1417808111

    View details for Web of Science ID 000344088100029

    View details for PubMedID 25331876

  • Density estimation on multivariate censored data with optional Polya tree BIOSTATISTICS Seok, J., Tian, L., Wong, W. H. 2014; 15 (1): 182-195


    Analyzing the failure times of multiple events is of interest in many fields. Estimating the joint distribution of the failure times in a non-parametric way is not straightforward because some failure times are often right-censored and only known to be greater than observed follow-up times. Although it has been studied, there is no universally optimal solution for this problem. It is still challenging and important to provide alternatives that may be more suitable than existing ones in specific settings. Related problems of the existing methods are not only limited to infeasible computations, but also include the lack of optimality and possible non-monotonicity of the estimated survival function. In this paper, we proposed a non-parametric Bayesian approach for directly estimating the density function of multivariate survival times, where the prior is constructed based on the optional Pólya tree. We investigated several theoretical aspects of the procedure and derived an efficient iterative algorithm for implementing the Bayesian procedure. The empirical performance of the method was examined via extensive simulation studies. Finally, we presented a detailed analysis using the proposed method on the relationship among organ recovery times in severely injured patients. From the analysis, we suggested interesting medical information that can be further pursued in clinics.

    View details for DOI 10.1093/biostatistics/kxt025

    View details for Web of Science ID 000328286700019

    View details for PubMedID 23902636

  • Characterization of the human ESC transcriptome by hybrid sequencing PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Au, K. F., Sebastiano, V., Afshar, P. T., Durruthy, J. D., Lee, L., Williams, B. A., van Bakel, H., Schadt, E. E., Reijo-Pera, R. A., Underwood, J. G., Wong, W. H. 2013; 110 (50): E4821-E4830


    Although transcriptional and posttranscriptional events are detected in RNA-Seq data from second-generation sequencing, full-length mRNA isoforms are not captured. On the other hand, third-generation sequencing, which yields much longer reads, has current limitations of lower raw accuracy and throughput. Here, we combine second-generation sequencing and third-generation sequencing with a custom-designed method for isoform identification and quantification to generate a high-confidence isoform dataset for human embryonic stem cells (hESCs). We report 8,084 RefSeq-annotated isoforms detected as full-length and an additional 5,459 isoforms predicted through statistical inference. Over one-third of these are novel isoforms, including 273 RNAs from gene loci that have not previously been identified. Further characterization of the novel loci indicates that a subset is expressed in pluripotent cells but not in diverse fetal and adult tissues; moreover, their reduced expression perturbs the network of pluripotency-associated genes. Results suggest that gene identification, even in well-characterized human cell lines and tissues, is likely far from complete.

    View details for DOI 10.1073/pnas.1320101110

    View details for Web of Science ID 000328061700004

    View details for PubMedID 24282307

  • Multivariate Density Estimation by Bayesian Sequential Partitioning JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Lu, L., Jiang, H., Wong, W. H. 2013; 108 (504): 1402-1410
  • Early role for IL-6 signalling during generation of induced pluripotent stem cells revealed by heterokaryon RNA-Seq. Nature cell biology Brady, J. J., Li, M., Suthram, S., Jiang, H., Wong, W. H., Blau, H. M. 2013; 15 (10): 1244-1252


    Molecular insights into somatic cell reprogramming to induced pluripotent stem cells (iPS) would aid regenerative medicine, but are difficult to elucidate in iPS because of their heterogeneity, as relatively few cells undergo reprogramming (0.1-1%; refs , ). To identify early acting regulators, we capitalized on non-dividing heterokaryons (mouse embryonic stem cells fused to human fibroblasts), in which reprogramming towards pluripotency is efficient and rapid, enabling the identification of transient regulators required at the onset. We used bi-species transcriptome-wide RNA-seq to quantify transcriptional changes in the human somatic nucleus during reprogramming towards pluripotency in heterokaryons. During heterokaryon reprogramming, the cytokine interleukin 6 (IL6), which is not detectable at significant levels in embryonic stem cells, was induced 50-fold. A 4-day culture with IL6 at the onset of iPS reprogramming replaced stably transduced oncogenic c-Myc such that transduction of only Oct4, Klf4 and Sox2 was required. IL6 also activated another Jak/Stat target, the serine/threonine kinase gene Pim1, which accounted for the IL6-mediated twofold increase in iPS frequency. In contrast, LIF, another induced GP130 ligand, failed to increase iPS frequency or activate c-Myc or Pim1, thereby revealing a differential role for the two Jak/Stat inducers in iPS generation. These findings demonstrate the power of heterokaryon bi-species global RNA-seq to identify early acting regulators of reprogramming, for example, extrinsic replacements for stably transduced transcription factors such as the potent oncogene c-Myc.

    View details for DOI 10.1038/ncb2835

    View details for PubMedID 23995732

  • Early role for IL-6 signalling during generation of induced pluripotent stem cells revealed by heterokaryon RNA-Seq NATURE CELL BIOLOGY Brady, J. J., Li, M., Suthram, S., Jiang, H., Wong, W. H., Blau, H. M. 2013; 15 (10): 1244-U272

    View details for DOI 10.1038/ncb2835

    View details for Web of Science ID 000325200300015


    View details for DOI 10.1214/13-AOAS645

    View details for Web of Science ID 000328198700003

  • Personalized prediction of first-cycle in vitro fertilization success FERTILITY AND STERILITY Choi, B., Bosch, E., Lannon, B. M., Leveille, M., Wong, W. H., Leader, A., Pellicer, A., Penzias, A. S., Yao, M. W. 2013; 99 (7): 1905-1911


    To test whether the probability of having a live birth (LB) with the first IVF cycle (C1) can be predicted and personalized for patients in diverse environments.Retrospective validation of multicenter prediction model.Three university-affiliated outpatient IVF clinics located in different countries.Using primary models aggregated from >13,000 C1s, we applied the boosted tree method to train a preIVF-diversity model (PreIVF-D) with 1,061 C1s from 2008 to 2009, and validated predicted LB probabilities with an independent dataset comprising 1,058 C1s from 2008 to 2009.None.Predictive power, reclassification, receiver operator characteristic analysis, calibration, dynamic range.Overall, with PreIVF-D, 86% of cases had significantly different LB probabilities compared with age control, and more than one-half had higher LB probabilities. Specifically, 42% of patients could have been identified by PreIVF-D to have a personalized predicted success rate >45%, whereas an age-control model could not differentiate them from others. Furthermore, PreIVF-D showed improved predictive power, with 36% improved log-likelihood (or 9.0-fold by log-scale; >1,000-fold linear scale), and prediction errors for subgroups ranged from 0.9% to 3.7%.Validated prediction of personalized LB probabilities from diverse multiple sources identify excellent prognoses in more than one-half of patients.

    View details for DOI 10.1016/j.fertnstert.2013.02.016

    View details for Web of Science ID 000320505900028

    View details for PubMedID 23522806

  • Detecting DNA modifications from SMRT sequencing data by modeling sequence context dependence of polymerase kinetic. PLoS computational biology Feng, Z., Fang, G., Korlach, J., Clark, T., Luong, K., Zhang, X., Wong, W., Schadt, E. 2013; 9 (3)


    DNA modifications such as methylation and DNA damage can play critical regulatory roles in biological systems. Single molecule, real time (SMRT) sequencing technology generates DNA sequences as well as DNA polymerase kinetic information that can be used for the direct detection of DNA modifications. We demonstrate that local sequence context has a strong impact on DNA polymerase kinetics in the neighborhood of the incorporation site during the DNA synthesis reaction, allowing for the possibility of estimating the expected kinetic rate of the enzyme at the incorporation site using kinetic rate information collected from existing SMRT sequencing data (historical data) covering the same local sequence contexts of interest. We develop an Empirical Bayesian hierarchical model for incorporating historical data. Our results show that the model could greatly increase DNA modification detection accuracy, and reduce requirement of control data coverage. For some DNA modifications that have a strong signal, a control sample is not even needed by using historical data as alternative to control. Thus, sequencing costs can be greatly reduced by using the model. We implemented the model in a R package named seqPatch, which is available at

    View details for DOI 10.1371/journal.pcbi.1002935

    View details for PubMedID 23516341

  • Detecting DNA Modifications from SMRT Sequencing Data by Modeling Sequence Context Dependence of Polymerase Kinetic PLOS COMPUTATIONAL BIOLOGY Feng, Z., Fang, G., Korlach, J., Clark, T., Khai Luong, K., Zhang, X., Wong, W., Schadt, E. 2013; 9 (3)
  • RNA sequencing reveals a diverse and dynamic repertoire of the Xenopus tropicalis transcriptome over development GENOME RESEARCH Tan, M. H., Au, K. F., Yablonovitch, A. L., Wills, A. E., Chuang, J., Baker, J. C., Wong, W. H., Li, J. B. 2013; 23 (1): 201-216


    The Xenopus embryo has provided key insights into fate specification, the cell cycle, and other fundamental developmental and cellular processes, yet a comprehensive understanding of its transcriptome is lacking. Here, we used paired end RNA sequencing (RNA-seq) to explore the transcriptome of Xenopus tropicalis in 23 distinct developmental stages. We determined expression levels of all genes annotated in RefSeq and Ensembl and showed for the first time on a genome-wide scale that, despite a general state of transcriptional silence in the earliest stages of development, approximately 150 genes are transcribed prior to the midblastula transition. In addition, our splicing analysis uncovered more than 10,000 novel splice junctions at each stage and revealed that many known genes have additional unannotated isoforms. Furthermore, we used Cufflinks to reconstruct transcripts from our RNA-seq data and found that ∼13.5% of the final contigs are derived from novel transcribed regions, both within introns and in intergenic regions. We then developed a filtering pipeline to separate protein-coding transcripts from noncoding RNAs and identified a confident set of 6686 noncoding transcripts in 3859 genomic loci. Since the current reference genome, XenTro3, consists of hundreds of scaffolds instead of full chromosomes, we also performed de novo reconstruction of the transcriptome using Trinity and uncovered hundreds of transcripts that are missing from the genome. Collectively, our data will not only aid in completing the assembly of the Xenopus tropicalis genome but will also serve as a valuable resource for gene discovery and for unraveling the fundamental mechanisms of vertebrate embryogenesis.

    View details for DOI 10.1101/gr.141424.112

    View details for Web of Science ID 000312963400019

    View details for PubMedID 22960373

  • Modeling kinetic rate variation in third generation DNA sequencing data to detect putative modifications to DNA bases GENOME RESEARCH Schadt, E. E., Banerjee, O., Fang, G., Feng, Z., Wong, W. H., Zhang, X., Kislyuk, A., Clark, T. A., Khai Luong, K., Keren-Paz, A., Chess, A., Kumar, V., Chen-Plotkin, A., Sondheimer, N., Korlach, J., Kasarskis, A. 2013; 23 (1): 129-141


    Current generation DNA sequencing instruments are moving closer to seamlessly sequencing genomes of entire populations as a routine part of scientific investigation. However, while significant inroads have been made identifying small nucleotide variation and structural variations in DNA that impact phenotypes of interest, progress has not been as dramatic regarding epigenetic changes and base-level damage to DNA, largely due to technological limitations in assaying all known and unknown types of modifications at genome scale. Recently, single-molecule real time (SMRT) sequencing has been reported to identify kinetic variation (KV) events that have been demonstrated to reflect epigenetic changes of every known type, providing a path forward for detecting base modifications as a routine part of sequencing. However, to date no statistical framework has been proposed to enhance the power to detect these events while also controlling for false-positive events. By modeling enzyme kinetics in the neighborhood of an arbitrary location in a genomic region of interest as a conditional random field, we provide a statistical framework for incorporating kinetic information at a test position of interest as well as at neighboring sites that help enhance the power to detect KV events. The performance of this and related models is explored, with the best-performing model applied to plasmid DNA isolated from Escherichia coli and mitochondrial DNA isolated from human brain tissue. We highlight widespread kinetic variation events, some of which strongly associate with known modification events, while others represent putative chemically modified sites of unknown types.

    View details for DOI 10.1101/gr.136739.111

    View details for Web of Science ID 000312963400012

    View details for PubMedID 23093720

  • An Oct4-Sall4-Nanog network controls developmental progression in the pre-implantation mouse embryo MOLECULAR SYSTEMS BIOLOGY Tan, M. H., Au, K. F., Leong, D. E., Foygel, K., Wong, W. H., Yao, M. W. 2013; 9


    Landmark events occur in a coordinated manner during pre-implantation development of the mammalian embryo, yet the regulatory network that orchestrates these events remains largely unknown. Here, we present the first systematic investigation of the network in pre-implantation mouse embryos using morpholino-mediated gene knockdowns of key embryonic stem cell (ESC) factors followed by detailed transcriptome analysis of pooled embryos, single embryos, and individual blastomeres. We delineated the regulons of Oct4, Sall4, and Nanog and identified a set of metabolism- and transport-related genes that were controlled by these transcription factors in embryos but not in ESCs. Strikingly, the knockdown embryos arrested at a range of developmental stages. We provided evidence that the DNA methyltransferase Dnmt3b has a role in determining the extent to which a knockdown embryo can develop. We further showed that the feed-forward loop comprising Dnmt3b, the pluripotency factors, and the miR-290-295 cluster exemplifies a network motif that buffers embryos against gene expression noise. Our findings indicate that Oct4, Sall4, and Nanog form a robust and integrated network to govern mammalian pre-implantation development.

    View details for DOI 10.1038/msb.2012.65

    View details for Web of Science ID 000314415800002

    View details for PubMedID 23295861

  • Neural-specific Sox2 input and differential Gli-binding affinity provide context and positional information in Shh-directed neural patterning GENES & DEVELOPMENT Peterson, K. A., Nishi, Y., Ma, W., Vedenko, A., Shokri, L., Zhang, X., McFarlane, M., Baizabal, J., Junker, J. P., van Oudenaarden, A., Mikkelsen, T., Bernstein, B. E., Bailey, T. L., Bulyk, M. L., Wong, W. H., McMahon, A. P. 2012; 26 (24): 2802-2816


    In the vertebrate neural tube, regional Sonic hedgehog (Shh) signaling invokes a time- and concentration-dependent induction of six different cell populations mediated through Gli transcriptional regulators. Elsewhere in the embryo, Shh/Gli responses invoke different tissue-appropriate regulatory programs. A genome-scale analysis of DNA binding by Gli1 and Sox2, a pan-neural determinant, identified a set of shared regulatory regions associated with key factors central to cell fate determination and neural tube patterning. Functional analysis in transgenic mice validates core enhancers for each of these factors and demonstrates the dual requirement for Gli1 and Sox2 inputs for neural enhancer activity. Furthermore, through an unbiased determination of Gli-binding site preferences and analysis of binding site variants in the developing mammalian CNS, we demonstrate that differential Gli-binding affinity underlies threshold-level activator responses to Shh input. In summary, our results highlight Sox2 input as a context-specific determinant of the neural-specific Shh response and differential Gli-binding site affinity as an important cis-regulatory property critical for interpreting Shh morphogen action in the mammalian neural tube.

    View details for DOI 10.1101/gad.207142.112

    View details for Web of Science ID 000312775700012

    View details for PubMedID 23249739

  • Activation of Innate Immunity Is Required for Efficient Nuclear Reprogramming CELL Lee, J., Sayed, N., Hunter, A., Au, K. F., Wong, W. H., Mocarski, E. S., Pera, R. R., Yakubov, E., Cooke, J. P. 2012; 151 (3): 547-558


    Retroviral overexpression of reprogramming factors (Oct4, Sox2, Klf4, c-Myc) generates induced pluripotent stem cells (iPSCs). However, the integration of foreign DNA could induce genomic dysregulation. Cell-permeant proteins (CPPs) could overcome this limitation. To date, this approach has proved exceedingly inefficient. We discovered a striking difference in the pattern of gene expression induced by viral versus CPP-based delivery of the reprogramming factors, suggesting that a signaling pathway required for efficient nuclear reprogramming was activated by the retroviral, but not CPP approach. In gain- and loss-of-function studies, we find that the toll-like receptor 3 (TLR3) pathway enables efficient induction of pluripotency by viral or mmRNA approaches. Stimulation of TLR3 causes rapid and global changes in the expression of epigenetic modifiers to enhance chromatin remodeling and nuclear reprogramming. Activation of inflammatory pathways are required for efficient nuclear reprogramming in the induction of pluripotency.

    View details for DOI 10.1016/j.cell.2012.09.034

    View details for Web of Science ID 000310529300012

    View details for PubMedID 23101625

  • Improving PacBio Long Read Accuracy by Short Read Alignment PLOS ONE Au, K. F., Underwood, J. G., Lee, L., Wong, W. H. 2012; 7 (10)


    The recent development of third generation sequencing (TGS) generates much longer reads than second generation sequencing (SGS) and thus provides a chance to solve problems that are difficult to study through SGS alone. However, higher raw read error rates are an intrinsic drawback in most TGS technologies. Here we present a computational method, LSC, to perform error correction of TGS long reads (LR) by SGS short reads (SR). Aiming to reduce the error rate in homopolymer runs in the main TGS platform, the PacBio® RS, LSC applies a homopolymer compression (HC) transformation strategy to increase the sensitivity of SR-LR alignment without scarifying alignment accuracy. We applied LSC to 100,000 PacBio long reads from human brain cerebellum RNA-seq data and 64 million single-end 75 bp reads from human brain RNA-seq data. The results show LSC can correct PacBio long reads to reduce the error rate by more than 3 folds. The improved accuracy greatly benefits many downstream analyses, such as directional gene isoform detection in RNA-seq study. Compared with another hybrid correction tool, LSC can achieve over double the sensitivity and similar specificity.

    View details for DOI 10.1371/journal.pone.0046679

    View details for Web of Science ID 000309580800039

    View details for PubMedID 23056399

  • Fast and accurate read alignment for resequencing BIOINFORMATICS Mu, J. C., Jiang, H., Kiani, A., Mohiyuddin, M., Asadi, N. B., Wong, W. H. 2012; 28 (18): 2366-2373


    Next-generation sequence analysis has become an important task both in laboratory and clinical settings. A key stage in the majority sequence analysis workflows, such as resequencing, is the alignment of genomic reads to a reference genome. The accurate alignment of reads with large indels is a computationally challenging task for researchers.We introduce SeqAlto as a new algorithm for read alignment. For reads longer than or equal to 100 bp, SeqAlto is up to 10 × faster than existing algorithms, while retaining high accuracy and the ability to align reads with large (up to 50 bp) indels. This improvement in efficiency is particularly important in the analysis of future sequencing data where the number of reads approaches many billions. Furthermore, SeqAlto uses less than 8 GB of memory to align against the human genome. SeqAlto is benchmarked against several existing tools with both real and simulated data.Linux and Mac OS X binaries free for academic use are available at

    View details for DOI 10.1093/bioinformatics/bts450

    View details for Web of Science ID 000308532300059

    View details for PubMedID 22811546

  • Six2 and Wnt Regulate Self-Renewal and Commitment of Nephron Progenitors through Shared Gene Regulatory Networks DEVELOPMENTAL CELL Park, J., Ma, W., O'Brien, L. L., Chung, E., Guo, J., Cheng, J., Valerius, M. T., McMahon, J. A., Wong, W. H., McMahon, A. P. 2012; 23 (3): 637-651


    A balance between Six2-dependent self-renewal and canonical Wnt signaling-directed commitment regulates mammalian nephrogenesis. Intersectional studies using chromatin immunoprecipitation and transcriptional profiling identified direct target genes shared by each pathway within nephron progenitors. Wnt4 and Fgf8 are essential for progenitor commitment; cis-regulatory modules flanking each gene are cobound by Six2 and β-catenin and are dependent on conserved Lef/Tcf binding sites for activity. In vitro and in vivo analyses suggest that Six2 and Lef/Tcf factors form a regulatory complex that promotes progenitor maintenance while entry of β-catenin into this complex promotes nephrogenesis. Alternative transcriptional responses associated with Six2 and β-catenin cobinding events occur through non-Lef/Tcf DNA binding mechanisms, highlighting the regulatory complexity downstream of Wnt signaling in the developing mammalian kidney.

    View details for DOI 10.1016/j.devcel.2012.07.008

    View details for Web of Science ID 000308776400019

    View details for PubMedID 22902740

  • Predicting personalized multiple birth risks after in vitro fertilization-double embryo transfer FERTILITY AND STERILITY Lannon, B. M., Choi, B., Hacker, M. R., Dodge, L. E., Malizia, B. A., Barrett, C. B., Wong, W. H., Yao, M. W., Penzias, A. S. 2012; 98 (1)


    To report and evaluate the performance and utility of an approach to predicting IVF-double embryo transfer (DET) multiple birth risks that is evidence-based, clinic-specific, and considers each patient's clinical profile.Retrospective prediction modeling.An outpatient university-affiliated IVF clinic.We used boosted tree methods to analyze 2,413 independent IVF-DET treatment cycles that resulted in live births. The IVF cycles were retrieved from a database that comprised more than 33,000 IVF cycles.None.The performance of this prediction model, MBP-BIVF, was validated by an independent data set, to evaluate predictive power, discrimination, dynamic range, and reclassification.Multiple birth probabilities ranging from 11.8% to 54.8% were predicted by the model and were significantly different from control predictions in more than half of the patients. The prediction model showed an improvement of 146% in predictive power and 16.0% in discrimination over control. The population standard error was 1.8%.We showed that IVF patients have inherently different risks of multiple birth, even when DET is specified, and this risk can be predicted before ET. The use of clinic-specific prediction models provides an evidence-based and personalized method to counsel patients.

    View details for DOI 10.1016/j.fertnstert.2012.04.011

    View details for Web of Science ID 000305950200020

    View details for PubMedID 22673597

  • A Sparse Transmission Disequilibrium Test for Haplotypes Based on Bradley-Terry Graphs HUMAN HEREDITY Ma, L., Wong, W. H., Owen, A. B. 2012; 73 (1): 52-61


    Linkage and association analysis based on haplotype transmission disequilibrium can be more informative than single marker analysis. Several works have been proposed in recent years to extend the transmission disequilibrium test (TDT) to haplotypes. Among them, a powerful approach called the evolutionary tree TDT (ET-TDT) incorporates information about the evolutionary relationship among haplotypes using the cladogram of the locus.In this work we extend this approach by taking into consideration the sparsity of causal mutations in the evolutionary history. We first introduce the notion of a Bradley-Terry (BT) graph representation of a haplotype locus. The most important property of the BT graph is that sparsity of the edge set of the graph corresponds to small number of causal mutations in the evolution of the haplotypes. We then propose a method to test the null hypothesis of no linkage and association against sparse alternatives under which a small number of edges on the BT graph have non-nil effects.We compare the performance of our approach to that of the ET-TDT through a power study, and show that incorporating sparsity of causal mutations can significantly improve the power of a haplotype-based TDT.

    View details for DOI 10.1159/000335937

    View details for Web of Science ID 000302111100008

    View details for PubMedID 22398955

  • Coupling Optional Polya Trees and the Two Sample Problem JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Ma, L., Wong, W. H. 2011; 106 (496): 1553-1565
  • A New FACS Approach Isolates hESC Derived Endoderm Using Transcription Factors PLOS ONE Pan, Y., Ouyang, Z., Wong, W. H., Baker, J. C. 2011; 6 (3)


    We show that high quality microarray gene expression profiles can be obtained following FACS sorting of cells using combinations of transcription factors. We use this transcription factor FACS (tfFACS) methodology to perform a genomic analysis of hESC-derived endodermal lineages marked by combinations of SOX17, GATA4, and CXCR4, and find that triple positive cells have a much stronger definitive endoderm signature than other combinations of these markers. Additionally, SOX17(+) GATA4(+) cells can be obtained at a much earlier stage of differentiation, prior to expression of CXCR4(+) cells, providing an important new tool to isolate this earlier definitive endoderm subtype. Overall, tfFACS represents an advancement in FACS technology which broadly crosses multiple disciplines, most notably in regenerative medicine to redefine cellular populations.

    View details for DOI 10.1371/journal.pone.0017536

    View details for Web of Science ID 000288170900026

    View details for PubMedID 21408072

  • Human transcriptome array for high-throughput clinical studies PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Xu, W., Seok, J., Mindrinos, M. N., Schweitzer, A. C., Jiang, H., Wilhelmy, J., Clark, T. A., Kapur, K., Xing, Y., Faham, M., Storey, J. D., Moldawer, L. L., Maier, R. V., Tompkins, R. G., Wong, W. H., Davis, R. W., Xiao, W. 2011; 108 (9): 3707-3712


    A 6.9 million-feature oligonucleotide array of the human transcriptome [Glue Grant human transcriptome (GG-H array)] has been developed for high-throughput and cost-effective analyses in clinical studies. This array allows comprehensive examination of gene expression and genome-wide identification of alternative splicing as well as detection of coding SNPs and noncoding transcripts. The performance of the array was examined and compared with mRNA sequencing (RNA-Seq) results over multiple independent replicates of liver and muscle samples. Compared with RNA-Seq of 46 million uniquely mappable reads per replicate, the GG-H array is highly reproducible in estimating gene and exon abundance. Although both platforms detect similar expression changes at the gene level, the GG-H array is more sensitive at the exon level. Deeper sequencing is required to adequately cover low-abundance transcripts. The array has been implemented in a multicenter clinical program and has generated high-quality, reproducible data. Considering the clinical trial requirements of cost, sample availability, and throughput, the GG-H array has a wide range of applications. An emerging approach for large-scale clinical genomic studies is to first use RNA-Seq to the sufficient depth for the discovery of transcriptome elements relevant to the disease process followed by high-throughput and reliable screening of these elements on thousands of patient samples using custom-designed arrays.

    View details for DOI 10.1073/pnas.1019753108

    View details for Web of Science ID 000287844400051

    View details for PubMedID 21317363

  • Statistical Modeling of RNA-Seq Data STATISTICAL SCIENCE Salzman, J., Jiang, H., Wong, W. H. 2011; 26 (1): 62-83

    View details for DOI 10.1214/10-STS343

    View details for Web of Science ID 000292424900013

  • Completely phased genome sequencing through chromosome sorting PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Yang, H., Chen, X., Wong, W. H. 2011; 108 (1): 12-17


    The two haploid genome sequences that a person inherits from the two parents represent the most fundamentally useful type of genetic information for the study of heritable diseases and the development of personalized medicine. Because of the difficulty in obtaining long-range phase information, current sequencing methods are unable to provide this information. Here, we introduce and show feasibility of a scalable approach capable of generating genomic sequences completely phased across the entire chromosome.

    View details for DOI 10.1073/pnas.1016725108

    View details for Web of Science ID 000285915000007

    View details for PubMedID 21169219



    Chromatin immunoprecipitation coupled with ultra-high-throug put parallel DNA sequencing (ChIP-seq) is an effective technology for the investigation of genome-wide protein-DNA interactions. Examples of applications include the studies of RNA polymerases transcription, transcriptional regulation, and histone modifications. The technology provides accurate and high-resolution mapping of the protein-DNA binding loci that are important in the understanding of many processes in development and diseases. Since the introduction of ChIP-seq experiments in 2007, many statistical and computational methods have been developed to support the analysis of the massive datasets from these experiments. However, because of the complex, multistaged analysis workflow, it is still difficult for an experimental investigator to conduct the analysis of his or her own ChIP-seq data. In this chapter, we review the basic design of ChIP-seq experiments and provide an in-depth tutorial on how to prepare, to preprocess, and to analyze ChIP-seq datasets. The tutorial is based on a revised version of our software package CisGenome, which was designed to encompass most standard tasks in ChIP-seq data analysis. Relevant statistical and computational issues will be highlighted, discussed, and illustrated by means of real data examples.

    View details for DOI 10.1016/B978-0-12-385075-1.00003-2

    View details for Web of Science ID 000291321200003

    View details for PubMedID 21601082

  • Integration of Brassinosteroid Signal Transduction with the Transcription Network for Plant Growth Regulation in Arabidopsis DEVELOPMENTAL CELL Sun, Y., Fan, X., Cao, D., Tang, W., He, K., Zhu, J., He, J., Bai, M., Zhu, S., Oh, E., Patil, S., Kim, T., Ji, H., Wong, W. H., Rhee, S. Y., Wang, Z. 2010; 19 (5): 765-777


    Brassinosteroids (BRs) regulate a wide range of developmental and physiological processes in plants through a receptor-kinase signaling pathway that controls the BZR transcription factors. Here, we use transcript profiling and chromatin-immunoprecipitation microarray (ChIP-chip) experiments to identify 953 BR-regulated BZR1 target (BRBT) genes. Functional studies of selected BRBTs further demonstrate roles in BR promotion of cell elongation. The BRBT genes reveal numerous molecular links between the BR-signaling pathway and downstream components involved in developmental and physiological processes. Furthermore, the results reveal extensive crosstalk between BR and other hormonal and light-signaling pathways at multiple levels. For example, BZR1 not only controls the expression of many signaling components of other hormonal and light pathways but also coregulates common target genes with light-signaling transcription factors. Our results provide a genomic map of steroid hormone actions in plants that reveals a regulatory network that integrates hormonal and light-signaling pathways for plant growth regulation.

    View details for DOI 10.1016/j.devcel.2010.10.010

    View details for Web of Science ID 000284516300016

    View details for PubMedID 21074725

  • From EM to Data Augmentation: The Emergence of MCMC Bayesian Computation in the 1980s STATISTICAL SCIENCE Tanner, M. A., Wong, W. H. 2010; 25 (4): 506-516

    View details for DOI 10.1214/10-STS341

    View details for Web of Science ID 000288497200006

  • Deep phenotyping to predict live birth outcomes in in vitro fertilization PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Banerjee, P., Choi, B., Shahine, L. K., Jun, S. H., O'leary, K., Lathi, R. B., Westphal, L. M., Wong, W. H., Yao, M. W. 2010; 107 (31): 13570-13575


    Nearly 75% of in vitro fertilization (IVF) treatments do not result in live births and patients are largely guided by a generalized age-based prognostic stratification. We sought to provide personalized and validated prognosis by using available clinical and embryo data from prior, failed treatments to predict live birth probabilities in the subsequent treatment. We generated a boosted tree model, IVFBT, by training it with IVF outcomes data from 1,676 first cycles (C1s) from 2003-2006, followed by external validation with 634 cycles from 2007-2008, respectively. We tested whether this model could predict the probability of having a live birth in the subsequent treatment (C2). By using nondeterministic methods to identify prognostic factors and their relative nonredundant contribution, we generated a prediction model, IVF(BT), that was superior to the age-based control by providing over 1,000-fold improvement to fit new data (p<0.05), and increased discrimination by receiver-operative characteristic analysis (area-under-the-curve, 0.80 vs. 0.68 for C1, 0.68 vs. 0.58 for C2). IVFBT provided predictions that were more accurate for approximately 83% of C1 and approximately 60% of C2 cycles that were out of the range predicted by age. Over half of those patients were reclassified to have higher live birth probabilities. We showed that data from a prior cycle could be used effectively to provide personalized and validated live birth probabilities in a subsequent cycle. Our approach may be replicated and further validated in other IVF clinics.

    View details for DOI 10.1073/pnas.1002296107

    View details for Web of Science ID 000280605900006

    View details for PubMedID 20643955

  • Detection of splice junctions from paired-end RNA-seq data by SpliceMap NUCLEIC ACIDS RESEARCH Au, K. F., Jiang, H., Lin, L., Xing, Y., Wong, W. H. 2010; 38 (14): 4570-4578


    Alternative splicing is a prevalent post-transcriptional process, which is not only important to normal cellular function but is also involved in human diseases. The newly developed second generation sequencing technique provides high-throughput data (RNA-seq data) to study alternative splicing events in different types of cells. Here, we present a computational method, SpliceMap, to detect splice junctions from RNA-seq data. This method does not depend on any existing annotation of gene structures and is capable of finding novel splice junctions with high sensitivity and specificity. It can handle long reads (50-100 nt) and can exploit paired-read information to improve mapping accuracy. Several parameters are included in the output to indicate the reliability of the predicted junction and help filter out false predictions. We applied SpliceMap to analyze 23 million paired 50-nt reads from human brain tissue. The results show at this depth of sequencing, RNA-seq can support reliable detection of splice junctions except for those that are present at very low level. Compared to current methods, SpliceMap can achieve 12% higher sensitivity without sacrificing specificity.

    View details for DOI 10.1093/nar/gkq211

    View details for Web of Science ID 000280922400010

    View details for PubMedID 20371516

  • CisGenome Browser: a flexible tool for genomic data visualization BIOINFORMATICS Jiang, H., Wang, F., Dyer, N. P., Wong, W. H. 2010; 26 (14): 1781-1782


    We present an open source, platform independent tool, called CisGenome Browser, which can work together with any other data analysis program to serve as a flexible component for genomic data visualization. It can also work by itself as a standalone genome browser. By working as a light-weight web server, CisGenome Browser is a convenient tool for data sharing between labs. It has features that are specifically designed for ultra high-throughput sequencing data visualization. approximately jiangh/browser/

    View details for DOI 10.1093/bioinformatics/btq286

    View details for Web of Science ID 000279474400017

    View details for PubMedID 20513664

  • An "Almost Exhaustive" Search-Based Sequential Permutation Method for Detecting Epistasis in Disease Association Studies GENETIC EPIDEMIOLOGY Ma, L., Assimes, T. L., Asadi, N. B., Iribarren, C., Quertermous, T., Wong, W. H. 2010; 34 (5): 434-443


    Due to the complex nature of common diseases, their etiology is likely to involve "uncommon but strong" (UBS) interactive effects--i.e. allelic combinations that are each present in only a small fraction of the patients but associated with high disease risk. However, the identification of such effects using standard methods for testing association can be difficult. In this work, we introduce a method for testing interactions that is particularly powerful in detecting UBS effects. The method consists of two modules--one is a pattern counting algorithm designed for efficiently evaluating the risk significance of each marker combination, and the other is a sequential permutation scheme for multiple testing correction. We demonstrate the work of our method using a candidate gene data set for cardiovascular and coronary diseases with an injected UBS three-locus interaction. In addition, we investigate the power and false rejection properties of our method using data sets simulated from a joint dominance three-locus model that gives rise to UBS interactive effects. The results show that our method can be much more powerful than standard approaches such as trend test and multifactor dimensionality reduction for detecting UBS interactions.

    View details for DOI 10.1002/gepi.20496

    View details for Web of Science ID 000280349600007

    View details for PubMedID 20583286

  • Analysis of factorial time-course microarrays with application to a clinical study of burn injury PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Zhou, B., Xu, W., Herndon, D., Tompkins, R., Davis, R., Xiao, W., Wong, W. H. 2010; 107 (22): 9923-9928


    Time-course microarray experiments are capable of capturing dynamic gene expression profiles. It is important to study how these dynamic profiles depend on the multiple factors that characterize the experimental condition under which the time course is observed. Analytic methods are needed to simultaneously handle the time course and factorial structure in the data. We developed a method to evaluate factor effects by pooling information across the time course while accounting for multiple testing and nonnormality of the microarray data. The method effectively extracts gene-specific response features and models their dependency on the experimental factors. Both longitudinal and cross-sectional time-course data can be handled by our approach. The method was used to analyze the impact of age on the temporal gene response to burn injury in a large-scale clinical study. Our analysis reveals that 21% of the genes responsive to burn are age-specific, among which expressions of mitochondria and immunoglobulin genes are differentially perturbed in pediatric and adult patients by burn injury. These new findings in the body's response to burn injury between children and adults support further investigations of therapeutic options targeting specific age groups. The methodology proposed here has been implemented in R package "TANOVA" and submitted to the Comprehensive R Archive Network at It is also available for download at

    View details for DOI 10.1073/pnas.1002757107

    View details for Web of Science ID 000278246000005


    View details for DOI 10.1214/09-AOS755

    View details for Web of Science ID 000277471000006

  • Hedgehog pathway-regulated gene networks in cerebellum development and tumorigenesis PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Lee, E. Y., Ji, H., Ouyang, Z., Zhou, B., Ma, W., Vokes, S. A., McMahon, A. P., Wong, W. H., Scott, M. P. 2010; 107 (21): 9736-9741


    Many genes initially identified for their roles in cell fate determination or signaling during development can have a significant impact on tumorigenesis. In the developing cerebellum, Sonic hedgehog (Shh) stimulates the proliferation of granule neuron precursor cells (GNPs) by activating the Gli transcription factors. Inappropriate activation of Shh target genes results in unrestrained cell division and eventually medulloblastoma, the most common pediatric brain malignancy. We find dramatic differences in the gene networks that are directly driven by the Gli1 transcription factor in GNPs and medulloblastoma. Gli1 binding location analysis revealed hundreds of genomic loci bound by Gli1 in normal and cancer cells. Only one third of the genes bound by Gli1 in GNPs were also bound in tumor cells. Correlation with gene expression levels indicated that 116 genes were preferentially transcribed in tumors, whereas 132 genes were target genes in both GNPs and medulloblastoma. Quantitative PCR and in situ hybridization for some putative target genes support their direct regulation by Gli. The results indicate that transformation of normal GNPs into deadly tumor cells is accompanied by a distinct set of Gli-regulated genes and may provide candidates for targeted therapies.

    View details for DOI 10.1073/pnas.1004602107

    View details for Web of Science ID 000278054700048

    View details for PubMedID 20460306

  • Modeling Co-Expression across Species for Complex Traits: Insights to the Difference of Human and Mouse Embryonic Stem Cells PLOS COMPUTATIONAL BIOLOGY Cai, J., Xie, D., Fan, Z., Chipperfield, H., Marden, J., Wong, W. H., Zhong, S. 2010; 6 (3)


    Complex interactions between genes or proteins contribute substantially to phenotypic evolution. We present a probabilistic model and a maximum likelihood approach for cross-species clustering analysis and for identification of conserved as well as species-specific co-expression modules. This model enables a "soft" cross-species clustering (SCSC) approach by encouraging but not enforcing orthologous genes to be grouped into the same cluster. SCSC is therefore robust to obscure orthologous relationships and can reflect different functional roles of orthologous genes in different species. We generated a time-course gene expression dataset for differentiating mouse embryonic stem (ES) cells, and compiled a dataset of published gene expression data on differentiating human ES cells. Applying SCSC to analyze these datasets, we identified conserved and species-specific gene regulatory modules. Together with protein-DNA binding data, an SCSC cluster specifically induced in murine ES cells indicated that the KLF2/4/5 transcription factors, although critical to maintaining the pluripotent phenotype in mouse ES cells, were decoupled from the OCT4/SOX2/NANOG regulatory module in human ES cells. Two of the target genes of murine KLF2/4/5, LIN28 and NODAL, were rewired to be targets of OCT4/SOX2/NANOG in human ES cells. Moreover, by mapping SCSC clusters onto KEGG signaling pathways, we identified the signal transduction components that were induced in pluripotent ES cells in either a conserved or a species-specific manner. These results suggest that the pluripotent cell identity can be established and maintained through more than one gene regulatory network.

    View details for DOI 10.1371/journal.pcbi.1000707

    View details for Web of Science ID 000278125200015

    View details for PubMedID 20300647

  • Modeling non-uniformity in short-read rates in RNA-Seq data GENOME BIOLOGY Li, J., Jiang, H., Wong, W. H. 2010; 11 (5)


    After mapping, RNA-Seq data can be summarized by a sequence of read counts commonly modeled as Poisson variables with constant rates along each transcript, which actually fit data poorly. We suggest using variable rates for different positions, and propose two models to predict these rates based on local sequences. These models explain more than 50% of the variations and can lead to improved estimates of gene and isoform expressions for both Illumina and Applied Biosystems data.

    View details for DOI 10.1186/gb-2010-11-5-r50

    View details for Web of Science ID 000279631000015

    View details for PubMedID 20459815

  • ChIP-Seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Ouyang, Z., Zhou, Q., Wong, W. H. 2009; 106 (51): 21521-21526


    Next-generation sequencing has greatly increased the scope and the resolution of transcriptional regulation study. RNA sequencing (RNA-Seq) and ChIP-Seq experiments are now generating comprehensive data on transcript abundance and on regulator-DNA interactions. We propose an approach for an integrated analysis of these data based on feature extraction of ChIP-Seq signals, principal component analysis, and regression-based component selection. Compared with traditional methods, our approach not only offers higher power in predicting gene expression from ChIP-Seq data but also provides a way to capture cooperation among regulators. In mouse embryonic stem cells (ESCs), we find that a remarkably high proportion of variation in gene expression (65%) can be explained by the binding signals of 12 transcription factors (TFs). Two groups of TFs are identified. Whereas the first group (E2f1, Myc, Mycn, and Zfx) act as activators in general, the second group (Oct4, Nanog, Sox2, Smad1, Stat3, Tcfcp2l1, and Esrrb) may serve as either activator or repressor depending on the target. The two groups of TFs cooperate tightly to activate genes that are differentially up-regulated in ESCs. In the absence of binding by the first group, the binding of the second group is associated with genes that are repressed in ESCs and derepressed upon early differentiation.

    View details for DOI 10.1073/pnas.0904863106

    View details for Web of Science ID 000272994200013

    View details for PubMedID 19995984

  • Identifiability of isoform deconvolution from junction arrays and RNA-Seq BIOINFORMATICS Hiller, D., Jiang, H., Xu, W., Wong, W. H. 2009; 25 (23): 3056-3059


    Splice junction microarrays and RNA-seq are two popular ways of quantifying splice variants within a cell. Unfortunately, isoform expressions cannot always be determined from the expressions of individual exons and splice junctions. While this issue has been noted before, the extent of the problem on various platforms has not yet been explored, nor have potential remedies been presented.We propose criteria that will guarantee identifiability of an isoform deconvolution model on exon and splice junction arrays and in RNA-Seq. We show that up to 97% of 2256 alternatively spliced human genes selected from the RefSeq database lead to identifiable gene models in RNA-seq, with similar results in mouse. However, in the Human Exon array only 26% of these genes lead to identifiable models, and even in the most comprehensive splice junction array only 69% lead to identifiable models.Supplementary data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btp544

    View details for Web of Science ID 000272080800002

    View details for PubMedID 19762346

  • Dissecting Early Differentially Expressed Genes in a Mixture of Differentiating Embryonic Stem Cells PLOS COMPUTATIONAL BIOLOGY Hong, F., Fang, F., He, X., Cao, X., Chipperfield, H., Xie, D., Wong, W. H., Ng, H. H., Zhong, S. 2009; 5 (12)


    The differentiation of embryonic stem cells is initiated by a gradual loss of pluripotency-associated transcripts and induction of differentiation genes. Accordingly, the detection of differentially expressed genes at the early stages of differentiation could assist the identification of the causal genes that either promote or inhibit differentiation. The previous methods of identifying differentially expressed genes by comparing different cell types would inevitably include a large portion of genes that respond to, rather than regulate, the differentiation process. We demonstrate through the use of biological replicates and a novel statistical approach that the gene expression data obtained without prior separation of cell types are informative for detecting differentially expressed genes at the early stages of differentiation. Applying the proposed method to analyze the differentiation of murine embryonic stem cells, we identified and then experimentally verified Smarcad1 as a novel regulator of pluripotency and self-renewal. We formalized this statistical approach as a statistical test that is generally applicable to analyze other differentiation processes.

    View details for DOI 10.1371/journal.pcbi.1000607

    View details for Web of Science ID 000274229000025

    View details for PubMedID 20019792

  • FoxOs Cooperatively Regulate Diverse Pathways Governing Neural Stem Cell Homeostasis CELL STEM CELL Paik, J., Ding, Z., Narurkar, R., Ramkissoon, S., Muller, F., Kamoun, W. S., Chae, S., Zheng, H., Ying, H., Mahoney, J., Hiller, D., Jiang, S., Protopopov, A., Wong, W. H., Chin, L., Ligon, K. L., DePinho, R. A. 2009; 5 (5): 540-553


    The PI3K-AKT-FoxO pathway is integral to lifespan regulation in lower organisms and essential for the stability of long-lived cells in mammals. Here, we report the impact of combined FoxO1, 3, and 4 deficiencies on mammalian brain physiology with a particular emphasis on the study of the neural stem/progenitor cell (NSC) pool. We show that the FoxO family plays a prominent role in NSC proliferation and renewal. FoxO-deficient mice show initial increased brain size and proliferation of neural progenitor cells during early postnatal life, followed by precocious significant decline in the NSC pool and accompanying neurogenesis in adult brains. Mechanistically, integrated transcriptomic, promoter, and functional analyses of FoxO-deficient NSC cultures identified direct gene targets with known links to the regulation of human brain size and the control of cellular proliferation, differentiation, and oxidative defense. Thus, the FoxO family coordinately regulates diverse genes and pathways to govern key aspects of NSC homeostasis in the mammalian brain.

    View details for DOI 10.1016/j.stem.2009.09.013

    View details for Web of Science ID 000272019500015

    View details for PubMedID 19896444

  • Energy landscape of a spin-glass model: Exploration and characterization PHYSICAL REVIEW E Zhou, Q., Wong, W. H. 2009; 79 (5)


    The disconnectivity graph (DG) is widely used to represent energy landscapes. Although powerful numerical methods have been developed to construct DGs for continuous potential-energy surfaces, they have difficulties in applications to discrete Hamiltonians as the case of spin-glass models. When the configuration space is large, brute force enumeration of all configurations to build a DG is not practical. We propose an alternative approach to construct DGs based on recursive partition of Monte Carlo samples from microcanonical ensembles. To characterize energy landscapes, we define the local density of states (LDOS) on a DG, with which one can compute many thermodynamic properties over local energy basins for any temperature. Estimation of LDOS is developed with DG construction. We further propose the concepts of tree entropy and local escape probability, both of which are functions of local density of states, to capture the symmetry and the roughness of a Boltzmann distribution, respectively. Our approach is applied to a study of the Sherrington-Kirkpatrick spin-glass model with N varying between 20 and 100 spins. We observe that the energy landscape is extremely asymmetric and there exists a sharp increase in local escape probability preceding the transition from spin glass to paramagnetic phase.

    View details for DOI 10.1103/PhysRevE.79.051117

    View details for Web of Science ID 000266500700031

    View details for PubMedID 19518426

  • Modeling the spatio-temporal network that drives patterning in the vertebrate central nervous system BIOCHIMICA ET BIOPHYSICA ACTA-GENE REGULATORY MECHANISMS Nishi, Y., Ji, H., Wong, W. H., McMahon, A. P., Vokes, S. A. 2009; 1789 (4): 299-305


    In this review, we discuss the gene regulatory network underlying the patterning of the ventral neural tube during vertebrate embryogenesis. The neural tube is partitioned into domains of distinct cell fates by inductive signals along both anterior-posterior and dorsal-ventral axes. A defining feature of the dorsal-ventral patterning is the graded distribution of Sonic hedgehog (Shh), which acts as a morphogen to specify several classes of ventral neurons in a concentration-dependent fashion. These inductive signals translate into patterned expressions of transcription factors that define different neural progenitor subtypes. Progenitor boundaries are sharpened by repressive interactions between these transcription factors. The progenitor-expressed transcription factors induce another set of transcription factors that are thought to contribute to neural identities in post-mitotic neural precursors. Thus, the gene regulatory network of the ventral neural tube patterning is characterized by hierarchical expression [inductive signal-->progenitor specifying factors (mitotic)--> precursor specifying factors (post mitotic)--> differentiated neural markers] and cross-repression between progenitor-expressed regulatory factors. Although a number of transcriptional regulators have been identified at each hierarchical level, their precise regulatory relationships are not clear. Here we discuss approaches aimed at clarifying and extending our understanding of the formation and propagation of this network.

    View details for DOI 10.1016/j.bbagrm.2009.01.002

    View details for Web of Science ID 000265729800008

    View details for PubMedID 19445894

  • Cross-hybridization modeling on Affymetrix exon arrays BIOINFORMATICS Kapur, K., Jiang, H., Xing, Y., Wong, W. H. 2008; 24 (24): 2887-2893


    Microarray designs have become increasingly probe-rich, enabling targeting of specific features, such as individual exons or single nucleotide polymorphisms. These arrays have the potential to achieve quantitative high-throughput estimates of transcript abundances, but currently these estimates are affected by biases due to cross-hybridization, in which probes hybridize to off-target transcripts.To study cross-hybridization, we map Affymetrix exon array probes to a set of annotated mRNA transcripts, allowing a small number of mismatches or insertion/deletions between the two sequences. Based on a systematic study of the degree to which probes with a given match type to a transcript are affected by cross-hybridization, we developed a strategy to correct for cross-hybridization biases of gene-level expression estimates. Comparison with Solexa ultra high-throughput sequencing data demonstrates that correction for cross-hybridization leads to a significant improvement of gene expression estimates.We provide mappings between human and mouse exon array probes and off-target transcripts and provide software extending the GeneBASE program for generating gene-level expression estimates including the cross-hybridization correction

    View details for DOI 10.1093/bioinformatics/btn571

    View details for Web of Science ID 000261456700012

    View details for PubMedID 18984598


    View details for DOI 10.1214/08-AOAS196

    View details for Web of Science ID 000262731100010

  • An integrated software system for analyzing ChIP-chip and ChIP-seq data NATURE BIOTECHNOLOGY Ji, H., Jiang, H., Ma, W., Johnson, D. S., Myers, R. M., Wong, W. H. 2008; 26 (11): 1293-1300


    We present CisGenome, a software system for analyzing genome-wide chromatin immunoprecipitation (ChIP) data. CisGenome is designed to meet all basic needs of ChIP data analyses, including visualization, data normalization, peak detection, false discovery rate computation, gene-peak association, and sequence and motif analysis. In addition to implementing previously published ChIP-microarray (ChIP-chip) analysis methods, the software contains statistical methods designed specifically for ChlP sequencing (ChIP-seq) data obtained by coupling ChIP with massively parallel sequencing. The modular design of CisGenome enables it to support interactive analyses through a graphic user interface as well as customized batch-mode computation for advanced data mining. A built-in browser allows visualization of array images, signals, gene structure, conservation, and DNA sequence and motif information. We demonstrate the use of these tools by a comparative analysis of ChIP-chip and ChIP-seq data for the transcription factor NRSF/REST, a study of ChIP-seq analysis with or without a negative control sample, and an analysis of a new motif in Nanog- and Sox2-binding regions.

    View details for DOI 10.1038/nbt.1505

    View details for Web of Science ID 000260832200025

    View details for PubMedID 18978777

  • SeqMap: mapping massive amount of oligonucleotides to the genome BIOINFORMATICS Jiang, H., Wong, W. H. 2008; 24 (20): 2395-2396


    SeqMap is a tool for mapping large amount of short sequences to the genome. It is designed for finding all the places in a reference genome where each sequence may come from. This task is essential to the analysis of data from ultra high-throughput sequencing machines. With a carefully designed index-filtering algorithm and an efficient implementation, SeqMap can map tens of millions of short sequences to a genome of several billions of nucleotides. Multiple substitutions and insertions/deletions of the nucleotide bases in the sequences can be tolerated and therefore detected. SeqMap supports FASTA input format and various output formats, and provides command line options for tuning almost every aspect of the mapping process. A typical mapping can be done in a few hours on a desktop PC. Parallel use of SeqMap on a cluster is also very straightforward.

    View details for DOI 10.1093/bioinformatics/btn429

    View details for Web of Science ID 000259973500020

    View details for PubMedID 18697769

  • A genome-scale analysis of the cis-regulatory circuitry underlying sonic hedgehog-mediated patterning of the mammalian limb GENES & DEVELOPMENT Vokes, S. A., Ji, H., Wong, W. H., McMahon, A. P. 2008; 22 (19): 2651-2663


    Sonic hedgehog (Shh) signals via Gli transcription factors to direct digit number and identity in the vertebrate limb. We characterized the Gli-dependent cis-regulatory network through a combination of whole-genome chromatin immunoprecipitation (ChIP)-on-chip and transcriptional profiling of the developing mouse limb. These analyses identified approximately 5000 high-quality Gli3-binding sites, including all known Gli-dependent enhancers. Discrete binding regions exhibit a higher-order clustering, highlighting the complexity of cis-regulatory interactions. Further, Gli3 binds inertly to previously identified neural-specific Gli enhancers, demonstrating the accessibility of their cis-regulatory elements. Intersection of DNA binding data with gene expression profiles predicted 205 putative limb target genes. A subset of putative cis-regulatory regions were analyzed in transgenic embryos, establishing Blimp1 as a direct Gli target and identifying Gli activator signaling in a direct, long-range regulation of the BMP antagonist Gremlin. In contrast, a long-range silencer cassette downstream from Hand2 likely mediates Gli3 repression in the anterior limb. These studies provide the first comprehensive characterization of the transcriptional output of a Shh-patterning process in the mammalian embryo and a framework for elaborating regulatory networks in the developing limb.

    View details for DOI 10.1101/gad.1693008

    View details for Web of Science ID 000259700900010

    View details for PubMedID 18832070

  • Isolation and transcriptional profiling of purified hepatic cells derived from human embryonic stem cells STEM CELLS Chiao, E., Elazar, M., Xing, Y., Xiong, A., Kmet, M., Millan, M. T., Glenn, J. S., Wong, W. H., Baker, J. 2008; 26 (8): 2032-2041


    The differentiation of human embryonic stem cells (hESCs) into functional hepatocytes provides a powerful in vitro model system for studying the molecular mechanisms governing liver development. Furthermore, a well-characterized renewable supply of hepatocytes differentiated from hESCs could be used for in vitro assays of drug metabolism and toxicology, screening of potential antiviral agents, and cell-based therapies to treat liver disease. In this study, we describe a protocol for the differentiation of hESCs toward hepatic cells with complex cellular morphologies. Putative hepatic cells were identified and isolated using a lentiviral vector, containing the alpha-fetoprotein promoter driving enhanced green fluorescent protein expression (AFP:eGFP). Whole-genome transcriptional profiling was performed on triplicate samples of AFP:eGFP+ and AFP:eGFP- cell populations using the recently released Affymetrix Exon Array ST 1.0 (Santa Clara, CA, Statistical analysis of the transcriptional profiles demonstrated that the AFP:eGFP+ population is highly enriched for genes characteristic of hepatic cells. These data provide a unique insight into the complex process of hepatocyte differentiation, point to signaling pathways that may be manipulated to more efficiently direct the differentiation of hESCs toward mature hepatocytes, and identify molecular markers that may be used for further dissection of hepatic cell differentiation from hESCs. Disclosure of potential conflicts of interest is found at the end of this article.

    View details for DOI 10.1634/stemcells.2007-0964

    View details for Web of Science ID 000258297500011

    View details for PubMedID 18535157

  • Defining Human Embryo Phenotypes by Cohort-Specific Prognostic Factors PLOS ONE Jun, S. H., Choi, B., Shahine, L., Westphal, L. M., Behr, B., Pera, R. A., Wong, W. H., Yao, M. W. 2008; 3 (7)


    Hundreds of thousands of human embryos are cultured yearly at in vitro fertilization (IVF) centers worldwide, yet the vast majority fail to develop in culture or following transfer to the uterus. However, human embryo phenotypes have not been formally defined, and current criteria for embryo transfer largely focus on characteristics of individual embryos. We hypothesized that embryo cohort-specific variables describing sibling embryos as a group may predict developmental competence as measured by IVF cycle outcomes and serve to define human embryo phenotypes.We retrieved data for all 1117 IVF cycles performed in 2005 at Stanford University Medical Center, and further analyzed clinical data from the 665 fresh IVF, non-donor cycles and their associated 4144 embryos. Thirty variables representing patient characteristics, clinical diagnoses, treatment protocol, and embryo parameters were analyzed in an unbiased manner by regression tree models, based on dichotomous pregnancy outcomes defined by positive serum beta-human chorionic gonadotropin (beta-hCG). IVF cycle outcomes were most accurately predicted at approximately 70% by four non-redundant, embryo cohort-specific variables that, remarkably, were more informative than any measures of individual, transferred embryos: Total number of embryos, number of 8-cell embryos, rate (percentage) of cleavage arrest in the cohort and day 3 follicle stimulating hormone (FSH) level. While three of these variables captured the effects of other significant variables, only the rate of cleavage arrest was independent of any known variables.Our findings support defining human embryo phenotypes by non-redundant, prognostic variables that are specific to sibling embryos in a cohort.

    View details for DOI 10.1371/journal.pone.0002562

    View details for Web of Science ID 000263288200029

    View details for PubMedID 18596962

  • Learning causal Bayesian network structures from experimental data JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Ellis, B., Wong, W. H. 2008; 103 (482): 778-789
  • Reconfigurable Computing for Learning Bayesian Networks FPGA 2008: SIXTEENTH ACM/SIGDA INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE GATE ARRAYS Asadi, N. B., Meng, T. H., Wong, W. H. 2008: 203-211
  • Optimal discovery of a stochastic genetic network 2008 AMERICAN CONTROL CONFERENCE, VOLS 1-12 Raffard, R. L., Lipan, O., Wong, W. H., Tomlin, C. J. 2008: 2773-2779
  • Evolutionary Monte Carlo methods for clustering JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS Goswami, G., Liu, J. S., Wong, W. H. 2007; 16 (4): 855-876
  • A gene regulatory network in mouse embryonic stem cells PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Zhou, Q., Chipperfield, H., Melton, D. A., Wong, W. H. 2007; 104 (42): 16438-16443


    We analyze new and existing expression and transcription factor-binding data to characterize gene regulatory relations in mouse ES cells (ESC). In addition to confirming the key roles of Oct4, Sox2, and Nanog, our analysis identifies several genes, such as Esrrb, Stat3, Tcf7, Sall4, and LRH-1, as statistically significant coregulators. The regulatory interactions among 15 core regulators are used to construct a gene regulatory network in ESC. The network encapsulates extensive cross-regulations among the core regulators, highlights how they may control epigenetic processes, and reveals the surprising roles of nuclear receptors. Our analysis also provides information on the regulation of a large number of putative target genes of the network.

    View details for DOI 10.1073/pnas.0701014104

    View details for Web of Science ID 000250373400012

    View details for PubMedID 17940043

  • Assessing the conservation of mammalian gene expression using high-density exon arrays MOLECULAR BIOLOGY AND EVOLUTION Xing, Y., Ouyang, Z., Kapur, K., Scott, M. P., Wong, W. H. 2007; 24 (6): 1283-1285


    Microarray data from multiple species have been used to study evolutionary constraints on gene expression. Expression measurements from conventional microarray platforms such as the 3' expression arrays are strongly affected by platform-dependent probe effects that may introduce apparent but misleading discrepancies between species. In this manuscript, we assess the conservation of mammalian gene expression in adult tissues using data from a high-density exon array platform. The exon arrays have more than 6 million probes on a single array targeting all exons in a genome. We find that, unlike 3' array data, gene expression measurements from exon arrays reveal patterns of gene expression that are highly conserved between humans and mice in multiple tissues. Our analysis provides strong evidence for widespread stabilizing selection pressure on transcript abundance during mammalian evolution.

    View details for DOI 10.1093/molbev/msm061

    View details for Web of Science ID 000247207700001

    View details for PubMedID 17387099


    View details for DOI 10.1214/07-AOAS103

    View details for Web of Science ID 000261050400003

  • Genomic characterization of Gli-activator targets in sonic hedgehog-mediated neural patterning DEVELOPMENT Vokes, S. A., Ji, H., McCuine, S., Tenzen, T., Giles, S., Zhong, S., Longabaugh, W. J., Davidson, E. H., Wong, W. H., McMahon, A. P. 2007; 134 (10): 1977-1989


    Sonic hedgehog (Shh) acts as a morphogen to mediate the specification of distinct cell identities in the ventral neural tube through a Gli-mediated (Gli1-3) transcriptional network. Identifying Gli targets in a systematic fashion is central to the understanding of the action of Shh. We examined this issue in differentiating neural progenitors in mouse. An epitope-tagged Gli-activator protein was used to directly isolate cis-regulatory sequences by chromatin immunoprecipitation (ChIP). ChIP products were then used to screen custom genomic tiling arrays of putative Hedgehog (Hh) targets predicted from transcriptional profiling studies, surveying 50-150 kb of non-transcribed sequence for each candidate. In addition to identifying expected Gli-target sites, the data predicted a number of unreported direct targets of Shh action. Transgenic analysis of binding regions in Nkx2.2, Nkx2.1 (Titf1) and Rab34 established these as direct Hh targets. These data also facilitated the generation of an algorithm that improved in silico predictions of Hh target genes. Together, these approaches provide significant new insights into both tissue-specific and general transcriptional targets in a crucial Shh-mediated patterning process.

    View details for DOI 10.1242/dev.001966

    View details for Web of Science ID 000246138700016

    View details for PubMedID 17442700

  • FoxOs are lineage-restricted redundant tumor suppressors and regulate endothelial cell homeostasis CELL Paik, J., Kollipara, R., Chu, G., Ji, H., Xiao, Y., Ding, Z., Miao, L., Tothova, Z., Horner, J. W., Carrasco, D. R., Jiang, S., Gilliland, D. G., Chin, L., Wong, W. H., Castrillon, D. H., DePinho, R. A. 2007; 128 (2): 309-323


    Activated phosphoinositide 3-kinase (PI3K)-AKT signaling appears to be an obligate event in the development of cancer. The highly related members of the mammalian FoxO transcription factor family, FoxO1, FoxO3, and FoxO4, represent one of several effector arms of PI3K-AKT signaling, prompting genetic analysis of the role of FoxOs in the neoplastic phenotypes linked to PI3K-AKT activation. While germline or somatic deletion of up to five FoxO alleles produced remarkably modest neoplastic phenotypes, broad somatic deletion of all FoxOs engendered a progressive cancer-prone condition characterized by thymic lymphomas and hemangiomas, demonstrating that the mammalian FoxOs are indeed bona fide tumor suppressors. Transcriptome and promoter analyses of differentially affected endothelium identified direct FoxO targets and revealed that FoxO regulation of these targets in vivo is highly context-specific, even in the same cell type. Functional studies validated Sprouty2 and PBX1, among others, as FoxO-regulated mediators of endothelial cell morphogenesis and vascular homeostasis.

    View details for DOI 10.1016/j.cell.2006.13.029

    View details for Web of Science ID 000244420500016

    View details for PubMedID 17254969

  • Exon arrays provide accurate assessments of gene expression GENOME BIOLOGY Kapur, K., Xing, Y., Ouyang, Z., Wong, W. H. 2007; 8 (5)


    We have developed a strategy for estimating gene expression on Affymetrix Exon arrays. The method includes a probe-specific background correction and a probe selection strategy in which a subset of probes with highly correlated intensities across multiple samples are chosen to summarize gene expression. Our results demonstrate that the proposed background model offers improvements over the default Affymetrix background correction and that Exon arrays may provide more accurate measurements of gene expression than traditional 3' arrays.

    View details for DOI 10.1186/gb-2007-8-5-r82

    View details for Web of Science ID 000246983100029

    View details for PubMedID 17504534

  • Probe Selection and Expression Index Computation of Affymetrix Exon Arrays PLOS ONE Xing, Y., Kapur, K., Wong, W. H. 2006; 1 (1)


    There is great current interest in developing microarray platforms for measuring mRNA abundance at both gene level and exon level. The Affymetrix Exon Array is a new high-density gene expression microarray platform, with over six million probes targeting all annotated and predicted exons in a genome. An important question for the analysis of exon array data is how to compute overall gene expression indexes. Because of the complexity of the design of exon array probes, this problem is different in nature from summarizing gene-level expression from traditional 3' expression arrays.In this manuscript, we use exon array data from 11 human tissues to study methods for computing gene-level expression. We showed that for most genes there is a subset of exon array probes having highly correlated intensities across multiple samples. We suggest that these probes could be used as reliable indicators of overall gene expression levels. We developed a probe selection algorithm to select such a subset of highly correlated probes for each gene, and computed gene expression indexes using the selected probes.Our results demonstrate that probe selection improves gene expression estimates from exon arrays. The selected probes can be used in future analyses of other exon array datasets to compute gene expression indexes.

    View details for DOI 10.1371/journal.pone.0000088

    View details for Web of Science ID 000207443600087

    View details for PubMedID 17183719

  • A comparative analysis of genome-wide chromatin immunoprecipitation data for mammalian transcription factors NUCLEIC ACIDS RESEARCH Ji, H., Vokes, S. A., Wong, W. H. 2006; 34 (21)


    Genome-wide location analysis (ChIP-chip, ChIP-PET) is a powerful technique to study mammalian transcriptional regulation. In order to obtain a basic understanding of the location data generated for mammalian transcription factors and potential issues in their analysis, we conducted a comparative study of eight independent ChIP experiments involving six different transcription factors in human and mouse. Our cross-study comparisons, to the best of our knowledge the first to analyze multiple datasets, revealed the importance of carefully chosen genomic controls in the de novo identification of key transcription factor binding motifs, raised issues about the interpretation of ubiquitously occurring sequence motifs, and demonstrated the clustering tendency of protein-binding regions for certain transcription factors.

    View details for DOI 10.1093/nar/gkl803

    View details for Web of Science ID 000242716800004

    View details for PubMedID 17090591

  • Computational biology: Toward deciphering gene regulatory information in mammalian genomes BIOMETRICS Ji, H., Wong, W. H. 2006; 62 (3): 645-663


    Computational biology is a rapidly evolving area where methodologies from computer science, mathematics, and statistics are applied to address fundamental problems in biology. The study of gene regulatory information is a central problem in current computational biology. This article reviews recent development of statistical methods related to this field. Starting from microarray gene selection, we examine methods for finding transcription factor binding motifs and cis-regulatory modules in coregulated genes, and methods for utilizing information from cross-species comparisons and ChIP-chip experiments. The ultimate understanding of cis-regulatory logic in mammalian genomes may require the integration of information collected from all these steps.

    View details for DOI 10.1111/j.1541-0420.2006.00625.x

    View details for Web of Science ID 000240708300001

    View details for PubMedID 16984301

  • Is the future biology Shakespearean or Newtonian? MOLECULAR BIOSYSTEMS Lipan, O., Wong, W. H. 2006; 2 (9): 411-416


    "Cells do not care about mathematics" thus concluded a biologist friend after a discussion on the future of biology. And indeed, why should they care? But if we exchange the word "cell" with "rock", "Moon" or "electrons", do we have to change the sentence also? Starting from this line of thought, we review some recent developments in understanding the stochastic behavior of biological systems. We emphasize the importance of a molecular Signal Generator in the study of genetic networks.

    View details for DOI 10.1039/b607243g

    View details for Web of Science ID 000240284300007

    View details for PubMedID 17153137

  • A tale of two morphogen gradients: Identifying Gli targets of Hedgehog Signaling Vokes, S. A., Ji, H., Wong, W. H., McMahon, A. P. ACADEMIC PRESS INC ELSEVIER SCIENCE. 2006: 423-423
  • A study of density of states and ground states in hydrophobic-hydrophilic protein folding models by equi-energy sampling JOURNAL OF CHEMICAL PHYSICS Kou, S. C., Oh, J., Wong, W. H. 2006; 124 (24)


    We propose an equi-energy (EE) sampling approach to study protein folding in the two-dimensional hydrophobic-hydrophilic (HP) lattice model. This approach enables efficient exploration of the global energy landscape and provides accurate estimates of the density of states, which then allows us to conduct a detailed study of the thermodynamics of HP protein folding, in particular, on the temperature dependence of the transition from folding to unfolding and on how sequence composition affects this phenomenon. With no extra cost, this approach also provides estimates on global energy minima and ground states. Without using any prior structural information of the protein the EE sampler is able to find the ground states that match the best known results in most benchmark cases. The numerical results demonstrate it as a powerful method to study lattice protein folding models.

    View details for DOI 10.1063/1.2208607

    View details for Web of Science ID 000238730600039

    View details for PubMedID 16821999

  • Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data BMC BIOINFORMATICS Zhang, X. G., Lu, X., Shi, Q., Xu, X. Q., Leung, H. C., Harris, L. N., D Iglehart, J., Miron, A., Liu, J. S., Wong, W. H. 2006; 7


    Like microarray-based investigations, high-throughput proteomics techniques require machine learning algorithms to identify biomarkers that are informative for biological classification problems. Feature selection and classification algorithms need to be robust to noise and outliers in the data.We developed a recursive support vector machine (R-SVM) algorithm to select important genes/biomarkers for the classification of noisy data. We compared its performance to a similar, state-of-the-art method (SVM recursive feature elimination or SVM-RFE), paying special attention to the ability of recovering the true informative genes/biomarkers and the robustness to outliers in the data. Simulation experiments show that a 5%- approximately 20% improvement over SVM-RFE can be achieved regard to these properties. The SVM-based methods are also compared with a conventional univariate method and their respective strengths and weaknesses are discussed. R-SVM was applied to two sets of SELDI-TOF-MS proteomics data, one from a human breast cancer study and the other from a study on rat liver cirrhosis. Important biomarkers found by the algorithm were validated by follow-up biological experiments.The proposed R-SVM method is suitable for analyzing noisy high-throughput proteomics and microarray data and it outperforms SVM-RFE in the robustness to noise and in the ability to recover informative features. The multivariate SVM-based method outperforms the univariate method in the classification performance, but univariate methods can reveal more of the differentially expressed features especially when there are correlations between the features.

    View details for DOI 10.1186/1471-2105-7-1-197

    View details for Web of Science ID 000237263600001

    View details for PubMedID 16606446

  • An improved distance measure between the expression profiles linking co-expression and co-regulation in mouse BMC BIOINFORMATICS Kim, R. S., Ji, H. K., Wong, W. H. 2006; 7


    Many statistical algorithms combine microarray expression data and genome sequence data to identify transcription factor binding motifs in the low eukaryotic genomes. Finding cis-regulatory elements in higher eukaryote genomes, however, remains a challenge, as searching in the promoter regions of genes with similar expression patterns often fails. The difficulty is partially attributable to the poor performance of the similarity measures for comparing expression profiles. The widely accepted measures are inadequate for distinguishing genes transcribed from distinct regulatory mechanisms in the complicated genomes of higher eukaryotes.By defining the regulatory similarity between a gene pair as the number of common known transcription factor binding motifs in the promoter regions, we compared the performance of several expression distance measures on seven mouse expression data sets. We propose a new distance measure that accounts for both the linear trends and fold-changes of expression across the samples.The study reveals that the proposed distance measure for comparing expression profiles enables us to identify genes with large number of common regulatory elements because it reflects the inherent regulatory information better than widely accepted distance measures such as the Pearson's correlation or cosine correlation with or without log transformation.

    View details for DOI 10.1186/1471-2105-7-44

    View details for Web of Science ID 000236062200001

    View details for PubMedID 16438730

  • Reliable prediction of transcription factor binding sites by phylogenetic verification PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Li, X. M., Zhong, S., Wong, W. H. 2005; 102 (47): 16945-16950


    We present a statistical methodology that largely improves the accuracy in computational predictions of transcription factor (TF) binding sites in eukaryote genomes. This method models the cross-species conservation of binding sites without relying on accurate sequence alignment. It can be coupled with any motif-finding algorithm that searches for overrepresented sequence motifs in individual species and can increase the accuracy of the coupled motif-finding algorithm. Because this method is capable of accurately detecting TF binding sites, it also enhances our ability to predict the cis-regulatory modules. We applied this method on the published chromatin immunoprecipitation (ChIP)-chip data in Saccharomyces cerevisiae and found that its sensitivity and specificity are 9% and 14% higher than those of two recent methods. We also recovered almost all of the previously verified TF binding sites and made predictions on the cis-regulatory elements that govern the tight regulation of ribosomal protein genes in 13 eukaryote species (2 plants, 4 yeasts, 2 worms, 2 insects, and 3 mammals). These results give insights to the transcriptional regulation in eukaryotic organisms.

    View details for DOI 10.1073/pnas.0504201102

    View details for Web of Science ID 000233463200009

    View details for PubMedID 16286651

  • De novo discovery of a tissue-specific gene regulatory module in a chordate GENOME RESEARCH Johnson, D. S., Zhou, Q., Yagi, K., Satoh, N., Wong, W., Sidow, A. 2005; 15 (10): 1315-1324


    We engage the experimental and computational challenges of de novo regulatory module discovery in a complex and largely unstudied metazoan genome. Our analysis is based on the comprehensive characterization of regulatory elements of 20 muscle genes in the chordate, Ciona savignyi. Three independent types of data we generate contribute to the characterization of a muscle-specific regulatory module: (1) Positive elements (PEs), short sequences sufficient for strong muscle expression that are identified in a high-resolution in vivo analysis; (2) CisModules (CMs), candidate regulatory modules defined by clusters of overrepresented motifs predicted de novo; and (3) Conserved elements (CEs), short noncoding sequences of strong conservation between C. savignyi and C. intestinalis. We estimate the accuracy of the computational predictions by an analysis of the intersection of these data. As final biological validation of the discovered muscle regulatory module, we implement a novel algorithm to search the genome for instances of the module and identify seven novel enhancers.

    View details for DOI 10.1101/gr.4062605

    View details for Web of Science ID 000232436800001

    View details for PubMedID 16169925

  • TileMap: create chromosomal map of tiling array hybridizations BIOINFORMATICS Ji, H. K., Wong, W. H. 2005; 21 (18): 3629-3636


    Tiling array is a new type of microarray that can be used to survey genomic transcriptional activities and transcription factor binding sites at high resolution. The goal of this paper is to develop effective statistical tools to identify genomic loci that show transcriptional or protein binding patterns of interest.A two-step approach is proposed and is implemented in TileMap. In the first step, a test-statistic is computed for each probe based on a hierarchical empirical Bayes model. In the second step, the test-statistics of probes within a genomic region are used to infer whether the region is of interest or not. Hierarchical empirical Bayes model shrinks variance estimates and increases sensitivity of the analysis. It allows complex multiple sample comparisons that are essential for the study of temporal and spatial patterns of hybridization across different experimental conditions. Neighboring probes are combined through a moving average method (MA) or a hidden Markov model (HMM). Unbalanced mixture subtraction is proposed to provide approximate estimates of false discovery rate for MA and model parameters for HMM.TileMap is freely available at (includes coloured versions of all figures).

    View details for DOI 10.1093/bioinformatics/bti593

    View details for Web of Science ID 000231694600007

    View details for PubMedID 16046496

  • Identification of Gli target genes using chromatin immuno-precipitation with a genetically inducible system on genomic arrays. Vokes, S. A., Ji, H. K., Wong, W. H., MCMAHON, A. P. ACADEMIC PRESS INC ELSEVIER SCIENCE. 2005: 666-666
  • Sampling motifs on phylogenetic trees PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Li, X. M., Wong, W. H. 2005; 102 (27): 9481-9486


    We present a method to find motifs by simultaneously using the overrepresentation property and the evolutionary conservation property of motifs. This method is applicable to divergent species where alignment is unreliable, which overcomes a major limitation of the current methods. The method has been applied to search regulatory motifs in four yeast species based on ChIP-chip data in Saccharomyces cerevisiae and obtained 20% higher accuracy than the best current methods. We also discovered cis-regulatory elements that govern the tight regulation of ribosomal protein genes in two distantly related insects by using this method. These results demonstrate that our method will be useful for the extraction of regulatory signals in multiple genomes.

    View details for DOI 10.1073/pnas.0501620102

    View details for Web of Science ID 000230406000010

    View details for PubMedID 15983378

  • mSin3A corepressor regulates diverse transcriptional networks governing normal and neoplastic growth and survival GENES & DEVELOPMENT Dannenberg, J. H., David, G., Zhong, S., van der Torre, J., Wong, W. H., DePinho, R. A. 2005; 19 (13): 1581-1595


    mSin3A is a core component of a large multiprotein corepressor complex with associated histone deacetylase (HDAC) enzymatic activity. Physical interactions of mSin3A with many sequence-specific transcription factors has linked the mSin3A corepressor complex to the regulation of diverse signaling pathways and associated biological processes. To dissect the complex nature of mSin3A's actions, we monitored the impact of conditional mSin3A deletion on the developmental, cell biological, and transcriptional levels. mSin3A was shown to play an essential role in early embryonic development and in the proliferation and survival of primary, immortalized, and transformed cells. Genetic and biochemical analyses established a role for mSin3A/HDAC in p53 deacetylation and activation, although genetic deletion of p53 was not sufficient to attenuate the mSin3A null cell lethal phenotype. Consistent with mSin3A's broad biological activities beyond regulation of the p53 pathway, time-course gene expression profiling following mSin3A deletion revealed deregulation of genes involved in cell cycle regulation, DNA replication, DNA repair, apoptosis, chromatin modifications, and mitochondrial metabolism. Computational analysis of the mSin3A transcriptome using a knowledge-based database revealed several nodal points through which mSin3A influences gene expression, including the Myc-Mad, E2F, and p53 transcriptional networks. Further validation of these nodes derived from in silico promoter analysis showing enrichment for Myc-Mad, E2F, and p53 cis-regulatory elements in regulatory regions of up-regulated genes following mSin3A depletion. Significantly, in silico promoter analyses also revealed specific cis-regulatory elements binding the transcriptional activator Stat and the ISWI ATP-dependent nucleosome remodeling factor Falz, thereby expanding further the mSin3A network of regulatory factors. Together, these integrated genetic, biochemical, and computational studies demonstrate the involvement of mSin3A in the regulation of diverse pathways governing many aspects of normal and neoplastic growth and survival and provide an experimental framework for the analysis of essential genes with diverse biological functions.

    View details for DOI 10.1101/gad.1286905

    View details for Web of Science ID 000230334600008

    View details for PubMedID 15998811

  • A small-molecule inhibitor of mpsl blocks the spindle-checkpoint response to a lack of tension on mitotic chromosomes CURRENT BIOLOGY Dorer, R. K., Zhong, S., Tallarico, J. A., Wong, W. H., Mitchison, T. J., Murray, A. W. 2005; 15 (11): 1070-1076


    The spindle checkpoint prevents chromosome loss by preventing chromosome segregation in cells with improperly attached chromosomes [1, 2 and 3]. The checkpoint senses defects in the attachment of chromosomes to the mitotic spindle [4] and the tension exerted on chromosomes by spindle forces in mitosis [5, 6 and 7]. Because many cancers have defects in chromosome segregation, this checkpoint may be required for survival of tumor cells and may be a target for chemotherapy. We performed a phenotype-based chemical-genetic screen in budding yeast and identified an inhibitor of the spindle checkpoint, called cincreasin. We used a genome-wide collection of yeast gene-deletion strains and traditional genetic and biochemical analysis to show that the target of cincreasin is Mps1, a protein kinase required for checkpoint function [8]. Despite the requirement for Mps1 for sensing both the lack of microtubule attachment and tension at kinetochores, we find concentrations of cincreasin that selectively inhibit the tension-sensitive branch of the spindle checkpoint. At these concentrations, cincreasin causes lethal chromosome missegregation in mutants that display chromosomal instability. Our results demonstrate that Mps1 can be exploited as a target and that inhibiting the tension-sensitive branch of the spindle checkpoint may be a way of selectively killing cancer cells that display chromosomal instability.

    View details for DOI 10.1016/j.cub.2005.05.020

    View details for Web of Science ID 000229984100031

    View details for PubMedID 15936280

  • UbIC(2) - Towards ubiquitous bio-information computing: Data protocols, middleware, and web services for heterogeneous biological information integration and retrieval INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING Hong, P. Y., Zhong, S., Wong, W. H. 2005; 15 (3): 475-485
  • A boosting approach for motif modeling using ChIP-chip data BIOINFORMATICS Hong, P. Y., Liu, X. S., Zhou, Q., Lu, X., Liu, J. S., Wong, W. H. 2005; 21 (11): 2636-2643


    Building an accurate binding model for a transcription factor (TF) is essential to differentiate its true binding targets from those spurious ones. This is an important step toward understanding gene regulation.This paper describes a boosting approach to modeling TF-DNA binding. Different from the widely used weight matrix model, which predicts TF-DNA binding based on a linear combination of position-specific contributions, our approach builds a TF binding classifier by combining a set of weight matrix based classifiers, thus yielding a non-linear binding decision rule. The proposed approach was applied to the ChIP-chip data of Saccharomyces cerevisiae. When compared with the weight matrix method, our new approach showed significant improvements on the specificity in a majority of cases.

    View details for DOI 10.1093/bioinformatics/bti402

    View details for Web of Science ID 000229441500010

    View details for PubMedID 15817698

  • Tight clustering: A resampling-based approach for identifying stable and tight patterns in data BIOMETRICS Tseng, G. C., Wong, W. H. 2005; 61 (1): 10-16


    In this article, we propose a method for clustering that produces tight and stable clusters without forcing all points into clusters. The methodology is general but was initially motivated from cluster analysis of microarray experiments. Most current algorithms aim to assign all genes into clusters. For many biological studies, however, we are mainly interested in identifying the most informative, tight, and stable clusters of sizes, say, 20-60 genes for further investigation. We want to avoid the contamination of tightly regulated expression patterns of biologically relevant genes due to other genes whose expressions are only loosely compatible with these patterns. "Tight clustering" has been developed specifically to address this problem. It applies K-means clustering as an intermediate clustering engine. Early truncation of a hierarchical clustering tree is used to overcome the local minimum problem in K-means clustering. The tightest and most stable clusters are identified in a sequential manner through an analysis of the tendency of genes to be grouped together under repeated resampling. We validated this method in a simulated example and applied it to analyze a set of expression profiles in the study of embryonic stem cells.

    View details for Web of Science ID 000227576600002

    View details for PubMedID 15737073

  • Comparative linkage analysis and visualization of high-density oligonucleotide SNP array data BMC GENETICS Leykin, I., Hao, K., Cheng, J. S., Meyer, N., Pollak, M. R., Smith, R. J., Wong, W. H., Rosenow, C., Li, C. 2005; 6


    The identification of disease-associated genes using single nucleotide polymorphisms (SNPs) has been increasingly reported. In particular, the Affymetrix Mapping 10 K SNP microarray platform uses one PCR primer to amplify the DNA samples and determine the genotype of more than 10,000 SNPs in the human genome. This provides the opportunity for large scale, rapid and cost-effective genotyping assays for linkage analysis. However, the analysis of such datasets is nontrivial because of the large number of markers, and visualizing the linkage scores in the context of genome maps remains less automated using the current linkage analysis software packages. For example, the haplotyping results are commonly represented in the text format.Here we report the development of a novel software tool called CompareLinkage for automated formatting of the Affymetrix Mapping 10 K genotype data into the "Linkage" format and the subsequent analysis with multi-point linkage software programs such as Merlin and Allegro. The new software has the ability to visualize the results for all these programs in dChip in the context of genome annotations and cytoband information. In addition we implemented a variant of the Lander-Green algorithm in the dChipLinkage module of dChip software (V1.3) to perform parametric linkage analysis and haplotyping of SNP array data. These functions are integrated with the existing modules of dChip to visualize SNP genotype data together with LOD score curves. We have analyzed three families with recessive and dominant diseases using the new software programs and the comparison results are presented and discussed.The CompareLinkage and dChipLinkage software packages are freely available. They provide the visualization tools for high-density oligonucleotide SNP array data, as well as the automated functions for formatting SNP array data for the linkage analysis programs Merlin and Allegro and calling these programs for linkage analysis. The results can be visualized in dChip in the context of genes and cytobands. In addition, a variant of the Lander-Green algorithm is provided that allows parametric linkage analysis and haplotyping.

    View details for DOI 10.1186/1471-2156-6-7

    View details for Web of Science ID 000227316700001

    View details for PubMedID 15713228

  • GeneNotes - A novel information management software for biologists BMC BIOINFORMATICS Hong, P. Y., Wong, W. H. 2005; 6


    Collecting and managing information is a challenging task in a genome-wide profiling research project. Most databases and online computational tools require a direct human involvement. Information and computational results are presented in various multimedia formats (e.g., text, image, PDF, word files, etc.), many of which cannot be automatically processed by computers in biologically meaningful ways. In addition, the quality of computational results is far from perfect and requires nontrivial manual examination. The timely selection, integration and interpretation of heterogeneous biological information still heavily rely on the sensibility of biologists. Biologists often feel overwhelmed by the huge amount of and the great diversity of distributed heterogeneous biological information.We developed an information management application called GeneNotes. GeneNotes is the first application that allows users to collect and manage multimedia biological information about genes/ESTs. GeneNotes provides an integrated environment for users to surf the Internet, collect notes for genes/ESTs, and retrieve notes. GeneNotes is supported by a server that integrates gene annotations from many major databases (e.g., HGNC, MGI, etc.). GeneNotes uses the integrated gene annotations to (a) identify genes given various types of gene IDs (e.g., RefSeq ID, GenBank ID, etc.), and (b) provide quick views of genes. GeneNotes is free for academic usage. The program and the tutorials are available at: provides a novel human-computer interface to assist researchers to collect and manage biological information. It also provides a platform for studying how users behave when they manipulate biological information. The results of such study can lead to innovation of more intelligent human-computer interfaces that greatly shorten the cycle of biology research.

    View details for DOI 10.1186/1471-2105-6-20

    View details for Web of Science ID 000227451700001

    View details for PubMedID 15686593

  • Functional annotation and network reconstruction through cross-platform integration of microarray data NATURE BIOTECHNOLOGY Zhou, X. H., Kao, M. C., Huang, H. Y., Wong, A., Nunez-Iglesias, J., Primig, M., Aparicio, O. M., Finch, C. E., Morgan, T. E., Wong, W. H. 2005; 23 (2): 238-243


    The rapid accumulation of microarray data translates into a need for methods to effectively integrate data generated with different platforms. Here we introduce an approach, 2(nd)-order expression analysis, that addresses this challenge by first extracting expression patterns as meta-information from each data set (1(st)-order expression analysis) and then analyzing them across multiple data sets. Using yeast as a model system, we demonstrate two distinct advantages of our approach: we can identify genes of the same function yet without coexpression patterns and we can elucidate the cooperativities between transcription factors for regulatory network reconstruction by overcoming a key obstacle, namely the quantification of activities of transcription factors. Experiments reported in the literature and performed in our lab support a significant number of our predictions.

    View details for DOI 10.1038/nbt1058

    View details for Web of Science ID 000226797600032

    View details for PubMedID 15654329

  • Detect and adjust for population stratification in population-based association study using genomic control markers: an application of Affymetrix Genechip (R) Human Mapping 10K array EUROPEAN JOURNAL OF HUMAN GENETICS Hao, K., Li, C., Rosenow, C., Wong, W. H. 2004; 12 (12): 1001-1006


    Population-based association design is often compromised by false or nonreplicable findings, partially due to population stratification. Genomic control (GC) approaches were proposed to detect and adjust for this confounder. To date, the performance of this strategy has not been extensively evaluated on real data. More than 10 000 single-nucleotide polymorphisms (SNPs) were genotyped on subjects from four populations (including an Asian, an African-American and two Caucasian populations) using GeneChip Mapping 10 K array. On these data, we tested the performance of two GC approaches in different scenarios including various numbers of GC markers and different degrees of population stratification. In the scenario of substantial population stratification, both GC approaches are sensitive using only 20-50 random SNPs, and the mixed subjects can be separated into homogeneous subgroups. In the scenario of moderate stratification, both GC approaches have poor sensitivities. However, the bias in association test can still be corrected even when no statistical significant population stratification is detected. We conducted extensive benchmark analyses on GC approaches using SNPs over the whole human genome. We found GC method can cluster subjects to homogeneous subgroups if there is a substantial difference in genetic background. The inflation factor, estimated by GC markers, can effectively adjust for the confounding effect of population stratification regardless of its extent. We also suggest that as low as 50 random SNPs with heterozygosity >40% should be sufficient as genomic controls.

    View details for DOI 10.1038/sj.ejhg.5201273

    View details for Web of Science ID 000225165200004

    View details for PubMedID 15367915

  • Estimation of genotype error rate using samples with pedigree information - an application on the GeneChip Mapping 10K array GENOMICS Hao, K., Li, C., Rosenow, C., Wong, W. H. 2004; 84 (4): 623-630


    Currently, most analytical methods assume all observed genotypes are correct; however, it is clear that errors may reduce statistical power or bias inference in genetic studies. We propose procedures for estimating error rate in genetic analysis and apply them to study the GeneChip Mapping 10K array, which is a technology that has recently become available and allows researchers to survey over 10,000 SNPs in a single assay. We employed a strategy to estimate the genotype error rate in pedigree data. First, the "dose-response" reference curve between error rate and the observable error number were derived by simulation, conditional on given pedigree structures and genotypes. Second, the error rate was estimated by calibrating the number of observed errors in real data to the reference curve. We evaluated the performance of this method by simulation study and applied it to a data set of 30 pedigrees genotyped using the GeneChip Mapping 10K array. This method performed favorably in all scenarios we surveyed. The dose-response reference curve was monotone and almost linear with a large slope. The method was able to estimate accurately the error rate under various pedigree structures and error models and under heterogeneous error rates. Using this method, we found that the average genotyping error rate of the GeneChip Mapping 10K array was about 0.1%. Our method provides a quick and unbiased solution to address the genotype error rate in pedigree data. It behaves well in a wide range of settings and can be easily applied in other genetic projects. The robust estimation of genotyping error rate allows us to estimate power and sample size and conduct unbiased genetic tests. The GeneChip Mapping 10K array has a low overall error rate, which is consistent with the results obtained from alternative genotyping assays.

    View details for DOI 10.1016/j.ygeno.2004.05.003

    View details for Web of Science ID 000224091200001

    View details for PubMedID 15475239

  • CisModule: De novo discovery of' cis-regulatory modules by hierarchical mixture modeling PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Zhou, Q., Wong, W. H. 2004; 101 (33): 12114-12119


    The regulatory information for a eukaryotic gene is encoded in cis-regulatory modules. The binding sites for a set of interacting transcription factors have the tendency to colocalize to the same modules. Current de novo motif discovery methods do not take advantage of this knowledge. We propose a hierarchical mixture approach to model the cis-regulatory module structure. Based on the model, a new de novo motif-module discovery algorithm, CisModule, is developed for the Bayesian inference of module locations and within-module motif sites. Dynamic programming-like recursions are developed to reduce the computational complexity from exponential to linear in sequence length. By using both simulated and real data sets, we demonstrate that CisModule is not only accurate in predicting modules but also more sensitive in detecting motif patterns and binding sites than standard motif discovery methods are.

    View details for DOI 10.1073/pnas.0402858101

    View details for Web of Science ID 000223410100038

    View details for PubMedID 15297614

  • Integrated analysis of microarray data and gene function information OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY Cui, Y., Zhou, M., Wong, W. H. 2004; 8 (2): 106-117


    Microarray data should be interpreted in the context of existing biological knowledge. Here we present integrated analysis of microarray data and gene function classification data using homogeneity analysis. Homogeneity analysis is a graphical multivariate statistical method for analyzing categorical data. It converts categorical data into graphical display. By simultaneously quantifying the microarray-derived gene groups and gene function categories, it captures the complex relations between biological information derived from microarray data and the existing knowledge about the gene function. Thus, homogeneity analysis provides a mathematical framework for integrating the analysis of microarray data and the existing biological knowledge.

    View details for Web of Science ID 000223063500003

    View details for PubMedID 15268770

  • Molecular diversity of astrocytes with implications for neurological disorders PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Bachoo, R. M., Kim, R. S., Ligon, K. L., Maher, E. A., Brennan, C., Billings, N., Chan, S., Li, C., Rowitch, D. H., Wong, W. H., DePinho, R. A. 2004; 101 (22): 8384-8389


    The astrocyte represents the most abundant yet least understood cell type of the CNS. Here, we use a stringent experimental strategy to molecularly define the astrocyte lineage by integrating microarray datasets across several in vitro model systems of astrocyte differentiation, primary astrocyte cultures, and various astrocyterich CNS structures. The intersection of astrocyte data sets, coupled with the application of nonastrocytic exclusion filters, yielded many astrocyte-specific genes possessing strikingly varied patterns of regional CNS expression. Annotation of these astrocyte-specific genes provides direct molecular documentation of the diverse physiological roles of the astrocyte lineage. This global perspective in the normal brain also provides a framework for how astrocytes may participate in the pathogenesis of common neurological disorders like Alzheimer's disease, Parkinson's disease, stroke, epilepsy, and primary brain tumors.

    View details for DOI 10.1073/pnas.0402140101

    View details for Web of Science ID 000221831800025

    View details for PubMedID 15155908

  • GoSurfer: a graphical interactive tool for comparative analysis of large gene sets in Gene Ontology space. Applied bioinformatics Zhong, S., Storch, K., Lipan, O., Kao, M. J., Weitz, C. J., Wong, W. H. 2004; 3 (4): 261-264


    The analysis of complex patterns of gene regulation is central to understanding the biology of cells, tissues and organisms. Patterns of gene regulation pertaining to specific biological processes can be revealed by a variety of experimental strategies, particularly microarrays and other highly parallel methods, which generate large datasets linking many genes. Although methods for detecting gene expression have improved substantially in recent years, understanding the physiological implications of complex patterns in gene expression data is a major challenge. This article presents GoSurfer, an easy-to-use graphical exploration tool with built-in statistical features that allow a rapid assessment of the biological functions represented in large gene sets. GoSurfer takes one or two list(s) of gene identifiers (Affymetrix probe set ID) as input and retrieves all the Gene Ontology (GO) terms associated with the input genes. GoSurfer visualises these GO terms in a hierarchical tree format. With GoSurfer, users can perform statistical tests to search for the GO terms that are enriched in the annotations of the input genes. These GO terms can be highlighted on the GO tree. Users can manipulate the GO tree in various ways and interactively query the genes associated with any GO term. The user-generated graphics can be saved as graphics files, and all the GO information related to the input genes can be exported as text files.GoSurfer is a Windows-based program freely available for noncommercial use and can be downloaded at Datasets used to construct the trees shown in the figures in this article are available at

    View details for PubMedID 15702958

  • Towards Ubiquitous Bio-Information Computing: Data protocols, middleware, and Web services for heterogeneous biological information integration and retrieval BIBE 2004: FOURTH IEEE SYMPOSIUM ON BIOINFORMATICS AND BIOENGINEERING, PROCEEDINGS Hong, P. Y., Zhong, S., Wong, W. H. 2004: 57-64
  • Relaxed simulated tempering for VLSI floorplan designs PROCEEDINGS OF ASP-DAC '99 Cong, J., Kong, T. M., Xu, D. M., Liang, F. M., Liu, J. S., Wong, W. H. 1999: 13-16