Instructor, Neurology & Neurological Sciences
The APOE4 allele is the strongest genetic risk factor for sporadic Alzheimer disease (AD). Case-control studies suggest the APOE4 link to AD is stronger in women. We examined the APOE4-by-sex interaction in conversion risk (from healthy aging to mild cognitive impairment (MCI)/AD or from MCI to AD) and cerebrospinal fluid (CSF) biomarker levels.Cox proportional hazards analysis was used to compute hazard ratios (HRs) for an APOE-by-sex interaction on conversion in controls (n = 5,496) and MCI patients (n = 2,588). The interaction was also tested in CSF biomarker levels of 980 subjects from the Alzheimer's Disease Neuroimaging Initiative.Among controls, male and female carriers were more likely to convert to MCI/AD, but the effect was stronger in women (HR = 1.81 for women; HR = 1.27 for men; interaction: p = 0.011). The interaction remained significant in a predefined subanalysis restricted to APOE3/3 and APOE3/4 genotypes. Among MCI patients, both male and female APOE4 carriers were more likely to convert to AD (HR = 2.16 for women; HR = 1.64 for men); the interaction was not significant (p = 0.14). In the subanalysis restricted to APOE3/3 and APOE3/4 genotypes, the interaction was significant (p = 0.02; HR = 2.17 for women; HR = 1.51 for men). The APOE4-by-sex interaction on biomarker levels was significant for MCI patients for total tau and the tau-to-Aβ ratio (p = 0.009 and p = 0.02, respectively; more AD-like in women).APOE4 confers greater AD risk in women. Biomarker results suggest that increased APOE-related risk in women may be associated with tau pathology. These findings have important clinical implications and suggest novel research approaches into AD pathogenesis.
View details for DOI 10.1002/ana.24135
View details for Web of Science ID 000335234200011
Childhood maltreatment is likely to influence fundamental biological processes and engrave long-lasting epigenetic marks, leading to adverse health outcomes in adulthood. We aimed to elucidate the impact of different early environment on disease-related genome-wide gene expression and DNA methylation in peripheral blood cells in patients with posttraumatic stress disorder (PTSD). Compared with the same trauma-exposed controls (n = 108), gene-expression profiles of PTSD patients with similar clinical symptoms and matched adult trauma exposure but different childhood adverse events (n = 32 and 29) were almost completely nonoverlapping (98%). These differences on the level of individual transcripts were paralleled by the enrichment of several distinct biological networks between the groups. Moreover, these gene-expression changes were accompanied and likely mediated by changes in DNA methylation in the same loci to a much larger proportion in the childhood abuse (69%) vs. the non-child abuse-only group (34%). This study is unique in providing genome-wide evidence of distinct biological modifications in PTSD in the presence or absence of exposure to childhood abuse. The findings that nonoverlapping biological pathways seem to be affected in the two PTSD groups and that changes in DNA methylation appear to have a much greater impact in the childhood-abuse group might reflect differences in the pathophysiology of PTSD, in dependence of exposure to childhood maltreatment. These results contribute to a better understanding of the extent of influence of differences in trauma exposure on pathophysiological processes in stress-related psychiatric disorders and may have implications for personalized medicine.
View details for DOI 10.1073/pnas.1217750110
View details for Web of Science ID 000319803500068
View details for PubMedID 23630272
High-throughput DNA sequencing (HTS) is of increasing importance in the life sciences. One of its most prominent applications is the sequencing of whole genomes or targeted regions of the genome such as all exonic regions (i.e., the exome). Here, the objective is the identification of genetic variants such as single nucleotide polymorphisms (SNPs). The extraction of SNPs from the raw genetic sequences involves many processing steps and the application of a diverse set of tools. We review the essential building blocks for a pipeline that calls SNPs from raw HTS data. The pipeline includes quality control, mapping of short reads to the reference genome, visualization and post-processing of the alignment including base quality recalibration. The final steps of the pipeline include the SNP calling procedure along with filtering of SNP candidates. The steps of this pipeline are accompanied by an analysis of a publicly available whole-exome sequencing dataset. To this end, we employ several alignment programs and SNP calling routines for highlighting the fact that the choice of the tools significantly affects the final results.
View details for DOI 10.1007/s00439-012-1213-z
View details for Web of Science ID 000308249300003
View details for PubMedID 22886560
In life sciences, interpretability of machine learning models is as important as their prediction accuracy. Linear models are probably the most frequently used methods for assessing feature relevance, despite their relative inflexibility. However, in the past years effective estimators of feature relevance have been derived for highly complex or non-parametric models such as support vector machines and RandomForest (RF) models. Recently, it has been observed that RF models are biased in such a way that categorical variables with a large number of categories are preferred.In this work, we introduce a heuristic for normalizing feature importance measures that can correct the feature importance bias. The method is based on repeated permutations of the outcome vector for estimating the distribution of measured importance for each variable in a non-informative setting. The P-value of the observed importance provides a corrected measure of feature importance. We apply our method to simulated data and demonstrate that (i) non-informative predictors do not receive significant P-values, (ii) informative variables can successfully be recovered among non-informative variables and (iii) P-values computed with permutation importance (PIMP) are very helpful for deciding the significance of variables, and therefore improve model interpretability. Furthermore, PIMP was used to correct RF-based importance measures for two real-world case studies. We propose an improved RF model that uses the significant variables with respect to the PIMP measure and show that its prediction accuracy is superior to that of other existing models.R code for the method presented in this article is available at http://www.mpi-inf.mpg.de/ approximately altmann/download/PIMP.R CONTACT: email@example.com, firstname.lastname@example.orgSupplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btq134
View details for Web of Science ID 000277447500009
View details for PubMedID 20385727
The outcome of antiretroviral combination therapy depends on many factors involving host, virus, and drugs. We investigate prediction of treatment response from the applied drug combination and the genetic constellation of the virus population at baseline. The virus's evolutionary potential for escaping from drug pressure is explored as an additional predictor.We compare different encodings of the viral genotype and antiretroviral regimen including phenotypic and evolutionary information, namely predicted phenotypic drug resistance, activity of the regimen estimated from sequence space search, the genetic barrier to drug resistance, and the genetic progression score. These features were evaluated in the context of different statistical learning procedures applied to the binary classification task of predicting virological response. Classifier performance was evaluated using cross-validation and receiver operating characteristic curves on 6,337 observed treatment change episodes from the Stanford HIV Drug Resistance Database and a large US clinic-based patient population.We find that the choice of appropriate features affects predictive performance more profoundly than the choice of the statistical learning method. Application of the genetic barrier to drug resistance, which combines phenotypic and evolutionary information, outperformed the genetic progression score, which uses exclusively evolutionary knowledge. The benefit of phenotypic information in predicting virological response was confirmed by using predicted fold changes in drug susceptibility. Moreover, genetic barrier and predicted phenotypic drug resistance were found to be the best encodings across all datasets and statistical learning methods examined.THEO (THErapy Optimizer), a prototypical implementation of the best performing approach, is freely available for research purposes at http://www.geno2pheno.org.
View details for Web of Science ID 000247111500004
View details for PubMedID 17503659
Alzheimer's disease (AD) is an increasingly prevalent, fatal neurodegenerative disease that has proven resistant, thus far, to all attempts to prevent it, forestall it, or slow its progression. The ε4 allele of the Apolipoprotein E gene (APOE4) is a potent genetic risk factor for sporadic and late-onset familial AD. While the link between APOE4 and AD is strong, many expected effects, like increasing the risk of conversion from MCI to AD, have not been widely replicable. One critical, and commonly overlooked, feature of the APOE4 link to AD is that several lines of evidence suggest it is far more pronounced in women than in men. Here we review previous literature on the APOE4 by gender interaction with a particular focus on imaging-related studies.
View details for DOI 10.1007/s11682-013-9272-x
View details for Web of Science ID 000335765700009
SLC6A15 is a neuron-specific neutral amino acid transporter that belongs to the solute carrier 6 gene family. This gene family is responsible for presynaptic re-uptake of the majority of neurotransmitters. Convergent data from human studies, animal models and pharmacological investigations suggest a possible role of SLC6A15 in major depressive disorder. In this work, we explored potential functional variants in this gene that could influence the activity of the amino acid transporter and thus downstream neuronal function and possibly the risk for stress-related psychiatric disorders. DNA from 400 depressed patients and 400 controls was screened for genetic variants using a pooled targeted re-sequencing approach. Results were verified by individual re-genotyping and validated non-synonymous coding variants were tested in an independent sample (N = 1934). Nine variants altering the amino acid sequence were then assessed for their functional effects by measuring SLC6A15 transporter activity in a cellular uptake assay. In total, we identified 405 genetic variants, including twelve non-synonymous variants. While none of the non-synonymous coding variants showed significant differences in case-control associations, two rare non-synonymous variants were associated with a significantly increased maximal (3)H proline uptake as compared to the wildtype sequence. Our data suggest that genetic variants in the SLC6A15 locus change the activity of the amino acid transporter and might thus influence its neuronal function and the risk for stress-related psychiatric disorders. As statistically significant association for rare variants might only be achieved in extremely large samples (N >70,000) functional exploration may shed light on putatively disease-relevant variants.
View details for DOI 10.1371/journal.pone.0068645
View details for Web of Science ID 000322064300058
View details for PubMedID 23874702
Recent advances in massively parallel sequencing (MPS) have had an extensive impact on research in medical genomics. In particular, the analysis of rare variants using MPS promises to lead to a better understanding of complex disorders. Nevertheless, for meaningful studies that address the genetic basis for neuropsychiatric disorders, at least hundreds of patient samples have to be analyzed. This undertaking is still not feasible for single research groups on a whole-genome scale and in individual samples. Thus, researchers increasingly employ strategies for reducing the amount of sequencing efforts, such as target enrichment and non-barcoded sample pooling. This review provides an overview of current technologies, discusses options for reduced experimental designs, and illustrates the successful application of the presented methodologies in a recent study of panic disorder patients. Thereby, it aims to introduce the emerging field of MPS into neuropsychiatric research and might serve as a guide for further studies.
View details for DOI 10.1007/s11920-012-0333-4
View details for Web of Science ID 000318759800003
View details for PubMedID 23250814
Until recently, genotype studies were limited to the investigation of single SNP effects due to the computational burden incurred when studying pairwise interactions of SNPs. However, some genetic effects as simple as coloring (in plants and animals) cannot be ascribed to a single locus but only understood when epistasis is taken into account . It is expected that such effects are also found in complex diseases where many genes contribute to the clinical outcome of affected individuals. Only recently have such problems become feasible computationally.The inherently parallel structure of the problem makes it a perfect candidate for massive parallelization on either grid or cloud architectures. Since we are also dealing with confidential patient data, we were not able to consider a cloud-based solution but had to find a way to process the data in-house and aimed to build a local GPU-based grid structure.Sequential epistatsis calculations were ported to GPU using CUDA at various levels. Parallelization on the CPU was compared to corresponding GPU counterparts with regards to performance and cost.A cost-effective solution was created by combining custom-built nodes equipped with relatively inexpensive consumer-level graphics cards with highly parallel GPUs in a local grid. The GPU method outperforms current cluster-based systems on a price/performance criterion, as a single GPU shows speed performance comparable up to 200 CPU cores.The outlined approach will work for problems that easily lend themselves to massive parallelization. Code for various tasks has been made available and ongoing development of tools will further ease the transition from sequential to parallel algorithms.
View details for DOI 10.3414/ME11-02-0049
View details for Web of Science ID 000313847800011
View details for PubMedID 23223640
Genome-wide association studies have identified common variants associated with common diseases. Most variants, however, explain only a small proportion of the estimated heritability, suggesting that rare variants might contribute to a larger extent to common diseases than assumed to date. Here, we use next-generation sequencing to test whether such variants contribute to the risk for anxiety disorders by re-sequencing 40?kb including all exons of the TMEM132D locus which we have previously shown to be associated with panic disorder and anxiety severity measures. DNA from 300 patients suffering from anxiety disorders, mostly panic disorder (84.7%), and 300 healthy controls was screened for the presence of genetic variants using next-generation re-sequencing in a pooled approach. Results were verified by individual re-genotyping. We identified 371 variants of which 247 had not been reported before, including 15 novel non-synonymous variants. The majority, 76% of these variants had a minor allele frequency less than 5%. While we did not identify additional common variants in TMEM132D associated with panic disorders, we observed an overrepresentation of presumably functional coding variants in healthy controls as compared to cases as well as a higher rate of private coding variants in cases, with one non-synonymous coding variant present in four patients but not in any of the matched controls nor in over 5,500 individuals of different ethnic origins from publicly available re-sequencing datasets. Our data suggest that not only common but also putatively functional and/or rare variants within TMEM132D might contribute to the risk to develop anxiety disorders.
View details for DOI 10.1002/ajmg.b.32096
View details for Web of Science ID 000312725200002
View details for PubMedID 22911938
Due to recent advances in genotyping technologies, mapping phenotypes to single loci in the genome has become a standard technique in statistical genetics. However, one-locus mapping fails to explain much of the phenotypic variance in complex traits. Here, we present GLIDE, which maps phenotypes to pairs of genetic loci and systematically searches for the epistatic interactions expected to reveal part of this missing heritability. GLIDE makes use of the computational power of consumer-grade graphics cards to detect such interactions via linear regression. This enabled us to conduct a systematic two-locus mapping study on seven disease data sets from the Wellcome Trust Case Control Consortium and on in-house hippocampal volume data in 6 h per data set, while current single CPU-based approaches require more than a year's time to complete the same task.
View details for DOI 10.1159/000341885
View details for Web of Science ID 000309381500004
View details for PubMedID 22965145
For a long time, the clinical management of antiretroviral drug resistance was based on sequence analysis of the HIV genome followed by estimating drug susceptibility from the mutational pattern that was detected. The large number of anti-HIV drugs and HIV drug resistance mutations has prompted the development of computer-aided genotype interpretation systems, typically comprising rules handcrafted by experts via careful examination of in vitro and in vivo resistance data. More recently, machine learning approaches have been applied to establish data-driven engines able to indicate the most effective treatments for any patient and virus combination. Systems of this kind, currently including the Resistance Response Database Initiative and the EuResist engine, must learn from the large data sets of patient histories and can provide an objective and accurate estimate of the virological response to different antiretroviral regimens. The EuResist engine was developed by a European consortium of HIV and bioinformatics experts and compares favorably with the most commonly used genotype interpretation systems and HIV drug resistance experts. Next-generation treatment response prediction engines may valuably assist the HIV specialist in the challenging task of establishing effective regimens for patients harboring drug-resistant virus strains. The extensive collection and accurate processing of increasingly large patient data sets are eagerly awaited to further train and translate these systems from prototype engines into real-life treatment decision support tools.
View details for DOI 10.1159/000332008
View details for Web of Science ID 000299599100009
View details for PubMedID 22286881
The EuResist expert system is a novel data-driven online system for computing the probability of 8-week success for any given pair of HIV-1 genotype and combination antiretroviral therapy regimen plus optional patient information. The objective of this study was to compare the EuResist system vs. human experts (EVE) for the ability to predict response to treatment.The EuResist system was compared with 10 HIV-1 drug resistance experts for the ability to predict 8-week response to 25 treatment cases derived from the EuResist database validation data set. All current and past patient data were made available to simulate clinical practice. The experts were asked to provide a qualitative and quantitative estimate of the probability of treatment success.There were 15 treatment successes and 10 treatment failures. In the classification task, the number of mislabelled cases was six for EuResist and 6-13 for the human experts [mean±standard deviation (SD) 9.1±1.9]. The accuracy of EuResist was higher than the average for the experts (0.76 vs. 0.64, respectively). The quantitative estimates computed by EuResist were significantly correlated (Pearson r=0.695, P<0.0001) with the mean quantitative estimates provided by the experts. However, the agreement among experts was only moderate (for the classification task, inter-rater ?=0.355; for the quantitative estimation, mean±SD coefficient of variation=55.9±22.4%).With this limited data set, the EuResist engine performed comparably to or better than human experts. The system warrants further investigation as a treatment-decision support tool in clinical practice.
View details for DOI 10.1111/j.1468-1293.2010.00871.x
View details for Web of Science ID 000288020700004
View details for PubMedID 20731728
Genotype-derived drug resistance profiles are a valuable asset in HIV-1 therapy decisions. Therapy decisions could be further improved, both in terms of predicting length of current therapy success and in preserving followup therapy options, through better knowledge of mutational pathways- here defined as specific locations on the viral genome which, when mutant, alter the risk that additional specific mutations arise. We limit the search to locations in the reverse transcriptase region of the HIV-1 genome which host resistance mutations to nucleoside (NRTI) and non-nucleoside (NNRTI) reverse transcriptase inhibitors (as listed in the 2008 International AIDS Society report), or which were mutant at therapy start in 5% or more of the therapies studied.A Cox proportional hazards model was fit to each location with the hazard of a mutation at that location during therapy proportional to the presence/absence of mutations at the remaining locations at therapy start. A pathway from preexisting to occurring mutation was indicated if the covariate was both selected as important via smoothly clipped absolute deviation (a form of regularized regression) and had a small p-value. The Cox model also allowed controlling for non-genetic parameters and potential nuisance factors such as viral resistance and number of previous therapies. Results were based on 1981 therapies given to 1495 distinct patients drawn from the EuResist database.The strongest influence on the hazard of developing NRTI resistance was having more than four previous therapies, not any one existing resistance mutation. Known NRTI resistance pathways were shown, and previously speculated inhibition between the thymidine analog pathways was evidenced. Evidence was found for a number of specific pathways between NRTI and NNRTI resistance sites. A number of common mutations were shown to increase the hazard of developing both NRTI and NNRTI resistance. Viral resistance to the therapy compounds did not materially effect the hazard of mutation in our model.The accuracy of therapy outcome prediction tools may be increased by including the number of previous treatments, and by considering locations in the HIV genome which increase the hazard of developing resistance mutations.
View details for DOI 10.1186/1742-6405-8-26
View details for PubMedID 21794106
Infections with the human immunodeficiency virus type 1 (HIV-1) are treated with combinations of drugs. Unfortunately, HIV responds to the treatment by developing resistance mutations. Consequently, the genome of the viral target proteins is sequenced and inspected for resistance mutations as part of routine diagnostic procedures for ensuring an effective treatment. For predicting response to a combination therapy, currently available computer-based methods rely on the genotype of the virus and the composition of the regimen as input. However, no available tool takes full advantage of the knowledge about the order of and the response to previously prescribed regimens. The resulting high-dimensional feature space makes existing methods difficult to apply in a straightforward fashion. The machine learning system proposed in this work, sequence boosting, is tailored to exploiting such high-dimensional information, i.e. the extraction of longitudinal features, by utilizing the recent advancements in data mining and boosting. When applied to predicting the latest treatment outcome for 3,759 treatment-experienced patients from the EuResist integrated database, sequence boosting achieved superior performance compared to SVMs with RBF kernels. Moreover, sequence boosting allows an easy access to the discriminative treatment information. Analysis of feature importance values provided by our model confirmed known facts regarding HIV treatment. For instance, application of potent and recently licensed drugs was beneficial for patients, and, conversely, the patient group that was subject to NRTI mono-therapies in the past had poor treatment perspectives today. Furthermore, our model revealed novel biological insights. More precisely, the combination of previously used drugs with their in vivo response is more informative than the information of previously used drugs alone. Using this information improves the performance of systems for predicting therapy outcome.
View details for DOI 10.2202/1544-6115.1604
View details for PubMedID 21291416
As there exists no cure or vaccine for the infection with human immunodeficiency virus (HIV), the standard approach to treating HIV patients is to repeatedly administer different combinations of several antiretroviral drugs. Because of the large number of possible drug combinations, manually finding a successful regimen becomes practically impossible. This presents a major challenge for HIV treatment. The application of machine learning methods for predicting virological responses to potential therapies is a possible approach to solving this problem. However, due to evolving trends in treating HIV patients the available clinical datasets have a highly unbalanced representation, which might negatively affect the usefulness of derived statistical models.This article presents an approach that tackles the problem of predicting virological response to combination therapies by learning a separate logistic regression model for each therapy. The models are fitted by using not only the data from the target therapy but also the information from similar therapies. For this purpose, we introduce and evaluate two different measures of therapy similarity. The models are also able to incorporate phenotypic knowledge on the therapy outcomes through a Gaussian prior. With our approach we balance the uneven therapy representation in the datasets and produce higher quality models for therapies with very few training samples. According to the results from the computational experiments our therapy similarity model performs significantly better than training separate models for each therapy by using solely their examples. Furthermore, the model's performance is as good as an approach that encodes therapy information in the input feature space with the advantage of delivering better results for therapies with very few training samples.Code of the efficient logistic regression is available from http://www.mpi-inf.mpg.de/%7Ejasmina/fastLogistic.zip.
View details for DOI 10.1093/bioinformatics/btq361
View details for Web of Science ID 000281738900003
View details for PubMedID 20624779
Replication capacity (RC) of specific HIV isolates is occasionally blamed for unexpected treatment responses. However, the role of viral RC in response to antiretroviral therapy is not yet fully understood.We developed a method for predicting RC from genotype using support vector machines (SVMs) trained on about 300 genotype-RC pairs. Next, we studied the impact of predicted viral RC (pRC) on the change of viral load (VL) and CD4(+) T-cell count (CD4) during the course of therapy on about 3,000 treatment change episodes (TCEs) extracted from the EuResist integrated database. Specifically, linear regression models using either treatment activity scores (TAS), the drug combination, or pRC or any combination of these covariates were trained to predict change in VL and CD4, respectively.The SVM models achieved a Spearman correlation (rho) of 0.54 between measured RC and pRC. The prediction of change in VL (CD4) was best at 180 (360) days, reaching a correlation of rho = 0.45 (rho = 0.27). In general, pRC was inversely correlated to drug resistance at treatment start (on average rho = -0.38). Inclusion of pRC in the linear regression models significantly improved prediction of virological response to treatment based either on the drug combination or on the TAS (t-test; p-values range from 0.0247 to 4 10(-6)) but not for the model using both TAS and drug combination. For predicting the change in CD4 the improvement derived from inclusion of pRC was not significant.Viral RC could be predicted from genotype with moderate accuracy and could slightly improve prediction of virological treatment response. However, the observed improvement could simply be a consequence of the significant correlation between pRC and drug resistance.
View details for DOI 10.1371/journal.pone.0009044
View details for Web of Science ID 000274474200024
View details for PubMedID 20140263
Expert-based genotypic interpretation systems are standard methods for guiding treatment selection for patients infected with human immunodeficiency virus type 1. We previously introduced the software pipeline geno2pheno-THEO (g2p-THEO), which on the basis of viral sequence predicts the response to treatment with a combination of antiretroviral compounds by applying methods from statistical learning and the estimated potential of the virus to escape from drug pressure.We retrospectively validated the statistical model used by g2p-THEO in approximately 7600 independent treatment-sequence pairs extracted from the EuResist integrated database, ranging from 1990 to 2007. Results were compared with the 3 most widely used expert-based interpretation systems: Stanford HIVdb, ANRS, and Rega.The difference in receiver operating characteristic curves between g2p-THEO and expert-based approaches was significant (P < .001; paired Wilcoxon test). Indeed, at 80% specificity, g2p-THEO found 16.2%-19.8% more successful regimens than did the expert-based approaches. The increased performance of g2p-THEO was confirmed in a 2001-2007 data set from which most obsolete therapies had been removed.Finding drug combinations that increase the chances of therapeutic success is the main reason for using decision support systems. The present analysis of a large data set derived from clinical practice demonstrates that g2p-THEO solves this task significantly better than state-of-the-art expert-based systems. The tool is available at http://www.geno2pheno.org.
View details for DOI 10.1086/597305
View details for Web of Science ID 000264056600011
View details for PubMedID 19239365
Inferring response to antiretroviral therapy from the viral genotype alone is challenging. The utility of an intermediate step of predicting in vitro drug susceptibility is currently controversial. Here, we provide a retrospective comparison of approaches using either genotype or predicted phenotypes alone, or in combination.Treatment change episodes were extracted from two large databases from the USA (Stanford-California) and Europe (EuResistDB) comprising data from 6,706 and 13,811 patients, respectively. Response to antiretroviral treatment was dichotomized according to two definitions. Using the viral sequence and the treatment regimen as input, three expert algorithms (ANRS, Rega and HIVdb) were used to generate genotype-based encodings and VircoTYPE() 4.0 (Virco BVBA, Mechelen, Belgium) was used to generate a predicted -phenotype-based encoding. Single drug classifications were combined into a treatment score via simple summation and statistical learning using random forests. Classification performance was studied on Stanford-California data using cross-validation and, in addition, on the independent EuResistDB data.In all experiments, predicted phenotype was among the most sensitive approaches. Combining single drug classifications by statistical learning was significantly superior to unweighted summation (P<2.2x10(-16)). Classification performance could be increased further by combining predicted phenotypes and expert encodings but not by combinations of expert encodings alone. These results were confirmed on an independent test set comprising data solely from EuResistDB.This study demonstrates consistent performance advantages in utilizing predicted phenotype in most scenarios over methods based on genotype alone in inferring virological response. Moreover, all approaches under study benefit significantly from statistical learning for merging single drug classifications into treatment scores.
View details for Web of Science ID 000265624700015
View details for PubMedID 19430102
The extreme flexibility of the HIV type-1 (HIV-1) genome makes it challenging to build the ideal antiretroviral treatment regimen. Interpretation of HIV-1 genotypic drug resistance is evolving from rule-based systems guided by expert opinion to data-driven engines developed through machine learning methods.The aim of the study was to investigate linear and non-linear statistical learning models for classifying short-term virological outcome of antiretroviral treatment. To optimize the model, different feature selection methods were considered. Robust extra-sample error estimation and different loss functions were used to assess model performance. The results were compared with widely used rule-based genotypic interpretation systems (Stanford HIVdb, Rega and ANRS).A set of 3,143 treatment change episodes were extracted from the EuResist database. The dataset included patient demographics, treatment history and viral genotypes. A logistic regression model using high order interaction variables performed better than rule-based genotypic interpretation systems (accuracy 75.63% versus 71.74-73.89%, area under the receiver operating characteristic curve [AUC] 0.76 versus 0.68-0.70) and was equivalent to a random forest model (accuracy 76.16%, AUC 0.77). However, when rule-based genotypic interpretation systems were coupled with additional patient attributes, and the combination was provided as input to the logistic regression model, the performance increased significantly, becoming comparable to the fully data-driven methods.Patient-derived supplementary features significantly improved the accuracy of the prediction of response to treatment, both with rule-based and data-driven interpretation systems. Fully data-driven models derived from large-scale data sources show promise as antiretroviral treatment decision support tools.
View details for Web of Science ID 000266738300014
View details for PubMedID 19474477
Analysis of the viral genome for drug resistance mutations is state-of-the-art for guiding treatment selection for human immunodeficiency virus type 1 (HIV-1)-infected patients. These mutations alter the structure of viral target proteins and reduce or in the worst case completely inhibit the effect of antiretroviral compounds while maintaining the ability for effective replication. Modern anti-HIV-1 regimens comprise multiple drugs in order to prevent or at least delay the development of resistance mutations. However, commonly used HIV-1 genotype interpretation systems provide only classifications for single drugs. The EuResist initiative has collected data from about 18,500 patients to train three classifiers for predicting response to combination antiretroviral therapy, given the viral genotype and further information. In this work we compare different classifier fusion methods for combining the individual classifiers.The individual classifiers yielded similar performance, and all the combination approaches considered performed equally well. The gain in performance due to combining methods did not reach statistical significance compared to the single best individual classifier on the complete training set. However, on smaller training set sizes (200 to 1,600 instances compared to 2,700) the combination significantly outperformed the individual classifiers (p<0.01; paired one-sided Wilcoxon test). Together with a consistent reduction of the standard deviation compared to the individual prediction engines this shows a more robust behavior of the combined system. Moreover, using the combined system we were able to identify a class of therapy courses that led to a consistent underestimation (about 0.05 AUC) of the system performance. Discovery of these therapy courses is a further hint for the robustness of the combined system.The combined EuResist prediction engine is freely available at http://engine.euresist.org.
View details for DOI 10.1371/journal.pone.0003470
View details for Web of Science ID 000265125900013
View details for PubMedID 18941628
In genetics, many evolutionary pathways can be modeled by the ordered accumulation of permanent changes. Mixture models of mutagenetic trees have been used to describe disease progression in cancer and in HIV. In cancer, progression is modeled by the accumulation of chromosomal gains and losses in tumor cells; in HIV, the accumulation of drug resistance-associated mutations in the viral genome is known to be associated with disease progression. From such evolutionary models, genetic progression scores can be derived that assign measures for the disease state to single patients. Rtreemix is an R package for estimating mixture models of evolutionary pathways from observed cross-sectional data and for estimating associated genetic progression scores. The package also provides extended functionality for estimating confidence intervals for estimated model parameters and for evaluating the stability of the estimated evolutionary mixture models.
View details for DOI 10.1093/bioinformatics/btn410
View details for Web of Science ID 000259973500018
View details for PubMedID 18718947
High-throughput-sequencing (HTS) technologies are the method of choice for screening the human genome for rare sequence variants causing susceptibility to complex diseases. Unfortunately, preparation of samples for a large number of individuals is still very cost- and labor intensive. Thus, recently, screens for rare sequence variants were carried out in samples of pooled DNA, in which equimolar amounts of DNA from multiple individuals are mixed prior to sequencing with HTS. The resulting sequence data, however, poses a bioinformatics challenge: the discrimination of sequencing errors from real sequence variants present at a low frequency in the DNA pool.Our method vipR uses data from multiple DNA pools in order to compensate for differences in sequencing error rates along the sequenced region. More precisely, instead of aiming at discriminating sequence variants from sequencing errors, vipR identifies sequence positions that exhibit significantly different minor allele frequencies in at least two DNA pools using the Skellam distribution. The performance of vipR was compared with three other models on data from a targeted resequencing study of the TMEM132D locus in 600 individuals distributed over four DNA pools. Performance of the methods was computed on SNPs that were also genotyped individually using a MALDI-TOF technique. On a set of 82 sequence variants, vipR achieved an average sensitivity of 0.80 at an average specificity of 0.92, thus outperforming the reference methods by at least 0.17 in specificity at comparable sensitivity.The code of vipR is freely available via: http://email@example.com.
View details for DOI 10.1093/bioinformatics/btr205
View details for Web of Science ID 000291752600010
View details for PubMedID 21685105