Workpackage 3: Genomics




To acquire cost-effective genome-wide variation data in the prospective cohorts in order to causally anchor -omics data from other WPs as input for WP9


To acquire highly powered retrospective genome-wide variation data in ALS to perform state of the art genomics analyses


To allow for focussed, high throughput modern resequencing based on identified targets from the WPs in pillar I and II in the retrospective validation cohort


To identify new genes that cause familial ALS through whole exome sequencing


Task 1. Combine all existing and to be generated genome-wide data on ALS (partners 1, 12). Combined datasets We will integrate existing genome-wide association study (GWAS) data from all available studies of ALS and additionally genotype already available collected DNA samples. In addition, we will collect samples prospectively from 2x 400 ALS patients and 400 controls, with phenotype and exposome data collected simultaneously as described in the Workpackages "Clinical coordination" and "Exposome". The prospective cohorts provide data to causally anchor multilevel -omics data in WP9. The prospective and retrospective cohorts combined with already available genome-wide data through the members of Euro-MOTOR and their international collaborations, will yield a sample of ~10.000 ALS patients and ~20.000 controls to have a maximally powered single study population for in silico analyses described below.

Task 2. Ensure genotyping of prospective cohorts (partners 1, 12). The samples will be genotyped using HumanOmniExpress DNA Analysis BeadChips by the UMCU, and KCL partners, who have extensive experience with these platforms. Data will be stored, processed and analysed on a dedicated computing cluster, providing access to data for both centers.

Task 3. Study epistasis in all available data (partners 1, 11, 12). Epistasis If association studies are underpowered relative to main effects, they have almost no chance of picking up interaction effects when all possibilities are explored. The combinatorial explosion of possible interactions creates enormous difficulties in estimation, multiple testing, and overfitting. An obvious way to overcome these limits is to analyze only ‘interesting’ combinations of SNPs (pairs or triples), selected based on an increased prior to be involved in the disease. Such priorities can be defined by statistical evidence (single-marker P-value in own data), genetic impact (genomic location) and potential biological relevance (SNP function class or pathway information). Because of computational burden, several data mining tools have been developed (i.e. random forest tree, CART) Although these tools are helpful in exploratory data analysis and excel in discriminating cases from controls, they suffer from several limitations, mainly computational burden for a truely unbiased survey. Even more sophisticated is the Multifactor Dimensionality Reduction (MDR) method, that effectively reduces the number of combination by pooling combined genotypes in low- and high risk genotypical groups. The general process of defining these groups as a function of two or more other attributes (SNP genotypes) is referred to as constructive induction, or attribute construction. Constructive, induction using the MDR kernel, is accomplished in the following way. Given a threshold T, a multilocus genotype combination is considered high-risk if the ratio of cases to controls exceeds T, otherwise it is considered low-risk. Genotype combinations considered to be high-risk are labeled G1 while those considered low-risk are labeled G0. This process constructs a new one-dimensional attribute with values of G0 and G1. It is this new single variable that is assessed, using any classification method. Cross-validation is used to prevent overfitting while permutation testing is used to assess statistical significance and to control for false-positives due to multiple testing. Due to permutation testing standard parallel CPU analysis is cumbersome but computationally this is feasible now using the parallel architecture of modern graphical processing units. Recently a pilot analysis identified a SNP pair associated with ALS, which replicated in an independent dataset. Interestingly, one of the SNPs did not show a significant main effect in both datasets, but due to the combination with the other SNP, a clear signal emerged.

An alternative but powerful approach is to maintain power with epistasis analysis using lasso penalized ordinary linear regression. The lasso penalty is an effective device for continuous model selection, especially in problems where the number of predictors p far exceeds the number of observations n as in a typical GWAS (Chen et al, 1998). Penalized regression is an ideal vehicle for finding a small subset of potent but weakly correlated predictors. The lasso penalty not only shrinks parameter estimates, it also zeros out the majority of them, thus achieving model selection. The strength of the penalty determines the number of predictors that enter a model. This approach is programmed and readily available in the Mendel software. In a previous study, an analysis of the 50 most significant associations, epistatic interactions between two human leukocyte antigen (HLA) SNPs and three SNPs on chromosomes 2, 3, and 8 were observed. It is particularly noteworthy that the univariate GWAS p values for these three non-HLA SNPs, considered as marginal effects, are far less impressive than their univariate p values as epistatic effects (Cantor et al, 2009)

Results that are shared by both methods will be prioritized for follow-up in the DNA samples available to Euro-Motor, and WP9. (UMCU, KCL).

Task 4. Study biological pathways in all available data (partners 1, 11, 12). Data driven pathways analyses This approach provides a means of integrating the results of a GWAS and the genes in a known molecular pathway to test whether the pathway is associated with the disorder.  Member of Euro-Motor recently showed that results of pathways analysis using existing annotation databases (e.g. KEGG, GO, PANTHER, DAVID) will be contingent upon the pathway resource used (Elbers et al). Several currently available databases were used to conduct a pathway analysis of the same Wellcome Trust Case Control Consortium (WTCCC) dataset, a publicly available resource. They conclude that although the analyses can highlight the relevant gene associations, the results are likely to be biased by which database is used. A more data-driven approach reconstructs a genenetwork, based on known protein-protein interactions, external microarray co-expression data and annotation, and is implemented in the software Prioritizer, developed by members of Euro-Motor (Franke et al, 2006). Prioritizer is based on the biologically plausible assumption that true disease genes are mostly functionally related and will therefore be closer to eachother in the genenetwork than to false-positive genes in a GWAS. (UMCU, KCL).

Task 5. Study polygenic contributions in all available data (partners 1, 11, 12). Polygenic score tests Score-test will be performed using a panel of independent genome wide SNPs to further identify likely causative pathways in ALS using a cross-validation approach as described recently (Sham et al, 2009) (UMCU, KCL). In addition, through international collaborations by members of euro-MOTOR, we will have access to genome dataset of multiple sclerosis patients (MS), Parkinson’s disease patients (PD), Alzheimers’s disease patients (AD) and pathology proven frontotemporal dementia patients (FTD). Therefore, modelling between these diseases is possible to search for common genetic variation underlying neurodegeneration.

Task 6. Perform exome sequencing (partners 1, 12). Exome sequencing  Using an initial cohort of cultured lymphoblast lines from 100 FALS cases we will extract high quality DNA, RNA and protein. Using the Nimblegen 2.1m array we will capture ~180,000 exons for sequencing on the Solexa GAII (KCL). Also, The Utrecht Next Generation Sequencing Platform (UNGSP), which is collaboration between the departments of Medical Genetics, Medical Oncology and The Hubrecht Institute in Utrecht (The Netherlands) uses both the SOLiD v3 (Applied Biosystems) and 454 / FLX Titanium (Roche) platforms. The UNGSP and KCL provides extensive bioinformatic support in data analysis of these types of data.

Task 7. Deep resequencing for validation purposes in retrospective samples (partners 1, 12). Genes identified in respective WPs, and especially WP9 (integrative analysis) as relevant for ALS pathogenesis will be analysed by deep resequencing in all DNA samples available to Euro-Motor, using the sequencing platforms available to partners UMCU and KCL described above.