Download a PDF version of the blog post from here:
After performing a genome-wide association study (GWAS), we’d then ideally want to link the identified associations/SNPs to (druggable) genes and biological pathways. Unearthing novel biology can inform drug target (in)validation but also lead to higher-impact publications (see ‘selected publications’ below). The latter point is especially important for early-career researchers who will be applying for fellowships and/or lectureships soon 🙂
Happy to help out with any of the below.

Methods and Software
The below are some of the Post-GWAS ‘SNP follow-up’ steps/software that I have been taking/using for the last 2-3 years:
1- Finemapping the identified signals:
This step refines each signal to a set of variants that are 99% likely to contain the underlying causal variant – assuming the causal variant has been analysed
• Wakefield method [1] – Output: 99% credible set (Tutorial and R code available here: Wakefield_method_finemapping)
2- Query eQTL databases:
Rather than just assume that the gene nearest to the sentinel SNP is the causal gene, we can bring in other lines of evidence such as eQTL and pQTL analyses to check whether the SNP(s) is associated with the expression of a gene.
• GTEx v7 dataset (n up to 492; RNASeq) [2] – publicly available at [3] (see My Research page to download my Journal club slides on GTEx v6 paper)
• NESDA-NTR Blood eQTL dataset (n=4,896; microarray) [4] – publicly available at [5]
• Lung eQTL dataset (n=1,111; microarray) [6] – need to request lookups from Dr. Ma’en Obeidat
• BIOS (Biobank-Based Integrative Omics Study) Blood eQTL dataset (n=2,116; RNAseq) [7] – publicly available at [8]
• Westra et al Blood eQTL dataset (n=5,311 with replication in 2,775; microarray) [9] – publicly available at [10]
• There are other tissue/organ specific databases such as BRAINEAC (n=134) and Brain xQTL (n=up to 494)
3- eQTL-GWAS signal colocalisation:
• eCAVIAR [11] by Hormozdiari et al, 2016 [12] – Click for Powerpoint presentation (ecaviar_colocalisation_mesut_04_07_18) and methods (ecaviar methods_v3)
• It also helps to plot the Z-scores of the eQTL (separate plots for each gene near the signal) and GWAS SNPs on the same plot – maybe with the SNPs in the 99% credible set mark differently to other SNPs near the sentinel SNP. Of course, choosing the relevant tissue(s) is crucial!
4- Query pQTL databases:
• Sun et al, 2018 dataset [13] – need to request lookups from the authors (maybe Dr. Adam Butterworth)
5- Variant effect prediction:
Checking whether our sentinel SNP is in LD with a coding variant that is predicted to be functional provides another line of evidence for a putatively causal gene.
• DeepSEA – for noncoding SNPs [14] (see My Research page to download my Journal club slides on DeepSEA)
• SIFT, PolyPhen-2, and FATHMM via Ensembl VEP – for coding SNPs [15]
6- Enrichment of associations at DNase hypersensitivity sites:
Using your GWAS results to identify chromatin features relevant to your trait of interest can yield important information on the genetic aetiology of that trait (e.g. DNase hypersensitivity site enrichment in fetal lung would mean that developmental pathways in the lung are playing an important role)
• GARFIELD [16]
• FORGE [17] – very easy to use but superseded by GARFIELD
7- Pathway enrichment analysis:
• ConsensusPathDB [18] – as it queries more biological pathway and gene ontology databases than the alternatives. You can input all the genes that are implicated by eQTL/pQTL databases and variant effect prediction (e.g. genes that harbour a coding variant in the 99% credible set). Good idea to remove genes in the MHC region (e.g. HLA genes) to identify pathways other than the immune system-related ones. Methods can be found here: ConsensusPathDB_methods
• You can also do an additional check to see if the ‘significant’ pathways (e.g. FDR<5%) are mainly due to the implicated genes – as identified by eQTL/pQTL and variant effect prediction (list 1) – or the regions identified by GWAS itself: extract all the genes within 500kb of the sentinel SNPs (list 2) and then make 100 lists (same size as list 1) with genes randomly selected from this set. Then input these to ConsensusPathDB and see how many times the pathways identified by list 1 appears in the output as ‘significant’.
8- LD score regression:
Bivariate LD score regression allows one to identify the genetic correlation between two traits which implies shared biology.
• LD Hub [19] – check the genetic correlation between your trait of interest and up to >600 traits (see My Research page to download my Journal club slides on LD Hub)
• Stratified LD score regression [20] – check if there’s significant enrichment of heritability at variants overlapping histone marks (e.g. H3K4me1, H3K4me3) that are specific to cell lines of interest (e.g. lung-related cell lines for a GWAS of a respiratory disease)
9- Single-variant and genetic risk-score PheWAS (phenome-wide association study):
• GeneAtlas [21] or the UK Biobank Engine [22] for single-variant PheWAS
• PRS Atlas [23] – for polygenic risk score PheWAS (see My Research page to download my Journal club slides on the PRS Atlas)
• Other automated and reliable software include PHESANT
10- Druggability analysis:
Once a list of potentially causal genes is created, one can then query drug/target databases to see whether the respective genes’ products (i.e. protein) are already targeted by certain compounds – or even better, in clinical trials (see ‘Approved Drugs and Clinical Candidates’ section for each protein in ChEMBL – if there is one).
• DGIdb – publicly available at [24]
• ChEMBL – publicly available at [25]
11- Protein-protein interactions:
If several proteins within your gene list are predicted/known to interact, this will provide a separate line of evidence for those genes – that is if they’re implicated by different signals/SNPs.
• STRING [26] – a score of >0.9 implies a ‘high-quality’ prediction
12- Literature review:
• A thorough literature review of the identified genes is always a good way to start a story. Download RefSeq_all_gene_summaries for extracted gene function summaries from RefSeq [27]
13- GWAS catalog lookup:
Checking to see if your associated SNPs are also associated with other traits can be important for (i) shared biology and (ii) specificity – can be important for drug target discovery.
• PhenoScanner [28]
• GWAS catalog – publicly available at [29]
14- Mouse Knockout studies:
• International Mouse Phenotyping Consortium (IMPC) [30] – see (i) if the genes of interest have been knocked out and (ii) what phenotypes were observed in the knockout mice
15- Mendelian randomization analysis:
Although over-hyped in my opinion, when carried out correctly it becomes a very useful tool to assess the causal relationship between an exposure and outcome. You can use your associated SNPs as a proxy for your trait (e.g. LDL cholesterol associated SNPs) and then check to see if your trait causes a disease (e.g. obesity)
• MR-Base [31] – carry out Mendelian randomization studies using your trait of interest as exposure or outcome
Selected Publications:
The methods above were used in the papers below:
1- Shrine, Guyatt, and Erzurumluoglu et al, 2018. New genetic signals for lung function highlight pathways and pleiotropy, and chronic obstructive pulmonary disease associations across multiple ancestries. Nature Genetics [32]
2- Wain et al, 2017. Genome-wide association analyses for lung function and chronic obstructive pulmonary disease identify new loci and potential druggable targets. Nature Genetics [33]
3- Allen et al, 2017. Genetic variants associated with susceptibility to idiopathic pulmonary fibrosis in people of European ancestry: a genome-wide association study. The Lancet Respiratory Medicine [34] – I like Figure 3 in this paper where they align and plot both the Lung eQTL and IPF GWAS results to visualise whether the causal variant in the eQTL study and GWAS are likely to be the same. However, as mentioned above at point 3 (i.e. eQTL-GWAS signal colocalisation), I would suggest using Z-scores rather than P-values to observe the direction of effects
4- Erzurumluoglu, Liu, and Jackson et al, 2018. Meta-analysis of up to 622,409 individuals identifies 40 novel smoking behaviour associated genetic loci. Molecular Psychiatry [35] – the Circos plot in this paper is brilliant! No competing interests declared 😉
Further reading
• Visscher et al, 2017. 10 Years of GWAS Discovery: Biology, Function, and Translation. AJHG
• Okada et al, 2014. Genetics of rheumatoid arthritis contributes to biology and drug discovery. Nature – one of those inspirational papers; I really liked Figure 2 the first time I saw it
• Erzurumluoglu et al, 2015. Identifying Highly Penetrant Disease Causal Mutations Using Next Generation Sequencing: Guide to Whole Process. BioMed Research International – I recommend this paper for PhD students who are looking for a comprehensive review comparing the ways Mendelian diseases and complex diseases are analysed. It is a little out of date in terms of the software/databases (e.g. The gnomAD database is not in there) that are in the tables but the main messages hold
Download a PDF version of the blog post from here:
Social Media
There’s a little thread under the below tweet, where Dr. Eric Fauman (Pfizer) states “The gene pointed at by an eQTL is actually less likely to be the causal gene”.
Post-GWAS analyses (Dec 2018) https://t.co/tXqNbpyJwv via @wordpressdotcom
— A Mesut Erzurumluoğlu (@mesuturkiye) December 15, 2018



