Feeds:
Posts
Comments

Posts Tagged ‘drug’

Download a PDF version of the blog post from here:


After performing a genome-wide association study (GWAS), we’d then ideally want to link the identified associations/SNPs to (druggable) genes and biological pathways. Unearthing novel biology can inform drug target (in)validation but also lead to higher-impact publications (see ‘selected publications’ below). The latter point is especially important for early-career researchers who will be applying for fellowships and/or lectureships soon 🙂

Happy to help out with any of the below.

A slide from my Journal club on the October 2017 GTEx paper: Identifying the causal variants and genes, and the relevant tissues and pathways is the ultimate aim of GWASs. If the causal gene(s) turns out to be ‘druggable’, it can lead to pharmaceutical companies to develop treatments for the disease of interest. See My Research page to download the full slides.

Methods and Software

The below are some of the Post-GWAS ‘SNP follow-up’ steps/software that I have been taking/using for the last 2-3 years:

1- Finemapping the identified signals:

This step refines each signal to a set of variants that are 99% likely to contain the underlying causal variant – assuming the causal variant has been analysed

• Wakefield method [1] – Output: 99% credible set (Tutorial and R code available here: Wakefield_method_finemapping)

2- Query eQTL databases:

Rather than just assume that the gene nearest to the sentinel SNP is the causal gene, we can bring in other lines of evidence such as eQTL and pQTL analyses to check whether the SNP(s) is associated with the expression of a gene.

• GTEx v7 dataset (n up to 492; RNASeq) [2] – publicly available at [3] (see My Research page to download my Journal club slides on GTEx v6 paper)

• NESDA-NTR Blood eQTL dataset (n=4,896; microarray) [4] – publicly available at [5]

• Lung eQTL dataset (n=1,111; microarray) [6] – need to request lookups from Dr. Ma’en Obeidat

• BIOS (Biobank-Based Integrative Omics Study) Blood eQTL dataset (n=2,116; RNAseq) [7] – publicly available at [8]

• Westra et al Blood eQTL dataset (n=5,311 with replication in 2,775; microarray) [9] – publicly available at [10]

• There are other tissue/organ specific databases such as BRAINEAC (n=134) and Brain xQTL (n=up to 494)

3- eQTL-GWAS signal colocalisation:

• eCAVIAR [11] by Hormozdiari et al, 2016 [12] – Click for Powerpoint presentation (ecaviar_colocalisation_mesut_04_07_18) and methods (ecaviar methods_v3)

• It also helps to plot the Z-scores of the eQTL (separate plots for each gene near the signal) and GWAS SNPs on the same plot – maybe with the SNPs in the 99% credible set mark differently to other SNPs near the sentinel SNP. Of course, choosing the relevant tissue(s) is crucial!

4- Query pQTL databases:

• Sun et al, 2018 dataset [13] – need to request lookups from the authors (maybe Dr. Adam Butterworth)

5- Variant effect prediction:

Checking whether our sentinel SNP is in LD with a coding variant that is predicted to be functional provides another line of evidence for a putatively causal gene.

• DeepSEA – for noncoding SNPs [14] (see My Research page to download my Journal club slides on DeepSEA)

• SIFT, PolyPhen-2, and FATHMM via Ensembl VEP – for coding SNPs [15]

6- Enrichment of associations at DNase hypersensitivity sites:

Using your GWAS results to identify chromatin features relevant to your trait of interest can yield important information on the genetic aetiology of that trait (e.g. DNase hypersensitivity site enrichment in fetal lung would mean that developmental pathways in the lung are playing an important role)

• GARFIELD [16]

• FORGE [17] – very easy to use but superseded by GARFIELD

7- Pathway enrichment analysis:

• ConsensusPathDB [18] – as it queries more biological pathway and gene ontology databases than the alternatives. You can input all the genes that are implicated by eQTL/pQTL databases and variant effect prediction (e.g. genes that harbour a coding variant in the 99% credible set). Good idea to remove genes in the MHC region (e.g. HLA genes) to identify pathways other than the immune system-related ones. Methods can be found here: ConsensusPathDB_methods

• You can also do an additional check to see if the ‘significant’ pathways (e.g. FDR<5%) are mainly due to the implicated genes – as identified by eQTL/pQTL and variant effect prediction (list 1) – or the regions identified by GWAS itself: extract all the genes within 500kb of the sentinel SNPs (list 2) and then make 100 lists (same size as list 1) with genes randomly selected from this set. Then input these to ConsensusPathDB and see how many times the pathways identified by list 1 appears in the output as ‘significant’.

8- LD score regression:

Bivariate LD score regression allows one to identify the genetic correlation between two traits which implies shared biology.

• LD Hub [19] – check the genetic correlation between your trait of interest and up to >600 traits (see My Research page to download my Journal club slides on LD Hub)

• Stratified LD score regression [20] – check if there’s significant enrichment of heritability at variants overlapping histone marks (e.g. H3K4me1, H3K4me3) that are specific to cell lines of interest (e.g. lung-related cell lines for a GWAS of a respiratory disease)

9- Single-variant and genetic risk-score PheWAS (phenome-wide association study):

• GeneAtlas [21] or the UK Biobank Engine [22] for single-variant PheWAS

• PRS Atlas [23] – for polygenic risk score PheWAS (see My Research page to download my Journal club slides on the PRS Atlas)

• Other automated and reliable software include PHESANT

10- Druggability analysis:

Once a list of potentially causal genes is created, one can then query drug/target databases to see whether the respective genes’ products (i.e. protein) are already targeted by certain compounds – or even better, in clinical trials (see ‘Approved Drugs and Clinical Candidates’ section for each protein in ChEMBL – if there is one).

• DGIdb – publicly available at [24]

• ChEMBL – publicly available at [25]

11- Protein-protein interactions:

If several proteins within your gene list are predicted/known to interact, this will provide a separate line of evidence for those genes – that is if they’re implicated by different signals/SNPs.

• STRING [26] – a score of >0.9 implies a ‘high-quality’ prediction

12- Literature review:

• A thorough literature review of the identified genes is always a good way to start a story. Download RefSeq_all_gene_summaries for extracted gene function summaries from RefSeq [27]

13- GWAS catalog lookup:

Checking to see if your associated SNPs are also associated with other traits can be important for (i) shared biology and (ii) specificity – can be important for drug target discovery.

• PhenoScanner [28]

• GWAS catalog – publicly available at [29]

14- Mouse Knockout studies:

• International Mouse Phenotyping Consortium (IMPC) [30] – see (i) if the genes of interest have been knocked out and (ii) what phenotypes were observed in the knockout mice

15- Mendelian randomization analysis:

Although over-hyped in my opinion, when carried out correctly it becomes a very useful tool to assess the causal relationship between an exposure and outcome. You can use your associated SNPs as a proxy for your trait (e.g. LDL cholesterol associated SNPs) and then check to see if your trait causes a disease (e.g. obesity)

• MR-Base [31] – carry out Mendelian randomization studies using your trait of interest as exposure or outcome

Selected Publications:

The methods above were used in the papers below:

1- Shrine, Guyatt, and Erzurumluoglu et al, 2018. New genetic signals for lung function highlight pathways and pleiotropy, and chronic obstructive pulmonary disease associations across multiple ancestries. Nature Genetics [32]

2- Wain et al, 2017. Genome-wide association analyses for lung function and chronic obstructive pulmonary disease identify new loci and potential druggable targets. Nature Genetics [33]

3- Allen et al, 2017. Genetic variants associated with susceptibility to idiopathic pulmonary fibrosis in people of European ancestry: a genome-wide association study. The Lancet Respiratory Medicine [34] – I like Figure 3 in this paper where they align and plot both the Lung eQTL and IPF GWAS results to visualise whether the causal variant in the eQTL study and GWAS are likely to be the same. However, as mentioned above at point 3 (i.e. eQTL-GWAS signal colocalisation), I would suggest using Z-scores rather than P-values to observe the direction of effects

4- Erzurumluoglu, Liu, and Jackson et al, 2018. Meta-analysis of up to 622,409 individuals identifies 40 novel smoking behaviour associated genetic loci. Molecular Psychiatry [35]the Circos plot in this paper is brilliant! No competing interests declared 😉

Further reading

• Visscher et al, 2017. 10 Years of GWAS Discovery: Biology, Function, and Translation. AJHG

• Okada et al, 2014. Genetics of rheumatoid arthritis contributes to biology and drug discovery. Nature – one of those inspirational papers; I really liked Figure 2 the first time I saw it

• Erzurumluoglu et al, 2015. Identifying Highly Penetrant Disease Causal Mutations Using Next Generation Sequencing: Guide to Whole Process. BioMed Research International – I recommend this paper for PhD students who are looking for a comprehensive review comparing the ways Mendelian diseases and complex diseases are analysed. It is a little out of date in terms of the software/databases (e.g. The gnomAD database is not in there) that are in the tables but the main messages hold

Download a PDF version of the blog post from here:


Social Media
There’s a little thread under the below tweet, where Dr. Eric Fauman (Pfizer) states “The gene pointed at by an eQTL is actually less likely to be the causal gene”.

Read Full Post »

copd_smoking_nat_genet_lung_function_gwas_wain

We – as a group – carried out the largest genome-wide association study to identify genetic variants that are associated with decreased lung function and increased risk of chronic obstructive pulmonary disease. We hope that our findings will ultimately lead to the identification of effective drug targets for COPD. Image source: University of Leicester

I remember reading somewhere that ‘if you get asked the same question three times, then write a blog post about it’. That’s what I’ve been doing so far, and the purpose of this blog post is the same: to try and provide an answer to a commonly asked question. (Important note: my answers are in no way authoritative and only meant for interested non-scientists)

As a ‘Genetic Epidemiologist’, I constantly get asked what I do and what my (replace ‘my’ with ‘our’, as I do everything within a team) research can lead to. Please see my previous post ‘Searching for “Breathtaking” genes. Literally!‘ and My Research page for short answers to these questions. In tandem to these, I am constantly asked ‘why we can’t find a ‘cure’ for (noncommunicable) diseases that affect/will affect most of us such as obesity, diabetes, cancer, COPD – although there are many scientific advancements?’. I looked around for a straight forward example, but couldn’t find one (probably didn’t look hard enough!). So I decided to write my own.

I will first try and put the question into context: We do have ‘therapies’ and ‘preventive measures’ for most diseases and sometimes making that distinction from ‘a cure’ answers their question. For example, coronary heart disease (CHD) is a major cause of death both in the UK and worldwide (see NHS page for details) but we know how we can prevent many CHD cases (e.g. lowering cholesterol, stopping smoking, regular exercise) and treat CHD patients (e.g. statins, aspirin, ACE inhibitors). However, there are currently there are no ‘cures’ for CHD. So once a person is diagnosed with CHD, it is currently impossible to cure them from it, but doctors can offer quite a few options to make their life better.

I then gave it some thought about why finding a ‘cure’ was so hard for most diseases, and came up with the below analogy of a river/sea, water dam, and a nicely functioning village/city (excuse my awful drawing!).

The first figure below sets the scene: there’s a water dam that’s keeping the river from flooding and damaging the nice village/city next to it. Now please read the caption of the below figure to make sense of how they’re related to a disease.

Prevention

The river/sea is the combination of your genetic risk (e.g. you could have inherited genetic variants from your parents that increased your chances of type-2 diabetes) and environmental exposures (e.g. for type-2 diabetes, that would be being obese, eating high sugar content diet, smoking). The water dam is your immune system and/or mechanisms in your body which tame the sea of risk factors to ensure that everything in your body work fine (e.g. pancreatic islet cells have beta cells which produce insulin to lower your glucose levels back to normal levels – which would be damaging to the body’s organs if it stayed high).

So to ‘prevent’ a disease (well, flooding in this case), we could (i) make the water dam taller, (ii) make the dam stronger, and (iii) do regular checks to patch any damage done to the dam. To provide an example, for type-2 diabetes, point (i) could correspond to being ‘fit’ (or playing with your genes, which currently isn’t possible), point (ii) could correspond to staying ‘fit’, and point (iii) could correspond to having regular check-ups to see whether any preventive measures are necessary. Hope that made sense. If not, please stop reading immediately and look for other blog posts on the subject matter 🙂

Using the figure below, I wanted to then move to ‘therapy’. So as you can see, the river has flooded i.e. this individual has the disease (e.g. type-2 diabetes as above). The water dam is now not doing a good job of stopping the river and the city is in danger of being destroyed. But we have treatments: (i) The (badly drawn) water pumping trucks suck up excess water, and (ii) we have now built a second (smaller) dam to protect the houses and/or slow the flow of the water. Again, to provide an example using type-2 diabetes, water pumping trucks could be analogous to insulin or metformin injections, and the smaller dams could be changing current diet to a ‘low sugar’ version. This way we can alleviate the effects of the current and future ‘floods’.

Therapy

Analogy for therapy/treatment – after being diagnosed with the disease

Finally, we move on to our main question: ‘the cure’. Using the same analogy as above, as the water dam is now dysfunctional, the only way to stop future ‘floods’ would be to design a sewage system that can mop up all water that could come towards the city. Of course the water dam and ‘old city’ was destroyed/damaged due to past floods, so we’d need to build a new functioning city to take over the job of the old one. A related real example (off the top of my head) could be to remove the damaged tissues and replace them with new ones. Genetic engineering (using CRISPR/Cas9) and/or stem cell techniques are likely to offer useful options in the future.

Cure

Analogy for cure – after being diagnosed with the disease

Hopefully it is now clear that the measures taken to prevent or treat the disease, cannot be used to cure the disease. E.g. you can build another dam in place of the old one, but the city is already destroyed so that’s not going to be of any use in curing the disease.

So to sum up, diseases like obesity, cancer, COPD are very complex diseases – in fact they’re called ‘complex diseases’ in the literature – and understanding their underlying biology is very hard (e.g. hundreds of genes and environmental exposures could combine to cause them). We’re currently identifying many causal variants but turning these findings into ‘cures’ is a challenge that we have not been able to crack yet. However, it is clear that the methods that we currently use to identify preventive measures and therapies cannot be used to identify cures.

I hope that was helpful. I’d be very happy to read your comments/suggestions and share credit with contributing scientists. Thanks for reading!

Read Full Post »

BBC_news_sperm_count

BBC news article published on the 18th March 2018. According to the article, men with low sperm counts are at a higher risk of disease/health problems. However, this is unlikely to be a causal relationship and more likely to be a spurious correlation. May even turn out to be the other way round due to “reverse causality”, a bias we encounter a lot in epidemiological studies. The following sounds more plausible (to me at least!): “Men with disease/health problems are likely to have low sperm counts” (likely cause: men with health problems tended to smoke more in general and this caused low sperm counts in those individuals).

As an enthusiastic genetic epidemiologist (keyword here: epidemiologist), I try to keep in touch with the latest developments in medicine and epidemiology. However, it is impossible to read all articles that come out as there is a lot of epidemiology and/or medicine papers published daily (in fact, too much!). For this reason, instead of reading the original academic papers (excluding papers in my specific field), I try to skim read from reputable news outlets such as the BBC, The Guardian and Medscape (mostly via Twitter). However, health news even in these respectable media outlets are full of wrong and/or oversensationalised titles: they either oversensationalise what the scientist has said or take the word of the scientist they contact – who are not infallible and can sometimes believe in their own hypotheses too much.

It wouldn’t harm us too much if the message of an astrophysics related publication is misinterpreted but we couldn’t say the same with health related news. Many people take these news articles as gospel truth and make lifestyle changes accordingly. Probably the best example for this is the Andrew Wakefield scandal in 1998 – where he claimed that the MMR vaccine caused autism and gastro-intestinal disease but later investigations showed that he had undeclared conflicts of interest and had faked most of the results (click here for a detailed article in the scandal). Many “anti-vaccination” (aka anti-vax) groups used his paper to strengthen their arguments and – although now retracted – the paper’s influence can still be felt today as many people, including my friends, do not allow their children to be vaccinated as they falsely think they might succumb to diseases like autism because of it.

The first thing we’re taught in our epidemiology course is “correlation does not mean causation.” However, a great deal of epidemiology papers published today report correlations (aka associations) without bringing in other lines of evidence to provide evidence for a causal relationship. Some of the “interesting ones” amongst these findings are then picked up by the media and we see a great deal of news articles with titles such as “coffee causes cancer” or “chocolate eaters are more successful in life”. There have been instances when I read the opposite in the same paper a couple of months later (example: wine drinking is protective/harmful for pregnant women). The problem isn’t caused only due to a lack of scientific method training on the media side, but also due to health scientists who are eager to make a name for themselves in the lay media without making sure that they have done everything they could to ensure that the message they’re giving is correct (e.g. triangulating using different methods). As a scientist who analyses a lot of genetic and phenotypic data, it is relatively easier for me to observe that the size of the data that we’re analysing has grown massively in the last 5-10 years. However, in general, we scientists haven’t been able to receive the computational and statistical training required to handle these ‘big data’. Today’s datasets are so massive that if we take the approach of “let’s analyse everything we got!”, we will find a tonne of correlations in our data whether they make sense or not.

To provide a simple example for illustrative purposes: let’s say that amongst the data we have in our hands, we also have each person’s coffee consumption and lung cancer diagnosis data. If we were to do a simple linear regression analysis between the two, we’d most probably find a positive correlation (i.e. increased coffee consumption means increased risk of lung cancer). 10 more scientists will identify the same correlation if they also get their hands on the same dataset; 3 of them will believe that the correlation is worthy of publication and submit a manuscript to a scientific journal; and one (other two are rejected) will make it past the “peer review” stage of the journal – and this will probably be picked up by a newspaper. Result: “coffee drinking causes lung cancer!”

However, there’s no causal relationship between coffee consumption and lung cancer (not that I know of anyway :D). The reason we find a positive correlation is because there is a third (confounding) factor that is associated with both of them: smoking. Since coffee drinkers smoke more in general and smoking causes lung cancer, if we do not control for smoking in our statistical model, we will find a correlation between coffee drinking and lung cancer. Unfortunately, it is not very easy to eliminate such spurious correlations, therefore health scientists must make sure they use several different methods to support their claims – and not try to publish everything they find (see “publish or perish” for an unfortunate pressure to publish more in scientific circles).

cikolata_ve_nobel_odulu

A figure showing the incredible correlation between countries’ annual per capita chocolate consumption and the number of Nobel laureates per 10 million population. Should we then give out chocolate in schools to ensure that the UK wins more Nobel prizes? However, this is likely not a causal relationship as it makes more sense that there is a (confounding) factor that is related to both of them: (most likely) GDP per capita at purchasing power parity. To view even quirkier correlations, I’d recommend this website (by Tyler Vigen). Image source: http://www.nejm.org/doi/full/10.1056/NEJMon1211064.

As a general rule, I keep repeating to friends: the more ‘interesting’ a ‘discovery’ sounds, the more likely it is to be false.

Hard to explain why I think like this but I’ll try: for a result to sound ‘interesting’ to me, it should be an unexpected finding as a result of a radical idea. There are just so many brilliant scientists today that finding unexpected things is becoming less and less likely – as almost every conceivable idea arises and is being tested in several groups around the world, especially in well researched areas such as cancer research. For this reason, the idea of a ‘discovery’ has changed from the days of Newtons and Einsteins. Today, ‘big discoveries’ (e.g. Mendel’s pea experimets, Einstein’s general relativity, Newton’s law of motion) have given way to incremental discoveries, which can be as valuable. So with each (well-designed) study, we’re getting closer and closer to cures/therapies or to a full understanding of underlying biology of diseases. There are still big discoveries made (e.g. CRISPR-Cas9 gene editing technique), but if they weren’t discovered by that respective group, they probably would have been discovered within a short space of time by another group as the discoverers built their research on a lot of other previously published papers. Before, elite scientists such as Newton and Einstein were generations ahead of their time and did most things on their own, but today, even the top scientists are probably not too ahead of a good postdoc as most science literature is out there for all to read in a timely manner (and more democratic compared to the not-so-distant past) and is advancing so fast that everyone is left behind – and we’re all dependent on each other to make discoveries. The days of lone wolves is virtually over as they will get left behind those who work in groups.

To conclude, without carefully reading the scientific paper that the newspaper article is referring to – hopefully they’ve included a link/citation at the bottom of the page! – or seeking what an impartial epidemiologist is saying about it, it’d be wise to take any health-related finding we read in newspapers with a pinch of salt as there are many things that can go wrong when looking for causal relationships – even scientists struggle to make the distinction between correlations and causal relationships.

power_posing

Amy Cuddy’s very famous ‘Power posing’ talk, which was the most watched video on the TED website for some time. In short, she states that if you give powerful/dominant looking poses, this will induce hormonal changes which will make you confident and relieve stress. However, subsequent studies showed that her ‘finding’ could not be replicated and she that did not analyse her data in the manner expected of a scientist. If a respectable scientist had found such a result, they would have tried to replicate their results; at least would have followed it up with studies which bring other lines of concrete evidence. What does she do? Write a book about it by bringing in anecdotal evidence at best and give a TED talk as if it’s all proven – as becoming famous (by any means necessary) is the ultimate aim for many people; and many academics are no different. Details can be found here. TED talk URL: https://www.ted.com/talks/amy_cuddy_your_body_language_shapes_who_you_are

PS: For readers interested in reading a bit more, I’d like to add a few more sentences. We should apply the below four criteria – as much as we can – to any health news that we read:

(i) Is it evidence based? (e.g. supported by a clinical trial, different experiments) – homeopathy is a bad example in this regard as they’re not supported by clinical trials, hence the name “alternative medicine” (not saying they’re all ineffective and further research is always required but most are very likely to be);

(ii) Does it make sense epidemiologically? (e.g. the example mentioned above i.e. the correlation observed between coffee consumption and lung cancer due to smoking);

(iii) Does it make sense biologically? (e.g. if gene “X” causes eye cancer but the gene is only expressed in the pancreatic cells, then we’ve most probably found the wrong gene)

(iv) Does it make sense statistically? (e.g. was the correct data quality control protocol and statistical method used? See figure below for a data quality problem and how it can cause a spurious correlation in a simple linear regression analysis)

graph-3

Wrong use of a statistical (linear regression) model. If we were to ignore the outlier data point at the top right of the plot, it becomes easy to see that there is no correlation between the two variables on the X and Y axes. However, since this outlier data point has been left in and a linear regression model has been used, the model identifies a positive correlation between the two variables – we would not have seen that this was a spurious correlation had we not visualised the data.

PPS: I’d recommend reading “Bad Science” by Ben Goldacre and/or “How to Read a Paper – The basics of evidence based medicine” by Trisha Greenhalgh – or if you’d like to read a much better article on this subject with a bit more technical jargon, have a look this highly influential paper by Prof. John Ioannidis: Why Most Published Research Findings Are False.

References:

Wakefield et al, 1998. Ileal-lymphoid-nodular hyperplasia, non-specific colitis, and pervasive developmental disorder in children. The Lancet. URL: http://www.thelancet.com/journals/lancet/article/PIIS0140-6736%2897%2911096-0/abstract

Editorial, 2011. Wakefield’s article linking MMR vaccine and autism was fraudulent. BMJ. URL: http://www.bmj.com/content/342/bmj.c7452

Read Full Post »