class: center, middle, inverse, title-slide # L12: Causal Inference in Gnetics and Some Discussion Questions ### Jean Morrison ### University of Michigan ### 2022-03-14 (updated: 2022-03-30) --- # Plan 1. Some theory/philosophy discussion questions. 2. A bit about my research (if time). --- # What Kinds of Questions Can Causal Inference Answer? - We have often talked about the idea of the target trial to guide design of causal inference studies. - This restricts the realm of causal inference to a fairly narrow range of questions: + E.g. Should we treat this disease with drug A or drug B. - In particular, it leaves out states that are hard to intervene on like race, gender and complex multifactoral conditions like obesity. - Today we will talk about some different persepectives on this issue. --- # References Glymour and Glymour (2014). Commentary: Race and Sex Are Causes. *Epidemiology* Hernán and Taubman (2008). Does obesity shorten life? The importance of well-defined interventions to answer causal questions. *International Journal of Obesity* Kaufman (2019). Commentary: Causal Inference for Social Exposures. *Annual Review of Public Health* --- # Questions Consider the following causal questions: + Will administering drug B instead of drug A reduce mortality? + Does obesity cause heart disease? + Was she denied the job because she is female? For each question: + Why are we (or the asker) asking the question? + Is there a well-defined intervention related to the question? + How would you try to answer the question? --- # Do We Need a Hypothetical Intervention? - Glymour and Glymour argue that a hypothetical intervention is unnecessary: <center> <img src="img/12_gandg2.png" width="75%" /> </center> - They argue that it is reasonable to ask questions without a concrete intervention attached. <center> <img src="img/12_gandg1.png" width="75%" /> </center> --- # Susan - Glymour and Glymour present a hypothetical scenario. <center> <img src="img/12_gandg3.png" width="75%" /> </center> --- # Hernán and Taubman - Hernán and Taubman argue that we do need a well defined intervention to make sensible causal statements. - They consider trying to answer the question "How many deaths are attributable to obesity." - They argue that this question is not really answerable. --- # Hypothetical Obesity Studies - Hernán and Taubman describe three hypothetical randomized trials, each using different mechanisms to reduce BMI in treated participants. + In one trial participants are forced to exercise every day for 30 years. + In another participants' diets are restricted. + In a third both interventions are combined but in less extreme forms. - All three treatments have the same effect on BMI. - But the trials observe different effects on mortality. <center> <img src="img/12_handt1.png" width="75%" /> </center> --- # Consequences of Vague Interventions - H and T argue that, vague counterfactuals violate consistency. <center> <img src="img/12_handt2.png" width="75%" /> </center> --- # Consequences of Vague Interventions <center> <img src="img/12_handt3.png" width="75%" /> </center> --- # Consequences of Vague Interventions - They go on to argue that vague interventions also lead to violations of exchangeability, since complex traits like obesity have many compnonent causes. - And possibly also violations of positivity: There may be some levels of confounders at which individuals have zero probability of some range of the exposure. + This is especially problematic when the proposed contrast is large (e.g. BMI of 20 vs BMI of 30). --- # Consequences of Vague Interventions <center> <img src="img/12_handt4.png" width="75%" /> </center> --- # Race and Social Variables - Jay Kaufman makes these comments about attempts to estimate causal effects of race. <center> <img src="img/12_k1.png" width="95%" /> </center> --- # Race and Social Variables <center> <img src="img/12_k2.png" width="95%" /> </center> </br> </br> <center> <img src="img/12_k4.png" width="85%" /> </center> --- # Discussion Questions Consider two studies. - The first study is attempting to measure a causal effect of gender on Covid-19 mortality. It has been observed that men die at higher rates from Covid-19. The researchers want to know if this relationship is causal? - The second study is attempting to measure the causal effect of gender on income. Does being female cause a person to earn less money? -- Questions: - What are the possible mechanisms by which gender could be causal in each scenario? - What are the merits or down-sides of trying to answer each question? - Are there alternative ways of asking these questions that may avoid some of the vagueness that Hernán and Taubman warn us away from? - Compare the position of Glymour and Glymour to the position of Hernán and Taubman. Who do you think is right? What are the merits of each argument? --- # Topic Change - In the rest of class I will give a short overview of the role of causal inference in genetics and the kinds of questions I work on. --- # Genetics 101 <center> <img src="img/12_chromosomes.jpg" width="85%" /> </center> --- # Genetic Inheritence - Each person has two copies of each chromosome: one from each parent. - Gregor Mendel demonstrated in his famous pea plant experiments that which of the parent's chromosomes an offspring inherits is a random coin-flip. <center> <img src="img/12_inheritance.png" width="85%" /> </center> <small>(By User:Gklambauer, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=68551285)</small> --- # Hardy-Weinberg Equilibrium - Suppose there is a genetic variant present in a population. - The population is isolated and mates randomly. - After only one generation of random mating, the variant will be in Hardy-Weinberg equilibrium. + Suppose there are two possible alleles, A and B. + Allele B has frequency `\(p\)` in population, meaning that if I choose a chromosome at random from a random person in the population, I will observe `\(B\)` with probability `\(p\)`. + HWE says that the proportion of individuals with genotypes AA, AB, and BB will be `\((1-p)^2\)`, `\(2p(1-p)\)` and `\(p^2\)` respectively. --- # Consequences of Hardy-Weinberg Equilibrium - If a variant is in HWE, then it is truly randomly assigned. - Genetic variants are also immutable, meaning they cannot be changed by outside forces. - This means a variant in HWE will be independent of every trait except for + Other genetic variants nearby in the genome (correlation between variants = linkage disequilibrium) + Traits that are causally affected by the variant. + Traits that are causally affected by the variant's close neighbors. - HWE `\(\Rightarrow\)` association is causation. --- # Genome-Wide Association Studies <center> <img src="img/12_gwas.png" width="85%" /> </center> --- # Genome-Wide Association Studies </br> <center> <img src="img/12_mht.png" width="85%" /> </center> --- # Sources of Confounding in GWAS - GWAS are not actually free from confounding. - The biggest sources of confounding are population structure and relatedness. + These are essentially the same problem on different scales. - Other problems we are only recently thinking about are selection bias and cross-trait assortative mating. - Effect modification by environmental factors (G x E) or other variants (G x G) can also affect results. --- # Genetic Drift - Suppose that our isolated, randomly mating population breaks into two populations which are isolated from each other. - In every generation, the variant will be in HWE in both populations. - However, the frequency of the variant will wander around randomly in each population. + This is a consequence of having a finite population. + After several generations, the variant will have different allele frequencies in the two populations. --- # Genetic Drift <center> <img src="img/12_drift.png" width="85%" /> </center> --- # Genetic Drift and Population Structure - A consequence of genetic drift is that there are many differences in allele frequency between groups of people that have been historically geographically separated. + The majority of these differences are probably not functional. - There are *some* differences between populations in functional variants due to differences in selective pressures. + Mostly related to environmental pathogens (e.g. malaria), extreme environments (e.g. high altitude), and diet differences (e.g. milk availability). - But most differences are due simply to drift. - Differences in allele frequency and the local correlation structure of the genome are called **population structure**. --- # Population Structure and Confounding - If our study includes individuals with multiple ancestries, and ancestry is associated with the disease, we will have confounding. <center> <img src="img/12_pop_structure.png" width="75%" /> </center> --- # Relatedness - Relatedness in a study causes confounding a similar way to population structure. - Individuals in a family will be more similar to each other genetically then they are to a random other individual. - Individuals in a family also share some of their environment. - So if our study contains related individuals, we could have confounding. --- # Accounting for Population Structure and Relatedness in GWAS - Typically, GWAS account for population structure and relatedness separatey. - For each variant, we fit a mixed model `$$Y_{i} = \beta_0 + \beta_1 G_{i,j} + \eta Z_i + \gamma W_i + e_{i} + \epsilon_{i,j}\\\ \mathbf{e} \sim N(0, \Sigma)$$` Where `\(\Sigma\)` is correlation due to relatedness and covariates `\(W_i\)` are variables representing population structure (e.g. principle components). `\(Z_i\)` are possible trait related covariates like age and sex. --- # What to Do with GWAS Results 1. Try to identify causal genes/causal mechanisms. 2. Mendelian Randomization 3. Disease risk prediction --- # My Interests - Issues and opportunities in Mendelian randomization + How to account for violations of the exclusion restriction? + Behavior of multivariable MR in realistic settings. + High-dimensional multivariable MR. - Once we have GWAS results, how do we identify the important biology? - Disease risk prediction turns out to be very sensitive to even small amounts of residual bias in GWAS estimates. + Can we do a better job adjusting for population structure. --- # Mendelian Randomization - MR is extremely powerful but also quite error prone (as we have discussed). - Is it possible to build MR methods that are more robust to violations of the exclusion restriction? Option A: Model the distribution of direct effects of variants on `\(Y\)`. Option B: Adjust for heritable confounders using multivarialbe MR. --- # Robust MR - Mixture models are one approach to modeling the direct effects of variants on `\(Y\)`. <center> <img src="img/12_cause.png" width="75%" /> </center> - This illustration is for a method called CAUSE (Causal Estimation Using Summary Effects; *Nat Genet* (2020)). --- # CAUSE - In CAUSE, we use an empirical Bayes strategy. - From the mixture model, we can write a likelihood for the summary statistics for the exposure and the outcome. <center> <img src="img/12_causeeq.png" width="85%" style="display: block; margin: auto;" /> </center> - However, we have lots of nuisance parameters - the `\(\beta_{M,j}\)`'s and the `\(\theta_j\)`'s. - Idea is to use a flexible prior for these effects and integrate them out. --- # CAUSE - Once we've integrated out the nuisance parameters, we can estimate posterior distributions for `\(\gamma\)`, `\(\eta\)`, and `\(q\)`. + Compare the fit of posterior distributions estimated with `\(\gamma = 0\)` vs `\(\gamma \neq 0\)` to determine if there is compelling evidence of a non-zero causal effect. - Works pretty well, we are able to reduce false discoveries when some invalid variants act through a shared confounder. + Power is a bit lower than other methods. + Tends to have conservative effect estimates. --- # Multivariable MR - CAUSE and related methods are forced to make assumptions about the distribution of direct effects, e.g. most instruments are valid. - If we have GWAS results for heritable confounders, we an adjust for them in multivariable MR. - However, as we saw last class, MVMR has some issues. - Weak instrument bias using the regression method can be dramatic. + Increases as the number of traits increases. - Real data often don't harmonize perfectly: + Some variants are measured in only a subset of traits. --- # Multivariable MR - Improving MR estimates via multivariate adjustment is now feasible. + There are large numbers of GWAS with public summary statistics. - With many candidate confounders, we may need to do some model selection. - Our goal is to adapt MVMR to: + Perform well with many traits + Accommodate incomplete data + Perform model selection --- # Identifying Causal Mechanisms - GWAS correctly performed, promises us that our association signal is close to the causal variant. <center> <img src="img/12_ld.png" width="45%" /> </center> - However, it is still hard to find the causal variant. -- - Worse, we don't know what most variants do. <center> <img src="img/12_42.png" width="55%" /> </center> --- # Incorporating External Evidence - In order to understand which variant in a region might be causal and understand the function of that variant, we can look at other information. - Associations with molecular traits like gene expression, protein levels, and chromatin accessibility. - Associations with other clinical traits. --- # Factor Analsyis to Identify Shared Biology <center> <img src="img/12_gfa.png" width="95%" /> </center> --- # Intermediate Phenotypes - Molecular traits are sometimes considered "closer to the biology" than clinical traits. + i.e. molecular traits may be mediating the effects of variants on the clinical trait of interest. - We can use MR and MR-like procedures to identify mediating molecular traits. + Ideally this implicates a meaningful biological function that could become a drug target. + Or which is at least illuminating. - Many molecular traits are measured on a shared unit, the gene. + This means that multiple data types can provide independent evidence that a particular gene is causal for the trait of interest. --- # Intermediate Phenotypes - Problems: Procedures for linking molecular traits to clinical traits via genetics are prone to both false positives and false negatives. + False positives occur when variants with different effects are in close proximity. + False negatives occur because molecular traits are measured with error and/or sample size is low. - Can we combine information from many molecular traits to implicate causal genes? </br> <center> <img src="img/12_fig1.png" width="25%" /> </center> --- # Last Thoughts - This class has been a lot of fun for me. - I am excited about your presentations for the rest of the semester. + I've been impressed so far. - Final presentations are 4/18 and 4/25. + Maybe we can make them a bit celebratory (e.g. snacks). - Feel free to get in touch (during the rest of the semester of after) if you have questions, want to discuss your project, or want to discuss out of class research. - Please leave feedback on the course reviews when they come (or directly if you feel comfortable) so I can improve the class for next year.