class: center, middle, inverse, title-slide .title[ # L2: DAGs and Confounding ] .author[ ### Jean Morrison ] .institute[ ### University of Michigan ] .date[ ### Lecture on 2023-01-09 (updated: 2023-01-17) ] --- `\(\newcommand{\ci}{\perp\!\!\!\perp}\)` ## Lecture Outline 1. Representing Causal Relationships in Directed, Acyclic Graphs (DAGs) 1. Connecting DAGs to Counterfactuals 1. Using the backdoor criterion to identify conditional exchangeability. 1. Single World Intervention Graphs --- # 1. Representing Causal Relationships in Directed, Acyclic Graphs (DAGs) --- ## Graphical Representations of Casual Effects + We can represent causal effects in a graph, with arrows. + Nodes in the graph are random variables. + Directed edges represent direct causal effects (not mediated by any other variables in the graph). + The absence of an edge indicates the absence of a direct causal effect. --
--- ## Why Use Graphs - Graphs are a natural way of encoding scientific understanding of the world. - For many people, the graph encoding is fairly intuitive. - This makes them a useful tool for communicating structural assumptions across domains. - Under some assumptions, graphical properties can be used to easily solve problems that are hard to solve otherwise. - In particular, graphs are very useful for answering the question "what variables should I condition on?". --- ## Graph Definitions - A graph, `\(\mathcal{G} = \lbrace V, E\rbrace\)` consists of - A set of nodes (vertices) `\(V = \lbrace V_1, \dots, V_J\rbrace\)` - A set of edges `\(E = \lbrace (V_{1_1}, V_{1_2}), \dots, (V_{K_1}, V_{K_2}) \rbrace\)`, which can be represented as pairs of nodes. - A graph can be either *directed*, in which case elements of `\(E\)` are ordered pairs or *undirected*, in which case elements of `\(E\)` are un-ordered. - Our graphs will almost always be directed. - Two nodes are *adjacent* if they are connected by an edge. + If the edge is directed, the node at the beginning of the edge is the *parent* and the node at the end is the *child*. --- ## Graph Definitions - A *path* is a sequence of edges in which each edge contains one node from the previous edge. - In a directed path, all of the edges are oriented in the same direction: i.e. each edge starts at last node of the previous edge. - In this graph: <center>
</center> there are two paths from `\(A\)` to `\(Y\)` but only one directed path. --- ## Graph Definitions - If there are no paths between two nodes, they are *disconnected* (or *connected* otherwise). - Node `\(j\)` is a descendant of node `\(k\)` if there is a directed path from `\(V_j\)` to `\(V_k\)`. - If a graph contains no directed cycles, it is *acyclic* + We will require all of our graphs to be acyclic. - DAG = Directed Acyclic Graph --- ## Example: + Consider this story: - There are two possible treatments for a disease. Treatment `\(A = 1\)` is more effective than `\(A = 0\)`, but has more side effects. - Doctors prefer treatment `\(A = 0\)` for patients who are older or who have more mild disease. - Patient outcome (remission or not) is affected by initial severity, treatment, and treatment adherence. - Conditional on everything else, age has no effect on patient outcome or disease severity. -- + Work with your neighbor to arrange the following variables in a DAG (there is more than one right answer): - Patient outcome - Initial disease severity - Treatment - Treatment Adherence - Age --- ## Disease Treatment Example <center>
</center> --- ## Temporality - Our definition of causality requires that the exposure occur before the outcome in time. - Under this restriction, a causal DAG must be consistent with at least one strict ordering of nodes. -- - How do we represent a feedback loop? -- + We can create multiple nodes representing unique time points (e.g. `\(A_1, A_2, \dots\)`), with each node only permitted to have causal effects on future nodes. + More on this later. <center>
</center> <!-- - What if measurements are taken at the same time? --> <!-- -- --> <!-- + We are generally forced to assume a temporal ordering even if our observations are taken at the same time. --> <!-- + For example, if we measure height and blood pressure, we can assume that height is stable over a long period of time while blood pressure is more variable. --> --- ## Causal Markov Property - The causal Markov property translates graph structure into probability statements. + It states that, conditional on it's parents, each node is independent of all nodes that are not not it's descendants. + This implies that the joint probability distribution of all nodes can be factored as $$ P(V) = \prod_{j = 1}^{J} P(V_j \vert pa_j). $$ --- ## Example Conditional independence statements in the disease tratment graph: `$$S\ci A \qquad S\ci Ad \qquad A \ci Ad$$` `$$T \ci Ad\ \vert A, S \qquad O \ci A \ \vert S, T, Ad$$` <center>
</center> --- ## Example In our example, we can factor the joint probability as `$$P(A, S, T, Ad, O) = P(A)P(S)P(Ad)P(T \vert A, S)P(O \vert S, T, Ad)$$` <center>
</center> --- # 2. Connecting DAGs and Counterfactuals through the 'do' Operator --- ## The 'do' Operator - In order for a graph to be a **causal graph**, it must say something about counterfactual worlds in which a variable is **set** to a particular value. - Judea Pearl makes this connection by introducing the 'do' operator and 'graph surgery'. We will see two other ways to make this connection in this lecture. - The operation `\(do(X = x)\)` indicates the intervention of setting the variable `\(X\)` equal to `\(x\)`. --- ## 'do' Operator in DAGs - Pearl suggests that the action of setting `\(X\)` to `\(x\)` can be represented graphically by removing all of the arrows into `\(X\)`. - This is because, in the interventional world, `\(X\)` is no longer affected by its parents, we determined what it would be --- ## Pearl's Graph Surgery .pull-left[ Original DAG <img src="img/2_sprinkler_dag_1.png" width="90%" /> $$ `\begin{split} P(\mathbf{x}) = P(x_1) P(x_2 \vert x_1)P(x_3, \vert x_1)\\P(x_4 \vert x_2, x_3)P(x_5 \vert x_4) \end{split}` $$ ] .pull-right[ DAG with intervention "turning sprinkler on". <img src="img/2_sprinkler_dag_2.png" width="90%" /> $$ `\begin{split} P_{X_3 = ON}(\mathbf{x}) = & P(x_1) P(x_2 \vert x_1)\\& P(x_4 \vert x_2, X_3 = ON)\\ & P(x_5 \vert x_4) \end{split}` $$ ] --- # 3. Connecting DAGs and Counterfactuals through Structural Equation Models --- ## Structural Equation Models - In a **structural equation** the bit on the left is qualitatively different from the bit on the right. - The bit on the left is a counterfactual. --- ## Example - Suppose we have a machine for measuring blood pressure. - `\(X\)` represents the true systolic blood pressure and `\(Y\)` represents the measured systolic blood pressure. - Suppose that the machine has some small amount of error. - We can represent this system with a graph: <center>
</center> --- ## Example - We could also represent the system with an equation `$$Y = X + \epsilon_Y$$` where `\(\epsilon_Y\)` is a random variable that is the error of the machine. - Mathematically, it would be equivalent to say `$$X = Y - \epsilon_Y$$` - But the first equation is structural while the second is not. - We can find the counterfactual value `\(Y(X = x)\)` by plugging `\(x\)` into the first equation. - The second equation is descriptive but not counterfactual. --- ## Non-Parametric Structural Equation Models - We can link a graph to a causal model with a system of non-parametric structural equations. - Let `\(\mathcal{G} = \lbrace V, E \rbrace\)` is a directed graph with nodes `\(V_1, \dots, V_n\)`. - Let `\(\epsilon_1, \dots, \epsilon_n\)` be a set of random variables corresponding to each node. - We assume that for a given `\(V_i\)` with parents `\(\mathbf{pa}_i \subset V\)`, there is a counterfactual `\(V_i(\mathbf{pa}_i)\)` given by the structural equation `$$V_i(\mathbf{pa}_i) = f_{V_i}(\mathbf{pa}_i, \epsilon_{i})$$` --- ## Example This graph <center> <img src="img/2_swig6_fig9.png" width="40%" /> </center> corresponds to equations `$$Z = f_Z(\epsilon_Z)$$` `$$M(z) = f_M(z, \epsilon_M)$$` `$$Y(z, m) = f_Y(z, m, \epsilon_Y)$$` --- ## Special Case: Linear Structural Equation Models - A linear SEM is the special case that `\(f_{V_1}, \dots, f_{V_n}\)` are linear and `\(\epsilon_{1}, \dots, \epsilon_n\)` are mutually independent. - In the previous example, a linear SEM would be $$ `\begin{split} &Z = \epsilon_Z\\ &M = \beta_{ZM} Z + \epsilon_M\\ &Y = \beta_{ZY} Z + \beta_{MY} M + \epsilon_Y \end{split}` $$ - In the linear case, the SEM can be written in matrix notation $$ \mathbf{V}(\mathbf{v}) = \mathbf{B}^{\top}\mathbf{v} + \boldsymbol{\epsilon} $$ --- ## Linear SEMs - Lots of early work on causal inference deals specifically with linear SEMs. - Linear SEMs are easy to work with, so it is sometimes convenient to demonstrate a property using linear SEMs. - However, this is a very restrictive model. - For now, we will try to make as few assumptions as possible. - When we start the modeling section, we will add some assumptions so that we can estimate parameters but it is nice to be clear about which assumptions are necessary and which are just a convenience. --- ## Completing the Causal Model Definiton - Our definition of NPSEMs is not quite sufficient to provide a causal model because it doesn't guarantee the causal Markov property. - For this we need an assumption about `\(\epsilon_1, \dots, \epsilon_n\)`. - One sufficient assumption is that `\(\epsilon_1, \dots, \epsilon_n\)` are mutually independent of each other and the other variables in the model. - Richardson and Robins call this the NPSEM-IE (IE = Indepement Errors) model. They also propose a weaker set of assumptions that is also sufficient. - We will come back to this later. --- # 4. The Backdoor Criterion - Using DAGs to identify conditioning sets. - Confounding - Colliding --- ## Exchangeability - We want to know if the outcome `\(O\)` exchangeable with respect to the treatment? `\(O(t) \ci T\)`? - If not, what set of variables, `\(L\)` can we condition on such that `\(O(t) \ci T \vert L\)`? <center>
</center> --- ## Recognizing Lack of Exchangeability in a DAG - Informally, there are two sources of lack of exchangeability: + The presence of common causes (confounders) that have not been conditioned on. + Common effects (colliders) that have been conditioned on. - We will see how to formalize these statements and how to use a DAG to identify a sufficient conditioning set to remove confounding. --- ## Common Causes (Confounders) - The presence of a common cause introduces association between two variables that is not due to a causal effect. <center>
</center> -- - In the disease treatment example, disease severity is a confounder: - Sicker patients are more likely to receive `\(A = 0\)` and sicker patients are also more likely to have a poor outcome. - So there would be an association between treatment and outcome, even if the two drugs worked equally well. --- ## Confounding - Confounding as a concept is quite old and therefore has been given many definitions. - We will define confounding as the lack of exchangeability that results from common causes. - *Confounders* are variables which can be used to adjust for confounding. + In the graph below, `\(L_1\)` and `\(L_2\)` are both confounders, even though only `\(L_1\)` is a common cause of `\(A\)` and `\(Y\)`. <center>
</center> --- ## Common Effects (Colliders) + A variable `\(L\)` is a collider relative to `\(A\)` and `\(Y\)` if `\(L\)` is a descendent of both `\(A\)` and `\(Y.\)` + Conditioning on a collider introduces an association between `\(A\)` and `\(Y\)`. - This is not necessarily as intuitive as the bias introduced by a common cause. <center>
</center> --- ## Collider Example: Routes to Stardom + Suppose that in order to become a movie star, one must either be talented or beautiful. + Suppose that in the population, talent and beauty are uncorrelated. - For simplicity, suppose `\(P(\text{talent}) = P(\text{beauty}) = 0.1\)`. + For simplicity, assume that anyone with talent or beauty has the same chance of becoming a star. <center>
</center> --- ## Collider Example: Routes to Stardom + Tables below show the proportions of talent and beauty in the general population and among stars: <center> <table class="table" style="width: auto !important; float: left; margin-right: 10px;"> <caption>Population</caption> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> \(B = 0\) </th> <th style="text-align:right;"> \(B = 1\) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> \(T = 0\) </td> <td style="text-align:right;"> 0.81 </td> <td style="text-align:right;"> 0.09 </td> </tr> <tr> <td style="text-align:left;"> \(T = 1\) </td> <td style="text-align:right;"> 0.09 </td> <td style="text-align:right;"> 0.01 </td> </tr> </tbody> </table> <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <caption>Stars</caption> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> \(B = 0\) </th> <th style="text-align:right;"> \(B = 1\) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> \(T = 0\) </td> <td style="text-align:right;"> 0.00 </td> <td style="text-align:right;"> 0.47 </td> </tr> <tr> <td style="text-align:left;"> \(T = 1\) </td> <td style="text-align:right;"> 0.47 </td> <td style="text-align:right;"> 0.05 </td> </tr> </tbody> </table> </center> -- - In the population `\(P[T=1] = P[ T = 1 \vert B = 1] = 0.1\)` - Among stars `\(P[T=1] = 0.1/0.19 = 0.53\)`, and `\(P[T = 1 \vert B = 1] = \frac{0.01/0.19}{0.1/0.19} = 0.1\)` -- - Among stars, talent and beauty are negatively correlated. Talent is rarer among beautiful stars than among stars as a whole. - This is because we have conditioned on the collider "Becoming a star". --- ## Colliders Can Block Confounding - In the graph below, there is no confounding because there is no common cause of `\(A\)` and `\(Y\)`. - The collider `\(L_2\)` is "blocking" the path from `\(A\)` to `\(Y\)`. - Causal Markov properties show us that `\(A\)` and `\(Y\)` are independent in this graph. - d-Separation formalizes the rules for identifying pairs of independent variables based on graphical rules. <center>
</center> --- ## No Statistical Definition of Confounding - One commonly given characterization of a confounder is a variable which + Is associated with the exposure. + Is associated with the outcome. + Is not on the pathway of interest between exposure and outcome. -- - Note that `\(L_2\)` satisfies all of these criteria but is not a confounder. <center>
</center> - Determining confounding requires a causal model. + The data cannot tell you if confounding is present. --- ## d-Separation + A path is *blocked* if: 1. Two arrowheads on the path collide ( `\(\rightarrow W \leftarrow\)` ) at a variable that is not being conditioned on *and* which has no descendants in the conditioning set. OR 1. It contains a non-collider that is being conditioned on. + A path is *open* if it is not blocked: - It does not contain a collider and no variables on the path are being conditioned on. OR - All colliders are conditioned on and no non-colliders are conditioned on. + Two variables are *d-separated* if all paths between them are blocked. --- ## Examples: <center> .pull-left[
] .pull-right[
]
</center> --- ## d-Separation and the Causal Markov Property Let `\(A\)`, `\(B\)`, and `\(C\)` be sets of variables. Verma and Pearl (1988) proved that: If `\(A\)` and `\(B\)` are d-separated given `\(C\)` then `\(A \ci B \vert\ C\)`. - We will not prove this in class. --- ## Faithfulness Faithfulness is the property that, for three sets of variables `\(A\)`, `\(B\)`, and `\(C\)` `$$A \ci B \vert C \Rightarrow\ A \text{ is } d\text{-separated from }B\text{ given }C$$` - Violations of faithfulness occur when confounding effects perfectly cancel each other. + Pearl calls these "incidental cancellations". + And then defines "stable" vs "unstable" unbiasedness. Unstable unbiasedness occurs when faithfulness is violated. - Practically, we can assume that faithfulness is never violated except when it is violated by design. - Matching studies intentionally violate faithfulness. --- ## Example Which pairs of variables are d-Separated? <center>
</center> -- - The causal Markov property allowed us to conclude that `\(T \ci Ad \vert S, A\)`. - Because `\(T\)` and `\(Ad\)` are d-separated, we can also conclude that `\(T \ci Ad\)` unconditionally. --- ## Backdoor Path - A backdoor path from `\(A\)` to `\(Y\)` is a path from `\(A\)` to `\(Y\)` that begins with an edge going *into* `\(A\)`. - Find the backdoor paths from `\(A\)` to `\(Y\)`. <center>
</center> --- ## Backdoor Criterion and Exchangeability Theorem: If a set of variables, `\(L\)`, + blocks every backdoor path between `\(A\)` and `\(Y\)` + contains no descendants of `\(A\)`, then `\(Y(a) \ci A \vert L\)`. -- - The two conditions in the theorem are referred to as the *backdoor criterion*. --- ## Examples What set of variables makes `\(A\)` and `\(Y\)` conditionally exchangeable? <center>
</center> -- - `\(A\)` and `\(Y\)` are exchangeable unconditionally. - Conditioning on `\(L\)` induces bias. (M-Bias). --- ## Examples What set of variables makes `\(A\)` and `\(Y\)` conditionally exchangeable? <center>
</center> -- - If `\(U_1\)` and `\(U_2\)` are unobserved, there is no available set of variables we can condition on to remove bias. - If we can measure either, `\(\lbrace U_1, L\rbrace\)` and `\(\lbrace U_2, L \rbrace\)` are both sufficient adjustment sets. --- ## Examples What set of variables makes `\(A\)` and `\(Y\)` conditionally exchangeable? <center>
</center> -- - Conditioning on `\(U\)` blocks all backdoor paths. - Could we condition on `\(L\)` instead? -- - No. `\(L\)` is a descendant of `\(A\)`, so it does not satisfy the backdoor criterion. --- # 5. Single World Intervention Graphs --- ## Single World Intervention Graphs (SWIGs) - SWIGs are a method of including counterfactuals in DAG proposed by Richardson and Robins. - This is an alternative to Pearl's graph surgery strategy. - SWIGs make the intervention more explicit. - Using SWIGs, we can evaluate conditional exchangeability using only `\(d\)`-separation without having to manage the other aspects of the backdoor criterion. --- ## SWIGs - To create a SWIG, the node representing the intervened on variable is split into two nodes. - One node represents the natural value of of the variable. - The other represents the fixed value due to the intervention. <center> <img src="img/2_dag1.png" width="30%" /> <img src="img/2_swig1.png" width="80%" /> </center> <!-- --- --> <!-- # SWIG Procedure --> <!-- - To represent the counterfactual `\(Y(a)\)`, we split the node `\(A\)` in the original graph into two nodes: --> <!-- - `\(A\)` (in the SWIG) represents the naturally occurring treatment -- what would have occurred with no intervention. --> <!-- + All of the arrows which entered into `\(A\)` in the original DAG will enter into `\(A\)` in the SWIG. --> <!-- - `\(a\)` represents the intervened on value of the treatment. --> <!-- + This variable is fixed at the value `\(a\)` (i.e. it is deterministic rather than random). --> <!-- + All arrows which exited `\(A\)` in the original DAG now exit the `\(a\)` intervention node. --> <!-- + All variables which were downstream of `\(A\)` in the original DAG are replaced by their counterfactual values. --> --- ## Templates (SWITs) - Each SWIG can represent only a single intervention, i.e. the world in which everyone receives treatment `\(A = a\)`. - A template is graph valued function `\(x \rightarrow \mathcal{G}(x)\)`. The input is the value the intervened on variable is set to, the output is a SWIG. <center> <img src="img/2_swig1.png" width="80%" /> <img src="img/2_swig4.png" width="40%" /> </center> --- ## Single World + `\(Y(0)\)` and `\(Y(1)\)` never appear on the same graph. + SWIGs cannot represent relationships between counterfactuals "across worlds" (i.e. `\(Y(0)\)` and `\(Y(1)\)` ). + From the SWIGs below, we can conclude `\(Y(0) \ci X\)` and `\(Y(1) \ci X\)` but not `\(Y(1), Y(0) \ci X\)` or `\(Y(1) \ci Y(0)\)`. <center> <img src="img/2_swig1.png" width="80%" /> </center> --- ## SWIG Procedure - Step 1: Split nodes <center> <img src="img/2_split_nodes.png" width="40%" /> </center> - Split every intervention node into - `\(A\)` the random component; what `\(A\)` would have been without intervention. - `\(a\)` a fixed component representing the intervention. - Incoming arrows go into `\(A\)` and outgoing arrows go out of `\(a\)`. --- ## SWIG Procedure - Step 2: Re-label downstream nodes as counterfactuals. <center> <img src="img/2_label_nodes.png" width="80%" /> </center> --- ## d-Separation in SWIGs - In a SWIG, any path containing an intervention node is blocked. - This sounds like a new rule but isn't really. - Intervention nodes are fixed at a value, so no information can propogate through. - Fixed nodes in any graph block a path. - However, it is atypical to include fixed nodes in a graph so this is an important rule to remember for SWIGs. --- ## d-Separation in SWIGs - Under the NPSEM-IE assumptions, `\(d\)`-separation in the SWIG implies exchangeability. - That is, if `\(\mathcal{G}(a)\)` is the SWIG of the intervention `\(A = a\)` and `\(Y(a)\)` and `\(A\)` are `\(d\)`-separated in `\(\mathcal{G}(a)\)`, then `$$Y(a) \ci A$$` - We have a similar result for conditional exchangeability. - IF `\(Y(a)\)` and `\(A\)` are `\(d\)`-separated in `\(\mathcal{G}(a)\)` conditional on a set of nodes `\(L\)`, then `$$Y(a) \ci A \vert \ L$$` - NPSEM-IE is stronger than necessary to achieve this result. <!-- - Alternatively, `\(A\)` and `\(Y\)` are d-separated in `\(\mathcal{G}(a)\)` if they are d-separated in the subgraph of `\(\mathcal{G}(a)\)` achieved by removing all of the fixed nodes. --> <!-- - Remember, intervention nodes are fixed at a single value, so no information can propogate through them. --> <!-- - We are heading toward the result that (conditional) d-separation of `\(Y(a)\)` and `\(A\)` in a SWIG `\(\mathcal{G}(a)\)` implies (conditional) exchangeability of `\(Y(a)\)` and `\(A\)`. --> --- ## Applying d-separation in SWIGs <center> <img src="img/2_fig16.png" width="80%" /> </center> - This is the M-Bias example. - We can see that `\(Y(a)\)` is d-separated from `\(A\)` unconditionally. - Conditional on `\(Z\)`, `\(Y(a)\)` is not d-separated from `\(A\)`. --- ## Independence Assumptions - As we said before, the NPSEM model is not sufficient to conclude the causal Markov property. We need an additional assumption. - FFRCISTG Assumption: Let `\(\mathbf{v}^\dagger\)` be an intervention on every variable in `\(V\)`. The FFRCISTG independence assumption says that all of the counterfactual variables `\(\lbrace V(\mathbf{v}^\dagger) \rbrace\)` are mutually independent after this intervention. - NPSEM-IE assumption: The NPSEM-IE assumption says that all of the errors `\(\epsilon_V\)` are independent. - NPSEM-IE is strictly stronger than FFRCISTG. - NPSEM-IE implies cross-world independences while FFRCISTG does not. - We will generally always assume FFRCISTG. --- ## FFRCISTG vs NPSEM-IE <center> <img src="img/2_swig6_fig9.png" width="40%" /> </center> - FFRCISTG says $$ Z\ci M(z) \ci Y(z, m)$$ for any `\(z\)` and `\(m\)`. - NPSEM-IE says `$$Z \ci \lbrace M(z) \text{ for all } z \rbrace \ci \lbrace Y(z, m)\ \text{for all } z, m \rbrace$$` - For example, `\(M(0) \ci Y(Z = 1, M = 0)\)`. --- ## Factorization Proof Sketch - An important result is that the FFRCISTG assumption is sufficient to conclude that if `\(P(\mathbf{V})\)` factorizes according to `\(\mathcal{G}\)` then `\(P(\mathbf{V}(\mathbf{a}))\)` factorizes according to the SWIG `\(\mathcal{G}(a)\)`. Proof idea: Work by reverse induction - FFRCISTG tells us that the property holds for interventions on all variables. - This follows trivially (in a mathematical sense) - CI properties for of the SWIG for an intervention on all variables say that `\(P(\mathbf{V}(\mathbf{a})) = \prod_j P(V_j(\mathbf{a}))\)`. - This is exactly the FFRCISTG independence assumption. --- ## Factorization Proof Sketch - Suppose we have an intervention on a set of variables `\(A \subset \mathbf{V}\)` and that the property holds for all (strict) supersets of `\(A\)`. - We can find a single variable `\(C\)` to add to `\(A\)` and prove the factorization result only for terms involving `\(C\)`. <center> <img src="img/2_factorization.png" width="75%" /> </center> - See appendix B1 of Richardson and Robins (2013). --- ## SWIGs, Exchangeability, and d-Separation - The factorization result allows us to conclude that d-separation in the SWIG implies conditional exchangeability. - Recall: Conditional exchangeability says that `\(Y(a) \ci A \vert L\)`, which is exactly what is implied by `\(d\)`-separation in the SWIG. - `\(d\)`-separation in the SWIG is equivalent to satisfying the backdoor criterion. - So this proof also shows that the FFRCISTG assumptions imply the backdoor criterion. - Since NPSEM-IE is strictly stronger than FFRCISTG, NPSEM-IE assumptions also imply the backdoor criterion. --- ## Descendents of `\(A\)` in the Backdoor Criterion + Using SWIGs helps show why the backdoor criterion excludes descendants of `\(A\)`. <center> <img src="img/2_swig3.png" width="50%" /> </center> + The SWIG on the right shows that, `\(Y(x) \ci X \vert L_1, L_2(x)\)`. + However, we cannot conclude that `\(Y(x) \ci X \vert L_1, L_2\)` because `\(L_2\)` is not on the graph. + If there is a causal effect of `\(X\)` on `\(Y\)`, then conditioning on `\(L_2\)` introduces a type of selection bias (more on this in the future). --- ## More Examples: Confounder <center> <img src="img/2_fig9iii.png" width="40%" /> </center> - `\(Y(m)\)` is not unconditionally independent of `\(M\)` but, `\(Y(m) \ci M \vert Z\)` - The factorization result gives us `$$P(Z, M, Y(m)) = P(Z)P(M \vert Z)P(Y(m) \vert Z)$$` - Note that we left the fixed node out of the probability calculation. --- ## Mediator <center> <img src="img/2_fig9ii.png" width="40%" /> </center> - Here we do have unconditional exchangeability because `\(Y(z) \ci Z\)`. - From factorization, `\(P(Z, M(z), Y(z)) = P(Z)P(M(z))P(Y(z) \vert M(z))\)` --- ## Mediator with Two Interventions <center> <img src="img/2_fig9iv.png" width="40%" /> </center> - From this graph we can get `\(Y(z, m) \ci M(z)\)`. - We have to use the new rule that fixed nodes block paths. - This is saying that intervening on `\(M\)` blocks the effect of `\(Z\)` that is propogated through `\(M(z)\)`. - From factorization, `\(P(Z, M(z), Y(z)) = P(Z)P(M(z))P(Y(z))\)` --- ## Mediation Effects <center>
</center> - In the graph above, `\(L\)` is *mediating* part of the effect of `\(A\)` on `\(Y\)`. - Using our machinery so far, we can define the total effect (TE) of `\(A\)` on `\(Y\)` as `$$E[Y(A = 1)] - E[Y(A = 0)]$$` - We might be interested in the effect of `\(A\)` that is not mediated through `\(L\)`. - This would be the effect of `\(A\)` on `\(Y\)` if we intervened on `\(L\)` and prevented `\(L\)` from changing. - For example, we intervene on `\(A\)` and set it to 1. - But we intervene on `\(L\)` and set it to `\(L(0)\)`. --- ## Natural Direct and Indirect Effect <center>
</center> - The *natural direct effect*(NDE) is `$$E[Y(A = 1, L = L(0))] - E[Y(A = 0, L = L(0)]$$` - The *natural indirect effect*(NIE) is $$ E[Y(A = 1, L= L(1))] - E[Y(A = 1, L = L(0))] $$ - So TE = NDE + NIE - Note that both of these involve "cross-world" counterfactuals. --- ## Identifying NIE and NDE - The FFRCISTG independence assumption does not allow identification of the NIE and NDE. - See Robins and Greenland (1992) for a good example. - If we further assume the NPSEM-IE model, NIE and NDE are identifiable. - We will come back to this in the future to talk more about mediation. <!-- --- --> <!-- ## Identifying the NDE under NPSEM-IE Assumptions --> <!-- - Under NPSEM-IE assumptions, cross-world independcies hold. --> <!-- - Specifically, we have --> <!-- $$ Y(A = 1, l) \ci L(A = 0) \qquad \forall\ l $$ --> <!-- - We want to estimate `\(E[Y(A = 1, L(A = 0))]\)`. --> <!-- $$ --> <!-- E[Y(A = 1, L(A = 0))] = \sum_l E[Y(A = 1, L = l)\vert L(0) = l]P[L(0) = l] --> <!-- $$ --> <!-- $$ --> <!-- =\sum_l E[Y(A = 1, L=l)]P[L(0) = l] --> <!-- $$ --> <!-- - The second equality comes from the independence assumption above. --> <!-- --- --> <!-- ## Identifying the NDE under NPSEM-IE Assumptions --> <!-- - We now only need to estimate `\(E[Y(A = 1, L = l)]\)` and `\(P[L(0) = l]\)` which can be estimated as --> <!-- $$ --> <!-- E[Y(A = 1, L = l)] = E[Y \vert A = 1, L = l] --> <!-- $$ --> <!-- $$ --> <!-- P[L(0) = l ] = P[L=l \vert A = 0] --> <!-- $$ --> <!-- - Both equalities come from exchangeability --> <!-- - So --> <!-- $$ --> <!-- E[Y(A = 1, L(A = 0))] = \sum_l E[Y \vert A = 1, L = l]P[L =l \vert A = 0] --> <!-- $$ --> <!-- --- --> <!-- ## Non-Identifiability of NDE under FFRCISTG Intuition --> <!-- - Consider two types of units: --> <!-- + Both have `\(L(0) = 0\)` and `\(L(1) = 1\)` --> <!-- - In both cases `\(Y(1, 1) = 1\)` and `\(Y(0, 0) = 0\)`. --> <!-- - For type 1, `\(Y(1, 0) = 0\)` and for type 2 `\(Y(1, 0) = 1\)`. --> <!-- - For type 1, if `\(A\)` had not caused `\(L\)`, they would not have experienced `\(Y\)`. --> <!-- - For type 2, `\(L\)` did not matter. --> <!-- - Even if we could observe both types under both treatments of `\(L\)`, we could not tell them apart. --> <!-- - But type 1 individuals represent indirect effects while type 2 individuals represent direct effects. --> <!-- --- --> <!-- ## Controlled Direct and Indirect Effects --> <!-- - Controlled effects are an alternative to natural effects. --> <!-- - These are the effect of `\(Y\)` when we control `\(L\)` at some pre-determined value. --> <!-- `$$E[Y(A = 1, L = l)] - E[Y(A = 0, L = l)]$$` --> <!-- - This definition only requires a joint intervention and doesn't require cross-world relationships between counterfactuals. --> <!-- - Therefore CDE and CIE are identifiable under the FFRCISTG model. --> <!-- --- --> <!-- ## Extras: Collapsibility and Confounding --> <!-- --- --> <!-- ## Causal Identification --> <!-- - Suppose I know the joint distribution `\(P(V)\)` over a set of variables. --> <!-- - Suppose I also assume: --> <!-- - The distribution factorizes according to a DAG. --> <!-- - We have observed all of the nodes on the DAG. --> <!-- - Faithfulness --> <!-- - When is it possible to infer the structure of the DAG? --> <!-- --- --> <!-- ## Markov Equivalence --> <!-- - Multiple graphs can imply the same set of conditional independence relations. --> <!-- - Two graphs `\(\mathcal{G}_1\)` and `\(\mathcal{G}_2\)` are Markov equivalent if they imply the same set of d-separation relations. -->