class: center, middle, inverse, title-slide # L9: Time Varying Treatment ### Jean Morrison ### University of Michigan ### 2022-02-16(updated: 2022-02-23) --- # Example `\(\newcommand{\ci}{\perp\!\!\!\perp}\)` - Patients are treated for a disease over time. - At each appointment, the treatment decision for the next period is made, possibly based on current or past symptoms. - We observe an outcome `\(Y\)`. <center>
</center> --- # Notation and Conventions - Bar notation indicates the history of a variable `\(\bar{A}_k = (A_1, \dots, A_k)\)`. - In statistical time, `\(A_k\)` is the last variable at time `\(k\)`. + Covariates `\(L_k\)` are measuerments that are taken after treatment `\(A_{k-1}\)` is given but before treatment `\(A_k\)` is given. - Timing aligns for all units. + We will often talk about time points as though they are evenly spaced (e.g. every month), but this is not required. - Time starts at 0. --- # Treatment Programs - In this setting, we might be interested in the effect of the entire course of treatment `\(\bar{A} = (A_0, A_1)\)`. - We are modeling a joint intervention. - With 2 time points, there are only four treatment strategies. - With `\(k\)` time points there are `\(2^k\)` treatment strategies. - How do we know which to compare? --- # Treatment Strategies - A treatment strategy is a rule for determining `\(A_k\)` from a unit's past covariate values `$$g = (g_0(\bar{a}_{-1}, l_0), \dots, g_K(\bar{a}_{K-1}, \bar{l}_k))$$` - A treatment strategy is static if it does not depend on any covariates, only past treatments. + In a static strategy, we could write out the entire program at the beginning of the study. + Ex: Treat every other month + Ex: Treat for only the first two time points. - A treatment strategy is dynamic if it does depend on covariates. + Ex: Treat if `\(L_{k-1}\)` was high. + Ex: If `\(L_{k-1}\)` is high, switch treatment, so `\(A_{k} = 1-A_{k-1}\)`. Otherwise set `\(A_k = A_{k-1}\)`. --- # Sequentially Randomized Trials - In a sequentially randomized trial, treatment `\(A_{k,i}\)` is assigned randomly with `\(P[A_{k,i} = a]\)` possibly depending on `\(\bar{A}_{k-1}\)` and `\(\bar{L}_{k-1}\)`. -- - Example: Every patient starts on treatment 0. Every month a random set of patients are assigned to switch to treatment 1 and stay on that treatment for the rest of the study. - Patients with high values of `\(L_{k}\)` may have a higher probability of starting treatment. - Example: Treatment is assigned randomly at every time point. - Patients with high values of `\(L_{k}\)` have a higher probability of switching treatments. -- - A random strategy will never be as good as the optimal deterministic strategy. + We would never recommend a random strategy for general treatment of patients. + But random strategies are necessary when the optimal strategy is unknown. --- # Causal Contrasts - The causal contrast we choose to look at will depend on the study. - We might be interested in comparing specific fixed programs, `\(E[Y(\bar{A} = \bar{a}_1)] - E[Y(\bar{A} = \bar{a}_2)]\)` such as + Always treat vs never treat: `\(\bar{a}_1 = (0, 0, \dots, 0)\)`, `\(\bar{a}_2 = (1, 1, \dots, 1)\)` + Treat early and continue vs begin treatment later and continue: `\(\bar{a}_1 = (1, 1, \dots, 1)\)`, `\(\bar{a}_2 = (0,\dots, 0, 1, \dots, 1)\)`. - Or we could compare one or more dynamic strategies `\(g\)`, `\(E[Y(g)]\)` such as: + Always treat vs treat only when symptoms are present. --- # DAG 1 - With your partner(s), draw a DAG for two time points for the following scenario: - Patients are enrolled in a randomized trial of HIV medication. - At each time point CD4 count ( `\(L\)` ) is measured. CD4 count is affected by the most recent treatment and the most recent past value of CD4 count. - Probability of receiving treatment 1 at time `\(k\)` depends on CD4 count at time `\(k\)`, but not on past treatment or past CD4 count measurements. - The outcome `\(Y\)` depends on treatment at all time points but does not depend on CD4 count. --- # DAG 1 --- # DAG 2 - To your DAG add an unmeasured variable `\(U\)` which is also time-varying. + `\(U\)` represents disease progression. - `\(U\)` at time `\(k\)` is affected by the most recent treatment value and most recent past value of `\(U\)`. - `\(U\)` at time `\(k\)` affects CD4 count at time `\(k\)`, the next value of `\(U\)`, and `\(Y\)`. - You can assume that `\(U_k\)` is statistically before `\(L_k\)`. --- # DAG 3 - Draw the same scenario except that treatment `\(A_k\)` does not depend on CD4 count but does depend on the most recent treatment `\(A_{k-1}\)`. --- # DAG 4 - Suppose that we now have an observational study. - Patients have checkups once a month. At each appointment, treatment is adjusted. - Doctors will take into account past treatment, current values of `\(L\)`, and other factors which are affected by `\(U\)`. --- # SWIGs for Static Strategies - Draw a SWIG for the DAG below for the treatment strategy `\((A_0 = a_0, A_1 = a_1).\)` <center> <img src="img/9_dag2s.png" width="55%" /> </center> --- # SWIGs for Static Strategies - Draw a SWIG for the DAG below for the treatment strategy `\((A_0 = a_0, A_1 = a_1).\)` <center> <img src="img/9_swig2s.png" width="85%" /> </center> --- # Static Sequential Exchangeability - Static sequential exchangeability says that `$$Y(\bar{a}) \ci A_k \vert\ \bar{A}_{k-1} = \bar{a}_{k-1}, \bar{L}_k\qquad k = 0, 1, \dots K$$` - Does static sequential exchangeability hold in our example? <center> <img src="img/9_swig2s.png" width="65%" /> </center> --- # Static Sequential Exchangeability - We can see that `\(Y(a_0, a_1) \ci A_0\)` ( `\(A_0\)` is disconnected) - In the SWIG we have `\(Y(a_0, a_1) \ci A_1(a_0) \vert A_0 = a_0, L_1(a_0)\)`. + By consistency `\(A_0 = a_0 \Rightarrow A_1 = A_1(a_0)\)` and + `\(A_0 = a_0 \Rightarrow L_1 = L_1(a_0)\)` - So `\(Y(a_0, a_1) \ci A_1 \vert A_0 = a_0, L_1\)` <center> <img src="img/9_swig2s.png" width="60%" /> </center> --- # Static Sequential Exchangeability - Does static sequential exchangeability hold in the SWIG below? <center> <img src="img/9_swig3.png" width="80%" /> </center> --- # Static Sequential Exchangeability - Does static sequential exchangeability hold in the SWIG below? <center> <img src="img/9_swig4.png" width="80%" /> </center> --- # SWIGS for Dynamic Treatment Strategies - Suppose that we want to estimate the counterfactual `\(E[Y(g)]\)` where `\(g\)` is a dynamic treatment strategy "treat only if `\(L_k = 1\)`". + Recall that the SWIG represents the hypothetical world of the intervention, not the world in which I collected my data. - Our intervention *introduces an arrow* from `\(L_1(g_0)\)` to the value of `\(A_1\)` under the intervention. <center> <img src="img/9_swig2dyn.png" width="75%" /> </center> --- # SWIGS for Dynamic Treatment Strategies - The dotted arrow is created by the proposed intervention. + It is not a result of the experimental design or underlying causal structure. - The dotted arrow functions just like a solid arrow for computing d-separation. + It is dotted so that we know it was introduced by the intervention. <center> <img src="img/9_swig2dyn.png" width="75%" /> </center> --- # Sequential Exchangeability for Dynamic Treatment Strategies - Sequential exchangeability for `\(Y(g)\)` holds if `$$Y(g) \ci A_k \vert \bar{A}_{k-1} = g(\bar{A}_{k-2}, \bar{L}_{k-1}), \bar{L}_k \qquad k = 0, 1, \dots, K$$` - This definition applies if `\(g\)` is static or dynamic, random or deterministic. - Also called *sequential conditional exchangeability* --- # Sequential Exchangeability for Dynamic Treatment Strategies - Does sequential exchangeability hold for `\(Y(g)\)` in the dynamic intervention? <center> <img src="img/9_swig2dyn.png" width="75%" /> </center> -- - We can see that `\(Y(g) \ci A_0\)` and `\(Y(g) \ci A_1(g_0) \vert\ L_1(g_0), A_0 = g_0\)` - Using consistency `\(Y(g) \ci A_1 \vert L_1, A_0 = g_0\)` --- # Sequential Exchangeability for Dynamic Treatment Strategies <center> <img src="img/9_swig3dyn.png" width="75%" /> </center> -- - We don't have `\(Y(g) \ci A_0\)` + They are connected by the `\(A_0 - W_0 - L_1(g_0) - g_1 - Y(g)\)` path + The `\(g_1\)` node does not block the path because it's not fixed. --- # Positivity - Let `\(f_{\bar{A}_{k-1}, \bar{L}_k}\)` be the joint pdf for the treatment history before point `\(k\)` and the covariate history. - For time-varying treatment, positivity requires that `$$f_{\bar{A}_{k-1}, \bar{L}_k}(\bar{a}_{k-1}, \bar{l}_k) > 0 \Rightarrow f_{A_{k} \vert \bar{A}_{k-1}, \bar{L}_k}(a_k \vert \bar{a}_{k-1}, \bar{l}_k) > 0$$` - If we are interested in a particular strategy, `\(g\)`, the condition only needs to hold for treatment histories compatible with `\(g\)` ( `\(a_k = g(\bar{a}_{k-1}, \bar{l}_k)\)` ). - This condition says that given past treatment history and covariates, any treatment consistent with the strategy should be possible. --- # Consistency - For a point treatment, consistency requires that `\(A = a \Rightarrow Y(a) = Y\)`. - For a static strategy, the condition `\(\bar{A} = \bar{a} \Rightarrow Y(\bar{a}) = Y\)` is sufficient. - For dynamic strategies, if `\(A_k = g_k(\bar{A}_{k-1}, \bar{L}_k)\)` for all `\(k\)` then `\(Y(g) = Y\)`. --- # Estimation for Static Treatment Strategies - Consider our previous example: <center> <img src="img/9_dag2sw.png" width="55%" /> </center> - In this example, we could estimate the average effect of `\(A_0\)` or the average effect of `\(A_1\)` using our previous techniques. - Suppose we want to compare the strategy always treat `\(Y(1, 1)\)` with the strategy never treat `\(Y(0, 0)\)`. <!-- - Can we use our previous strategies to estimate the joint counterfactuals `\(Y(1,1)\)` and `\(Y(0,0)\)`? --> --- # Example - Data below were aggregated from a trial of 320,000 units. + The trial conforms to our previous DAG. + Treatment at time 0 is random with probability 0.5. + Treatment at time 1 depends only on `\(L_1\)`: `\(P[A_1 = 1 \vert L_1 = 1] = 0.8\)`, `\(P[A_1 = 1 \vert L_1 = 0] = 0.4\)`. + In these data, there is no effect of treatment on `\(Y\)`. <table class="table" style="width: auto !important; float: left; margin-right: 10px;"> <thead> <tr> <th style="text-align:right;"> N </th> <th style="text-align:right;"> \(A_{0}\) </th> <th style="text-align:right;"> \(L_{1}\) </th> <th style="text-align:right;"> \(A_{1}\) </th> <th style="text-align:right;"> \(\bar{Y}\) </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 2400 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 84 </td> </tr> <tr> <td style="text-align:right;"> 1600 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 84 </td> </tr> <tr> <td style="text-align:right;"> 2400 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 52 </td> </tr> <tr> <td style="text-align:right;"> 9600 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 52 </td> </tr> </tbody> </table> <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> N </th> <th style="text-align:right;"> \(A_{0}\) </th> <th style="text-align:right;"> \(L_{1}\) </th> <th style="text-align:right;"> \(A_{1}\) </th> <th style="text-align:right;"> \(\bar{Y}\) </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 4800 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 76 </td> </tr> <tr> <td style="text-align:right;"> 3200 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 76 </td> </tr> <tr> <td style="text-align:right;"> 1600 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 44 </td> </tr> <tr> <td style="text-align:right;"> 6400 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 44 </td> </tr> </tbody> </table> --- # Example <table class="table" style="width: auto !important; float: left; margin-right: 10px;"> <thead> <tr> <th style="text-align:right;"> N </th> <th style="text-align:right;"> \(A_{0}\) </th> <th style="text-align:right;"> \(L_{1}\) </th> <th style="text-align:right;"> \(A_{1}\) </th> <th style="text-align:right;"> \(\bar{Y}\) </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 2400 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 84 </td> </tr> <tr> <td style="text-align:right;"> 1600 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 84 </td> </tr> <tr> <td style="text-align:right;"> 2400 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 52 </td> </tr> <tr> <td style="text-align:right;"> 9600 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 52 </td> </tr> </tbody> </table> <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> N </th> <th style="text-align:right;"> \(A_{0}\) </th> <th style="text-align:right;"> \(L_{1}\) </th> <th style="text-align:right;"> \(A_{1}\) </th> <th style="text-align:right;"> \(\bar{Y}\) </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 4800 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 76 </td> </tr> <tr> <td style="text-align:right;"> 3200 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 76 </td> </tr> <tr> <td style="text-align:right;"> 1600 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 44 </td> </tr> <tr> <td style="text-align:right;"> 6400 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 44 </td> </tr> </tbody> </table> + Within each stratum of `\((A_0, L_1)\)`, there is no treatment effect. + There is also no average effect of `\(A_0\)`: + `\(E[Y \vert A_0 = 0] = \frac{40\cdot 84 + 120 \cdot 52}{160} = 60\)` + `\(E[Y \vert A_0 = 1] = \frac{80\cdot 76 + 80 \cdot 44}{160} = 60\)` --- # Example <table class="table" style="width: auto !important; float: left; margin-right: 10px;"> <thead> <tr> <th style="text-align:right;"> N </th> <th style="text-align:right;"> \(A_{0}\) </th> <th style="text-align:right;"> \(L_{1}\) </th> <th style="text-align:right;"> \(A_{1}\) </th> <th style="text-align:right;"> \(\bar{Y}\) </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 2400 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 84 </td> </tr> <tr> <td style="text-align:right;"> 1600 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 84 </td> </tr> <tr> <td style="text-align:right;"> 2400 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 52 </td> </tr> <tr> <td style="text-align:right;"> 9600 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 52 </td> </tr> </tbody> </table> <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> N </th> <th style="text-align:right;"> \(A_{0}\) </th> <th style="text-align:right;"> \(L_{1}\) </th> <th style="text-align:right;"> \(A_{1}\) </th> <th style="text-align:right;"> \(\bar{Y}\) </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 4800 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 76 </td> </tr> <tr> <td style="text-align:right;"> 3200 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 76 </td> </tr> <tr> <td style="text-align:right;"> 1600 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 44 </td> </tr> <tr> <td style="text-align:right;"> 6400 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 44 </td> </tr> </tbody> </table> - We want to estimate `\(E[Y(1,1)] - E[Y(0, 0)]\)` - We could try computing the associational difference `\(E[Y \vert A_0 = 1, A_1 = 1] - E[Y \vert A_0 = 0, A_1 = 0]\)`: + `\(E[Y \vert A_0 = 1, A_1 = 1] = \frac{32\cdot 76 + 64 \cdot 44}{96} = 54.67\)` + `\(E[Y \vert A_0 = 0, A_1 = 0] = \frac{24 \cdot 84 + 24 \cdot 52}{48} = 68\)` -- - Problem: We have confounding from `\(L_1\)`. --- # Example <table class="table" style="width: auto !important; float: left; margin-right: 10px;"> <thead> <tr> <th style="text-align:right;"> N </th> <th style="text-align:right;"> \(A_{0}\) </th> <th style="text-align:right;"> \(L_{1}\) </th> <th style="text-align:right;"> \(A_{1}\) </th> <th style="text-align:right;"> \(\bar{Y}\) </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 2400 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 84 </td> </tr> <tr> <td style="text-align:right;"> 1600 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 84 </td> </tr> <tr> <td style="text-align:right;"> 2400 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 52 </td> </tr> <tr> <td style="text-align:right;"> 9600 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 52 </td> </tr> </tbody> </table> <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> N </th> <th style="text-align:right;"> \(A_{0}\) </th> <th style="text-align:right;"> \(L_{1}\) </th> <th style="text-align:right;"> \(A_{1}\) </th> <th style="text-align:right;"> \(\bar{Y}\) </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 4800 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 76 </td> </tr> <tr> <td style="text-align:right;"> 3200 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 76 </td> </tr> <tr> <td style="text-align:right;"> 1600 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 44 </td> </tr> <tr> <td style="text-align:right;"> 6400 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 44 </td> </tr> </tbody> </table> - To account for confounding by `\(L_1\)`, we could try computing the associational difference within strata of `\(L_1\)` and then standardizing. - Stratifying on `\(L_1\)`: `$$E[Y \vert A_0 = 1, A_1 = 1, L_1 = 0] - E[Y \vert A_0 = 0, A_1 = 0, L_1 = 0] = 76-84 = -8$$` `$$E[Y \vert A_0 = 1, A_1 = 1, L_1 = 1] - E[Y \vert A_0 = 0, A_1 = 0, L_1 = 1] = 44-52 = -8$$` -- - Problem: `\(L_1\)` is a collider. Conditioning on `\(L_1\)` induces an association between `\(A_0\)` and `\(Y\)`. --- # G-Formula - The g-formula for point treatments has been the basis of IPW, standardization, and double robust methods we have seen so far: -- `$$E[Y(a)] = \sum_l E[Y \vert A = a, L = l]f_L(l)$$` - Integral version for continuous `\(L\)`, `$$E[Y(a)] = \int_{l} E[Y \vert A = a, L = l] d F_L(l)$$` -- - To derive this, we used the iterated expectation formula, exchangeability, and consistency: `$$E[Y(a)] = E_L[ E_Y[ Y(a) \vert L]] = \\\ E_L[ E_Y[ Y(a) \vert A = a, L]] = \\\ E_{L}[E[Y \vert A = a, L] = \int_l E[Y \vert A = a, L = l] dF_L(l)$$` --- # G-Formula for Static Treatment Strategies - The G-Formula for two time points is `$$E[Y(a_0, a_1)] = \sum_l E[Y \vert A_0 = a_0, A_1 = a_1, L_1 = l]f_{L_1 \vert A_0}(l \vert a_0)$$` - First intuition: Write `$$E[Y(a_0, a_1)] = \sum_l E[Y(a_0, a_1) \vert L_1(a_0) = l_1]P[L_1(a_0)]$$` - Use sequential exchangeability to + Replace `\(E[Y(a_0, a_1) \vert L_1(a_0)]\)` with `\(E[Y \vert A_0 = a_0, A_1 = a_1, L_1 = l_1]\)` and + `\(P[L_1(a_0)]\)` with `\(P[L_1 \vert A_0 = a_0]\)` --- # Sequential Exchangeability - Our assumption that `\(P[L_1 \vert A_0 = a_0] = P[L_1(a_0)]\)` and `\(E[Y(a_0, a_1) \vert L_1(a_0)] = E[Y \vert A_0 = a_0, A_1 = a_1, L_1 = l_1]\)` requires a stronger form of sequential exchangeabilty. `$$\left(Y(\bar{a}_k), \underline{L}_{k+1}(\bar{a}_k)\right) \ci A_{k} \vert A_{k-1}, \bar{L}_k$$` - Where `\(\underline{L}_{k+1}\)` is all covariates *after* time `\(k\)`. - It turns out that the G-formula holds under only our weaker condition of sequential exchangeability for `\(Y(\bar{a})\)`. - Static conditional exchangeability is sufficient for identifying statitc treatment strategies. <!-- --- --> <!-- # G-Formula for Static Treatment Strategies --> <!-- - The g-formula can be derived by applying the iterated expectation formula, even without a causal interpretation of either `\(E[Y \vert A_0 = a_0, A_1 = a_1, L_1 = l_1]\)` or `\(P[L_1 \vert A_0 = a_0]\)` --> <!-- $$E[Y(a_0, a_1)] = E_{L_1} [ E_{Y}[Y(a_0, a_1) \vert L_1]] = \\\ --> <!-- E_{L_1} [ E_{Y}[Y(a_0, a_1) \vert A_0 = a_0, A_1 = a_1, L_1]] = \\\ --> <!-- \sum_{l}E[Y(a_0, a_1) \vert A_0 = a_0, A_1 = a_1, L_1 = l]P[L_1 \vert A_1 = a_1] = \\\ --> <!-- \sum_{l}E[Y \vert A_0 = a_0, A_1 = a_1, L_1 = l]P[L_1 \vert A_1 = a_1]$$ --> --- # G-Formula for Our Example <table class="table" style="width: auto !important; float: left; margin-right: 10px;"> <thead> <tr> <th style="text-align:right;"> N </th> <th style="text-align:right;"> \(A_{0}\) </th> <th style="text-align:right;"> \(L_{1}\) </th> <th style="text-align:right;"> \(A_{1}\) </th> <th style="text-align:right;"> \(\bar{Y}\) </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 2400 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 84 </td> </tr> <tr> <td style="text-align:right;"> 1600 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 84 </td> </tr> <tr> <td style="text-align:right;"> 2400 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 52 </td> </tr> <tr> <td style="text-align:right;"> 9600 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 52 </td> </tr> </tbody> </table> <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> N </th> <th style="text-align:right;"> \(A_{0}\) </th> <th style="text-align:right;"> \(L_{1}\) </th> <th style="text-align:right;"> \(A_{1}\) </th> <th style="text-align:right;"> \(\bar{Y}\) </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 4800 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 76 </td> </tr> <tr> <td style="text-align:right;"> 3200 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 76 </td> </tr> <tr> <td style="text-align:right;"> 1600 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 44 </td> </tr> <tr> <td style="text-align:right;"> 6400 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 44 </td> </tr> </tbody> </table> `$$E[Y(a_0, a_1)] = \sum_l E[Y \vert A_0 = a_0, A_1 = a_1, L_1 = l]f_{L_1 \vert A_0}(l \vert a_0)$$` -- + `\(E[Y \vert A_0 = 1, A_1 = 1, L_1 = 0]P[L_1 = 0 \vert A_0 = 1] = 76 \cdot \frac{1}{2} = 38\)` + `\(E[Y \vert A_0 = 1, A_1 = 1, L_1 = 1]P[L_1 = 1 \vert A_0 = 1] = 44 \cdot \frac{1}{2} = 22\)` + `\(E[Y(1, 1)] = 38 + 22 = 60\)` -- + `\(E[Y \vert A_0 = 0, A_1 = 0, L_1 = 0]P[L_1 = 0 \vert A_0 = 0] = 84 \cdot \frac{1}{4} = 21\)` + `\(E[Y \vert A_0 = 0, A_1 = 0, L_1 = 1]P[L_1 = 1 \vert A_0 = 0] = 52 \cdot \frac{3}{4} = 39\)` + `\(E[Y(1, 1)] = 21 + 39 = 60\)` --- # General Version of G-Formula for Static Treatments - The `\(G\)`-formula for a static treatment strategy generalizes to `$$E[Y(\bar{a})] = \sum_\bar{l} E[Y \vert \bar{A} = \bar{a},\bar{L}= \bar{l}]\prod_{k = 0}^Kf(l_k \vert \bar{a}_{k-1}, \bar{l}_{k-1})$$` or `$$\int_l E[Y \vert \bar{A} = \bar{a},\bar{L}= \bar{l}] \prod_{k = 0}^K dF (l_k \vert \bar{a}_{k-1}, \bar{l}_{k-1})$$` --- # G-Formula for Dynamic Treatment Strategies - In a static deterministic strategy `\(a_k\)` is completely determined by treatment and confounder history. - For dynamic or random strategies, this is the case and we need to add a term to the g-formula. `$$E[Y(\bar{a})] = \sum_\bar{l} E[Y \vert \bar{A} = \bar{a},\bar{L}= \bar{l}]\prod_{k = 0}^Kf(l_k \vert \bar{a}_{k-1}, \bar{l}_{k-1})\prod_{k=0}^K f^{int}(a_k \vert \bar{a}_{k-1}, \bar{l}_k)$$` - `\(f^{int}\)` is the conditional probability of `\(a_k\)` given the history *under the proposed intervention*. --- # Inverse Probability Weighting - We can generalize the IPW strategy we have been using for a point treatment to the time-varying regime. `$$W^A = \prod_{k = 0}^{K} \frac{1}{f(A_k \vert \bar{A}_{k-1}, \bar{L}_{k})}$$` -- - As before, we can stabilize the weights as `$$SW^A = \prod_{k = 0}^{K} \frac{f(A_k \vert \bar{A}_{k-1})}{f(A_k \vert \bar{A}_{k-1}, \bar{L}_{k})}$$` - If there are baseline covariates, `\(L_0\)`, we can condition on `\(L_0\)` in both numerator and denominator `$$SW^A = \prod_{k = 0}^{K} \frac{f(A_k \vert \bar{A}_{k-1}, L_0)}{f(A_k \vert \bar{A}_{k-1}, \bar{L}_{k}, L_0)}$$` - Only the model for the denominator needs to be correct. --- # Inverse Probability Weighting - Just like before, weighting subjects creates a pseudo-population in which treatment and confounders are uncorrelated. - So we can compute the counterfactual as simply the conditional mean in the pseudo-population `$$E[Y(a_0, a_1)] = E_{ps}[Y \vert A_0 = a_0, A_1 = a_1]$$` --- # IP Weighting Example <center> <img src="img/9_fig211.png" width="60%" /> </center> - Compute unstabilized weights in the example - Compute the sample size in each stratum. How big is the pseudo-population? --- # IP Weighting Example <center> <img src="img/9_fig211.png" width="65%" /> </center> - Compute stabilized weights in the example - How big is the pseudo-population created by the stabilized weights? --- # IP Weighting Example <center> <img src="img/9_fig213.png" width="80%" /> </center> --- # Using IP Weights Non-Parametrically + Once we have computed `\(W^{\bar{A}}\)` or `\(SW^{\bar{A}}\)` we can compute `\(Y(\bar{a})\)` as `$$\hat{E}\left[W_i^{\bar{A}}Y_i I(\bar{A}_i = \bar{a}) \right] = \frac{1}{N}\sum_{i = 1}^{N} W^{\bar{A}}_i Y_i I(\bar{A}_i = \bar{a})$$` - We could have used either standardized or non-standardized weights. - Notice that we are only making use of samples with observed treatment history is identical to the proposed intervention. - Suppose we want to compare "always treat" and "never treat" strategies. + An equivalent procedure to IP weighting is to censor everyone who changed treatment (non-adherent units). + We then combine weights for treatment and weights for censoring. `$$\frac{1}{P[A_0 = a]} \times \frac{1}{P[A_1 = a \vert A_0 = a, L_1]}$$` --- # Weights for Dynamic Treatments <center> <img src="img/9_fig213cropped.png" width="80%" /> </center> - Suppose we want to compare dynamic regimes `\(g = (0, L_1)\)` and `\(g^{\prime} = (0, 1-L_1)\)`. - We can see that, conditional on `\(A_0\)` and `\(L_1\)`, treatment choice at `\(A_1\)` doesn't matter so we should find that `\(E[Y(g)] - E[Y(g^{\prime})] = 0\)`. --- # Weights for Dynamic Treatments <center> <img src="img/9_fig213cropped.png" width="80%" /> </center> - Using the `\(W^{\bar{A}}\)` weights gives the right answer `$$E[Y(g)] = \frac{80\cdot 84 + 24 \cdot 52}{80 + 24} = 60 \qquad E[Y(g^{\prime})] = \frac{80\cdot 84 + 24 \cdot 52}{80 + 24} = 60$$` - But using `\(SW^{\bar{A}}\)` gives the wrong answer. Why? `$$E[Y(g)] = \frac{12\cdot 84 + 84 \cdot 52}{12 + 84} = 56 \qquad E[Y(g^{\prime})] = \frac{28\cdot 84 + 36 \cdot 52}{28 + 36} = 66$$` --- # Weights for Dynamic Treatments - The problem is the numerator of the standardized weights. - To estimate `\(E[Y(g)]\)`, we can limit ourselves only to units with `\(\bar{A}_i = (g_0(L_{0,i}), g_1(A_{0,i}, \bar{L}_{1,i}), \dots, g_K(\bar{A}_{k-1,i}, \bar{L}_{k,i}))\)`. - For convenience, abbreviate `\(g_k(\bar{A}_{k-1}, \bar{L}_k)\)` as `\(g_k\)` - The denominator of the standardized weight is the same as before. `$$\prod_{k=0}^K \frac{1}{f(A_k = g_k \vert \bar{A}_{k-1} = g_{k-1}, \bar{L}_k)} = \prod_{k=0}^{K}\frac{1}{f(A_k \vert \bar{A}_{k-1}, \bar{L}_{k})}$$` - The numerator is not the same. The numerator should be `$$\prod_{k=0}^{1}f(A_k = g_k \vert \bar{A}_{k-1} = g_{k-1}) = \prod_{k=0}^K \sum_{l} f(A_k \vert \bar{A}_{k-1}, \bar{L}_{k}) P[\bar{L}_k = l \vert \bar{A}_{k-1}]$$` --- # Weights for Dynamic Treatments <center> <img src="img/9_fig213cropped.png" width="80%" /> </center> - The numerator of the standardized weights for `\(g = (0, L_1)\)` for the first and fourth row should be `$$(0.5)(0.6\cdot 0.25 + 0.8 \cdot 0.75) = (0.5)(0.525)$$` - The numerator of the standardized weights for `\(g^{\prime} = (0, 1-L_1)\)` for the second and third row should be `$$(0.5)(0.4\cdot 0.25 + 0.2 \cdot 0.75) = (0.5)(0.475)$$` --- # Estimating Weights - If `\(L_k\)` is high dimensional or there are many time points, we will need to assume a parametric model for `\(f(A_k \vert \bar{A}_{k-1}, \bar{L}_k)\)`. + We might assume that `\(A_k\)` depends only on the most recent treatment and covariates. + Or some summary of the past history. - We can fit one model (e.g. logistic regression) at each time point. `$$E[A_k \vert \bar{A}_{k-1}, \bar{L}_{k-1}] = \beta_{0,k} + \beta_{1,k} A_{k-1} + \beta_{2,k} cum_{-5}(\bar{A}_{k-1}) + \beta_{3,k} L_{k-1}$$` + `\(cum_{-5}(\bar{A}_{k})\)` is the number of times treated out of the previous five treatment times. - Alternatively, we could assume that some coefficients are shared across time points and fit a pooled model, possibly with some time effects. `$$E[A_k \vert \bar{A}_{k-1}, \bar{L}_{k-1}] = \beta_{0,k} + \beta_1 A_{k-1} + \beta_2 cum_{-5}(\bar{A}_{k-1}) + \beta_3 L_{k-1} + \beta_4 A_{k-1} k$$` --- # Non-Parametric Estimation - If we are totally non-parametric and there is no effect of treatment on confounders, then estimating the effect of a treatment strategy `\(\bar{a}\)` or `\(g\)` is very similar to estimating the effect of a point treatment of assignment to a particular regime. - In fact, if there is no effect of treatment history on time-varying confounders, `\(f(L_k \vert \bar{A}_{k-1}) = f(L_k)\)` and the G-formula for time-varying treatment reduces to the regular point-treatment G-formula. - HR call this "treatment-confounder feedback". - Whether or not there is treatment-confounder feedback, if we are willing to make parametric assumptions, we can borrow information from units with similar treatment histories. --- # Marginal Structural Models - Just as we did before, we can use our IP weights to fit parametric marginal structural models. - For example, we might assume that the total treatment effect of `\(\bar{a}\)` only depends on the total number of times treated and not on the timing of the treatment. `$$E[Y(\bar{a})] = \beta_0 + \beta_1 cum(\bar{a})$$` --- # Marginal Structural Models - Once we have proposed a marginal structural mean model, we can fit it using the pseudo-population created by weighting the data. `$$E_{ps}[Y \vert \bar{A}] = \beta_0 + \beta_1 cum(\bar{A})$$` - `\(\hat{\beta}_1\)` estimates the causal effect of increasing the number of treated periods by 1. - Variance of from bootstrap or, conservatively, from robust sandwich estimator. - Testing that `\(\hat{\beta}_1 = 0\)` gives a test of the strong null that treatment at any time is unrelated to outcome, `\(Y(\bar{a}) = Y\)` for all `\(\bar{a}\)`. --- # Marginal Structural Models for Effect Modification - We could propose a marginal structural model that includes effect modification `$$E_{ps}[Y \vert \bar{A}] = \beta_0 + \beta_1 cum(\bar{A}) + \beta_2 V + \beta_3 cum(\bar{a}) V$$` - What are `\(\beta_1\)`, `\(\beta_2\)` and `\(\beta_3\)`? --- # Assumptions - For correct inference using IP weighting + a marginal structural model we need: -- - Consistency, sequential positivity, sequential conditional exchangeability - Correct propensity-score model - Correct marginal structural model --- # Parametric G-Formula - When we only had a single point intervention, we could use outcome regression plus standardization as a plug-in estimator of the g-formula. - We needed to estimate `\(E[Y \vert A, L]\)` but not `\(f_L(l)\)`, the density of covariates. - To estimate `\(E[Y(a)]\)` we replaced each persons treatment value with `\(a\)` and then estimated `\(\hat{Y}_i(a) = \hat{E}[Y \vert A = a_0, L = L_i]\)`. + We hen approximated the integral `$$\int_l E[Y \vert A, L = l]f_L(l) dl$$` with the sum `$$\frac{1}{N} \sum_{i = 1}^N \hat{Y}_i(a)$$` --- # Parametric G-Formula - There is an analog of this strategy for time-varying treatments. - Recall the (integral form) of the G-formula for static time-varying treatments: `$$\int E[Y \vert \bar{A} = \bar{a},\bar{L}= \bar{l}] \prod_{k = 0}^K dF (l_k \vert \bar{a}_{k-1}, \bar{l}_{k-1})$$` - In the time-varying g-formula, we clearly need to estimate `\(E[Y \vert \bar{A}, \bar{L}]\)`. - Can we use our same standardization trick to avoid estimating the covariate density? -- - No --- # Parametric G-Formula - Imagine we try the standardization trick with two time points: - We first set `\(A_0 = a_0\)` and `\(A_1 = a_1\)` for everyone in the data set. - What is the problem? -- - The value of the covariates `\(L_1\)` depends on `\(A_0\)`. + We need to replace `\(L_1\)` with `\(L_1(a_0)\)` but these values are not observed. --- # Simulating Covariates - We need to propose a parametric model for the density `\(f(l_k \vert \bar{a}_{k-1}, \bar{l}_{k-1})\)`. - We can then simulate covariate histories conditional on the intervention of interest from our estimated model. <!-- - Finally, we compute `\(E[Y(\bar{a})]\)` by standardizing over the data set with simulated covariates replaced. --> --- # Parametric G-Formula Algorithm Step 1. Fit parametric models for + `\(m(\bar{A}, \bar{L}; \theta) = E[Y \vert \bar{A}, \bar{L}]\)` + `\(e_{L_k}(\bar{A}_{k-1}, \bar{L}_{k-1}; \beta) = f(L_k \vert \bar{A}_{k-1}, \bar{L}_{k-1})\)` + `\(e_{L_k}\)` is a `\(p\)`-dimensional density, where `\(p\)` is the dimension of `\(L_{k}\)`. + We might propose a component-wise model for `\(e_{L_k}\)`. + Note that `\(e_{L,k}\)` is an estimate of a density, not just the expectation. --- # Parametric G-Formula Algorithm Step 2: - Start with the original data and delete everything after time 0 <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> ID </th> <th style="text-align:left;"> \(L_{0}\) </th> <th style="text-align:left;"> \(A_{0}\) </th> <th style="text-align:left;"> \(L_{1}\) </th> <th style="text-align:left;"> \(A_{1}\) </th> <th style="text-align:left;"> \(Y\) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> \(l_{0, 1}\) </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> \(l_{0, 2}\) </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;"> \(\vdots\) </td> <td style="text-align:left;"> \(\vdots\) </td> <td style="text-align:left;"> \(\vdots\) </td> <td style="text-align:left;"> \(\vdots\) </td> <td style="text-align:left;"> \(\vdots\) </td> <td style="text-align:left;"> \(\vdots\) </td> </tr> <tr> <td style="text-align:left;"> N </td> <td style="text-align:left;"> \(l_{0, N}\) </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> </tr> </tbody> </table> --- # Parametric G-Formula Algorithm Step 2: - Fill `\(A_0\)` in with the value dictated by the intervention, `\(g\)` <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> ID </th> <th style="text-align:left;"> \(L_{0}\) </th> <th style="text-align:left;"> \(A_{0}\) </th> <th style="text-align:left;"> \(L_{1}\) </th> <th style="text-align:left;"> \(A_{1}\) </th> <th style="text-align:left;"> \(Y\) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> \(l_{0, 1}\) </td> <td style="text-align:left;color: red !important;"> \(g_0(l_{0,1})\) </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> \(l_{0, 2}\) </td> <td style="text-align:left;color: red !important;"> \(g_0(l_{0,2})\) </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;"> \(\vdots\) </td> <td style="text-align:left;"> \(\vdots\) </td> <td style="text-align:left;color: red !important;"> \(\vdots\) </td> <td style="text-align:left;"> \(\vdots\) </td> <td style="text-align:left;"> \(\vdots\) </td> <td style="text-align:left;"> \(\vdots\) </td> </tr> <tr> <td style="text-align:left;"> N </td> <td style="text-align:left;"> \(l_{0, N}\) </td> <td style="text-align:left;color: red !important;"> \(g_0(l_{0,N})\) </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> </tr> </tbody> </table> --- # Parametric G-Formula Algorithm Step 2: - Simulate values for `\(L_1\)` by sampling `\(\tilde{l}_{1,i}(g)\)` from `\(e_{L_1}(g_1(L_{0,i}), L_{0,i}; \hat{\beta})\)` <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> ID </th> <th style="text-align:left;"> \(L_{0}\) </th> <th style="text-align:left;"> \(A_{0}\) </th> <th style="text-align:left;"> \(L_{1}\) </th> <th style="text-align:left;"> \(A_{1}\) </th> <th style="text-align:left;"> \(Y\) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> \(l_{0, 1}\) </td> <td style="text-align:left;color: red !important;"> \(g_0(l_{0,1})\) </td> <td style="text-align:left;color: red !important;"> \(\tilde{l}_{1,1}(g)\) </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> \(l_{0, 2}\) </td> <td style="text-align:left;color: red !important;"> \(g_0(l_{0,2})\) </td> <td style="text-align:left;color: red !important;"> \(\tilde{l}_{1,2}(g)\) </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;"> \(\vdots\) </td> <td style="text-align:left;"> \(\vdots\) </td> <td style="text-align:left;color: red !important;"> \(\vdots\) </td> <td style="text-align:left;color: red !important;"> \(\vdots\) </td> <td style="text-align:left;"> \(\vdots\) </td> <td style="text-align:left;"> \(\vdots\) </td> </tr> <tr> <td style="text-align:left;"> N </td> <td style="text-align:left;"> \(l_{0, N}\) </td> <td style="text-align:left;color: red !important;"> \(g_0(l_{0,N})\) </td> <td style="text-align:left;color: red !important;"> \(\tilde{l}_{1,N}(g)\) </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> </tr> </tbody> </table> --- # Parametric G-Formula Algorithm Step 2: - Repeat for all subsequent time-points <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> ID </th> <th style="text-align:left;"> \(L_{0}\) </th> <th style="text-align:left;"> \(A_{0}\) </th> <th style="text-align:left;"> \(L_{1}\) </th> <th style="text-align:left;"> \(A_{1}\) </th> <th style="text-align:left;"> \(Y\) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> \(l_{0, 1}\) </td> <td style="text-align:left;color: red !important;"> \(g_0(l_{0,1})\) </td> <td style="text-align:left;color: red !important;"> \(\tilde{l}_{1,1}(g)\) </td> <td style="text-align:left;color: red !important;"> \(g_1(g_0(l_{0,1}), \bar{\tilde{l}}_{1,1}(g))\) </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> \(l_{0, 2}\) </td> <td style="text-align:left;color: red !important;"> \(g_0(l_{0,2})\) </td> <td style="text-align:left;color: red !important;"> \(\tilde{l}_{1,2}(g)\) </td> <td style="text-align:left;color: red !important;"> \(g_1(g_0(l_{0,2}), \bar{\tilde{l}}_{1,2}(g))\) </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;"> \(\vdots\) </td> <td style="text-align:left;"> \(\vdots\) </td> <td style="text-align:left;color: red !important;"> \(\vdots\) </td> <td style="text-align:left;color: red !important;"> \(\vdots\) </td> <td style="text-align:left;color: red !important;"> \(\vdots\) </td> <td style="text-align:left;"> \(\vdots\) </td> </tr> <tr> <td style="text-align:left;"> N </td> <td style="text-align:left;"> \(l_{0, N}\) </td> <td style="text-align:left;color: red !important;"> \(g_0(l_{0,N})\) </td> <td style="text-align:left;color: red !important;"> \(\tilde{l}_{1,N}(g)\) </td> <td style="text-align:left;color: red !important;"> \(g_1(g_0(l_{0,N}), \bar{\tilde{l}}_{1,N}(g))\) </td> <td style="text-align:left;"> </td> </tr> </tbody> </table> --- # Parametric G-Formula Algorithm Step 3: - Fill in `\(\hat{Y}_i\)` by plugging previous values of treatment and covariates into the fitted outcome model `\(m\)`. <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> ID </th> <th style="text-align:left;"> \(L_{0}\) </th> <th style="text-align:left;"> \(A_{0}\) </th> <th style="text-align:left;"> \(L_{1}\) </th> <th style="text-align:left;"> \(A_{1}\) </th> <th style="text-align:left;"> \(Y\) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> \(l_{0, 1}\) </td> <td style="text-align:left;color: red !important;"> \(g_0(l_{0,1})\) </td> <td style="text-align:left;color: red !important;"> \(\tilde{l}_{1,1}(g)\) </td> <td style="text-align:left;color: red !important;"> \(g_1(g_0(l_{0,1}), \bar{\tilde{l}}_{1,1}(g))\) </td> <td style="text-align:left;color: red !important;"> \(\hat{Y}_1(g)\) </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> \(l_{0, 2}\) </td> <td style="text-align:left;color: red !important;"> \(g_0(l_{0,2})\) </td> <td style="text-align:left;color: red !important;"> \(\tilde{l}_{1,2}(g)\) </td> <td style="text-align:left;color: red !important;"> \(g_1(g_0(l_{0,2}), \bar{\tilde{l}}_{1,2}(g))\) </td> <td style="text-align:left;color: red !important;"> \(\hat{Y}_2(g)\) </td> </tr> <tr> <td style="text-align:left;"> \(\vdots\) </td> <td style="text-align:left;"> \(\vdots\) </td> <td style="text-align:left;color: red !important;"> \(\vdots\) </td> <td style="text-align:left;color: red !important;"> \(\vdots\) </td> <td style="text-align:left;color: red !important;"> \(\vdots\) </td> <td style="text-align:left;color: red !important;"> \(\vdots\) </td> </tr> <tr> <td style="text-align:left;"> N </td> <td style="text-align:left;"> \(l_{0, N}\) </td> <td style="text-align:left;color: red !important;"> \(g_0(l_{0,N})\) </td> <td style="text-align:left;color: red !important;"> \(\tilde{l}_{1,N}(g)\) </td> <td style="text-align:left;color: red !important;"> \(g_1(g_0(l_{0,N}), \bar{\tilde{l}}_{1,N}(g))\) </td> <td style="text-align:left;color: red !important;"> \(\hat{Y}_N(g)\) </td> </tr> </tbody> </table> -- Step 4: + Compute the mean `\(\frac{1}{N} \hat{Y}_i(g)\)` --- # Simulating Covariates - Since we are simulating, we might as well create more data. - Rather than starting with the original data set, resample with replacement, a large number of observations, `\(S\)`, from the original data. + In the resampled data, the procedure is the same: keep `\(L_0\)` and generate all subsequent data from fitted models. + Since we are sampling covariates from a distribution, this procedure helps reduce the variance of our estimate. - Alternatively, instead of resampling, we could replicate the original data several times. --- # Assumptions - For valid inference using the parametric g-formula we need: - Consistency, sequential positivity, sequential conditional exchangeability - Correct model for `\(E[Y \vert \bar{A}, \bar{L}]\)` - Correct model for the density of `\(L_k\)` given covariate and treatment history. --- # Parametric G-Formula Implementation - The R package `gfoRmula` implements the estimation and simulation procedure we have described. - Can handle: + Binary and continuous outcomes + Time to event outcomes - Can estimate effects of both static and dynamic treatments. - Allows a variety of specifications for outcome model and covariate models including lagged effects and cumulative effects. <!-- --- --> <!-- # Example --> <!-- - Built in data `binary_eofdata` contains data for 2,500 individuals measured at 7 time points. --> <!-- - There are 3 time-varying covariates. --> <!-- + `cov1` is binary --> <!-- + `cov2` is continuous --> <!-- + `cov3` is categorical with 6 values. --> <!-- + Treatment is continuous. --> --- # G-Null Paradox - In the DAG below, the strict null holds - there is no effect of treatment at any time on `\(Y\)`. - However `\(Y\)` is correlated with `\(A_0\)` and `\(A_1\)` due to the confounder `\(U\)`. - So `\(h_Y = E[Y \vert L_1, A_0, A_1]\)` will be a function of `\(A_0\)` and `\(A_1\)`. - So will our estimate of `\(f(L_1 \vert A_0)\)` <center>
</center> --- # G-Null Paradox - Robins and Wasserman (1997) show that unless the parametric models are saturated, parametric models cannot correctly represent the g-formula under the null. - Suppose `\(L_1\)` is binary and `\(A_1\)` and `\(A_0\)` are continuous. - We fit the models `$$E[Y \vert L_1, A_1, A_0] = m(L_1, A_1, A_0; \theta) = \theta_0 + \theta_1 L_1 + \theta_2 A_1 + \theta_3 A_0$$` `$$P(L_1 = 1 \vert a_0) = e(L_1 = 1, A_0; \beta) = \frac{\exp(\beta_0 + \beta_1 A_0)}{1 + \exp(\beta_0 + \beta_1 A_0)}$$` - Plugging into the g-formula `$$E[Y(a_0, a_1)] = \sum_{l = 0}^1 m(l, a_1, a_0; \theta)e(l, a_0; \beta) =\\\ \theta_0 + \theta_2 a_1 + \theta_3 a_0 + \theta_1 \frac{\exp(\beta_0 + \beta_1 a_0)}{1 + \exp(\beta_0 + \beta_1 a_0)}$$` --- # G-Null Paradox - Under the strict null `\(E[Y(a_0, a_1)]\)` does not depend on `\(a_0\)` and `\(a_1\)`. - We can see in our model that `\(\hat{E}[Y(a_0, a_1)]\)` always depends on `\(a_0\)` and `\(a_1\)` unless `\(\theta_2\)`, `\(\theta_3\)`, and `\(\beta_1\)` are all 0. + But this can't occur because we know `\(Y\)` is correlated with `\(A_1\)` and `\(A_0\)` - If we had access to `\(U\)` and could model it correctly, this wouldn't be a problem. - Essentially, our parametric model does not include the strict null. --- # Beyond the Strict Null - It may be impossible to correctly specify the parametric G-formula even when the strict null does not hold. - For example, the problem occurs in the DAG below where there is an effect only of `\(A_1\)` on `\(Y\)` using the same model we've been using. - McGrath, Young, and HernĂ¡n (2022) show that non-negligible bias can occur in some non-null models *even* using flexible model specifications. - Previous results show that in at least some settings, bias from the g-null paradox is negligible. <center>
</center> --- # G-Null Paradox - By contrast, using marginal structural models with IP weighting, - Under the strict null, `\(E[Y(\bar{a})]\)` does not depend on `\(\bar{a}\)`. - Under the strict null, the marginal structural mean model will never be mis-specified (as long as it contains an intercept parameter). - So the IP weighting method does not suffer from the g-null paradox. --- # Repeated Measures - Suppose that rather than observing only a single `\(Y\)` at the end of the study, we observe `\(Y_k\)` at every time point. - For example, suppose we are interested in the effect of highly active antiretroviral therapy (HAART) on CD4 count over time. + Example from Cole et al (2007) - We can combine information across time points if we are willing to assume a common marginal structural model for each outcome. --- # Repeated Measures - Cole et al (2007), use a marginal structural model for CD4 count at time `\(k\)` including a function of the cumulative time on HAART `$$E[Y_{k}(\bar{a}_k) \vert L_{0}] = \beta_{0,k} + \beta_{1}^{\top}L_{0} + \beta_2^\top h\left(\sum_{j =0}^k a_{j}\right)$$` - `\(h()\)` is a function which categorizes the cumulative time on treatment into four categories. - To fit this model, we need to assume that nobody was on treatment before the study, `\(A_{k} \equiv 0\)` for `\(k < 0\)`. + In this case that assumption is fine because the study starts when HAART became available. + The study also excludes anyone who was already on another form of ART. --- # Repeated Measures - Cole et al compute weights `\(W_{ik}^{A}\)` for confounding by time-varying covariates and `\(W_{ik}^{C}\)` for censoring. - Note that time-varying confounders *include* past values of CD4 count. - For any given time `\(k\)`, we could use our previous strategy and fit the marginal structural model weighted by `\(W_{ik}^{A}W_{ik}^C\)`. - Our fitted model would let us compute the average counterfactual value of CD4 count that we would have observed at time `\(k\)` under various static interventions up to time `\(k\)`. --- # Repeated Measures - To combine across time-points, we now think of our study as `\(K\)` (nested) shorter studies. - In study `\(k\)`, the outcome is `\(Y_k\)`. Past values of `\(Y\)` are time-varying covariates which may be confounders. <center> <img src="img/9_timevaryingY2.png" width="90%" /> </center> --- # Repeated Measures - Since we are using the same MSM in each study, we can fit one large pooled regression. - Each person contributes one row per time point after baseline. - The row for person `\(i\)` at time `\(k\)` is weighted by `\(W_{i,k}^A W_{i,k}^C\)`. - We then fit the MSM to the expanded data set using robust standard errors to account for repeated measurements of the same person. --- # Repeated Measures - This strategy works whenever there is an MSM that plausibly applies to all time-points. - We relied on knowing the treatment value before time 0, usually assumed to be 0. - Alternatively, we could assume a "short memory". For example `$$E[Y_{k} \vert L_{0}, \bar{A}_{k}] = \beta_{0,j} + \beta_{1}^{\top}L_{0} + \beta_2 A_k + \beta_3 A_{k-1} + \beta_4 A_{k-2}$$` - In this model the outcome only depends on the three most recent treatments. - Without knowing treatment prior to the study, we can't use `\(Y_0\)` and `\(Y_1\)`, but we can fit the model for all time points after and including 2. --- # Time to Event Outcomes - The treatment of time-to-event data is very similar to repeated measures data. + We have a repeated measure of death status, `\(D_k\)`. - We need a marginal structural model for the counterfactual hazard at time `\(k\)`. + Either use the Cox RR model + Or the logistic approximation - If we are using the Cox model, time-varying weights are used to adjusted each individuals contribution to the "at risk" set at time `\(k\)`. - The logistic model is fit using pooled data. - We've already seen an analysis like this in the NHS study. --- # Parametric G-Formula for Time to Event - To apply the parametric G-formula, we need a model for the hazard as a function of past history and time. - And a model (or models) for the covariate density as a function of past history. - For each person, we simulate time-varying covariates at each time. - We also compute the hazard at each time. We do not need to simulate death, only compute the probability of death. - We can then compute the counterfactual survival probability for each person. + And then average to find the average countefactual survival. --- # Iterated Conditional Expectation (ICE) - Another way to express the G-formula is as a nested expectation - For just two time points this is `$$E[Y(a_0, a_1)] = \color{purple}{E_{L_0}\bigg[ }\color{blue}{E_{L_1 \vert A_0 = a_0, L_0}\Big[} \color{green}{E\big[\color{red}{Y} \vert A_0 = a_0, A_1 = a_1, L_0, L_1\big] } \color{blue}{\Big \vert A_0 = a_0, L_0\Big]} \color{purple}{\bigg]}$$` --- # ICE Estimation - The nested expectation formulation gives us an alternate estimation. - Define `$$Q_{K+2} = Y\\\ Q_{K+1} = E[Y \vert\bar{L}_{K}, \bar{A}_{K} = \bar{a}_{K}]\\\ \vdots\\\ Q_m = E[Q_{m+1} \vert \bar{L}_{m-1}, \bar{A}_{m-1} = \bar{a}_{m-1}]\\\ \vdots\\\ Q_0 = E[Q_1 \vert L_0, A_0 = a_0]$$` - We can propose a model for `\(E[Q_{m+1} \vert \bar{L}_{m-1}, \bar{A}_{m-1}]\)` and then iteratively fit each model. - When we get all the way down to 0, `\(E[Q_0]\)` will be an estimator of `\(Y(\bar{a})\)`. --- # ICE Estimation - Example with two time-points and a static intervention `\(\bar{a} = (a_0, a_1)\)`. - In the first step (nothing really to do here), we use our definition that `\(Q_{2} = Y\)` `$$E[Y(a_0, a_1)] = \color{purple}{E_{L_0}\bigg[ }\color{blue}{E_{L_1 \vert A_0, L_0}\Big[} \color{green}{E\big[\color{red}{Q_2} \vert A_0= a_0, A_1 = a_1, L_0, L_1\big] } \color{blue}{\Big \vert A_0 = a_0, L_0\Big]} \color{purple}{\bigg]}$$` -- - Let's propose a model: `$$E[Y\vert \bar{L}_{1}, \bar{A}_{1}] = \theta_{0, 1} + \theta_{1,1}cum(\bar{A}_{1}) + \theta_{2,1} L_{1}$$` - We can fit this as a regular regression. --- # ICE Estimation - Fitting this model gives us an estimate of the green part of the expression `$$\hat{Q}_1 = \hat{E}[Y \vert \bar{L}_1, A_1 = a_1, A_0]$$` which we plug into the expression `$$E[Y(a_0, a_1)] = \color{purple}{E_{L_0}\bigg[ }\color{blue}{E_{L_1 \vert A_0 = a_0, L_0}\Big[} \color{green}{\hat{Q}_1} \color{blue}{\Big \vert A_0 = a_0, L_0\Big]} \color{purple}{\bigg]}$$` - In order to fit the next regression, we estimate `\(\hat{Q}_1\)` for all the covariate values observed in our data. - We compute `\(\hat{Q}_{1,i}\)` for each person in the study by substituting `\(a_1\)` for `\(A_{1,i}\)` but leaving `\(L_{1,i}\)`, `\(L_{0,i}\)` and `\(A_{0,i}\)` at their natural values. --- # ICE Estimation - Next, use these fitted values to fit the same model as before `$$E[\hat{Q}_1 \vert L_{0}, A_{0}] = \theta_{0, 0} + \theta_{1,0}cum(A_0) + \theta_{2,0} L_{0}$$` - This gives us an estimate of `\(Q_0 = E[Q_1 \vert A_0 = a_0, L_0]\)` that we can plug into the expression. `$$E[Y(a_0, a_1)] = \color{purple}{E_{L_0}\bigg[ }\color{blue}{\hat{Q}_0} \color{purple}{\bigg]}$$` - We compute `\(\hat{Q}_{0,i}\)` for each person in the study by substituting `\(a_0\)` for `\(A_{0,i}\)` but leaving `\(L_{0,i}\)` at their natural values. - Finally we estimate `\(\hat{E}[Y(0)] = \frac{1}{N}\hat{Q}_{0,i}(\bar{a})\)` --- # ICE Estimation - The ICE estimator will be valid if the outcome regression was correct at each time point. - An advantage over the parametric G-formula approach is that we didn't have to estimate the joint distribution of `\(L\)`. --- # Doubly Robust Estimator - We can turn the ICE estimator into a doubly robust estimator by adding a special covariate based on the propensity score to each regression. - Recall the "special covariate" regression strategy for estimating `\(Y(a)\)`: 1. Estimate `\(\hat{\pi}\)` using a `\(\pi\)`-model 2. Propose an outcome model `\(E[Y \vert A, L] = m(A, L; \theta)\)` 3. Estimate a model `\(E[Y \vert A, L] = m(A, L; \theta) + \phi_1 A/\hat{\pi} + \phi_2 (1-A)/(1-\hat{\pi})\)` 4. Compute the standardized estimates `$$E[Y(1)] = \frac{1}{N}\sum_{i=1}^N \left(m(1, L_i; \hat{\theta}) + \hat{\phi_1}\frac{1}{\hat{\pi}_i} \right)\\\ E[Y(0)] = \frac{1}{N}\sum_{i=1}^N \left(m(0, L_i; \hat{\theta}) + \hat{\phi_2}\frac{1}{1-\hat{\pi}_i} \right)$$` --- # Doubly Robust Estimator for Time Varying Treatment - We will use the same strategy in combination with the ICE estimation strategy. - At each regression we compute the `$$W^{\bar{A}_m} = \prod_{k=0}^m \frac{1}{f(A_k \vert \bar{A}_{k-1}, \bar{L}_k)}$$` and `$$W^{\bar{A}_m, a_m = 1} = \frac{W^{\bar{A}_{m-1}}}{f(A_m = 1 \vert \bar{A}_{m-1}, \bar{L}_m)}$$` --- # Doubly Robust Estimator for Time Varying Treatment - In each regression step, we add `\(I(\bar{A}_{m-1} = \bar{a}_{m-1}) \hat{W}^{\bar{A}_m}\)` to the regression. - In the estimation step, we replace `\(A_m\)` with `\(a_m\)` *and* `\(W^{\bar{A}_m}\)` with `\(W^{\bar{A}_m, a_m = 1}\)`. <!-- --- --> <!-- # Doubly Robust Estimator for Time Varying Treatment --> <!-- - We will use a similar procedure with some modifications. --> <!-- 1a. Estimate time-varying weights (one per each person-time point) --> <!-- 1b. Compute the modified time-varying weight corresponding to the treatment strategy. We will use "always treated" so `\(A_m = 1\)` --> <!-- `$$W^{\bar{A}_m, a_m = 1} = W^{\bar{A}_{m-1}} \frac{1}{f(A_m = 1 \vert \bar{A}_{m-1}, \bar{L}_m)}$$` --> <!-- --- --> <!-- # Doubly Robust Estimator for Time Varying Treatment --> <!-- 2a. Start at time `\(K\)` and fit an outcome model for `\(E[Y \vert \bar{A}_{K}, \bar{L}_{K}]\)` including `\(\hat{W}^{\bar{A}_K}\)` as a covariate. For example, --> <!-- `$$E[ Y \vert \bar{A}_K, \bar{L}_K] = \theta_{0, K} + \theta_{1,K}cum(\bar{A}_K) + \theta_{2,K} L_K + \theta_{3,K} \hat{W}^{\bar{A}_K}$$` --> <!-- Define `\(\hat{T}_{K+1} = Y\)`. --> <!-- 2b. Estimate `\(\hat{T}_{K}\)` under the treatment strategy in which `\(A_{K}\)` was set to 1. --> <!-- + To do this, we replace `\(A_K\)` in our data with 1 and `\(\hat{W}^{\bar{A}_K}\)` with `\(\hat{W}^{\bar{A}_{K-1}, a_K = 1}\)` --> <!-- + And then take the average of the predicted values with the new data. --> <!-- --- --> <!-- # Doubly Robust Estimator for Time Varying Treatment --> <!-- 2c. Proceed backwards to the next time point. Fit the same model, except now `\(\hat{T}_{K}\)` rather than `\(Y\)` is the outcome. --> <!-- `$$E[ \hat{T}_{K} \vert \bar{A}_{K-1}, \bar{L}_{K-1}] = \theta_{0, K} + \theta_{1,K-1}cum(\bar{A}_{K-1}) + \theta_{2,K-1} L_{K-1} + \theta_{3,K-1} \hat{W}^{\bar{A}_{K-1}}$$` --> <!-- 2d. Estimate `\(\hat{T}_{K-1}\)` as the fitted values from this model, replacing `\(A_{K-1}\)` with the intervention value `\(a_{K-1} = 1\)` and `\(\hat{W}^{\bar{A}_{K-1}}\)` with `\(\hat{W}^{\bar{A}_{K-2}, a_{K-1} = 1}\)`. --> <!-- --- --> <!-- # Doubly Robust Estimator for Time Varying Treatment --> <!-- 2e. Keep going all the way back to time point 0. --> <!-- - When we get all the way back to time 0, the average `\(\frac{1}{N}\sum_{i = 1}^N \hat{T}_{0,i}\)` is a doubly robust estimator of `\(Y(\bar{a})\)`. --> --- # Doubly Robust Estimator for Time Varying Treatment Our estimator is valid if: 1) The treatment model is correct at all times OR 2) The outcome model is correct OR 3) The treatment model is correct for times 0 to `\(k\)` and the outcome model is correct for times `\(k+1\)` to `\(K\)` for any `\(k < K\)` - Called `\(k + 1\)` robustness.