L9: Time Varying Treatment

class: center, middle, inverse, title-slide

# L9: Time Varying Treatment
### Jean Morrison
### University of Michigan
### 2022-02-16(updated: 2022-02-23)

---

# Example

`$\newcommand{\ci}{\perp\!\!\!\perp}$`

- Patients are treated for a disease over time.

- At each appointment, the treatment decision for the next period is made, possibly based on current or past symptoms.

- We observe an outcome `$Y$`.

<center>
<div id="htmlwidget-1c6fc6ecd166bf3d0beb" style="width:504px;height:252px;" class="grViz html-widget"></div>
<script type="application/json" data-for="htmlwidget-1c6fc6ecd166bf3d0beb">{"x":{"diagram":"digraph {\n\ngraph [layout = \"neato\",\n       outputorder = \"edgesfirst\",\n       bgcolor = \"white\"]\n\nnode [fontname = \"Helvetica\",\n      fontsize = \"10\",\n      shape = \"circle\",\n      fixedsize = \"true\",\n      width = \"0.5\",\n      style = \"filled\",\n      fillcolor = \"aliceblue\",\n      color = \"gray70\",\n      fontcolor = \"gray50\"]\n\nedge [fontname = \"Helvetica\",\n     fontsize = \"8\",\n     len = \"1.5\",\n     color = \"gray80\",\n     arrowsize = \"0.5\"]\n\n  \"1\" [label = <A<FONT POINT-SIZE=\"8\"><SUB>0<\/SUB><\/FONT>>, fontname = \"Helvetica\", fontsize = \"10\", width = \"0.3\", fontcolor = \"black\", color = \"black\", fillcolor = \"#FFFFFF\", pos = \"0,0!\"] \n  \"2\" [label = <L<FONT POINT-SIZE=\"8\"><SUB>1<\/SUB><\/FONT>>, fontname = \"Helvetica\", fontsize = \"10\", width = \"0.3\", fontcolor = \"black\", color = \"black\", fillcolor = \"#FFFFFF\", pos = \"0.5,-0.6!\"] \n  \"3\" [label = <A<FONT POINT-SIZE=\"8\"><SUB>1<\/SUB><\/FONT>>, fontname = \"Helvetica\", fontsize = \"10\", width = \"0.3\", fontcolor = \"black\", color = \"black\", fillcolor = \"#FFFFFF\", pos = \"1,0!\"] \n  \"4\" [label = \"U\", fontname = \"Helvetica\", fontsize = \"10\", width = \"0.3\", fontcolor = \"black\", color = \"black\", fillcolor = \"#FFFFFF\", pos = \"0.5,-1.2!\"] \n  \"5\" [label = \"Y\", fontname = \"Helvetica\", fontsize = \"10\", width = \"0.3\", fontcolor = \"black\", color = \"black\", fillcolor = \"#FFFFFF\", pos = \"1.5,0!\"] \n\"1\"->\"2\" [color = \"black\"] \n\"2\"->\"3\" [color = \"black\"] \n\"3\"->\"5\" [color = \"black\"] \n\"4\"->\"2\" [color = \"black\"] \n\"4\"->\"5\" [color = \"black\"] \n}","config":{"engine":"dot","options":null}},"evals":[],"jsHooks":[]}</script>
</center>

---
# Notation and Conventions

- Bar notation indicates the history of a variable `$\bar{A}_k = (A_1, \dots, A_k)$`.

- In statistical time, `$A_k$` is the last variable at time `$k$`. 
  + Covariates `$L_k$` are measuerments that are taken after treatment `$A_{k-1}$` is given but before treatment `$A_k$` is given. 
  
- Timing aligns for all units. 
  + We will often talk about time points as though they are evenly spaced (e.g. every month), but this is not required.

- Time starts at 0.

---
# Treatment Programs

- In this setting, we might be interested in the effect of the entire course of treatment `$\bar{A} = (A_0, A_1)$`. 
  - We are modeling a joint intervention. 
  
- With 2 time points, there are only four treatment strategies.

- With `$k$` time points there are `$2^k$` treatment strategies.

- How do we know which to compare?

---
# Treatment Strategies

- A treatment strategy is a rule for determining `$A_k$` from a unit's past covariate values

`$$g = (g_0(\bar{a}_{-1}, l_0), \dots, g_K(\bar{a}_{K-1}, \bar{l}_k))$$`

- A treatment strategy is static if it does not depend on any covariates, only past treatments. 
  + In a static strategy, we could write out the entire program at the beginning of the study. 
  + Ex: Treat every other month
  + Ex: Treat for only the first two time points. 
  
- A treatment strategy is dynamic if it does depend on covariates. 
  + Ex: Treat if `$L_{k-1}$` was high. 
  + Ex: If `$L_{k-1}$` is high, switch treatment, so `$A_{k} = 1-A_{k-1}$`. Otherwise set `$A_k = A_{k-1}$`.

---
# Sequentially Randomized Trials

- In a sequentially randomized trial, treatment `$A_{k,i}$` is assigned randomly with `$P[A_{k,i} = a]$` possibly depending on `$\bar{A}_{k-1}$` and `$\bar{L}_{k-1}$`.

- Example: Every patient starts on treatment 0. Every month a random set of patients are assigned to switch to treatment 1 and stay on that treatment for the rest of the study. 
      - Patients with high values of `$L_{k}$` may have a higher probability of starting treatment. 
      
- Example: Treatment is assigned randomly at every time point. 
      - Patients with high values of `$L_{k}$` have a higher probability of switching treatments.

- A random strategy will never be as good as the optimal deterministic strategy.

+ We would never recommend a random strategy for general treatment of patients.
  + But random strategies are necessary when the optimal strategy is unknown.

---
# Causal Contrasts

- The causal contrast we choose to look at will depend on the study.

- We might be interested in comparing specific fixed programs, `$E[Y(\bar{A} = \bar{a}_1)] - E[Y(\bar{A} = \bar{a}_2)]$` such as 
  + Always treat vs never treat: `$\bar{a}_1 = (0, 0, \dots, 0)$`, `$\bar{a}_2 = (1, 1, \dots, 1)$`
  + Treat early and continue vs begin treatment later and continue: `$\bar{a}_1 = (1, 1, \dots, 1)$`, `$\bar{a}_2 = (0,\dots, 0, 1, \dots, 1)$`.
  
- Or we could compare one or more dynamic strategies `$g$`, `$E[Y(g)]$` such as:
  + Always treat vs treat only when symptoms are present.

---
# DAG 1

- With your partner(s), draw a DAG for two time points for the following scenario:

- Patients are enrolled in a randomized trial of HIV medication.

- At each time point CD4 count ( `$L$` ) is measured. CD4 count is affected by the most recent treatment and the most recent past value of CD4 count.

- Probability of receiving treatment 1 at time `$k$` depends on CD4 count at time `$k$`, but not on past treatment or past CD4 count measurements.

- The outcome `$Y$` depends on treatment at all time points but does not depend on CD4 count.

---
# DAG 1

---
# DAG 2

- To your DAG add  an unmeasured variable `$U$` which is also time-varying.

+ `$U$` represents disease progression. 
  
- `$U$` at time `$k$`  is affected by the most recent treatment value and most recent past value of `$U$`. 
- `$U$` at time `$k$` affects CD4 count at time `$k$`, the next value of `$U$`, and `$Y$`.

- You can assume that `$U_k$` is statistically before `$L_k$`.

---
# DAG 3

- Draw the same scenario except that treatment `$A_k$` does not depend on CD4 count but does depend on the most recent treatment `$A_{k-1}$`.

---
# DAG 4

- Suppose that we now have an observational study.

- Patients have checkups once a month. At each appointment, treatment is adjusted.

- Doctors will take into account past treatment, current values of `$L$`, and other factors which are affected by `$U$`.

---
# SWIGs for Static Strategies

- Draw a SWIG for the DAG below for the treatment strategy `$(A_0 = a_0, A_1 = a_1).$`

</center>

---
# SWIGs for Static Strategies

- Draw a SWIG for the DAG below for the treatment strategy `$(A_0 = a_0, A_1 = a_1).$`

</center>
---
# Static Sequential Exchangeability

- Static sequential exchangeability says that

`$$Y(\bar{a}) \ci A_k \vert\ \bar{A}_{k-1} = \bar{a}_{k-1}, \bar{L}_k\qquad k = 0, 1, \dots K$$`
- Does static sequential exchangeability hold in our example?

---
# Static Sequential Exchangeability

- We can see that `$Y(a_0, a_1) \ci A_0$` ( `$A_0$` is disconnected)

- In the SWIG we have `$Y(a_0, a_1) \ci A_1(a_0) \vert A_0 = a_0, L_1(a_0)$`. 
  + By consistency `$A_0 = a_0 \Rightarrow A_1 = A_1(a_0)$` and 
  + `$A_0 = a_0 \Rightarrow L_1 = L_1(a_0)$`
  
- So `$Y(a_0, a_1) \ci A_1 \vert A_0 = a_0, L_1$`

---
# Static Sequential Exchangeability

- Does static sequential exchangeability hold in the SWIG below?

---
# Static Sequential Exchangeability

- Does static sequential exchangeability hold in the SWIG below?

---

# SWIGS for Dynamic Treatment Strategies

- Suppose that we want to estimate the counterfactual `$E[Y(g)]$` where `$g$` is a dynamic treatment strategy "treat only if `$L_k = 1$`".
  
+ Recall that the SWIG represents the hypothetical world of the intervention, not the world in which I collected my data.

- Our intervention *introduces an arrow* from `$L_1(g_0)$` to the value of `$A_1$` under the intervention.

---

# SWIGS for Dynamic Treatment Strategies

- The dotted arrow is created by the proposed intervention.

+ It is not a result of the experimental design or underlying causal structure.

- The dotted arrow functions just like a solid arrow for computing d-separation.

+ It is dotted so that we know it was introduced by the intervention.

---
# Sequential Exchangeability for Dynamic Treatment Strategies

- Sequential exchangeability for `$Y(g)$` holds if

`$$Y(g) \ci A_k \vert \bar{A}_{k-1} = g(\bar{A}_{k-2}, \bar{L}_{k-1}), \bar{L}_k \qquad  k = 0, 1, \dots, K$$`
- This definition applies if `$g$` is static or dynamic, random or deterministic.

- Also called *sequential conditional exchangeability*

---

# Sequential Exchangeability for Dynamic Treatment Strategies

- Does sequential exchangeability hold for `$Y(g)$` in the dynamic intervention?

- We can see that `$Y(g) \ci A_0$` and `$Y(g) \ci A_1(g_0) \vert\ L_1(g_0), A_0 = g_0$`

- Using consistency `$Y(g) \ci A_1 \vert L_1, A_0 = g_0$`

---

# Sequential Exchangeability for Dynamic Treatment Strategies

- We don't have `$Y(g) \ci A_0$`

+ They are connected by the `$A_0 - W_0 - L_1(g_0) - g_1 - Y(g)$` path
  + The `$g_1$` node does not block the path because it's not fixed.
  
---

# Positivity

- Let `$f_{\bar{A}_{k-1}, \bar{L}_k}$` be the joint pdf for the treatment history before point `$k$` and the covariate history.

- For time-varying treatment, positivity requires that

`$$f_{\bar{A}_{k-1}, \bar{L}_k}(\bar{a}_{k-1}, \bar{l}_k) > 0 \Rightarrow f_{A_{k} \vert \bar{A}_{k-1}, \bar{L}_k}(a_k \vert \bar{a}_{k-1}, \bar{l}_k) > 0$$`
- If we are interested in a particular strategy, `$g$`, the condition only needs to hold for treatment histories compatible with `$g$` ( `$a_k = g(\bar{a}_{k-1}, \bar{l}_k)$` ).

- This condition says that given past treatment history and covariates, any treatment consistent with the strategy should be possible.

---
# Consistency

- For a point treatment, consistency requires that `$A = a \Rightarrow Y(a) = Y$`.

- For a static strategy, the condition `$\bar{A} = \bar{a} \Rightarrow Y(\bar{a}) = Y$` is sufficient.

- For dynamic strategies, if `$A_k = g_k(\bar{A}_{k-1}, \bar{L}_k)$` for all `$k$` then `$Y(g) = Y$`.

---
# Estimation for Static Treatment Strategies

- Consider our previous example:

- In this example, we could estimate the average effect of `$A_0$` or the average effect of `$A_1$` using our previous techniques.

- Suppose we want to compare the strategy always treat `$Y(1, 1)$` with the strategy never treat `$Y(0, 0)$`.

---
# Example

- Data below were aggregated from a trial of 320,000 units.

+ The trial conforms to our previous DAG. 
  + Treatment at time 0 is random with probability 0.5. 
  + Treatment at time 1 depends only on `$L_1$`: `$P[A_1 = 1 \vert L_1 = 1] = 0.8$`, `$P[A_1 = 1 \vert L_1 = 0] = 0.4$`.
  + In these data, there is no effect of treatment on `$Y$`.

---
# Example
<table class="table" style="width: auto !important; float: left; margin-right: 10px;">
 <thead>
  <tr>
   <th style="text-align:right;"> N </th>
   <th style="text-align:right;"> $A_{0}$ </th>
   <th style="text-align:right;"> $L_{1}$ </th>
   <th style="text-align:right;"> $A_{1}$ </th>
   <th style="text-align:right;"> $\bar{Y}$ </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 2400 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 84 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1600 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 84 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2400 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 52 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 9600 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 52 </td>
  </tr>
</tbody>
</table>

+ Within each stratum of `$(A_0, L_1)$`, there is no treatment effect.

+ There is also no average effect of `$A_0$`: 
  + `$E[Y \vert A_0 = 0] = \frac{40\cdot 84 + 120 \cdot 52}{160} = 60$`
  + `$E[Y \vert A_0 = 1] = \frac{80\cdot 76 + 80 \cdot 44}{160} = 60$`

- We want to estimate `$E[Y(1,1)] - E[Y(0, 0)]$`

- We could try computing the associational difference `$E[Y \vert A_0 = 1, A_1 = 1] - E[Y \vert A_0 = 0, A_1 = 0]$`:

+ `$E[Y \vert A_0 = 1, A_1 = 1] = \frac{32\cdot 76 + 64 \cdot 44}{96} = 54.67$` 
  + `$E[Y \vert A_0 = 0, A_1 = 0] = \frac{24 \cdot 84 + 24 \cdot 52}{48} = 68$`
  
--

- Problem: We have confounding from `$L_1$`.

- To account for confounding by `$L_1$`, we could try computing the associational difference within strata of `$L_1$` and then standardizing.

- Stratifying on `$L_1$`:

`$$E[Y \vert A_0 = 1, A_1 = 1, L_1 = 0]  - E[Y \vert A_0 = 0, A_1 = 0, L_1 = 0] = 76-84 = -8$$`
`$$E[Y \vert A_0 = 1, A_1 = 1, L_1 = 1]  - E[Y \vert A_0 = 0, A_1 = 0, L_1 = 1] = 44-52 = -8$$`

- Problem: `$L_1$` is a collider. Conditioning on `$L_1$` induces an association between `$A_0$` and `$Y$`.

---
# G-Formula

- The g-formula for point treatments has been the basis of IPW, standardization, and double robust methods we have seen so far:

`$$E[Y(a)] = \sum_l E[Y \vert A = a, L = l]f_L(l)$$`
- Integral version for continuous `$L$`,

`$$E[Y(a)] = \int_{l} E[Y \vert A = a, L = l] d F_L(l)$$`
--

- To derive this, we used the iterated expectation formula, exchangeability, and consistency:

`$$E[Y(a)] = E_L[ E_Y[ Y(a) \vert L]] = \\\
E_L[ E_Y[ Y(a) \vert A = a, L]] = \\\
E_{L}[E[Y \vert A = a, L] = \int_l E[Y \vert A = a, L = l] dF_L(l)$$`

---
# G-Formula for Static Treatment Strategies

- The G-Formula for two time points is
`$$E[Y(a_0, a_1)] = \sum_l E[Y \vert A_0 = a_0, A_1 = a_1, L_1 = l]f_{L_1 \vert A_0}(l \vert a_0)$$`

- First intuition: Write

`$$E[Y(a_0, a_1)] = \sum_l E[Y(a_0, a_1) \vert L_1(a_0) = l_1]P[L_1(a_0)]$$`
- Use sequential exchangeability to 
  + Replace `$E[Y(a_0, a_1) \vert L_1(a_0)]$` with `$E[Y \vert A_0 = a_0, A_1 = a_1, L_1 = l_1]$` and
  + `$P[L_1(a_0)]$` with `$P[L_1 \vert A_0 = a_0]$`
  
  
---
# Sequential Exchangeability

- Our assumption that `$P[L_1 \vert A_0 = a_0] = P[L_1(a_0)]$` and `$E[Y(a_0, a_1) \vert L_1(a_0)] = E[Y \vert A_0 = a_0, A_1 = a_1, L_1 = l_1]$` requires a stronger form of sequential exchangeabilty.

`$$\left(Y(\bar{a}_k), \underline{L}_{k+1}(\bar{a}_k)\right) \ci A_{k} \vert A_{k-1}, \bar{L}_k$$`
- Where `$\underline{L}_{k+1}$` is all covariates *after* time `$k$`.

- It turns out that the G-formula holds under only our weaker condition of sequential exchangeability for `$Y(\bar{a})$`.

- Static conditional exchangeability is sufficient for identifying statitc treatment strategies.

---
# G-Formula for Our Example

`$$E[Y(a_0, a_1)] = \sum_l E[Y \vert A_0 = a_0, A_1 = a_1, L_1 = l]f_{L_1 \vert A_0}(l \vert a_0)$$`
--

+ `$E[Y \vert A_0 = 1, A_1 = 1, L_1 = 0]P[L_1 = 0 \vert A_0 = 1] = 76 \cdot \frac{1}{2} = 38$`
+ `$E[Y \vert A_0 = 1, A_1 = 1, L_1 = 1]P[L_1 = 1 \vert A_0 = 1] = 44 \cdot \frac{1}{2} = 22$`
+ `$E[Y(1, 1)] = 38 + 22 = 60$`
--

+ `$E[Y \vert A_0 = 0, A_1 = 0, L_1 = 0]P[L_1 = 0 \vert A_0 = 0] = 84 \cdot \frac{1}{4} = 21$`
+ `$E[Y \vert A_0 = 0, A_1 = 0, L_1 = 1]P[L_1 = 1 \vert A_0 = 0] = 52 \cdot \frac{3}{4} = 39$`
+ `$E[Y(1, 1)] = 21 + 39 = 60$`

---

# General Version of G-Formula for Static Treatments

- The `$G$`-formula for a static treatment strategy generalizes to

`$$E[Y(\bar{a})] = \sum_\bar{l} E[Y \vert \bar{A} = \bar{a},\bar{L}= \bar{l}]\prod_{k = 0}^Kf(l_k \vert \bar{a}_{k-1}, \bar{l}_{k-1})$$`
or

`$$\int_l E[Y \vert \bar{A} = \bar{a},\bar{L}= \bar{l}] \prod_{k = 0}^K dF (l_k \vert \bar{a}_{k-1}, \bar{l}_{k-1})$$`

---
# G-Formula for Dynamic Treatment Strategies

- In a static deterministic strategy `$a_k$` is completely determined by treatment and confounder history.

- For dynamic or random strategies, this is the case and we need to add a term to the g-formula.

`$$E[Y(\bar{a})] = \sum_\bar{l} E[Y \vert \bar{A} = \bar{a},\bar{L}= \bar{l}]\prod_{k = 0}^Kf(l_k \vert \bar{a}_{k-1}, \bar{l}_{k-1})\prod_{k=0}^K f^{int}(a_k \vert \bar{a}_{k-1}, \bar{l}_k)$$`

- `$f^{int}$` is the conditional probability of `$a_k$` given the history *under the proposed intervention*.

---
# Inverse Probability Weighting

- We can generalize the IPW strategy we have been using for a point treatment to the time-varying regime.

`$$W^A = \prod_{k = 0}^{K} \frac{1}{f(A_k \vert \bar{A}_{k-1}, \bar{L}_{k})}$$`

- As before, we can stabilize the weights as

`$$SW^A = \prod_{k = 0}^{K} \frac{f(A_k \vert \bar{A}_{k-1})}{f(A_k \vert \bar{A}_{k-1}, \bar{L}_{k})}$$`

- If there are baseline covariates, `$L_0$`, we can condition on `$L_0$` in both numerator and denominator

`$$SW^A = \prod_{k = 0}^{K} \frac{f(A_k \vert \bar{A}_{k-1}, L_0)}{f(A_k \vert \bar{A}_{k-1}, \bar{L}_{k}, L_0)}$$`

- Only the model for the denominator needs to be correct.

---
# Inverse Probability Weighting

- Just like before, weighting subjects creates a pseudo-population in which treatment and confounders are uncorrelated.

- So we can compute the counterfactual as simply the conditional mean in the pseudo-population

`$$E[Y(a_0, a_1)] = E_{ps}[Y \vert A_0 = a_0, A_1 = a_1]$$`

---
# IP Weighting Example

</center>

- Compute unstabilized weights in the example

- Compute the sample size in each stratum. How big is the pseudo-population?

---

# IP Weighting Example

- Compute stabilized weights in the example

- How big is the pseudo-population created by the stabilized weights?

---
# IP Weighting Example

---
# Using IP Weights Non-Parametrically

+ Once we have computed `$W^{\bar{A}}$` or `$SW^{\bar{A}}$` we can compute `$Y(\bar{a})$` as

`$$\hat{E}\left[W_i^{\bar{A}}Y_i I(\bar{A}_i = \bar{a}) \right] = \frac{1}{N}\sum_{i = 1}^{N} W^{\bar{A}}_i Y_i I(\bar{A}_i = \bar{a})$$`
- We could have used either standardized or non-standardized weights.

- Notice that we are only making use of samples with observed treatment history is identical to the proposed intervention.

- Suppose we want to compare "always treat" and "never treat" strategies.
  + An equivalent procedure to IP weighting is to censor everyone who changed treatment (non-adherent units).
  + We then combine weights for treatment and weights for censoring. 
  `$$\frac{1}{P[A_0 = a]} \times \frac{1}{P[A_1 = a \vert A_0 = a, L_1]}$$`

---
# Weights for Dynamic Treatments

- Suppose we want to compare dynamic regimes `$g = (0, L_1)$` and `$g^{\prime} = (0, 1-L_1)$`.

- We can see that, conditional on `$A_0$` and `$L_1$`, treatment choice at `$A_1$` doesn't matter so we should find that `$E[Y(g)] - E[Y(g^{\prime})] = 0$`.

---
# Weights for Dynamic Treatments

- Using the `$W^{\bar{A}}$` weights gives the right answer

`$$E[Y(g)] = \frac{80\cdot 84 + 24 \cdot 52}{80 + 24} = 60 \qquad  E[Y(g^{\prime})] = \frac{80\cdot 84 + 24 \cdot 52}{80 + 24} = 60$$`

- But using `$SW^{\bar{A}}$` gives the wrong answer. Why?

`$$E[Y(g)] = \frac{12\cdot 84 + 84 \cdot 52}{12 + 84} = 56 \qquad  E[Y(g^{\prime})] = \frac{28\cdot 84 + 36 \cdot 52}{28 + 36} = 66$$`
---
# Weights for Dynamic Treatments

- The problem is the numerator of the standardized weights.

- To estimate `$E[Y(g)]$`, we can limit ourselves only to units with `$\bar{A}_i = (g_0(L_{0,i}), g_1(A_{0,i}, \bar{L}_{1,i}), \dots, g_K(\bar{A}_{k-1,i}, \bar{L}_{k,i}))$`.
 - For convenience, abbreviate `$g_k(\bar{A}_{k-1}, \bar{L}_k)$` as `$g_k$`

- The denominator of the standardized weight is the same as before.

`$$\prod_{k=0}^K \frac{1}{f(A_k = g_k \vert \bar{A}_{k-1} = g_{k-1}, \bar{L}_k)} = \prod_{k=0}^{K}\frac{1}{f(A_k \vert \bar{A}_{k-1}, \bar{L}_{k})}$$`

- The numerator is not the same. The numerator should be
`$$\prod_{k=0}^{1}f(A_k = g_k \vert \bar{A}_{k-1} = g_{k-1}) = \prod_{k=0}^K \sum_{l} f(A_k \vert \bar{A}_{k-1}, \bar{L}_{k}) P[\bar{L}_k = l \vert \bar{A}_{k-1}]$$`
---
# Weights for Dynamic Treatments

- The numerator of the standardized weights for `$g = (0, L_1)$` for the first and fourth row should be

`$$(0.5)(0.6\cdot 0.25 + 0.8 \cdot 0.75) = (0.5)(0.525)$$`

- The numerator of the standardized weights for `$g^{\prime} = (0, 1-L_1)$` for the second and third row should be

`$$(0.5)(0.4\cdot 0.25 + 0.2 \cdot 0.75) = (0.5)(0.475)$$`
---
# Estimating Weights

- If `$L_k$` is high dimensional or there are many time points, we will need to assume a parametric model for `$f(A_k \vert \bar{A}_{k-1}, \bar{L}_k)$`. 
  + We might assume that `$A_k$` depends only on the most recent treatment and covariates. 
  + Or some summary of the past history.

- We can fit one model (e.g. logistic regression) at each time point.

`$$E[A_k  \vert \bar{A}_{k-1}, \bar{L}_{k-1}] = \beta_{0,k} + \beta_{1,k} A_{k-1} + \beta_{2,k} cum_{-5}(\bar{A}_{k-1}) + \beta_{3,k} L_{k-1}$$`

+ `$cum_{-5}(\bar{A}_{k})$` is the number of times treated out of the previous five treatment times.

- Alternatively, we could assume that some coefficients are shared across time points and fit a pooled model, possibly with some time effects.

`$$E[A_k  \vert \bar{A}_{k-1}, \bar{L}_{k-1}] = \beta_{0,k} + \beta_1 A_{k-1} + \beta_2 cum_{-5}(\bar{A}_{k-1}) + \beta_3 L_{k-1} + \beta_4 A_{k-1} k$$`

---
# Non-Parametric Estimation

- If we are totally non-parametric and there is no effect of treatment on confounders, then estimating the effect of a treatment strategy `$\bar{a}$` or `$g$` is very similar to estimating the effect of a point treatment of assignment to a particular regime.

- In fact, if there is no effect of treatment history on time-varying confounders, `$f(L_k \vert \bar{A}_{k-1}) = f(L_k)$` and the G-formula for time-varying treatment reduces to the regular point-treatment G-formula. 
  - HR call this "treatment-confounder feedback".

- Whether or not there is treatment-confounder feedback, if we are willing to make parametric assumptions, we can borrow information from units with similar treatment histories.

---
# Marginal Structural Models

- Just as we did before, we can use our IP weights to fit parametric marginal structural models.

- For example, we might assume that the total treatment effect of `$\bar{a}$` only depends on the total number of times treated and not on the timing of the treatment.

`$$E[Y(\bar{a})] = \beta_0 + \beta_1 cum(\bar{a})$$`

---
# Marginal Structural Models

- Once we have proposed a marginal structural mean model, we can fit it using the pseudo-population created by weighting the data.

`$$E_{ps}[Y \vert \bar{A}] = \beta_0 + \beta_1 cum(\bar{A})$$`

- `$\hat{\beta}_1$` estimates the causal effect of increasing the number of treated periods by 1.

- Variance of from bootstrap or, conservatively, from robust sandwich estimator.

- Testing that `$\hat{\beta}_1 = 0$` gives a test of the strong null that treatment at any time is unrelated to outcome, `$Y(\bar{a}) = Y$` for all `$\bar{a}$`.

---
# Marginal Structural Models for Effect Modification

- We could propose a marginal structural model that includes effect modification

`$$E_{ps}[Y \vert \bar{A}] = \beta_0 + \beta_1 cum(\bar{A}) + \beta_2 V + \beta_3 cum(\bar{a}) V$$`

- What are `$\beta_1$`, `$\beta_2$` and `$\beta_3$`?

---
# Assumptions

- For correct inference using IP weighting + a marginal structural model we need:

- Consistency, sequential positivity, sequential conditional exchangeability

- Correct propensity-score model

- Correct marginal structural model

---
# Parametric G-Formula

- When we only had a single point intervention, we could use outcome regression plus standardization as a plug-in estimator of the g-formula.

- We needed to estimate `$E[Y \vert A, L]$` but not `$f_L(l)$`, the density of covariates.

- To estimate `$E[Y(a)]$` we replaced each persons treatment value with `$a$` and then estimated `$\hat{Y}_i(a) = \hat{E}[Y \vert A = a_0, L = L_i]$`.

+ We hen approximated the integral 
    
      `$$\int_l E[Y \vert A, L = l]f_L(l) dl$$`
with the sum `$$\frac{1}{N} \sum_{i = 1}^N \hat{Y}_i(a)$$`

---
# Parametric G-Formula

- There is an analog of this strategy for time-varying treatments.

- Recall the (integral form) of the G-formula for static time-varying treatments:

`$$\int E[Y \vert \bar{A} = \bar{a},\bar{L}= \bar{l}] \prod_{k = 0}^K dF (l_k \vert \bar{a}_{k-1}, \bar{l}_{k-1})$$`

- In the time-varying g-formula, we clearly need to estimate `$E[Y \vert \bar{A}, \bar{L}]$`.

- Can we use our same standardization trick to avoid estimating the covariate density?

- No

---
# Parametric G-Formula

- Imagine we try the standardization trick with two time points:

- We first set `$A_0 = a_0$` and `$A_1 = a_1$` for everyone in the data set.

- What is the problem?

- The value of the covariates `$L_1$` depends on `$A_0$`. 
  + We need to replace `$L_1$` with `$L_1(a_0)$` but these values are not observed.
 
---
# Simulating Covariates

- We need to propose a parametric model for the density `$f(l_k \vert \bar{a}_{k-1}, \bar{l}_{k-1})$`.

- We can then simulate covariate histories conditional on the intervention of interest from our estimated model.