L10: Time Varying Treatment Part 1

class: center, middle, inverse, title-slide

.title[
# L10: Time Varying Treatment
Part 1
]
.author[
### Jean Morrison
]
.institute[
### University of Michigan
]
.date[
### Lecture on 2023-02-15
(updated: 2023-02-22)
]

---

`$\newcommand{\ci}{\perp\!\!\!\perp}$`

## Lecture Outline

1. Introduction
1. Sequential Exchangeability 
1. G-Formula For Time-Varying Treatments
1. IP Weighting for Time-Varying Treatments
1. G-Computation for Time-Varying Treatments, G-Null Paradox

---
# 1. Introduction

---
## Example

- Patients are treated for a disease over time.

- At each appointment, the treatment decision for the next period is made, possibly based on current or past symptoms or treatments.

- We observe a single outcome `$Y$` after all of the treatments are delivered.

---
## Example

- A simple possible DAG for a setting with two time-points is below.

- `$L_1$` represents symptoms, or perhaps how symptoms have changed since time 0.

---
## Notation and Conventions

- Bar notation indicates the history of a variable `$\bar{A}_k = (A_0, \dots, A_k)$`.

- By convention, `$A_k$` is the last variable at time `$k$`. 
  + Covariates `$L_k$` are measurements that are taken after treatment `$A_{k-1}$` is given but before treatment `$A_k$` is given. 
  
- Timing aligns for all units. 
  + We will often talk about time points as though they are evenly spaced (e.g. every month), but this is not required.

- Time starts at 0.

---
## Treatment Programs

- We might be interested in the effect of an entire course of treatment `$\bar{A} = (A_0, A_1, \dots, A_K)$`. 
  - i.e. we are interested in the effect of a joint intervention on treatment at all time-points. 
  
- With 2 time points and a binary treatment, there are only four possible courses of treatment.

- With `$K$` time points there are `$2^K$` treatment courses.

- In fact, there are even more treatments that we could consider than these `$2^K$`!

---
## Treatment Strategies

- A treatment strategy, `$g$` is a rule for determining `$A_k$` from a unit's past covariate values

`$$g = (g_0(\bar{a}_{-1}, l_0), \dots, g_K(\bar{a}_{K-1}, \bar{l}_K))$$`

- A treatment strategy is static if it **does not** depend on any covariates, only past treatments. 
  + In a static strategy, we could write out the entire program at the beginning of the study. 
  + Ex: Treat every other month
  + Ex: Treat for only the first two time points. 
  
- A treatment strategy is dynamic if it **does** depend on covariates. 
  + Ex: Treat if `$L_{k-1}$` was high. 
  + Ex: If `$L_{k-1}$` is high, switch treatment, so `$A_{k} = 1-A_{k-1}$`. Otherwise set `$A_k = A_{k-1}$`.

---
## Sequentially Randomized Trials

- In a sequentially randomized trial, treatment `$A_{k,i}$` is assigned randomly with `$P[A_{k,i} = a]$` possibly depending on `$\bar{A}_{k-1}$` and `$\bar{L}_{k-1}$`.

- Example: Every patient starts on treatment 0. Every month a random set of patients are assigned to switch to treatment 1 and stay on that treatment for the rest of the study. 
      - Patients with high values of `$L_{k}$` may have a higher probability of starting treatment. 
      
- Example: Treatment is assigned randomly at every time point. 
      - Patients with high values of `$L_{k}$` have a higher probability of switching treatments.

---
## Sequentially Randomized Trials

- A random strategy will never be as good as the optimal deterministic strategy.

- We would never recommend a random strategy for general treatment of patients.

- But random strategies are necessary when the optimal strategy is unknown.

---
## Causal Contrasts

- The causal contrast we choose to look at will depend on the study.

- We might be interested in comparing specific fixed programs, `$E[Y(\bar{A} = \bar{a})] - E[Y(\bar{A} = \bar{a}^\prime)]$` such as 
  + Always treat vs never treat: `$\bar{a} = (0, 0, \dots, 0)$`, `$\bar{a}^\prime = (1, 1, \dots, 1)$`
  + Treat early and continue vs begin treatment later and continue: `$\bar{a} = (1, 1, \dots, 1)$`, `$\bar{a}^\prime = (0,\dots, 0, 1, \dots, 1)$`.
  
- Or we could compare one or more dynamic strategies `$g$`, `$E[Y(g)]$` such as:
  + Always treat vs treat only when symptoms are present.

- In the next few lectures, we will always assume that the causal contrast of interest is defined a priori. There are also whole fields of research on determining the optimal treatment regime from observational data.

---
# 2. Sequential Exchangeability

---
## Example

- Consider the DAG we saw earlier:

- With your partner, propose a method to estimate `$E[Y(a_0)]$` and a method to estimate `$E[Y(a_1)]$`.

---
## Example

- What should we do if we want to estimate `$E[Y(a_0, a_1)]$`? Should we condition on `$L_1$` or not?

---
## Example

- Data below were aggregated from a trial of 320,000 units.

+ The trial conforms to our previous DAG. 
  + Treatment at time 0 is random with probability 0.5. 
  + Treatment at time 1 depends only on `$L_1$`: `$P[A_1 = 1 \vert L_1 = 1] = 0.8$`, `$P[A_1 = 1 \vert L_1 = 0] = 0.4$`.
  
<table class="table" style="width: auto !important; float: left; margin-right: 10px;">
 <thead>
  <tr>
   <th style="text-align:right;"> N </th>
   <th style="text-align:right;"> $A_{0}$ </th>
   <th style="text-align:right;"> $L_{1}$ </th>
   <th style="text-align:right;"> $A_{1}$ </th>
   <th style="text-align:right;"> $\bar{Y}$ </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 2400 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 84 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1600 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 84 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2400 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 52 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 9600 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 52 </td>
  </tr>
</tbody>
</table>

- Take a minute to calculate the average effect of treatment at time 0 and the average effect of treatment at time 1 separately.

---
# Example
<table class="table" style="width: auto !important; float: left; margin-right: 10px;">
 <thead>
  <tr>
   <th style="text-align:right;"> N </th>
   <th style="text-align:right;"> $A_{0}$ </th>
   <th style="text-align:right;"> $L_{1}$ </th>
   <th style="text-align:right;"> $A_{1}$ </th>
   <th style="text-align:right;"> $\bar{Y}$ </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 2400 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 84 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1600 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 84 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2400 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 52 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 9600 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 52 </td>
  </tr>
</tbody>
</table>

+ There is no average effect of `$A_0$`: 
  + `$E[Y \vert A_0 = 0] = \frac{4000\cdot 84 + 12000 \cdot 52}{16000} = 60$`
  + `$E[Y \vert A_0 = 1] = \frac{8000\cdot 76 + 8000 \cdot 44}{16000} = 60$`

+ Within each stratum of `$(A_0, L_1)$`, the expected value of `$Y$` is equal regardless of `$A_1$`. Therefore there is no average effect of `$A_1$` on `$Y$` and 
there is no effect modification by `$A_0$`.

+ Therefore there can be no effect of the joint intervention on `$A_0$` and `$A_1$`.

- We want to estimate `$E[Y(1,1)] - E[Y(0, 0)]$`. We've seen that our answer shoudl be 0.

- We could try computing `$E[Y \vert A_0 = 1, A_1 = 1] - E[Y \vert A_0 = 0, A_1 = 0]$`:

+ `$E[Y \vert A_0 = 1, A_1 = 1] = \frac{3200\cdot 76 + 6400 \cdot 44}{9600} = 54.67$` 
  + `$E[Y \vert A_0 = 0, A_1 = 0] = \frac{2400 \cdot 84 + 2400 \cdot 52}{4800} = 68$`
  
--

- Problem: Confounding from `$L_1$`.

- We could try computing the associational difference within strata of `$L_1$` and then standardizing.

- Stratifying on `$L_1$`:

`$$E[Y \vert A_0 = 1, A_1 = 1, L_1 = 0]  - E[Y \vert A_0 = 0, A_1 = 0, L_1 = 0] = 76-84 = -8$$`
`$$E[Y \vert A_0 = 1, A_1 = 1, L_1 = 1]  - E[Y \vert A_0 = 0, A_1 = 0, L_1 = 1] = 44-52 = -8$$`

- Problem: `$L_1$` is a collider between `$A_0$` and `$U$`.

---

## Estimating the Effect in the Example

- To estimate the effect of the joint intervention in our example, we can start by looking at the SWIG

</center>

---
## Estimating the Effect in the Example

- From the SWIG, we can determine four conditional independence statements

$$
`\begin{split}
&Y(a_0, a_1) \ci A_1(a_0) \vert A_0, L_1(a_0)\qquad &\text{(1)}\\
&Y(a_0, a_1) \ci A_0\qquad&\text{(2)}\\
&Y(a_0, a_1) \ci A_0 \vert L_1(a_0)\qquad&\text{(3)}\\
&L_1(a_0) \ci A_0 \qquad&\text{(4)}
\end{split}`
$$

</center>

---
## Estimating the Effect in the Example

- Now we will use these statements, consistency, and the law of total probability to compute `$E[Y(a_0, a_1)]$`.

- First step: Use the law of total probability

$$
E[Y(a_0, a_1)] = \sum_l E[Y(a_0, a_1)  \vert L_1(a_0) = l]P[L_1(a_0) = l]
$$

- Now we need to use our conditional independence relation and consistency to write each term as something we can estimate form the data.

---
## Estimating the Effect in the Example

- Our first conditional independence relation is 
`$$Y(a_0, a_1) \ci A_1(a_0) \vert A_0, L_1(a_0)$$`

- By consistency, if `$A_0 = a_0$` then `$A_1 = A_1(a_0)$` and `$L_1 = L_1(a_0)$`, so this relation means that

`$$Y(a_0, a_1) \ci A_1 \vert A_0 = a_0, L_1$$`
- So
$$
`\begin{split}
E[Y(a_0, a_1) &\vert L_1(a_0)] \\
= &E[Y(a_0, a_1) \vert A_0 = a_0, L_1(a_0)] \qquad \text{(CI relation 3)}\\
= & E[Y(a_0, a_1) \vert A_0 = a_0, L_1] \qquad \text{(Consistency)}\\
= & E[Y \vert A_1 = a_1, A_0 = a_0, L_1] \qquad \text{(CI relation 1 and consistency)}
\end{split}`
$$
---
## Estimating the Effect in the Example

- For the second part of our formula, `$P[L_1(a_0) = l] = P[L_1 = l \vert A_0 ]$` by CI relation 4.

- Putting it all together we have

$$
`\begin{split}
E[Y(a_0, a_1)] = &\sum_l E[Y(a_0, a_1)  \vert L_1(a_0) = l]P[L_1(a_0) = l]\\
= & \sum_l E[Y \vert A_1 = a_1, A_0 = a_0, L_1 = l]P[L_1 = l \vert A_0 = a_0]
\end{split}`
$$

- It will turn out that even though we need CI relations (3) and (4) to give causal interpretation to the parameters in the formula, the formula works if only (1) and (2) hold.
  + We didn't need (2) to derive this formula, but we will use it later.

---
## Estimating the Effect in Example
<table class="table" style="width: auto !important; float: left; margin-right: 10px;">
 <thead>
  <tr>
   <th style="text-align:right;"> N </th>
   <th style="text-align:right;"> $A_{0}$ </th>
   <th style="text-align:right;"> $L_{1}$ </th>
   <th style="text-align:right;"> $A_{1}$ </th>
   <th style="text-align:right;"> $\bar{Y}$ </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 2400 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 84 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1600 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 84 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2400 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 52 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 9600 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 52 </td>
  </tr>
</tbody>
</table>

- We can now apply our formula: 
$$
E[Y(a_0, a_1)] = \sum_l E[Y \vert A_1 = a_1, A_0 = a_0, L_1 = l]P[L_1 = l \vert A_0 = a_0]
$$

$$
`\begin{split}
&E[Y(0, 0)] = 84 \cdot \frac{40000}{16000} + 52 \cdot \frac{120000}{160000} = 60\\
&E[Y(1, 1)] = 76 \cdot \frac{80000}{160000} + 44 \cdot \frac{80000}{160000} = 60
\end{split}`
$$

---
## Sequential Exchangeability

- In our example, CI relations (1) and (2) are a version of **sequential exchangeability**.

- For time-varying treatments, sequential exchangeability is the condition that will allow us to identify treatment effects.

- We will also need slightly updated concepts of positivity and consistency.

- The **time-varying G-formula** will be the formula that identifies these effects. 
  + We applied the time-varying G-formula in our example. 
  + Formal definition coming in the next section.

---
## Static Sequential Exchangeability

- Static sequential exchangeability says that

`$$Y(\bar{a}) \ci A_k \vert\ \bar{A}_{k-1} = \bar{a}_{k-1}, \bar{L}_k\qquad k = 0, 1, \dots K$$`

- The joint counterfactual outcome is conditionally independent of each treatment **given** all previous treatments and all previous confounders.

---
## Static Sequential Exchangeability in the Example

- In our example, we have two time points so there are two CI relations required to satisify sequential exchangeability:

$$
`\begin{split}
&Y(a_0, a_1) \ci A_1 \vert A_0 = a_0, L_1\\
&Y(a_0, a_1) \ci A_0
\end{split}`
$$
- From the SWIG we showed that 
$$
`\begin{split}
&Y(a_0, a_1) \ci A_1 \vert A_0 = a_0, L_1(a_0)\\
&Y(a_0, a_1) \ci A_0
\end{split}`
$$

- We can then conclude that `$Y(a_0, a_1) \ci A_1 \vert A_0 = a_0, L_1$` from consistency.

---
## Static Sequential Exchangeability

- Does static sequential exchangeability hold in the SWIG below?

---
## Static Sequential Exchangeability

- Does static sequential exchangeability hold in the SWIG below?

---
## Sequential Exchangeability for Dynamic Treatment Strategies

- Sequential exchangeability for `$Y(g)$` holds if

`$$Y(g) \ci A_k \vert \bar{A}_{k-1} = g(\bar{A}_{k-2}, \bar{L}_{k-1}), \bar{L}_k \qquad  k = 0, 1, \dots, K$$`
- This definition applies if `$g$` is static or dynamic, random or deterministic.

- Also called **sequential conditional exchangeability**

---

## SWIGS for Dynamic Treatment Strategies

- Suppose that we want to estimate the counterfactual `$E[Y(g)]$` where `$g$` is a dynamic treatment strategy "treat only if `$L_k = 1$`".
  
+ Recall that the SWIG represents the hypothetical world of the intervention, not the observational world.

- Our intervention **introduces an arrow** from `$L_1(g_0)$` to the value of `$A_1$` in the interventional world.

---

## SWIGS for Dynamic Treatment Strategies

- The dotted arrow is created by the proposed intervention.

+ It is not a result of the experimental design or underlying causal structure.

- The dotted arrow functions just like a solid arrow for computing d-separation.

+ It is dotted so that we know it was introduced by the intervention.

---

## Sequential Exchangeability for Dynamic Treatment Strategies

- Does sequential exchangeability hold for `$Y(g)$` in the dynamic intervention?

- We can see that `$Y(g) \ci A_0$` and `$Y(g) \ci A_1(g_0) \vert\ L_1(g_0), A_0 = g_0$`

- Using consistency `$Y(g) \ci A_1 \vert L_1, A_0 = g_0$`

---

## Sequential Exchangeability for Dynamic Treatment Strategies

- We don't have `$Y(g) \ci A_0$`

+ They are connected by the `$A_0 - W_0 - L_1(g_0) - g_1 - Y(g)$` path
  + The `$g_1$` node does not block the path because it's not fixed.
  
---

## Positivity

- Let `$f_{\bar{A}_{k-1}, \bar{L}_k}$` be the joint pdf for the treatment history before point `$k$` and the covariate history.

- For time-varying treatment, positivity requires that

`$$f_{\bar{A}_{k-1}, \bar{L}_k}(\bar{a}_{k-1}, \bar{l}_k) > 0 \Rightarrow f_{A_{k} \vert \bar{A}_{k-1}, \bar{L}_k}(a_k \vert \bar{a}_{k-1}, \bar{l}_k) > 0$$`
- If we are interested in a particular strategy, `$g$`, the condition only needs to hold for treatment histories compatible with `$g$` ( `$a_k = g(\bar{a}_{k-1}, \bar{l}_k)$` ).

- This condition says that given past treatment history and covariates, any treatment consistent with the strategy should be possible.

---
## Consistency

- For a point treatment, consistency requires that `$A = a \Rightarrow Y(a) = Y$`.

- For a static strategy, the condition `$\bar{A} = \bar{a} \Rightarrow Y(\bar{a}) = Y$` is sufficient.

- For dynamic strategies, if `$A_k = g_k(\bar{A}_{k-1}, \bar{L}_k)$` for all `$k$` then `$Y(g) = Y$`.

---
# 3. Time-Varying G-Formula

---
## G-Formula

- The g-formula for point treatments has been the basis of IPW, standardization, and double robust methods we have seen so far:

`$$E[Y(a)] = \sum_l E[Y \vert A = a, L = l]f_L(l)$$`
- Integral version for continuous `$L$`,

`$$E[Y(a)] = \int_{l} E[Y \vert A = a, L = l] d F_L(l)$$`

---
## G-Formula for Static Treatment Strategies

- The G-Formula for two time points is
`$$E[Y(a_0, a_1)] = \sum_l E[Y \vert A_0 = a_0, A_1 = a_1, L_1 = l]f_{L_1 \vert A_0}(l \vert a_0)$$`

- This should look familiar, we used it in the example earlier.

---
## G-Formula for Static Treatment Strategies

- We saw earlier that static sequential exchangeability holds in this graph.

- However, `$E[Y \vert A_0=a_0, A_1 = a_1, L_1 = l] \neq E[Y(a_0, a_1) \vert L_1(a_0)]$` and `$P[L_1 = l \vert A_0 = a_0] \neq P[L_1(a_0) = l]$`.

- Nevertheless, the G-formula still holds for this graph.

---

## G-Formula as Iterated Expectations

- Suppose that we have two time points and static sequential exchangeability holds:

$$
`\begin{split}
&Y(a_0, a_1) \ci A_1 \vert A_0 = a_0, L_1\\
&Y(a_0, a_1) \ci A_0
\end{split}`
$$

- Think of `$Y(a_0, a_1)$` as `$Y(a_1)(a_0)$`. We could also write it as `$Y(a_0)(a_1)$` but the first ordering is more useful.

- Re-write the second relation as `$Y(a_1)(a_0) \ci A_0$`. From this we can see that if we knew `$Y(a_1)$` we could calcluate

`$$E[Y(a_0, a_1)] = E[Y(a_1) \vert A_0 = a_0]$$`

---

## G-Formula as Iterated Expectations

$$
`\begin{split}
&Y(a_0, a_1) \ci A_1 \vert A_0 = a_0, L_1\\
&Y(a_0, a_1) \ci A_0
\end{split}`
$$

- Now re-write the first relation as `$Y(a_1) \ci A_1 \vert A_0 = a_0, L_1$` using consistency. This means that

$$
`\begin{split}
E[Y(a_0, a_1)] = &E[Y(a_1) \vert A_0 = a_0]\\
=& \sum_l E[Y \vert A_1 = a_1, A_0 = a_0, L_1 = l]P[L_1 = l\vert A_0 = a_0]
\end{split}`
$$

- The first line is our result from the last slide.

- The second line follows from the point-treatment G-formula.

---

## General Version of G-Formula for Static Treatments

- The `$G$`-formula for a static treatment strategy generalizes to

`$$E[Y(\bar{a})] = \sum_\bar{l} E[Y \vert \bar{A} = \bar{a},\bar{L}= \bar{l}]\prod_{k = 0}^Kf(l_k \vert \bar{a}_{k-1}, \bar{l}_{k-1})$$`
or

`$$\int_l E[Y \vert \bar{A} = \bar{a},\bar{L}= \bar{l}] \prod_{k = 0}^K dF (l_k \vert \bar{a}_{k-1}, \bar{l}_{k-1})$$`

---
## G-Formula for Dynamic Treatment Strategies

- In a static deterministic strategy `$a_k$` can be completely determined ahead of time.

- For dynamic or random strategies, we need to add a term to the G-formula.

`$$E[Y(\bar{a})] = \sum_\bar{l} E[Y \vert \bar{A} = \bar{a},\bar{L}= \bar{l}]\prod_{k = 0}^Kf(l_k \vert \bar{a}_{k-1}, \bar{l}_{k-1})\prod_{k=0}^K f^{int}(a_k \vert \bar{a}_{k-1}, \bar{l}_k)$$`

- `$f^{int}$` is the conditional probability of `$a_k$` given the history *under the proposed intervention*.

---
# 3. IP Weighting for Time-Varying Treatments

---
## Inverse Probability Weighting

- We can generalize the IPW strategy we have been using for a point treatment to the time-varying regime.

`$$W^A = \prod_{k = 0}^{K} \frac{1}{f(A_k \vert \bar{A}_{k-1}, \bar{L}_{k})}$$`

- As before, we can stabilize the weights as

`$$SW^A = \prod_{k = 0}^{K} \frac{f(A_k \vert \bar{A}_{k-1})}{f(A_k \vert \bar{A}_{k-1}, \bar{L}_{k})}$$`

- If there are baseline covariates, `$L_0$`, we can condition on `$L_0$` in both numerator and denominator

`$$SW^A = \prod_{k = 0}^{K} \frac{f(A_k \vert \bar{A}_{k-1}, L_0)}{f(A_k \vert \bar{A}_{k-1}, \bar{L}_{k}, L_0)}$$`

- Only the model for the denominator needs to be correct.

---
## Inverse Probability Weighting

- Just like before, weighting subjects creates a pseudo-population in which treatment and confounders are uncorrelated.

- So we can compute the counterfactual as simply the conditional mean in the pseudo-population

`$$E[Y(a_0, a_1)] = E_{ps}[Y \vert A_0 = a_0, A_1 = a_1]$$`

---
## IP Weighting Example

</center>

- Compute unstabilized weights in the example

- Compute the sample size in each stratum. How big is the pseudo-population?

---

## IP Weighting Example

- Compute stabilized weights in the example

- How big is the pseudo-population created by the stabilized weights?

---
## IP Weighting Example

---
## Using IP Weights Non-Parametrically

+ Once we have computed `$W^{\bar{A}}$` or `$SW^{\bar{A}}$` we can compute `$Y(\bar{a})$` as

`$$\frac{\hat{E}\left[W_i^{\bar{A}}Y_i I(\bar{A}_i = \bar{a}) \right]}{\hat{E}[W^{\bar{A}}_i I(\bar{A}_i = \bar{a})]}= \frac{\sum_{i = 1}^{N} W^{\bar{A}}_i Y_i I(\bar{A}_i = \bar{a})}{\sum_{i =1}^N W_i^{\bar{A}}I(\bar{A}_i = \bar{a})}$$`
- We could have used either stabilized or non-stabilized weights. With non-stabilized weights, the denominator will always equal `$N$`.

- Notice that we are only making use of samples with observed treatment history identical to the proposed intervention.

---
## Weights for Dynamic Treatments

- To compute the non-parametric IP weighted estimate for `$E[Y(0, 0)]$` we only need the data for the units that received treatment `$(0, 0)$`.

- Using the non-standardized weights:

`$$E[Y(0, 0)] =84 \frac{8000}{32000} + 52 \frac{24000}{32000} = 60$$`
---
## Using IP Weights Non-Parametrically

- An equivalent way to think of IP weights used non-parametrically is as censoring weights for "non-adherence".

- Suppose we want to compare "always treat" and "never treat" strategies.

- We first censor anyone who did not adhere to one of these strategies and think of our study as now a study of the effect of the point treatment at time 1 and full adherence.

- We first compute the censoring weights

`$$W^{C}_i = \frac{1}{P[A_1 = A_{0,i} \vert A_0 = A_{0,i}, L_1, L_0]}$$`

- And then compute the confounding weights

`$$W_i^{L} = \frac{1}{P[A_0 = A_i \vert L_0]}$$`

- So the total weights are the product of `$W^{L}$` and `$W^{C}$`.

---
## Weights for Dynamic Treatments

- Suppose we want to compare dynamic regimes `$g = (0, L_1)$` and `$g^{\prime} = (0, 1-L_1)$`.

- We can see that, conditional on `$A_0$` and `$L_1$`, treatment choice at `$A_1$` doesn't matter so we should find that `$E[Y(g)] - E[Y(g^{\prime})] = 0$`.

---
## Weights for Dynamic Treatments

- Using the `$W^{\bar{A}}$` we can compute `$E[Y(g)]$` and `$E[Y(g^\prime)]$` using the non-parametric approach.

- The first and fourth row follow the `$(0, L_1)$` treatment plan while the second and third row follow `$(0, 1-L_1)$`.

---
## Weights for Dynamic Treatments

- Computing the counterfactual means for two dynmaic treatments:

`$$E[Y(g)] = \frac{1}{N}\sum_{i = 1}^N Y_i W_i^{\bar{A}} I(\bar{A}_i = g) =   \frac{8000\cdot 84 + 24000 \cdot 52}{32000} = 60$$`

`$$E[Y(g^{\prime})] = \frac{1}{N}\sum_{i = 1}^N Y_i W_i^{\bar{A}} I(\bar{A}_i = g^\prime) = \frac{8000\cdot 84 + 24000 \cdot 52}{32000} = 60$$`
---
## Weights for Dynamic Treatments
<center> 
<img src="img/9_fig213cropped.png" width="80%" />
</center>

- We can do the same calculation using `$SW^{\bar{A}}$`, but we get the wrong answer.

`$$E[Y(g)] = \frac{1200 \cdot 84 + 8400 \cdot 52}{1200 + 8400} = 56$$`

`$$E[Y(g^{\prime})] = \frac{2800\cdot 84 + 3600 \cdot 52}{2800 + 3600} = 66$$`
---
## Weights for Dynamic Treatments

- The problem is the numerator of the standardized weights.

- We calculated the numerator as `$f(A_1 = A_{1,i} \vert A_{0} = A_{0,i})$` but this depends on `$A_k$` which isn't fixed under the dynamic intervention.

-  This means that we cannot use stabilized weights for dynamic interventions.

---
## Estimating Weights

- If `$L_k$` is high dimensional or there are many time points, we will need to assume a parametric model for `$f(A_k \vert \bar{A}_{k-1}, \bar{L}_k)$`. 
  + We might assume that `$A_k$` depends only on the most recent treatment and covariates. 
  + Or some summary of the past history.

- We can fit one model (e.g. logistic regression) at each time point.

`$$E[A_k  \vert \bar{A}_{k-1}, \bar{L}_{k-1}] = \beta_{0,k} + \beta_{1,k} A_{k-1} + \beta_{2,k} cum_{-5}(\bar{A}_{k-1}) + \beta_{3,k} L_{k-1}$$`

+ `$cum_{-5}(\bar{A}_{k})$` is the number of times treated out of the previous five treatment times.

---
## Estimating Weights

- Alternatively, we could assume that some coefficients are shared across time points and fit a pooled model, possibly with some time effects.

`$$E[A_k  \vert \bar{A}_{k-1}, \bar{L}_{k-1}] = \beta_{0,k} + \beta_1 A_{k-1} + \beta_2 cum_{-5}(\bar{A}_{k-1}) + \beta_3 L_{k-1} + \beta_4 A_{k-1} k$$`

- This is a more commonly used approach than fitting one model at every time point.

- To fit this model, convert the data into "long" format with one row per person-time combination. 
  + Add columns for any time-dependent covariates.
  
- We now want to fit a marginal model with repeated measures, so we can use GEE.

---
## Non-Parametric Estimation

- If we are totally non-parametric and there is no effect of treatment on confounders, then estimating the effect of a treatment strategy `$\bar{a}$` or `$g$` is very similar to estimating the effect of a point treatment of assignment to a particular regime. 
  - HR call effects of treatment on confounders "treatment-confounder feedback".

- If there is no effect of treatment history on time-varying confounders, `$f(L_k \vert \bar{A}_{k-1}) = f(L_k)$` and the G-formula for time-varying treatment reduces to the regular point-treatment G-formula.

- Whether or not there is treatment-confounder feedback, if we are willing to make parametric assumptions, we can borrow information from units with similar treatment histories.

---
## Marginal Structural Models

- Just as we did before, we can use our IP weights to fit parametric marginal structural models.

- For example, we might assume that the total treatment effect of `$\bar{a}$` only depends on the total number of times treated and not on the timing of the treatment.

`$$E[Y(\bar{a})] = \beta_0 + \beta_1 cum(\bar{a})$$`

---
## Marginal Structural Models

- Once we have proposed a marginal structural mean model, we can fit it using the pseudo-population created by weighting the data.

`$$E_{ps}[Y \vert \bar{A}] = \beta_0 + \beta_1 cum(\bar{A})$$`

- `$\hat{\beta}_1$` estimates the causal effect of increasing the number of treated periods by one.

- Variance from bootstrap or, conservatively, from robust sandwich estimator.

- Testing that `$\hat{\beta}_1 = 0$` gives a test of the strong null that treatment at any time is unrelated to outcome, `$Y(\bar{a}) = Y$` for all `$\bar{a}$`.

---
## Marginal Structural Models for Effect Modification

- We could propose a marginal structural model that includes effect modification

`$$E_{ps}[Y \vert \bar{A}, V] = \beta_0 + \beta_1 cum(\bar{A}) + \beta_2 V + \beta_3 cum(\bar{a}) V$$`

- What are `$\beta_1$`, `$\beta_2$` and `$\beta_3$`?

---
## Assumptions

- For correct inference using IP weighting + a marginal structural model we need:

- Consistency, sequential positivity, sequential conditional exchangeability

- Correct propensity-score model

- Correct marginal structural model

---
# 4. G-Computation and the G-Null Paradox

---
## Parametric G-Formula

- When we only had a single point intervention, we could use outcome regression plus standardization as a plug-in estimator of the g-formula.

- We needed to estimate `$E[Y \vert A, L]$` but not `$f_L(l)$`, the density of covariates.

- To estimate `$E[Y(a)]$` we replaced each persons treatment value with `$a$` and then estimated `$\hat{Y}_i(a) = \hat{E}[Y \vert A = a_0, L = L_i]$`.

+ We then approximated the integral 
    
      `$$\int_l E[Y \vert A, L = l]f_L(l) dl$$`
with the sum `$$\frac{1}{N} \sum_{i = 1}^N \hat{Y}_i(a)$$`

---
## Parametric G-Formula

- There is an analog of this strategy for time-varying treatments.

- Recall the (integral form) of the G-formula for static time-varying treatments:

`$$\int E[Y \vert \bar{A} = \bar{a},\bar{L}= \bar{l}] \prod_{k = 0}^K dF (l_k \vert \bar{a}_{k-1}, \bar{l}_{k-1})$$`

- In the time-varying g-formula, we clearly need to estimate `$E[Y \vert \bar{A}, \bar{L}]$`.

- Can we use our same standardization trick to avoid estimating the covariate density?

- No

---
## Parametric G-Formula

- Imagine we try the standardization trick with two time points:

- We first set `$A_0 = a_0$` and `$A_1 = a_1$` for everyone in the data set.

- What is the problem?

- The value of the covariates `$L_1$` depends on `$A_0$`. 
  + We need to replace `$L_1$` with `$L_1(a_0)$` but these values are not observed.
 
---
## Simulating Covariates

- We need to propose a parametric model for the density `$f(l_k \vert \bar{a}_{k-1}, \bar{l}_{k-1})$`.

- We can then simulate covariate histories conditional on the intervention of interest from our estimated model.