L5: IP Weighting and G-Computation

class: center, middle, inverse, title-slide

.title[
# L5: IP Weighting and G-Computation
]
.author[
### Jean Morrison
]
.institute[
### University of Michigan
]
.date[
### Lecture on 2023-01-30
(updated: 2024-02-07)
]

---

`$\newcommand{\ci}{\perp\!\!\!\perp}$`

## Lecture Plan

1. Intro

2. Inverse Probability Weighting 
  - IP weighting for binary and continuous treatment
  - Marginal structural models
  - Censoring weights

3. G-Computation

---
# 1. Introduction

---
## Modeling

- So far we have learned two strategies for estimating `$E[Y(a)]$` from data when we need to adjust for
confounding or selection bias.

- In IP weighting we re-weight our observations by `$1/P[A = A_i \vert L_i]$`, the probability that 
they received the treatment that they got given confounderts `$L_i$`.

- In standardization, we first stratify the data by `$L$`, compute `$E[Y \vert A = a, L]$` within each stratum, and then take a weighted sum of these averages.

---

## Models

- Formally, a **model** is a class of probability distributions `$\mathcal{M} = \lbrace P_\theta : \theta \in \Theta \rbrace$`.

- `$\theta$` is a (possibly infinite) vector of parameters and `$\Theta$` is the parameter space.

- A model is called **parametric** when `$\theta$` is finite dimensional and `$\mathcal{M}$` is not the set of all possible probability distributions.

- If `$\theta$` is infinite dimensional, `$\mathcal{M}$` may be called **non-parametric** or **semi-parametric**.
  + Semi-parametric models have a finite, structured component, and an infinite component. 
  
- In this  lecture we will talk about parametric models.

---

## Our Models So Far

- So far in examples, we have always used **saturated** models. 
  + Saturated models contain all possible probability distributions for the observed data.

- We assume that `$A$`, `$Y$`, and any confounders have a discrete number of levels and that we have 
enough data to estimate quantities like `$E[Y \vert A, L]$` in each stratum of `$A$` and `$L$`.

- This has allowed us to focus on **identification**, i.e. could we estimate a parameter with infinite data.

- However, in the real world, we don't get infinite data and it is impractical to assume we can always estimate `$E[Y \vert A, L]$` in every stratum of `$A$` and `$L$`.

- We will need **modeling** to go further.

---

## G-Formula

- The formula we saw previously as the "stratification" formula is also called the g-formula

$$
E[Y(a)] = \sum_l E[Y \vert A = a, L = l]P[L = l]
$$

- We showed in L1 that an alternate expression of the g-formula is

$$
E[Y(a)] = E\left[ \frac{I(A = a) Y}{f(A \vert L)}\right]
$$

- We can use either of these expressions to obtain parametric estiamtes of `$E[Y(a)]$`.

- In G-computation (also called standardization) we estimate `$E[Y \vert A, L]$` and plug into the first expression.

- In IP weighting we estimate `$f(A \vert L)$` (the propensity score) and use the second expression.

---
# 2. Inverse Probabilty Weighting

---
## NHEFS Data

- Cigarette smokers aged 25-74 were recruited around 1971 and given a survey.

- Ten years later, participants were given a followup survey.

- We are interested in estimating the effect of quitting smoking on weight change on the additive scale.

---

## Variables

- Our exposure is binary (whether or not a person quit smoking between the first and second survey).

- The outcome is the change in weight in kg.

- What other features would you want to know in order to estimate the causal effect?

- For now we will assume that the following variables are sufficient:

+ Sex (0: male, 1: female)
+ Age (in years)
+ Race (0: white, 1: other)
+ Education (5 categories)
+ Intensity of smoking (cigarettes per day)
+ Duration of smoking (years)
+ Physical activity in daily life (3 categories)
+ Recreational exercise (3 categories)
+ Weight (kg)

---
## Limitations

- For now we will focus on individuals with complete data. 
  + There are some people who did not fill in the second survey. 
  + For these people we know covariates but don't know either the exposure or the outcome. 
  + Additionally, there are 63 people who did fill in the second survey but who's weight is unknown. 
  + More on this later.

- Individuals may have quit at different times.

+ We could think of `$A$` as a time-varying treatment (coming up later).

---
## Simple Analysis

- The simplest analysis is to simply compare the mean weight change between quitters and non-quitters.

<table>
 <thead>
  <tr>
   <th style="text-align:right;"> Quit Smoking </th>
   <th style="text-align:right;"> Average Weight Change </th>
   <th style="text-align:right;"> N </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1.98 </td>
   <td style="text-align:right;"> 1163 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 4.53 </td>
   <td style="text-align:right;"> 403 </td>
  </tr>
</tbody>
</table>

---
## Quitters v Non-Quitters

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> Variable </th>
   <th style="text-align:left;"> Did Not Quit Smoking </th>
   <th style="text-align:left;"> Quit Smoking </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Total </td>
   <td style="text-align:left;"> 1163 </td>
   <td style="text-align:left;"> 403 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Age, years </td>
   <td style="text-align:left;"> 42.8 (11.8) </td>
   <td style="text-align:left;"> 46.2 (12.2) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Female </td>
   <td style="text-align:left;"> 53.4% (621) </td>
   <td style="text-align:left;"> 45.4% (183) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Non-White </td>
   <td style="text-align:left;"> 14.6% (170) </td>
   <td style="text-align:left;"> 8.9% (36) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Cigarettes/day </td>
   <td style="text-align:left;"> 21.2 (11.5) </td>
   <td style="text-align:left;"> 18.6 (12.4) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Years smoking </td>
   <td style="text-align:left;"> 24.1 (11.7) </td>
   <td style="text-align:left;"> 26 (12.7) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Weight, kg </td>
   <td style="text-align:left;"> 70.3 (15.2) </td>
   <td style="text-align:left;"> 72.4 (15.6) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> College </td>
   <td style="text-align:left;"> 9.9% (115) </td>
   <td style="text-align:left;"> 15.4% (62) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Little exercise </td>
   <td style="text-align:left;"> 37.9% (441) </td>
   <td style="text-align:left;"> 40.7% (164) </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Inactive life </td>
   <td style="text-align:left;"> 8.9% (104) </td>
   <td style="text-align:left;"> 11.2% (45) </td>
  </tr>
</tbody>
</table>

---
## Estimating Weights

- In our data, `$L$` is a vector of nine measurements.

- We cannot compute `$P[A = 1 \vert L ]$` among every (or any) stratum of `$L$` because every
participant has their own unique value of `$L$`.

- We have to fit a parametric model.

- What would you do?

---

## Weights Estimated by Logistic Regression

- We will use a logistic regression model to predict `$P[A \vert L]$`

- The goal is to get the best possible estimate of the probability of treatment given `$L$`, so 
we should fit a flexible model.

- Once we fit the model, we can estimate the probability of quitting for each person as  `$\pi_i = P[A = 1 \vert L = L_i]$`. These are the probability-scale fitted values from the logistic model.

- The weight for individual `$i$` will be equal to `$\frac{1}{\pi_i}$` if `$A_i = 1$` (the person actually did quit) or `$\frac{1}{1-\pi_i}$` if `$A_i = 0$` (the person did not quit).

---

## Weights Estimated by Logistic Regression

```r
fit <- glm(qsmk ~ sex + race + age + I(age ^ 2) +
    as.factor(education) + smokeintensity +
    I(smokeintensity ^ 2) + smokeyrs + I(smokeyrs ^ 2) +
    as.factor(exercise) + as.factor(active) + wt71 + I(wt71 ^ 2),
    family = binomial(), data = dat)

dat$w <- ifelse(dat$qsmk == 0, 
                1/(1 - predict(fit, type = "response")), 
                1/(predict(fit, type = "response")))
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.054   1.230   1.373   1.996   1.990  16.700
```

- Why is the mean value close to 2?

---

## Positivity

- *Structural violations* of positivity occur when it is impossible for people with some 
levels of confounders to receive a particular level of treatment.

+ If we have structural violations, we cannot use IP weighting or standardization. 
  + We need to restrict our inference to relevant strata of `$L$`.

- *Random violations* of positivity occur when certain combinations of `$L$` and `$A$` 
are missing from our data by chance.

+ We are using a parametric model, so we are able to smooth over unobserved covariate values. 
  + We are able to predict for strata that were not observed in our data. 
  + We should be careful about predicting outside the range of the observed data.

---

## Assessing Positivity in Propensity Scores

- If positivity holds, propensity scores should be bounded away from 0 and 1:
  
  + Scores very close to 0 or very close to 1 suggest that there are some strata of `$A$` and `$L$` that have no chance to receive one of the two treatments.
  + We might get scores close to 0 or 1 if there is perfect separation by one or a combination of confounders.
  + Scores close to 0 or 1 suggest there could be structural positivity violations.

- Propensity scores should have approximately the same range in both groups.

+ Non-overlapping ranges indicate that there are some regions of confounder space with only cases or only controls in our study. 
  + If we believe the PS model, we can trust the weights and assume this is due to random positivity violations. 
  + In some approaches based on propensity scores, this will cause a problem (matching ans subclassification).

---
# Propensity Scores and Positivity

---

## Horvitz-Thompson Estimator

- The *Horvitz-Thompson estimator* is one of two options for estimating `$E[Y(a)]$` using our 
estimated weights.

- The Horvitz-Thompson estimator just plugs our estimated weights into our previous formula.

$$
\hat{E}[Y(a)] = \hat{E}\left[\frac{I(A = a)Y}{f(A\vert L)} \right] = \frac{1}{n}\sum_{i = 1}^n I(A_i = a)Y_i W_i
$$

---

## Hajek Estimator

- The *Hajek estimator* fits the linear model 
$$
E[Y \vert A] = \theta_0 + \theta_1 A
$$
by weighted least squares, weighting individuals by our estimated IP weights.

- It is is equivalent to

`$$\frac{ \hat{E}\left[\frac{I(A = a)Y}{f(A\vert L)} \right]}{ \hat{E}\left[\frac{I(A = a)}{f(A\vert L)} \right]} = \frac{\sum_{i = 1}^n I(A_i = a)Y_i W_i}{\sum_{i = 1}^n I(A_i = a) W_i}$$`
---

## Hajek Estimator

- Hajek estimator is unbiased for

`$$\frac{ E\left[\frac{I(A = a)Y}{f(A\vert L)} \right]}{ E\left[\frac{I(A = a)}{f(A\vert L)} \right]}$$`
- If positivity holds, then

$$
E\left[\frac{I(A = a)}{f(A\vert L)} \right] = 1
$$

- The Hajek estimator is guaranteed to be between 0 and 1 for dichotomous outcomes. The Horvitz-Thompson estimator is not.

---

## IP Weighted Effect Estimate in NHEFS

- This code computes the Hajek estimator:

```r
f_w_lm <- lm(wt82_71 ~ qsmk, data = dat, weights  = w)
f_w_lm
```

```
## 
## Call:
## lm(formula = wt82_71 ~ qsmk, data = dat, weights = w)
## 
## Coefficients:
## (Intercept)         qsmk  
##       1.780        3.441
```

- We estimate that if nobody had quit smoking, the average weight gain would have been 1.8 kg.

- If everyone had quit smoking, the average weight gain would have been 5.2 kg.

- The average causal effect of quitting smoking on weight gain is 3.4 kg.

---
## Estimating the Variance of the Estimate

- The raw variance estimate from weighted least squares does not account for uncertainty in the weights.

- We need to account for having estimated the IP weights.

- Options:

1. Derive the variance analytically
  2. Bootstrap the variance
  3. Use a robust sandwich variance estimate.
  
- Option 3 is conservative but easier than options 1 and 2.

- Option 3 can be achieved by fitting with GEE using and independent working correlation matrix or using the `HC0` option in `vcovHC`.

---

## Variance Estiamte in the NHEFS Data

```r
library("geepack")
f_w_gee <- geeglm(wt82_71 ~ qsmk, data = dat,weights = w,
                id = seqn,corstr = "independence")
beta <- coef(f_w_gee)
SE <- coef(summary(f_w_gee))[, 2]
lcl <- beta - qnorm(0.975) * SE
ucl <- beta + qnorm(0.975) * SE
cbind(beta, lcl, ucl)
```

```
##                 beta      lcl      ucl
## (Intercept) 1.779978 1.339514 2.220442
## qsmk        3.440535 2.410587 4.470484
```

```r
library(sandwich)
# Equivalent alternative using sandwich
v_wlm_robust <- vcovHC(f_w_lm, type = "HC0")
  
all.equal(v_wlm_robust, vcov(f_w_gee))
```

```
## [1] TRUE
```

---
## Stabilized Weights

- Using weights `$W^{A} = 1/f(A \vert L)$` we create a pseudo-population in which (heuristically), each person is matched by someone exactly like them who received the opposite treatment.

- Our pseudo-population is twice as big as our actual sample so the mean of `$W^A$` is 2.

- In the pseudo-population, the frequency of treatment `$A = 1$` is 50%.

- We could have created other pseudo-populations.

---
## Stabilized Weights

- Our requirements are that, in the pseudo-population, probability of treatment is independent of `$L$`.

- But different people could have different probabilities of treatment.

- To create stabilized weights, we construct a pseudo-population in which the probability of receiving 
each treatment is the same as the frequency of the treatment in our original sample.

$$
SW^A = \frac{f(A)}{f(A \vert L)}
$$
---

## Stabilized Weights

- In our data, `$A$` is binary, so we can compute `$f(1)$` as the proportion of quitters in the data.

```r
p1 <- mean(dat$qsmk); p1
```

```
## [1] 0.2573436
```

- In the pseudo-population created by the stabilized weights, each person in our data set corresponds to  26% of a treated person and 74% of an untreated person.

```r
dat <- dat %>% mutate(pa = ifelse(dat$qsmk == 0, 1-p1,  p1),
              sw= pa*w)
summary(dat$sw)
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3312  0.8665  0.9503  0.9988  1.0793  4.2977
```

- The expected value of `$SW^A$` is 1 because the pseudo-population is the same size as
the observed data.

---

## Estimation Using the Stabilized Weights

```r
f_sw_gee <- geeglm(wt82_71 ~ qsmk,data = dat,
  weights = sw, id = seqn,corstr = "independence")
beta <- coef(f_sw_gee)
SE <- coef(summary(f_sw_gee))[, 2]
lcl <- beta - qnorm(0.975) * SE
ucl <- beta + qnorm(0.975) * SE
cbind(beta, lcl, ucl)
```

```
##                 beta      lcl      ucl
## (Intercept) 1.779978 1.339514 2.220442
## qsmk        3.440535 2.410587 4.470484
```

- These are exactly the results we saw with the unstabilized weights.

---
## Why Use Stabilized Weights

- Differences between stabilized and non-stabilized weights only occur when the model for `$E[Y \vert A]$` is not saturated.

- When the model is not saturated, stabilized weights typically result in greater efficiency.

---

## Marginal Structural Models

- In the IP weighting strategy we create a population in which `$A$` is indepedent of `$L$` and then fit the model 
`$$E_{ps}[Y \vert A ] = \theta_0 + \theta_1 A$$`

- If we believe our conditional exchangeability assumption `$Y(a) \ci A \vert L$` and the propensity score model, then in the pseudo-population, `$E_{ps}[Y \vert A] = E[Y(a)]$`.

- So the parameter `$\theta_1$` is interpretable as the causal risk difference.

- We have proposed a model for the average counterfactual:

`$$E[Y(a)] = \beta_0 + \beta_1 a$$`
--

- This model is *marginal* because we have marginalized (averaged) over the values of all
other covariates.

- It is structural because it is a model for a counterfactual `$E[Y(a)]$`.

---
## Modeling Continuous Treatments

- With a binary, we could construct a saturated model.

- For continuous (or highly polytomous) variables, we can't do that and will have to use 
a parametric model instead.

- In the NHEFS data, let `$A$` be the change in smoking intensity ( `$A = 10$` indicates that a person increased their smoking by 10 cigarettes).

- We will limit to only those who smoked 25 or fewer cigarettes per day at baseline ( `$N = 1,162$` )

- We can propose a model for our *dose-response curve*

$$
E[Y(a)] = \beta_0  + \beta_1 a + \beta_2 a^2
$$