(updated: 2025-02-24)
\newcommand{\ci}{\perp\!\!\!\perp}
Lecture materials from HR Chapter 14 and from Vansteelandt and Joffe (2014).
This is probably the least familiar estimation method we will see for non-time varying treatments.
G-estimation is fairly uncommon in applied lit but has several advantages:
G-Estimation is also useful for time-varying data, which we will see later.
E[Y(a)] = \beta_0 + \beta_1 a
We then estimate the parameters of the MSM by fitting a regression, weighting each observation by \frac{1}{f(A_i \vert L_i)}.
The model above is marginal because we marginalized over the distribution of L.
We needed to get the propensity model (model for f(A | L)) right.
We also need to get the MSM right (only hard if A is not binary).
In G-Computation, we propose a model for E[Y \vert A, L].
This model is also structural because we are assuming that E[Y \vert A, L] = E[Y(a) \vert L].
However, it is a conditional structural model because we are conditioning on L.
Suppose we have a single, continuous confounder L, a binary treatment A, and a continuous outcome Y.
We are interested in estimating the ATE, E[Y(1)] - E[Y(0)].
Suppose we specify a conditional structural model, E[Y \vert A = a, L ] = E[Y(a) \vert L] = \beta_0 + \beta_1 a + \beta_2 L + \beta_3 a L
We could think of this as really specifying two structural models: E[Y(0) \vert L] = \beta_0 + \beta_2 L E[Y(1) \vert L] = (\beta_0 + \beta_1) + (\beta_2 + \beta_3) L = E[Y(0) \vert L ] + \beta_1 + \beta_3L
In order to estimate the ATE, we need to estimate \beta_1 and \beta_3.
The parameters in the formula for E[Y(0) \vert L] are nuisance parameters.
If we are using G-computation, we need to estimate them, but they don't tell us anything about the causal effect.
E[Y(1) \vert L] - E[Y(0) \vert L] = \beta_1 + \beta_3 L
This model is semiparametric because we don't specify a parametric form for for E[Y(0) \vert L].
It is not marginal because we are conditioning on L.
If we know \beta_1 and \beta_3, we can calculate the ATE by standardizing over L, just as we would in G-computation
E[Y(1)] - E[Y(0)] = \beta_1 + \frac{1}{n}\sum_{i = 1}^n \beta_3 L_i
Suppose we have a semiparametric structural mean model (SMM) E[Y(a)-Y(0) \vert L] = \beta a
How do we estimate \beta if we can never observe both Y(a) and Y(0) for the same person?
To motivate the method, first assume the causal effect is homogeneous, meaning it is exactly the same for everyone.
This means that for each person, Y_i(1)- Y_i(0) = \beta
Validity of G-estimation doesn't actually require homogeneity but it is a convenient place to start.
Re-write as Y_i(0) = Y_i(1) - \beta
By consistency, if A_i = a then Y_i = Y_i(a) so Y_i(0) = Y_i - \beta A_i
This means that if we know \beta, we can fill in each unit's missing counterfactual.
Let H_i(x) = Y_i - x A_i
We want to find the value of x that will make H_i(x) equal to Y_i(0).
Now we will use exchangeability. Exchangeability says that Y(0) \ci A \vert L aka, H(\beta) \ci A \vert L.
For any given value of x, we can compute H(x)_i = Y_i - x A_i for every person in the study.
We can estimate \beta as \hat{\beta} = x^* where H(x^*) is statistically independent of A conditional on L, in our sample.
L | A | Y(0) = H(x) | Y(1) | L | A | Y(0) = H(x) | Y(1) |
---|---|---|---|---|---|---|---|
-1 | 0 | 0.27 | 0.27+x | 0 | 1 | 0.1-x | 0.1 |
-1 | 0 | -0.31 | -0.31+x | 0 | 1 | 0.29-x | 0.29 |
-1 | 0 | -0.1 | -0.1+x | 0 | 1 | 0.48-x | 0.48 |
-1 | 1 | 0.71-x | 0.71 | 0 | 1 | 0.45-x | 0.45 |
-1 | 1 | -0.2-x | -0.2 | 1 | 0 | 1.03 | 1.03+x |
-1 | 1 | 0.88-x | 0.88 | 1 | 0 | 1.14 | 1.14+x |
-1 | 1 | 0.53-x | 0.53 | 1 | 0 | 1.24 | 1.24+x |
0 | 0 | 0.05 | 0.05+x | 1 | 1 | 1.42-x | 1.42 |
0 | 0 | 0.13 | 0.13+x | 1 | 1 | 1.68-x | 1.68 |
0 | 0 | 0.77 | 0.77+x | 1 | 1 | 0.95-x | 0.95 |
How would we know if H(x) were independent of A \vert L?
If we have a guess for x, we could fit a regression like logit \left( P[A = 1 \vert H(x), L] \right) = \alpha_0 + \alpha_1 H(x) + \alpha_2 L
If \alpha_1 is not statistically different from 0, then our data are consistent with H(x) \ci A \vert L.
We could find the values of x that make this true by doing a grid search, repeatedly fitting the regression for multiple candidate x's.
A good point estimate of \beta is the value of x that gives \hat{\alpha}_1(x) closest to 0.
\sum_{i = 1}^n H_i(x)(A_i - E[A \vert L_i]) = 0
x^* = \frac{\sum_{i=1}^n Y_i (A_i - E[A \vert L_i])}{\sum_{i=1}^n A_i (A_i - E[A \vert L_i])}
Suppose that \hat{x} is the solution to \min_{x} \vert \hat{\alpha}_1(x) \vert found via grid search.
For every value of x we try, we get a regression fit including a p-value for \hat{\alpha}_1(x) = 0. Call this p-value p_1(x).
We can get a confidence interval for \hat{x} by inverting this p-value.
The 95% confidence interval for \hat{x} is the set of values \lbrace x : p_1(x) > 0.05 \rbrace.
Alternatively, we can bootstrap using the estimating equations version.
grid = seq(0.4, 0.6, by = 0.01)Hx_coefs <- purrr::map_dfr(grid, function(x){ dat$Hx <- dat$Y - x * dat$A # Using GEE for robust SEs ea_fit <- geeglm(A ~ as.factor(L) + Hx, family=binomial(), data=dat, id=id, corstr="independence") res <- data.frame("Est" = summary(ea_fit)$coefficients["Hx", "Estimate"], "SE" = summary(ea_fit)$coefficients["Hx", "Std.err"], "p" = summary(ea_fit)$coefficients["Hx", "Pr(>|W|)"])})Hx_coefs <- Hx_coefs %>% mutate(ci_lower = Est - 1.96*SE, ci_upper = Est + 1.96*SE, pns = p > 0.05)
gest_func <- function(dat){ ps_model <- glm(A ~ as.factor(L), family=binomial(), data=dat) dat$ea <- predict(ps_model, type = "response") est <- with(dat, sum(Y*(A-ea))/sum(A*(A -ea))) return(est)}boot_samps <- replicate(n = 500, expr = {gest_func(dat[sample(seq(nrow(dat)), size = nrow(dat), replace = TRUE),])})est <- gest_func(dat)round(est, digits = 2)
## [1] 0.49
round(est + c(-1, 1)*qnorm(0.975)*sd(boot_samps), digits = 2)
## [1] 0.45 0.52
We have motivated G-estimation under the very strong assumption that Y_i(1)-Y_i(0) = \beta for everyone.
If this is the case, then H_i(\beta) = Y_i - \beta A_i is equal to Y_i(0).
However, we are only making use of mean exchangeability: E[Y(0) \vert L, A] = E[Y(0) \vert L].
So we only need H(\beta) to have the same conditional mean as Y(0), i.e. E[H(\beta) \vert L] = E[Y(0) \vert L]
This is true as long as we get the semiparametric structural mean model for E[Y(1)]-E[Y(0)] correct.
When there is censoring, we want to estimate E[Y(A = 1, C =0) - Y(A =0, C =0)].
We can estimate censoring weights W^C = \frac{1}{P[C = 0 \vert A, L]}, weight the population by these, and then apply G-estimation.
Using the grid search method, this means using W^{C} as weights when we fit the model for E[A \vert H(x), L].
Using the estimating equations version, we add weights to the sums in numerator and denominator,
x^* = \frac{\sum_{i: C_i = 0} W^C_i Y_i (A_i - E[A \vert L_i])}{\sum_{i:C_i = 0} W^C_i A_i (A_i - E[A \vert L_i])}
Step 1: Compute censoring weights
dat <- read_csv("../../data/nhefs.csv") dat$cens <- ifelse(is.na(dat$wt82), 1, 0)cw_fit <- glm(cens==0 ~ qsmk + sex + race + age + I(age^2) + as.factor(education) + smokeintensity + I(smokeintensity^2) + smokeyrs + I(smokeyrs^2) + as.factor(exercise) + as.factor(active) + wt71 + I(wt71^2), data = dat, family = binomial())dat$wc <- 1/predict(cw_fit, type="response")
Step 2: Estimate x by grid search.
grid = seq(2, 5, by = 0.1)Hx_coefs <- purrr::map_dfr(grid, function(x){ dat$Hx <- dat$wt82_71 - x * dat$qsmk # Using GEE for robust SEs gest_fit <- geeglm(qsmk ~ sex + race + age + I(age*age) + as.factor(education) + smokeintensity + I(smokeintensity*smokeintensity) + smokeyrs + I(smokeyrs*smokeyrs) + as.factor(exercise) + as.factor(active) + wt71 + I(wt71*wt71) + Hx, family=binomial(), data=dat, weights=wc, id=seqn, corstr="independence", subset = which(!is.na(wt82_71))) res <- data.frame("Est" = summary(gest_fit)$coefficients["Hx", "Estimate"], "SE" = summary(gest_fit)$coefficients["Hx", "Std.err"], "p" = summary(gest_fit)$coefficients["Hx", "Pr(>|W|)"])})Hx_coefs <- Hx_coefs %>% mutate(ci_lower = Est - 1.96*SE, ci_upper = Est + 1.96*SE, pns = p > 0.05)
ps_model <- glm(qsmk ~ sex + race + age + I(age*age) + as.factor(education) + smokeintensity + I(smokeintensity*smokeintensity) + smokeyrs + I(smokeyrs*smokeyrs) + as.factor(exercise) + as.factor(active) + wt71 + I(wt71*wt71), family=binomial(), data=dat)dat$ea <- predict(ps_model, type = "response")g_est_num <- with(dat, sum((1-cens)*wc*wt82_71*(qsmk-ea), na.rm=T))g_est_denom <- with(dat, sum((1-cens)*wc*qsmk*(qsmk-ea)))round(g_est_num/g_est_denom, digits = 2)
## [1] 3.51
In order for the G-estimation estimate to be consistent, we need to get two things right:
The semiparametric structural mean model, E[Y(a) \vert A = a, L] - E[Y(0) \vert A = a, L]
The model for E[A \vert L].
In G-computation, we may get extrapolation bias because G-computation will smooth and borrow from surrounding samples to predict Y(1) for all of the untreated units in this region.
On the other hand, IP weighting could have very high variance due to enormous weights for the few treated units in the region.
x^* = \frac{\sum_{i=1}^n 1_{C_i=0}W^C_i Y_i (A_i - E[A \vert L_i])}{\sum_{i=1}^n 1_{C_i=0}W^C_i A_i (A_i - E[A \vert L_i])}
Untreated units in the problematic region will have A_i - E[A \vert L_i] \approx 0.
Treated units will have A_i - E[A \vert L_i] \approx 1.
Vansteelandt and Joffe argue that, in the binary A case, mis-specifying the model as E[Y(1) \vert L] - E[Y(0) \vert L ] = \beta_1 when there is effect modification, still gives interpretable results.
Suppose the true model is E[Y(1) \vert L] - E[Y(0) \vert L ] = \omega(L)
Then the G-estimator using the incorrect model with no effect modification converges to \frac{E\left[ Var(A \vert L)\omega(L)\right ]}{E\left [ Var(A \vert L)\right ]} which is a weighted average of treatment effects, giving the most weight to strata with higher values of Var(A \vert L), or strata with the most overlap between treated and control.
E[Y(a)-Y(0) \vert L] = \beta_1 a + \beta_2 a L
H_i(x_1, x_2) = Y_i - x_1 A_i - x_2 A_i L_i
We now have two parameters to estimate, so we need to perform a search over a two dimensional grid.
At x_1 = \beta_1 and x_2 = \beta_2, the \alpha_1 and \alpha_2 terms will both be 0.
logit\left( P[A = 1 \vert H(x_1, x_2), L] \right) = \alpha_0 + \alpha_1 H(x_1, x_2) + \alpha_2 H(x_1, x_2) L + \alpha_3 L
We have some choices about how we put H(x_1, x_2) into the exposure model.
For example, we could use
logit\left( P[A = 1 \vert H(x_1, x_2), L] \right) = \alpha_0 + \alpha_1 H(x_1, x_2) + \alpha_2 H(x_1, x_2) L + \alpha_3 L
or
logit\left( P[A = 1 \vert H(x_1, x_2), L] \right) = \alpha_0 + \alpha_1 H(x_1, x_2) + \alpha_2 H(x_1, x_2) L^2 + \alpha_3 L
g\left \lbrace E[Y(a) \vert A =a, L]\right \rbrace - g\left \lbrace E[Y(0) \vert A =a, L]\right \rbrace = \gamma(a, l ; x^*)
\begin{split} &H(x) = Y - \gamma(L, A; x) \qquad &g\ \text{identity}\\ &H(x) = Y\exp[-\gamma(L, A; x)]\qquad &g\ \text{log}\\ &H(x) = expit \left( logit(E[Y \vert L, A])- \gamma(L, A; x) \right) ) \qquad &g\ \text{logit} \end{split}
The fact that E[Y \vert L, A] shows up in the definition of H(x) if g is logit makes the odds ratio formulation more difficult to deal with.
Suppose x has dimension k.
Next we either grid search for a value of x such that H(x) is statistically independent of A - E[A \vert L] (possible for small k)
Or we solve the estimating equations
\sum_{i = 1}^n 1_{C_i = 0}W^C_i H_i(x)(d(A_i, L_i) - d(E[A \vert L_i], L_i)) = 0_k where d is a function with that outputs a k-dimensional vector and 0_k is the k-dimensional 0 vector.
d(A, L) = E\left\lbrace \frac{\partial H(x)}{dx} \Big\vert A, L\right\rbrace
H_i(x_1, x_2) = Y_i - x_1 A_i - x_2 A_i L_{i}
\sum_{i = 1}^n H_i(x) \begin{pmatrix}A_i - E[A \vert L_i]\\L_i(A_i - E[A \vert L_i]) \end{pmatrix}= \begin{pmatrix}0 \\ 0 \end{pmatrix}
\sum_{i = 1}^n (Y_i - x_1 A_i - x_2 A_i L_i) D_i\begin{pmatrix}1\\L_i \end{pmatrix}= \begin{pmatrix}0 \\ 0 \end{pmatrix}
x_1 \begin{pmatrix} \sum_{i = 1}^n A_i D_i\\ \sum_{i = 1}^n A_i D_iL_i \end{pmatrix} + x_2\begin{pmatrix}\sum_{i = 1}^n A_i L_i D_i\\\sum_{i = 1}^n A_i L_i^2 D_i \end{pmatrix} = \begin{pmatrix}\sum_{i = 1}^n Y_i D_i\\\sum_{i = 1}^n Y_i D_iL_i \end{pmatrix}
\begin{pmatrix} \sum_{i = 1}^n A_i D_i & \sum_{i = 1}^n A_i D_i L_i\\ \sum_{i = 1}^n A_i D_iL_i & \sum_{i = 1}^n A_i L_i^2 D_i\end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} = \begin{pmatrix}\sum_{i = 1}^n Y_i D_i\\\sum_{i = 1}^n Y_i D_iL_i \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} = \begin{pmatrix} \sum_{i = 1}^n A_i D_i & \sum_{i = 1}^n A_i L_i D_i\\ \sum_{i = 1}^n A_i D_iL_i & \sum_{i = 1}^n A_i L_i^2 D_i\end{pmatrix}^{-1}\begin{pmatrix}\sum_{i = 1}^n Y_i D_i\\\sum_{i = 1}^n Y_i D_iL_i \end{pmatrix}
Suppose we believe there is effect modification by smoking intensity. Let L be smoking intensity and W the remaining confounders. We could use the SMM E[Y(a) \vert L, W] - E[Y(0) \vert L, W] = \beta_1 a + \beta_2 a L
We include W to make explicit that this model is conditional on all of the confounders. We are explicitly saying we think there is only effect modification by smoking intensity.
For this we can use the solution to the example that we just worked out.
dat$wc[dat$cens != 0] <- 0dat$rem_w <- with(dat, wc*(qsmk - ea))S <- with(dat, c(sum(wt82_71*rem_w, na.rm=T), sum(smokeintensity*wt82_71*rem_w, na.rm=T)))R <- with(dat, c(sum(qsmk*rem_w), sum(smokeintensity*qsmk*rem_w), sum(smokeintensity*qsmk*rem_w), sum(smokeintensity^2*qsmk*rem_w)))R <- matrix(R, nrow = 2, byrow = T)x <- solve(R) %*% Sround(x, digits = 2)
## [,1]## [1,] 2.92## [2,] 0.03
# Standardize to get the ATEate <- x[1] + mean(dat$smokeintensity*x[2])round(ate, digits = 2)
## [1] 3.54
To make the estimator more robust, we can replace H(x) with H(x) - E[H(x) \vert L].
To do this, we first fit an outcome model for E[Y \vert L, A = 0].
If H_i(x) = Y_i - xA_i, then H_i(x) - E[H_i(x) \vert L] = (Y_i - \hat{Y}_{i,0}) - xA_i.
The resulting estimator will be consistent if either the outcome model in the untreated or the propensity model are correct.
One nice feature of G-estimation for continuous treatments is that we only need a model for E[A \vert L].
For IP weighting, we need a model for f(A \vert L) which is much harder to get right.
In G-Estimation we propose a semiparametric structural mean model for the causal contrast of interest conditional on L, with parameters \boldsymbol{\beta}.
We also propose a model for E[A \vert L].
We then obtain a function H(x) such that E[H(\boldsymbol{\beta}) \vert L] = E[Y(0) \vert L].
Finally, we search for the value of x that makes H(x) statistically independent of A \vert L.
G-Estimation avoids some of the problems of G-Computation and IP Weighting
The gesttools
R package came out in 2022.
\newcommand{\ci}{\perp\!\!\!\perp}
Lecture materials from HR Chapter 14 and from Vansteelandt and Joffe (2014).
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
(updated: 2025-02-24)
\newcommand{\ci}{\perp\!\!\!\perp}
Lecture materials from HR Chapter 14 and from Vansteelandt and Joffe (2014).
This is probably the least familiar estimation method we will see for non-time varying treatments.
G-estimation is fairly uncommon in applied lit but has several advantages:
G-Estimation is also useful for time-varying data, which we will see later.
E[Y(a)] = \beta_0 + \beta_1 a
We then estimate the parameters of the MSM by fitting a regression, weighting each observation by \frac{1}{f(A_i \vert L_i)}.
The model above is marginal because we marginalized over the distribution of L.
We needed to get the propensity model (model for f(A | L)) right.
We also need to get the MSM right (only hard if A is not binary).
In G-Computation, we propose a model for E[Y \vert A, L].
This model is also structural because we are assuming that E[Y \vert A, L] = E[Y(a) \vert L].
However, it is a conditional structural model because we are conditioning on L.
Suppose we have a single, continuous confounder L, a binary treatment A, and a continuous outcome Y.
We are interested in estimating the ATE, E[Y(1)] - E[Y(0)].
Suppose we specify a conditional structural model, E[Y \vert A = a, L ] = E[Y(a) \vert L] = \beta_0 + \beta_1 a + \beta_2 L + \beta_3 a L
We could think of this as really specifying two structural models: E[Y(0) \vert L] = \beta_0 + \beta_2 L E[Y(1) \vert L] = (\beta_0 + \beta_1) + (\beta_2 + \beta_3) L = E[Y(0) \vert L ] + \beta_1 + \beta_3L
In order to estimate the ATE, we need to estimate \beta_1 and \beta_3.
The parameters in the formula for E[Y(0) \vert L] are nuisance parameters.
If we are using G-computation, we need to estimate them, but they don't tell us anything about the causal effect.
E[Y(1) \vert L] - E[Y(0) \vert L] = \beta_1 + \beta_3 L
This model is semiparametric because we don't specify a parametric form for for E[Y(0) \vert L].
It is not marginal because we are conditioning on L.
If we know \beta_1 and \beta_3, we can calculate the ATE by standardizing over L, just as we would in G-computation
E[Y(1)] - E[Y(0)] = \beta_1 + \frac{1}{n}\sum_{i = 1}^n \beta_3 L_i
Suppose we have a semiparametric structural mean model (SMM) E[Y(a)-Y(0) \vert L] = \beta a
How do we estimate \beta if we can never observe both Y(a) and Y(0) for the same person?
To motivate the method, first assume the causal effect is homogeneous, meaning it is exactly the same for everyone.
This means that for each person, Y_i(1)- Y_i(0) = \beta
Validity of G-estimation doesn't actually require homogeneity but it is a convenient place to start.
Re-write as Y_i(0) = Y_i(1) - \beta
By consistency, if A_i = a then Y_i = Y_i(a) so Y_i(0) = Y_i - \beta A_i
This means that if we know \beta, we can fill in each unit's missing counterfactual.
Let H_i(x) = Y_i - x A_i
We want to find the value of x that will make H_i(x) equal to Y_i(0).
Now we will use exchangeability. Exchangeability says that Y(0) \ci A \vert L aka, H(\beta) \ci A \vert L.
For any given value of x, we can compute H(x)_i = Y_i - x A_i for every person in the study.
We can estimate \beta as \hat{\beta} = x^* where H(x^*) is statistically independent of A conditional on L, in our sample.
L | A | Y(0) = H(x) | Y(1) | L | A | Y(0) = H(x) | Y(1) |
---|---|---|---|---|---|---|---|
-1 | 0 | 0.27 | 0.27+x | 0 | 1 | 0.1-x | 0.1 |
-1 | 0 | -0.31 | -0.31+x | 0 | 1 | 0.29-x | 0.29 |
-1 | 0 | -0.1 | -0.1+x | 0 | 1 | 0.48-x | 0.48 |
-1 | 1 | 0.71-x | 0.71 | 0 | 1 | 0.45-x | 0.45 |
-1 | 1 | -0.2-x | -0.2 | 1 | 0 | 1.03 | 1.03+x |
-1 | 1 | 0.88-x | 0.88 | 1 | 0 | 1.14 | 1.14+x |
-1 | 1 | 0.53-x | 0.53 | 1 | 0 | 1.24 | 1.24+x |
0 | 0 | 0.05 | 0.05+x | 1 | 1 | 1.42-x | 1.42 |
0 | 0 | 0.13 | 0.13+x | 1 | 1 | 1.68-x | 1.68 |
0 | 0 | 0.77 | 0.77+x | 1 | 1 | 0.95-x | 0.95 |
How would we know if H(x) were independent of A \vert L?
If we have a guess for x, we could fit a regression like logit \left( P[A = 1 \vert H(x), L] \right) = \alpha_0 + \alpha_1 H(x) + \alpha_2 L
If \alpha_1 is not statistically different from 0, then our data are consistent with H(x) \ci A \vert L.
We could find the values of x that make this true by doing a grid search, repeatedly fitting the regression for multiple candidate x's.
A good point estimate of \beta is the value of x that gives \hat{\alpha}_1(x) closest to 0.
\sum_{i = 1}^n H_i(x)(A_i - E[A \vert L_i]) = 0
x^* = \frac{\sum_{i=1}^n Y_i (A_i - E[A \vert L_i])}{\sum_{i=1}^n A_i (A_i - E[A \vert L_i])}
Suppose that \hat{x} is the solution to \min_{x} \vert \hat{\alpha}_1(x) \vert found via grid search.
For every value of x we try, we get a regression fit including a p-value for \hat{\alpha}_1(x) = 0. Call this p-value p_1(x).
We can get a confidence interval for \hat{x} by inverting this p-value.
The 95% confidence interval for \hat{x} is the set of values \lbrace x : p_1(x) > 0.05 \rbrace.
Alternatively, we can bootstrap using the estimating equations version.
grid = seq(0.4, 0.6, by = 0.01)Hx_coefs <- purrr::map_dfr(grid, function(x){ dat$Hx <- dat$Y - x * dat$A # Using GEE for robust SEs ea_fit <- geeglm(A ~ as.factor(L) + Hx, family=binomial(), data=dat, id=id, corstr="independence") res <- data.frame("Est" = summary(ea_fit)$coefficients["Hx", "Estimate"], "SE" = summary(ea_fit)$coefficients["Hx", "Std.err"], "p" = summary(ea_fit)$coefficients["Hx", "Pr(>|W|)"])})Hx_coefs <- Hx_coefs %>% mutate(ci_lower = Est - 1.96*SE, ci_upper = Est + 1.96*SE, pns = p > 0.05)
gest_func <- function(dat){ ps_model <- glm(A ~ as.factor(L), family=binomial(), data=dat) dat$ea <- predict(ps_model, type = "response") est <- with(dat, sum(Y*(A-ea))/sum(A*(A -ea))) return(est)}boot_samps <- replicate(n = 500, expr = {gest_func(dat[sample(seq(nrow(dat)), size = nrow(dat), replace = TRUE),])})est <- gest_func(dat)round(est, digits = 2)
## [1] 0.49
round(est + c(-1, 1)*qnorm(0.975)*sd(boot_samps), digits = 2)
## [1] 0.45 0.52
We have motivated G-estimation under the very strong assumption that Y_i(1)-Y_i(0) = \beta for everyone.
If this is the case, then H_i(\beta) = Y_i - \beta A_i is equal to Y_i(0).
However, we are only making use of mean exchangeability: E[Y(0) \vert L, A] = E[Y(0) \vert L].
So we only need H(\beta) to have the same conditional mean as Y(0), i.e. E[H(\beta) \vert L] = E[Y(0) \vert L]
This is true as long as we get the semiparametric structural mean model for E[Y(1)]-E[Y(0)] correct.
When there is censoring, we want to estimate E[Y(A = 1, C =0) - Y(A =0, C =0)].
We can estimate censoring weights W^C = \frac{1}{P[C = 0 \vert A, L]}, weight the population by these, and then apply G-estimation.
Using the grid search method, this means using W^{C} as weights when we fit the model for E[A \vert H(x), L].
Using the estimating equations version, we add weights to the sums in numerator and denominator,
x^* = \frac{\sum_{i: C_i = 0} W^C_i Y_i (A_i - E[A \vert L_i])}{\sum_{i:C_i = 0} W^C_i A_i (A_i - E[A \vert L_i])}
Step 1: Compute censoring weights
dat <- read_csv("../../data/nhefs.csv") dat$cens <- ifelse(is.na(dat$wt82), 1, 0)cw_fit <- glm(cens==0 ~ qsmk + sex + race + age + I(age^2) + as.factor(education) + smokeintensity + I(smokeintensity^2) + smokeyrs + I(smokeyrs^2) + as.factor(exercise) + as.factor(active) + wt71 + I(wt71^2), data = dat, family = binomial())dat$wc <- 1/predict(cw_fit, type="response")
Step 2: Estimate x by grid search.
grid = seq(2, 5, by = 0.1)Hx_coefs <- purrr::map_dfr(grid, function(x){ dat$Hx <- dat$wt82_71 - x * dat$qsmk # Using GEE for robust SEs gest_fit <- geeglm(qsmk ~ sex + race + age + I(age*age) + as.factor(education) + smokeintensity + I(smokeintensity*smokeintensity) + smokeyrs + I(smokeyrs*smokeyrs) + as.factor(exercise) + as.factor(active) + wt71 + I(wt71*wt71) + Hx, family=binomial(), data=dat, weights=wc, id=seqn, corstr="independence", subset = which(!is.na(wt82_71))) res <- data.frame("Est" = summary(gest_fit)$coefficients["Hx", "Estimate"], "SE" = summary(gest_fit)$coefficients["Hx", "Std.err"], "p" = summary(gest_fit)$coefficients["Hx", "Pr(>|W|)"])})Hx_coefs <- Hx_coefs %>% mutate(ci_lower = Est - 1.96*SE, ci_upper = Est + 1.96*SE, pns = p > 0.05)
ps_model <- glm(qsmk ~ sex + race + age + I(age*age) + as.factor(education) + smokeintensity + I(smokeintensity*smokeintensity) + smokeyrs + I(smokeyrs*smokeyrs) + as.factor(exercise) + as.factor(active) + wt71 + I(wt71*wt71), family=binomial(), data=dat)dat$ea <- predict(ps_model, type = "response")g_est_num <- with(dat, sum((1-cens)*wc*wt82_71*(qsmk-ea), na.rm=T))g_est_denom <- with(dat, sum((1-cens)*wc*qsmk*(qsmk-ea)))round(g_est_num/g_est_denom, digits = 2)
## [1] 3.51
In order for the G-estimation estimate to be consistent, we need to get two things right:
The semiparametric structural mean model, E[Y(a) \vert A = a, L] - E[Y(0) \vert A = a, L]
The model for E[A \vert L].
In G-computation, we may get extrapolation bias because G-computation will smooth and borrow from surrounding samples to predict Y(1) for all of the untreated units in this region.
On the other hand, IP weighting could have very high variance due to enormous weights for the few treated units in the region.
x^* = \frac{\sum_{i=1}^n 1_{C_i=0}W^C_i Y_i (A_i - E[A \vert L_i])}{\sum_{i=1}^n 1_{C_i=0}W^C_i A_i (A_i - E[A \vert L_i])}
Untreated units in the problematic region will have A_i - E[A \vert L_i] \approx 0.
Treated units will have A_i - E[A \vert L_i] \approx 1.
Vansteelandt and Joffe argue that, in the binary A case, mis-specifying the model as E[Y(1) \vert L] - E[Y(0) \vert L ] = \beta_1 when there is effect modification, still gives interpretable results.
Suppose the true model is E[Y(1) \vert L] - E[Y(0) \vert L ] = \omega(L)
Then the G-estimator using the incorrect model with no effect modification converges to \frac{E\left[ Var(A \vert L)\omega(L)\right ]}{E\left [ Var(A \vert L)\right ]} which is a weighted average of treatment effects, giving the most weight to strata with higher values of Var(A \vert L), or strata with the most overlap between treated and control.
E[Y(a)-Y(0) \vert L] = \beta_1 a + \beta_2 a L
H_i(x_1, x_2) = Y_i - x_1 A_i - x_2 A_i L_i
We now have two parameters to estimate, so we need to perform a search over a two dimensional grid.
At x_1 = \beta_1 and x_2 = \beta_2, the \alpha_1 and \alpha_2 terms will both be 0.
logit\left( P[A = 1 \vert H(x_1, x_2), L] \right) = \alpha_0 + \alpha_1 H(x_1, x_2) + \alpha_2 H(x_1, x_2) L + \alpha_3 L
We have some choices about how we put H(x_1, x_2) into the exposure model.
For example, we could use
logit\left( P[A = 1 \vert H(x_1, x_2), L] \right) = \alpha_0 + \alpha_1 H(x_1, x_2) + \alpha_2 H(x_1, x_2) L + \alpha_3 L
or
logit\left( P[A = 1 \vert H(x_1, x_2), L] \right) = \alpha_0 + \alpha_1 H(x_1, x_2) + \alpha_2 H(x_1, x_2) L^2 + \alpha_3 L
g\left \lbrace E[Y(a) \vert A =a, L]\right \rbrace - g\left \lbrace E[Y(0) \vert A =a, L]\right \rbrace = \gamma(a, l ; x^*)
\begin{split} &H(x) = Y - \gamma(L, A; x) \qquad &g\ \text{identity}\\ &H(x) = Y\exp[-\gamma(L, A; x)]\qquad &g\ \text{log}\\ &H(x) = expit \left( logit(E[Y \vert L, A])- \gamma(L, A; x) \right) ) \qquad &g\ \text{logit} \end{split}
The fact that E[Y \vert L, A] shows up in the definition of H(x) if g is logit makes the odds ratio formulation more difficult to deal with.
Suppose x has dimension k.
Next we either grid search for a value of x such that H(x) is statistically independent of A - E[A \vert L] (possible for small k)
Or we solve the estimating equations
\sum_{i = 1}^n 1_{C_i = 0}W^C_i H_i(x)(d(A_i, L_i) - d(E[A \vert L_i], L_i)) = 0_k where d is a function with that outputs a k-dimensional vector and 0_k is the k-dimensional 0 vector.
d(A, L) = E\left\lbrace \frac{\partial H(x)}{dx} \Big\vert A, L\right\rbrace
H_i(x_1, x_2) = Y_i - x_1 A_i - x_2 A_i L_{i}
\sum_{i = 1}^n H_i(x) \begin{pmatrix}A_i - E[A \vert L_i]\\L_i(A_i - E[A \vert L_i]) \end{pmatrix}= \begin{pmatrix}0 \\ 0 \end{pmatrix}
\sum_{i = 1}^n (Y_i - x_1 A_i - x_2 A_i L_i) D_i\begin{pmatrix}1\\L_i \end{pmatrix}= \begin{pmatrix}0 \\ 0 \end{pmatrix}
x_1 \begin{pmatrix} \sum_{i = 1}^n A_i D_i\\ \sum_{i = 1}^n A_i D_iL_i \end{pmatrix} + x_2\begin{pmatrix}\sum_{i = 1}^n A_i L_i D_i\\\sum_{i = 1}^n A_i L_i^2 D_i \end{pmatrix} = \begin{pmatrix}\sum_{i = 1}^n Y_i D_i\\\sum_{i = 1}^n Y_i D_iL_i \end{pmatrix}
\begin{pmatrix} \sum_{i = 1}^n A_i D_i & \sum_{i = 1}^n A_i D_i L_i\\ \sum_{i = 1}^n A_i D_iL_i & \sum_{i = 1}^n A_i L_i^2 D_i\end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} = \begin{pmatrix}\sum_{i = 1}^n Y_i D_i\\\sum_{i = 1}^n Y_i D_iL_i \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} = \begin{pmatrix} \sum_{i = 1}^n A_i D_i & \sum_{i = 1}^n A_i L_i D_i\\ \sum_{i = 1}^n A_i D_iL_i & \sum_{i = 1}^n A_i L_i^2 D_i\end{pmatrix}^{-1}\begin{pmatrix}\sum_{i = 1}^n Y_i D_i\\\sum_{i = 1}^n Y_i D_iL_i \end{pmatrix}
Suppose we believe there is effect modification by smoking intensity. Let L be smoking intensity and W the remaining confounders. We could use the SMM E[Y(a) \vert L, W] - E[Y(0) \vert L, W] = \beta_1 a + \beta_2 a L
We include W to make explicit that this model is conditional on all of the confounders. We are explicitly saying we think there is only effect modification by smoking intensity.
For this we can use the solution to the example that we just worked out.
dat$wc[dat$cens != 0] <- 0dat$rem_w <- with(dat, wc*(qsmk - ea))S <- with(dat, c(sum(wt82_71*rem_w, na.rm=T), sum(smokeintensity*wt82_71*rem_w, na.rm=T)))R <- with(dat, c(sum(qsmk*rem_w), sum(smokeintensity*qsmk*rem_w), sum(smokeintensity*qsmk*rem_w), sum(smokeintensity^2*qsmk*rem_w)))R <- matrix(R, nrow = 2, byrow = T)x <- solve(R) %*% Sround(x, digits = 2)
## [,1]## [1,] 2.92## [2,] 0.03
# Standardize to get the ATEate <- x[1] + mean(dat$smokeintensity*x[2])round(ate, digits = 2)
## [1] 3.54
To make the estimator more robust, we can replace H(x) with H(x) - E[H(x) \vert L].
To do this, we first fit an outcome model for E[Y \vert L, A = 0].
If H_i(x) = Y_i - xA_i, then H_i(x) - E[H_i(x) \vert L] = (Y_i - \hat{Y}_{i,0}) - xA_i.
The resulting estimator will be consistent if either the outcome model in the untreated or the propensity model are correct.
One nice feature of G-estimation for continuous treatments is that we only need a model for E[A \vert L].
For IP weighting, we need a model for f(A \vert L) which is much harder to get right.
In G-Estimation we propose a semiparametric structural mean model for the causal contrast of interest conditional on L, with parameters \boldsymbol{\beta}.
We also propose a model for E[A \vert L].
We then obtain a function H(x) such that E[H(\boldsymbol{\beta}) \vert L] = E[Y(0) \vert L].
Finally, we search for the value of x that makes H(x) statistically independent of A \vert L.
G-Estimation avoids some of the problems of G-Computation and IP Weighting
The gesttools
R package came out in 2022.