Processing math: 0%
+ - 0:00:00
Notes for current slide
Notes for next slide

L9: G-Estimation

Jean Morrison

University of Michigan

Lecture on 2025-02-24

(updated: 2025-02-24)

1 / 46

\newcommand{\ci}{\perp\!\!\!\perp}

Lecture Plan

  1. G-Estimation

Lecture materials from HR Chapter 14 and from Vansteelandt and Joffe (2014).

2 / 46

G-Estimation

  • This is probably the least familiar estimation method we will see for non-time varying treatments.

  • G-estimation is fairly uncommon in applied lit but has several advantages:

    • For a continuous treatment, we only have to specify E[A \vert L] instead of f(A \vert L).
    • Potentially more efficient than IP weighting.
    • Less extrapolation bias than G-computation.
  • G-Estimation is also useful for time-varying data, which we will see later.

3 / 46

Marginal Structural Models

  • In IP weighting, we estimate weights f(A \vert L) and propose a marginal structural model (MSM) such as

E[Y(a)] = \beta_0 + \beta_1 a

  • We then estimate the parameters of the MSM by fitting a regression, weighting each observation by \frac{1}{f(A_i \vert L_i)}.

  • The model above is marginal because we marginalized over the distribution of L.

  • We needed to get the propensity model (model for f(A | L)) right.

  • We also need to get the MSM right (only hard if A is not binary).

4 / 46

Conditional Outcome Models

  • In G-Computation, we propose a model for E[Y \vert A, L].

  • This model is also structural because we are assuming that E[Y \vert A, L] = E[Y(a) \vert L].

  • However, it is a conditional structural model because we are conditioning on L.

5 / 46

Semiparametric Structural Mean Models

  • Suppose we have a single, continuous confounder L, a binary treatment A, and a continuous outcome Y.

  • We are interested in estimating the ATE, E[Y(1)] - E[Y(0)].

  • Suppose we specify a conditional structural model, E[Y \vert A = a, L ] = E[Y(a) \vert L] = \beta_0 + \beta_1 a + \beta_2 L + \beta_3 a L

  • We could think of this as really specifying two structural models: E[Y(0) \vert L] = \beta_0 + \beta_2 L E[Y(1) \vert L] = (\beta_0 + \beta_1) + (\beta_2 + \beta_3) L = E[Y(0) \vert L ] + \beta_1 + \beta_3L

6 / 46

Semiparametric Structural Mean Models

  • Since we are only interested in the ATE, we really only care about the second formula, E[Y(1) \vert L] = E[Y(0) \vert L ] + \beta_1 + \beta_3L
  • In order to estimate the ATE, we need to estimate \beta_1 and \beta_3.

  • The parameters in the formula for E[Y(0) \vert L] are nuisance parameters.

  • If we are using G-computation, we need to estimate them, but they don't tell us anything about the causal effect.

7 / 46

Semiparametric Structural Mean Models

  • A semiparametric structural mean model only specifies a model for the target contrast (in our case difference in counterfactual means).

E[Y(1) \vert L] - E[Y(0) \vert L] = \beta_1 + \beta_3 L

  • This model is semiparametric because we don't specify a parametric form for for E[Y(0) \vert L].

  • It is not marginal because we are conditioning on L.

  • If we know \beta_1 and \beta_3, we can calculate the ATE by standardizing over L, just as we would in G-computation

E[Y(1)] - E[Y(0)] = \beta_1 + \frac{1}{n}\sum_{i = 1}^n \beta_3 L_i

8 / 46

Semi-Parametric Structural Mean Models

9 / 46

G-Estimation

  • Suppose we have a semiparametric structural mean model (SMM) E[Y(a)-Y(0) \vert L] = \beta a

  • How do we estimate \beta if we can never observe both Y(a) and Y(0) for the same person?

10 / 46

G-Estimation

  • To motivate the method, first assume the causal effect is homogeneous, meaning it is exactly the same for everyone.

  • This means that for each person, Y_i(1)- Y_i(0) = \beta

  • Validity of G-estimation doesn't actually require homogeneity but it is a convenient place to start.

11 / 46

G-Estimation

  • Re-write as Y_i(0) = Y_i(1) - \beta

  • By consistency, if A_i = a then Y_i = Y_i(a) so Y_i(0) = Y_i - \beta A_i

  • This means that if we know \beta, we can fill in each unit's missing counterfactual.

  • Let H_i(x) = Y_i - x A_i

  • We want to find the value of x that will make H_i(x) equal to Y_i(0).

12 / 46

G-Estimation

  • Now we will use exchangeability. Exchangeability says that Y(0) \ci A \vert L aka, H(\beta) \ci A \vert L.

  • For any given value of x, we can compute H(x)_i = Y_i - x A_i for every person in the study.

  • We can estimate \beta as \hat{\beta} = x^* where H(x^*) is statistically independent of A conditional on L, in our sample.

13 / 46

Example with Categorical L

L A Y(0) = H(x) Y(1) L A Y(0) = H(x) Y(1)
-1 0 0.27 0.27+x 0 1 0.1-x 0.1
-1 0 -0.31 -0.31+x 0 1 0.29-x 0.29
-1 0 -0.1 -0.1+x 0 1 0.48-x 0.48
-1 1 0.71-x 0.71 0 1 0.45-x 0.45
-1 1 -0.2-x -0.2 1 0 1.03 1.03+x
-1 1 0.88-x 0.88 1 0 1.14 1.14+x
-1 1 0.53-x 0.53 1 0 1.24 1.24+x
0 0 0.05 0.05+x 1 1 1.42-x 1.42
0 0 0.13 0.13+x 1 1 1.68-x 1.68
0 0 0.77 0.77+x 1 1 0.95-x 0.95
  • The best estimate of \beta is the value of x that makes H(x) as close to independent of A \vert L as possible.
14 / 46

G-Estimation

  • How would we know if H(x) were independent of A \vert L?

  • If we have a guess for x, we could fit a regression like logit \left( P[A = 1 \vert H(x), L] \right) = \alpha_0 + \alpha_1 H(x) + \alpha_2 L

  • If \alpha_1 is not statistically different from 0, then our data are consistent with H(x) \ci A \vert L.

  • We could find the values of x that make this true by doing a grid search, repeatedly fitting the regression for multiple candidate x's.

  • A good point estimate of \beta is the value of x that gives \hat{\alpha}_1(x) closest to 0.

15 / 46

G-Estimation

  • Alternatively, minimizing the association between H(x)= Y_i - xA_i and A conditional on L is equivalent to solving the estimating equation:

\sum_{i = 1}^n H_i(x)(A_i - E[A \vert L_i]) = 0

  • This is solved by

x^* = \frac{\sum_{i=1}^n Y_i (A_i - E[A \vert L_i])}{\sum_{i=1}^n A_i (A_i - E[A \vert L_i])}

16 / 46

Variance of the Estimate

  • Suppose that \hat{x} is the solution to \min_{x} \vert \hat{\alpha}_1(x) \vert found via grid search.

  • For every value of x we try, we get a regression fit including a p-value for \hat{\alpha}_1(x) = 0. Call this p-value p_1(x).

  • We can get a confidence interval for \hat{x} by inverting this p-value.

  • The 95% confidence interval for \hat{x} is the set of values \lbrace x : p_1(x) > 0.05 \rbrace.

  • Alternatively, we can bootstrap using the estimating equations version.

17 / 46

Example with Categorical L

  • In our categorical L example, to perform the grid search method, we first write a function to calculate \alpha_1(x).
grid = seq(0.4, 0.6, by = 0.01)
Hx_coefs <- purrr::map_dfr(grid, function(x){
dat$Hx <- dat$Y - x * dat$A
# Using GEE for robust SEs
ea_fit <- geeglm(A ~ as.factor(L) + Hx,
family=binomial(), data=dat,
id=id, corstr="independence")
res <- data.frame("Est" = summary(ea_fit)$coefficients["Hx", "Estimate"],
"SE" = summary(ea_fit)$coefficients["Hx", "Std.err"],
"p" = summary(ea_fit)$coefficients["Hx", "Pr(>|W|)"])
})
Hx_coefs <- Hx_coefs %>% mutate(ci_lower = Est - 1.96*SE,
ci_upper = Est + 1.96*SE,
pns = p > 0.05)
18 / 46

G-Estimation in Catgorical Example

19 / 46

Estimating Equations in Categorical Example

gest_func <- function(dat){
ps_model <- glm(A ~ as.factor(L),
family=binomial(), data=dat)
dat$ea <- predict(ps_model, type = "response")
est <- with(dat, sum(Y*(A-ea))/sum(A*(A -ea)))
return(est)
}
boot_samps <- replicate(n = 500,
expr = {gest_func(dat[sample(seq(nrow(dat)),
size = nrow(dat), replace = TRUE),])})
est <- gest_func(dat)
round(est, digits = 2)
## [1] 0.49
round(est + c(-1, 1)*qnorm(0.975)*sd(boot_samps), digits = 2)
## [1] 0.45 0.52
20 / 46

Beyond Homogeneous Effects

  • We have motivated G-estimation under the very strong assumption that Y_i(1)-Y_i(0) = \beta for everyone.

  • If this is the case, then H_i(\beta) = Y_i - \beta A_i is equal to Y_i(0).

  • However, we are only making use of mean exchangeability: E[Y(0) \vert L, A] = E[Y(0) \vert L].

  • So we only need H(\beta) to have the same conditional mean as Y(0), i.e. E[H(\beta) \vert L] = E[Y(0) \vert L]

  • This is true as long as we get the semiparametric structural mean model for E[Y(1)]-E[Y(0)] correct.

21 / 46

Censoring

  • When there is censoring, we want to estimate E[Y(A = 1, C =0) - Y(A =0, C =0)].

  • We can estimate censoring weights W^C = \frac{1}{P[C = 0 \vert A, L]}, weight the population by these, and then apply G-estimation.

  • Using the grid search method, this means using W^{C} as weights when we fit the model for E[A \vert H(x), L].

  • Using the estimating equations version, we add weights to the sums in numerator and denominator,

x^* = \frac{\sum_{i: C_i = 0} W^C_i Y_i (A_i - E[A \vert L_i])}{\sum_{i:C_i = 0} W^C_i A_i (A_i - E[A \vert L_i])}

22 / 46

G-Estimation in NHEFS

Step 1: Compute censoring weights

dat <- read_csv("../../data/nhefs.csv")
dat$cens <- ifelse(is.na(dat$wt82), 1, 0)
cw_fit <- glm(cens==0 ~ qsmk + sex + race + age + I(age^2)
+ as.factor(education) + smokeintensity + I(smokeintensity^2)
+ smokeyrs + I(smokeyrs^2) + as.factor(exercise)
+ as.factor(active) + wt71 + I(wt71^2),
data = dat, family = binomial())
dat$wc <- 1/predict(cw_fit, type="response")
23 / 46

Grid Search G-Estimation in NHEFS

Step 2: Estimate x by grid search.

grid = seq(2, 5, by = 0.1)
Hx_coefs <- purrr::map_dfr(grid, function(x){
dat$Hx <- dat$wt82_71 - x * dat$qsmk
# Using GEE for robust SEs
gest_fit <- geeglm(qsmk ~ sex + race + age + I(age*age) + as.factor(education)
+ smokeintensity + I(smokeintensity*smokeintensity) + smokeyrs
+ I(smokeyrs*smokeyrs) + as.factor(exercise) + as.factor(active)
+ wt71 + I(wt71*wt71) + Hx,
family=binomial(), data=dat, weights=wc,
id=seqn, corstr="independence",
subset = which(!is.na(wt82_71)))
res <- data.frame("Est" = summary(gest_fit)$coefficients["Hx", "Estimate"],
"SE" = summary(gest_fit)$coefficients["Hx", "Std.err"],
"p" = summary(gest_fit)$coefficients["Hx", "Pr(>|W|)"])
})
Hx_coefs <- Hx_coefs %>% mutate(ci_lower = Est - 1.96*SE,
ci_upper = Est + 1.96*SE,
pns = p > 0.05)
24 / 46

G-Estimation in NHEFS

25 / 46

Estimating Equations in NHEFS

ps_model <- glm(qsmk ~ sex + race + age + I(age*age) + as.factor(education)
+ smokeintensity + I(smokeintensity*smokeintensity) + smokeyrs
+ I(smokeyrs*smokeyrs) + as.factor(exercise) + as.factor(active)
+ wt71 + I(wt71*wt71),
family=binomial(), data=dat)
dat$ea <- predict(ps_model, type = "response")
g_est_num <- with(dat, sum((1-cens)*wc*wt82_71*(qsmk-ea), na.rm=T))
g_est_denom <- with(dat, sum((1-cens)*wc*qsmk*(qsmk-ea)))
round(g_est_num/g_est_denom, digits = 2)
## [1] 3.51
26 / 46

Assumptions

  • In order for the G-estimation estimate to be consistent, we need to get two things right:

  • The semiparametric structural mean model, E[Y(a) \vert A = a, L] - E[Y(0) \vert A = a, L]

  • The model for E[A \vert L].

27 / 46

Effect of Positivity Violations in G-Estimation

  • Suppose we have a region of covariate space with almost no treated units. In this region E[A \vert L] will be close to 0.
  • In G-computation, we may get extrapolation bias because G-computation will smooth and borrow from surrounding samples to predict Y(1) for all of the untreated units in this region.

    • This will make our estimates highly model dependent.
  • On the other hand, IP weighting could have very high variance due to enormous weights for the few treated units in the region.

28 / 46

Effect of Positivity Violations in G-Estimation

  • Look again at the estimating equations version of the G-estimation estimator:

x^* = \frac{\sum_{i=1}^n 1_{C_i=0}W^C_i Y_i (A_i - E[A \vert L_i])}{\sum_{i=1}^n 1_{C_i=0}W^C_i A_i (A_i - E[A \vert L_i])}

  • Untreated units in the problematic region will have A_i - E[A \vert L_i] \approx 0.

  • Treated units will have A_i - E[A \vert L_i] \approx 1.

  • In G-estimation, the untreated units will barely contribute to the sum, so G-estimation will not suffer from extrapolation bias.
    • Treated units will contribute, but the weight of their contribution is not extreme, preventing high variance.
29 / 46

Effect Modification

  • In G-estimation, we are estimating E[Y(a)-Y(0) \vert L, A = a], the treatment effect conditional on L.
    • If there is any effect modification by variables in L, we have to include that in the model.
    • To get the marginal effect, we need to standardize as in G-computation.
  • This is different from the MSMs we use for IP weighting.
    • In that case, we only put in effect modifiers if we find them interesting.
30 / 46

Effect Modification

  • Vansteelandt and Joffe argue that, in the binary A case, mis-specifying the model as E[Y(1) \vert L] - E[Y(0) \vert L ] = \beta_1 when there is effect modification, still gives interpretable results.

  • Suppose the true model is E[Y(1) \vert L] - E[Y(0) \vert L ] = \omega(L)

  • Then the G-estimator using the incorrect model with no effect modification converges to \frac{E\left[ Var(A \vert L)\omega(L)\right ]}{E\left [ Var(A \vert L)\right ]} which is a weighted average of treatment effects, giving the most weight to strata with higher values of Var(A \vert L), or strata with the most overlap between treated and control.

31 / 46

Multi-Parameter Structural Mean Models

  • Suppose we propose the model

E[Y(a)-Y(0) \vert L] = \beta_1 a + \beta_2 a L

  • The corresponding "H" model is

H_i(x_1, x_2) = Y_i - x_1 A_i - x_2 A_i L_i

  • We are looking for values x_1^* and x_2^* such that H(x_1^*, x_2^*) \ci A \vert L
32 / 46

Multi-Parameter Structural Mean Models

  • We now have two parameters to estimate, so we need to perform a search over a two dimensional grid.

  • At x_1 = \beta_1 and x_2 = \beta_2, the \alpha_1 and \alpha_2 terms will both be 0.

logit\left( P[A = 1 \vert H(x_1, x_2), L] \right) = \alpha_0 + \alpha_1 H(x_1, x_2) + \alpha_2 H(x_1, x_2) L + \alpha_3 L

33 / 46

Multi-Parameter Structural Mean Models

  • We have some choices about how we put H(x_1, x_2) into the exposure model.

  • For example, we could use

logit\left( P[A = 1 \vert H(x_1, x_2), L] \right) = \alpha_0 + \alpha_1 H(x_1, x_2) + \alpha_2 H(x_1, x_2) L + \alpha_3 L

or

logit\left( P[A = 1 \vert H(x_1, x_2), L] \right) = \alpha_0 + \alpha_1 H(x_1, x_2) + \alpha_2 H(x_1, x_2) L^2 + \alpha_3 L

  • It turns out that the choice only affects efficiency and not consistency, and we can derive the locally most efficient choice.
34 / 46

General Formulation

  • In general we start with a structural mean model

g\left \lbrace E[Y(a) \vert A =a, L]\right \rbrace - g\left \lbrace E[Y(0) \vert A =a, L]\right \rbrace = \gamma(a, l ; x^*)

  • From this, we define

\begin{split} &H(x) = Y - \gamma(L, A; x) \qquad &g\ \text{identity}\\ &H(x) = Y\exp[-\gamma(L, A; x)]\qquad &g\ \text{log}\\ &H(x) = expit \left( logit(E[Y \vert L, A])- \gamma(L, A; x) \right) ) \qquad &g\ \text{logit} \end{split}

  • The fact that E[Y \vert L, A] shows up in the definition of H(x) if g is logit makes the odds ratio formulation more difficult to deal with.

    • There is no consistent G-estimator for the causal odds ratio.
  • Suppose x has dimension k.

35 / 46

General Formulation

  • Next we either grid search for a value of x such that H(x) is statistically independent of A - E[A \vert L] (possible for small k)

  • Or we solve the estimating equations

\sum_{i = 1}^n 1_{C_i = 0}W^C_i H_i(x)(d(A_i, L_i) - d(E[A \vert L_i], L_i)) = 0_k where d is a function with that outputs a k-dimensional vector and 0_k is the k-dimensional 0 vector.

  • d can be any function of A and L, but there is a most efficient choice.
36 / 46

Efficient choice of d

  • Under homoskedasticity, meaning that the variance of H(x) is constant given A and L, the best choice of d is

d(A, L) = E\left\lbrace \frac{\partial H(x)}{dx} \Big\vert A, L\right\rbrace

  • Under the two parameter model from a few slides ago where we had

H_i(x_1, x_2) = Y_i - x_1 A_i - x_2 A_i L_{i}

  • The best choice of d is d(A,L) = \begin{pmatrix}A \\ A L_1 \end{pmatrix}.
    • We dropped the minus sign since we can just multiply both sides of the estimating equations by -1.
37 / 46

Multi-Parameter Example

  • In our example multi-parameter model we need to solve (dropping censoring weights for simplicity)

\sum_{i = 1}^n H_i(x) \begin{pmatrix}A_i - E[A \vert L_i]\\L_i(A_i - E[A \vert L_i]) \end{pmatrix}= \begin{pmatrix}0 \\ 0 \end{pmatrix}

  • Let D_i = A_i - E[A \vert L_i].

\sum_{i = 1}^n (Y_i - x_1 A_i - x_2 A_i L_i) D_i\begin{pmatrix}1\\L_i \end{pmatrix}= \begin{pmatrix}0 \\ 0 \end{pmatrix}

38 / 46

Multi-Parameter Example

  • Separating terms and rearranging : x_1 \sum_{i = 1}^n A_i D_i\begin{pmatrix}1\\L_i \end{pmatrix} + x_2\sum_{i = 1}^n A_i L_i D_i\begin{pmatrix}1\\L_i \end{pmatrix} = \sum_{i = 1}^n Y_i D_i\begin{pmatrix}1\\L_i \end{pmatrix}

x_1 \begin{pmatrix} \sum_{i = 1}^n A_i D_i\\ \sum_{i = 1}^n A_i D_iL_i \end{pmatrix} + x_2\begin{pmatrix}\sum_{i = 1}^n A_i L_i D_i\\\sum_{i = 1}^n A_i L_i^2 D_i \end{pmatrix} = \begin{pmatrix}\sum_{i = 1}^n Y_i D_i\\\sum_{i = 1}^n Y_i D_iL_i \end{pmatrix}

39 / 46

Multi-Parameter Example

  • Solving for (x_1, x_2):

\begin{pmatrix} \sum_{i = 1}^n A_i D_i & \sum_{i = 1}^n A_i D_i L_i\\ \sum_{i = 1}^n A_i D_iL_i & \sum_{i = 1}^n A_i L_i^2 D_i\end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} = \begin{pmatrix}\sum_{i = 1}^n Y_i D_i\\\sum_{i = 1}^n Y_i D_iL_i \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} = \begin{pmatrix} \sum_{i = 1}^n A_i D_i & \sum_{i = 1}^n A_i L_i D_i\\ \sum_{i = 1}^n A_i D_iL_i & \sum_{i = 1}^n A_i L_i^2 D_i\end{pmatrix}^{-1}\begin{pmatrix}\sum_{i = 1}^n Y_i D_i\\\sum_{i = 1}^n Y_i D_iL_i \end{pmatrix}

40 / 46

Multi-Parameter Model in NHEFS

  • Suppose we believe there is effect modification by smoking intensity. Let L be smoking intensity and W the remaining confounders. We could use the SMM E[Y(a) \vert L, W] - E[Y(0) \vert L, W] = \beta_1 a + \beta_2 a L

  • We include W to make explicit that this model is conditional on all of the confounders. We are explicitly saying we think there is only effect modification by smoking intensity.

  • For this we can use the solution to the example that we just worked out.

41 / 46

Multi-Parameter Model in NHEFS

dat$wc[dat$cens != 0] <- 0
dat$rem_w <- with(dat, wc*(qsmk - ea))
S <- with(dat, c(sum(wt82_71*rem_w, na.rm=T),
sum(smokeintensity*wt82_71*rem_w, na.rm=T)))
R <- with(dat, c(sum(qsmk*rem_w), sum(smokeintensity*qsmk*rem_w),
sum(smokeintensity*qsmk*rem_w), sum(smokeintensity^2*qsmk*rem_w)))
R <- matrix(R, nrow = 2, byrow = T)
x <- solve(R) %*% S
round(x, digits = 2)
## [,1]
## [1,] 2.92
## [2,] 0.03
# Standardize to get the ATE
ate <- x[1] + mean(dat$smokeintensity*x[2])
round(ate, digits = 2)
## [1] 3.54
42 / 46

Doubly Robust G-Estimation

  • To make the estimator more robust, we can replace H(x) with H(x) - E[H(x) \vert L].

    • This works using either the estimating equation approach or the grid search approach.
  • To do this, we first fit an outcome model for E[Y \vert L, A = 0].

  • If H_i(x) = Y_i - xA_i, then H_i(x) - E[H_i(x) \vert L] = (Y_i - \hat{Y}_{i,0}) - xA_i.

  • The resulting estimator will be consistent if either the outcome model in the untreated or the propensity model are correct.

43 / 46

Continuous Treatments

  • One nice feature of G-estimation for continuous treatments is that we only need a model for E[A \vert L].

  • For IP weighting, we need a model for f(A \vert L) which is much harder to get right.

44 / 46

Summary

  • In G-Estimation we propose a semiparametric structural mean model for the causal contrast of interest conditional on L, with parameters \boldsymbol{\beta}.

  • We also propose a model for E[A \vert L].

  • We then obtain a function H(x) such that E[H(\boldsymbol{\beta}) \vert L] = E[Y(0) \vert L].

  • Finally, we search for the value of x that makes H(x) statistically independent of A \vert L.

45 / 46

Summary

  • G-Estimation avoids some of the problems of G-Computation and IP Weighting

    • Extrapolation bias (G-Computation)
    • Large variance (IP Weighting)
  • The gesttools R package came out in 2022.

    • Primarily focused on time-varying exposures and outcomes but also works for single time point data.
46 / 46

\newcommand{\ci}{\perp\!\!\!\perp}

Lecture Plan

  1. G-Estimation

Lecture materials from HR Chapter 14 and from Vansteelandt and Joffe (2014).

2 / 46
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
oTile View: Overview of Slides
Esc Back to slideshow