class: center, middle, inverse, title-slide .title[ # L5: IP Weighting and G-Computation ] .author[ ### Jean Morrison ] .institute[ ### University of Michigan ] .date[ ### Lecture on 2023-01-30 (updated: 2024-02-07) ] --- `\(\newcommand{\ci}{\perp\!\!\!\perp}\)` ## Lecture Plan 1. Intro 2. Inverse Probability Weighting - IP weighting for binary and continuous treatment - Marginal structural models - Censoring weights 3. G-Computation --- # 1. Introduction --- ## Modeling - So far we have learned two strategies for estimating `\(E[Y(a)]\)` from data when we need to adjust for confounding or selection bias. - In IP weighting we re-weight our observations by `\(1/P[A = A_i \vert L_i]\)`, the probability that they received the treatment that they got given confounderts `\(L_i\)`. - In standardization, we first stratify the data by `\(L\)`, compute `\(E[Y \vert A = a, L]\)` within each stratum, and then take a weighted sum of these averages. --- ## Models - Formally, a **model** is a class of probability distributions `\(\mathcal{M} = \lbrace P_\theta : \theta \in \Theta \rbrace\)`. - `\(\theta\)` is a (possibly infinite) vector of parameters and `\(\Theta\)` is the parameter space. - A model is called **parametric** when `\(\theta\)` is finite dimensional and `\(\mathcal{M}\)` is not the set of all possible probability distributions. - If `\(\theta\)` is infinite dimensional, `\(\mathcal{M}\)` may be called **non-parametric** or **semi-parametric**. + Semi-parametric models have a finite, structured component, and an infinite component. - In this lecture we will talk about parametric models. --- ## Our Models So Far - So far in examples, we have always used **saturated** models. + Saturated models contain all possible probability distributions for the observed data. - We assume that `\(A\)`, `\(Y\)`, and any confounders have a discrete number of levels and that we have enough data to estimate quantities like `\(E[Y \vert A, L]\)` in each stratum of `\(A\)` and `\(L\)`. - This has allowed us to focus on **identification**, i.e. could we estimate a parameter with infinite data. - However, in the real world, we don't get infinite data and it is impractical to assume we can always estimate `\(E[Y \vert A, L]\)` in every stratum of `\(A\)` and `\(L\)`. - We will need **modeling** to go further. --- ## G-Formula - The formula we saw previously as the "stratification" formula is also called the g-formula $$ E[Y(a)] = \sum_l E[Y \vert A = a, L = l]P[L = l] $$ - We showed in L1 that an alternate expression of the g-formula is $$ E[Y(a)] = E\left[ \frac{I(A = a) Y}{f(A \vert L)}\right] $$ - We can use either of these expressions to obtain parametric estiamtes of `\(E[Y(a)]\)`. - In G-computation (also called standardization) we estimate `\(E[Y \vert A, L]\)` and plug into the first expression. - In IP weighting we estimate `\(f(A \vert L)\)` (the propensity score) and use the second expression. --- # 2. Inverse Probabilty Weighting --- ## NHEFS Data - Cigarette smokers aged 25-74 were recruited around 1971 and given a survey. - Ten years later, participants were given a followup survey. - We are interested in estimating the effect of quitting smoking on weight change on the additive scale. --- ## Variables - Our exposure is binary (whether or not a person quit smoking between the first and second survey). - The outcome is the change in weight in kg. - What other features would you want to know in order to estimate the causal effect? -- - For now we will assume that the following variables are sufficient: + Sex (0: male, 1: female) + Age (in years) + Race (0: white, 1: other) + Education (5 categories) + Intensity of smoking (cigarettes per day) + Duration of smoking (years) + Physical activity in daily life (3 categories) + Recreational exercise (3 categories) + Weight (kg) --- ## Limitations - For now we will focus on individuals with complete data. + There are some people who did not fill in the second survey. + For these people we know covariates but don't know either the exposure or the outcome. + Additionally, there are 63 people who did fill in the second survey but who's weight is unknown. + More on this later. - Individuals may have quit at different times. + We could think of `\(A\)` as a time-varying treatment (coming up later). --- ## Simple Analysis - The simplest analysis is to simply compare the mean weight change between quitters and non-quitters. <table> <thead> <tr> <th style="text-align:right;"> Quit Smoking </th> <th style="text-align:right;"> Average Weight Change </th> <th style="text-align:right;"> N </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1.98 </td> <td style="text-align:right;"> 1163 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 4.53 </td> <td style="text-align:right;"> 403 </td> </tr> </tbody> </table> --- ## Quitters v Non-Quitters <table> <thead> <tr> <th style="text-align:left;"> Variable </th> <th style="text-align:left;"> Did Not Quit Smoking </th> <th style="text-align:left;"> Quit Smoking </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Total </td> <td style="text-align:left;"> 1163 </td> <td style="text-align:left;"> 403 </td> </tr> <tr> <td style="text-align:left;"> Age, years </td> <td style="text-align:left;"> 42.8 (11.8) </td> <td style="text-align:left;"> 46.2 (12.2) </td> </tr> <tr> <td style="text-align:left;"> Female </td> <td style="text-align:left;"> 53.4% (621) </td> <td style="text-align:left;"> 45.4% (183) </td> </tr> <tr> <td style="text-align:left;"> Non-White </td> <td style="text-align:left;"> 14.6% (170) </td> <td style="text-align:left;"> 8.9% (36) </td> </tr> <tr> <td style="text-align:left;"> Cigarettes/day </td> <td style="text-align:left;"> 21.2 (11.5) </td> <td style="text-align:left;"> 18.6 (12.4) </td> </tr> <tr> <td style="text-align:left;"> Years smoking </td> <td style="text-align:left;"> 24.1 (11.7) </td> <td style="text-align:left;"> 26 (12.7) </td> </tr> <tr> <td style="text-align:left;"> Weight, kg </td> <td style="text-align:left;"> 70.3 (15.2) </td> <td style="text-align:left;"> 72.4 (15.6) </td> </tr> <tr> <td style="text-align:left;"> College </td> <td style="text-align:left;"> 9.9% (115) </td> <td style="text-align:left;"> 15.4% (62) </td> </tr> <tr> <td style="text-align:left;"> Little exercise </td> <td style="text-align:left;"> 37.9% (441) </td> <td style="text-align:left;"> 40.7% (164) </td> </tr> <tr> <td style="text-align:left;"> Inactive life </td> <td style="text-align:left;"> 8.9% (104) </td> <td style="text-align:left;"> 11.2% (45) </td> </tr> </tbody> </table> --- ## Estimating Weights - In our data, `\(L\)` is a vector of nine measurements. - We cannot compute `\(P[A = 1 \vert L ]\)` among every (or any) stratum of `\(L\)` because every participant has their own unique value of `\(L\)`. - We have to fit a parametric model. - What would you do? --- ## Weights Estimated by Logistic Regression - We will use a logistic regression model to predict `\(P[A \vert L]\)` - The goal is to get the best possible estimate of the probability of treatment given `\(L\)`, so we should fit a flexible model. - Once we fit the model, we can estimate the probability of quitting for each person as `\(\pi_i = P[A = 1 \vert L = L_i]\)`. These are the probability-scale fitted values from the logistic model. - The weight for individual `\(i\)` will be equal to `\(\frac{1}{\pi_i}\)` if `\(A_i = 1\)` (the person actually did quit) or `\(\frac{1}{1-\pi_i}\)` if `\(A_i = 0\)` (the person did not quit). --- ## Weights Estimated by Logistic Regression ```r fit <- glm(qsmk ~ sex + race + age + I(age ^ 2) + as.factor(education) + smokeintensity + I(smokeintensity ^ 2) + smokeyrs + I(smokeyrs ^ 2) + as.factor(exercise) + as.factor(active) + wt71 + I(wt71 ^ 2), family = binomial(), data = dat) dat$w <- ifelse(dat$qsmk == 0, 1/(1 - predict(fit, type = "response")), 1/(predict(fit, type = "response"))) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.054 1.230 1.373 1.996 1.990 16.700 ``` - Why is the mean value close to 2? --- ## Positivity - *Structural violations* of positivity occur when it is impossible for people with some levels of confounders to receive a particular level of treatment. + If we have structural violations, we cannot use IP weighting or standardization. + We need to restrict our inference to relevant strata of `\(L\)`. - *Random violations* of positivity occur when certain combinations of `\(L\)` and `\(A\)` are missing from our data by chance. + We are using a parametric model, so we are able to smooth over unobserved covariate values. + We are able to predict for strata that were not observed in our data. + We should be careful about predicting outside the range of the observed data. --- ## Assessing Positivity in Propensity Scores - If positivity holds, propensity scores should be bounded away from 0 and 1: + Scores very close to 0 or very close to 1 suggest that there are some strata of `\(A\)` and `\(L\)` that have no chance to receive one of the two treatments. + We might get scores close to 0 or 1 if there is perfect separation by one or a combination of confounders. + Scores close to 0 or 1 suggest there could be structural positivity violations. - Propensity scores should have approximately the same range in both groups. + Non-overlapping ranges indicate that there are some regions of confounder space with only cases or only controls in our study. + If we believe the PS model, we can trust the weights and assume this is due to random positivity violations. + In some approaches based on propensity scores, this will cause a problem (matching ans subclassification). --- # Propensity Scores and Positivity <img src="5_modeling1_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- ## Horvitz-Thompson Estimator - The *Horvitz-Thompson estimator* is one of two options for estimating `\(E[Y(a)]\)` using our estimated weights. - The Horvitz-Thompson estimator just plugs our estimated weights into our previous formula. $$ \hat{E}[Y(a)] = \hat{E}\left[\frac{I(A = a)Y}{f(A\vert L)} \right] = \frac{1}{n}\sum_{i = 1}^n I(A_i = a)Y_i W_i $$ --- ## Hajek Estimator - The *Hajek estimator* fits the linear model $$ E[Y \vert A] = \theta_0 + \theta_1 A $$ by weighted least squares, weighting individuals by our estimated IP weights. - It is is equivalent to `$$\frac{ \hat{E}\left[\frac{I(A = a)Y}{f(A\vert L)} \right]}{ \hat{E}\left[\frac{I(A = a)}{f(A\vert L)} \right]} = \frac{\sum_{i = 1}^n I(A_i = a)Y_i W_i}{\sum_{i = 1}^n I(A_i = a) W_i}$$` --- ## Hajek Estimator - Hajek estimator is unbiased for `$$\frac{ E\left[\frac{I(A = a)Y}{f(A\vert L)} \right]}{ E\left[\frac{I(A = a)}{f(A\vert L)} \right]}$$` - If positivity holds, then $$ E\left[\frac{I(A = a)}{f(A\vert L)} \right] = 1 $$ - The Hajek estimator is guaranteed to be between 0 and 1 for dichotomous outcomes. The Horvitz-Thompson estimator is not. --- ## IP Weighted Effect Estimate in NHEFS - This code computes the Hajek estimator: ```r f_w_lm <- lm(wt82_71 ~ qsmk, data = dat, weights = w) f_w_lm ``` ``` ## ## Call: ## lm(formula = wt82_71 ~ qsmk, data = dat, weights = w) ## ## Coefficients: ## (Intercept) qsmk ## 1.780 3.441 ``` - We estimate that if nobody had quit smoking, the average weight gain would have been 1.8 kg. - If everyone had quit smoking, the average weight gain would have been 5.2 kg. - The average causal effect of quitting smoking on weight gain is 3.4 kg. --- ## Estimating the Variance of the Estimate - The raw variance estimate from weighted least squares does not account for uncertainty in the weights. - We need to account for having estimated the IP weights. - Options: 1. Derive the variance analytically 2. Bootstrap the variance 3. Use a robust sandwich variance estimate. - Option 3 is conservative but easier than options 1 and 2. - Option 3 can be achieved by fitting with GEE using and independent working correlation matrix or using the `HC0` option in `vcovHC`. --- ## Variance Estiamte in the NHEFS Data ```r library("geepack") f_w_gee <- geeglm(wt82_71 ~ qsmk, data = dat,weights = w, id = seqn,corstr = "independence") beta <- coef(f_w_gee) SE <- coef(summary(f_w_gee))[, 2] lcl <- beta - qnorm(0.975) * SE ucl <- beta + qnorm(0.975) * SE cbind(beta, lcl, ucl) ``` ``` ## beta lcl ucl ## (Intercept) 1.779978 1.339514 2.220442 ## qsmk 3.440535 2.410587 4.470484 ``` ```r library(sandwich) # Equivalent alternative using sandwich v_wlm_robust <- vcovHC(f_w_lm, type = "HC0") all.equal(v_wlm_robust, vcov(f_w_gee)) ``` ``` ## [1] TRUE ``` --- ## Stabilized Weights - Using weights `\(W^{A} = 1/f(A \vert L)\)` we create a pseudo-population in which (heuristically), each person is matched by someone exactly like them who received the opposite treatment. - Our pseudo-population is twice as big as our actual sample so the mean of `\(W^A\)` is 2. - In the pseudo-population, the frequency of treatment `\(A = 1\)` is 50%. - We could have created other pseudo-populations. --- ## Stabilized Weights - Our requirements are that, in the pseudo-population, probability of treatment is independent of `\(L\)`. - But different people could have different probabilities of treatment. - To create stabilized weights, we construct a pseudo-population in which the probability of receiving each treatment is the same as the frequency of the treatment in our original sample. $$ SW^A = \frac{f(A)}{f(A \vert L)} $$ --- ## Stabilized Weights - In our data, `\(A\)` is binary, so we can compute `\(f(1)\)` as the proportion of quitters in the data. ```r p1 <- mean(dat$qsmk); p1 ``` ``` ## [1] 0.2573436 ``` - In the pseudo-population created by the stabilized weights, each person in our data set corresponds to 26% of a treated person and 74% of an untreated person. ```r dat <- dat %>% mutate(pa = ifelse(dat$qsmk == 0, 1-p1, p1), sw= pa*w) summary(dat$sw) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.3312 0.8665 0.9503 0.9988 1.0793 4.2977 ``` - The expected value of `\(SW^A\)` is 1 because the pseudo-population is the same size as the observed data. --- ## Estimation Using the Stabilized Weights ```r f_sw_gee <- geeglm(wt82_71 ~ qsmk,data = dat, weights = sw, id = seqn,corstr = "independence") beta <- coef(f_sw_gee) SE <- coef(summary(f_sw_gee))[, 2] lcl <- beta - qnorm(0.975) * SE ucl <- beta + qnorm(0.975) * SE cbind(beta, lcl, ucl) ``` ``` ## beta lcl ucl ## (Intercept) 1.779978 1.339514 2.220442 ## qsmk 3.440535 2.410587 4.470484 ``` - These are exactly the results we saw with the unstabilized weights. --- ## Why Use Stabilized Weights - Differences between stabilized and non-stabilized weights only occur when the model for `\(E[Y \vert A]\)` is not saturated. - When the model is not saturated, stabilized weights typically result in greater efficiency. --- ## Marginal Structural Models - In the IP weighting strategy we create a population in which `\(A\)` is indepedent of `\(L\)` and then fit the model `$$E_{ps}[Y \vert A ] = \theta_0 + \theta_1 A$$` - If we believe our conditional exchangeability assumption `\(Y(a) \ci A \vert L\)` and the propensity score model, then in the pseudo-population, `\(E_{ps}[Y \vert A] = E[Y(a)]\)`. - So the parameter `\(\theta_1\)` is interpretable as the causal risk difference. -- - We have proposed a model for the average counterfactual: `$$E[Y(a)] = \beta_0 + \beta_1 a$$` -- - This model is *marginal* because we have marginalized (averaged) over the values of all other covariates. - It is structural because it is a model for a counterfactual `\(E[Y(a)]\)`. --- ## Modeling Continuous Treatments - With a binary, we could construct a saturated model. - For continuous (or highly polytomous) variables, we can't do that and will have to use a parametric model instead. - In the NHEFS data, let `\(A\)` be the change in smoking intensity ( `\(A = 10\)` indicates that a person increased their smoking by 10 cigarettes). - We will limit to only those who smoked 25 or fewer cigarettes per day at baseline ( `\(N = 1,162\)` ) - We can propose a model for our *dose-response curve* $$ E[Y(a)] = \beta_0 + \beta_1 a + \beta_2 a^2 $$ --- ## Estimating Weights for the Continuous Treatment - To use stabilized weights, we need to estimate `\(f(A \vert L)\)` and `\(f(A)\)`. - Both of these are PDFs which are hard to estimate! - We will assume that both `\(f(A \vert L)\)` and `\(f(A)\)` are normal with constant variance across participants. + These assumptions are almost certainly wrong. - We will estimate the mean of `\(f(A \vert L)\)` via linear regression. - IP weighted estimates of continuous variables can be very sensitive to choice of weights. --- ## Estimating Weights for the Continuous Treatment - We fit a linear model for change in smoking intensity including: + sex, race, education, exercise, life activity + age and age squared + baseline smoking intensity and baseline smoking intensity squared + baseline weight and baseline weight squared - The fitted values of this model give us `\(E[A \vert L]\)`. - We can use the variance of the residuals as an estimate of the variance of `\(A \vert L\)` - We then compute `\(f(A \vert L)\)` using `dnorm`. --- ## Estimating Weights for the Continuous Treatment ```r dat.s <- filter(dat, smokeintensity <= 25) fal_fit<- lm(smkintensity82_71 ~ as.factor(sex) + as.factor(race) + age + I(age ^ 2) + as.factor(education) + smokeintensity + I(smokeintensity ^ 2) + smokeyrs + I(smokeyrs ^ 2) + as.factor(exercise) + as.factor(active) + wt71 + I(wt71 ^ 2), data = dat.s) fal_mean <- predict(fal_fit, type = "response") fal <- dnorm(dat.s$smkintensity82_71, fal_mean, summary(fal_fit)$sigma) ``` --- ## Estimating Weights for the Continuous Treatment - To estimate `\(f(A)\)` we use the same procedure, except that the mean of `\(E[A]\)` is is just the average change in smoking intensity. ```r fa_fit <- lm(smkintensity82_71 ~ 1, data = dat.s) fa_mean <- predict(fa_fit, type = "response") fa <- dnorm(dat.s$smkintensity82_71, fa_mean, summary(fa_fit)$sigma) dat.s$sw_cont <- fa / fal dat.s$w_cont <- 1/fal ``` --- ## Estimating Weights for the Continuous Treatment - For comparison, we compute both the standardized weights and the non-standardized weights ```r summary(dat.s$sw_cont) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.1938 0.8872 0.9710 0.9968 1.0545 5.1023 ``` ```r summary(dat.s$w_cont) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 25.4 26.4 30.0 1078.1 47.5 428611.4 ``` - The non-standardized weights are much more variable! --- ## Fitting the Continuous Treatment Model - We then fit the marginal structural model $$ E[Y(a)] = \beta_0 + \beta_1 a + \beta_2 a^2$$ using exactly the same strategy we used before. <table> <thead> <tr> <th style="empty-cells: hide;border-bottom:hidden;" colspan="1"></th> <th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="2"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Stabilized Weights</div></th> <th style="empty-cells: hide;border-bottom:hidden;" colspan="1"></th> <th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="2"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Non-Stabilized Weights</div></th> </tr> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> Estimate </th> <th style="text-align:left;"> 95% CI </th> <th style="text-align:left;"> </th> <th style="text-align:left;"> Estimate </th> <th style="text-align:left;"> 95% CI </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Intercept </td> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> (1.4, 2.6) </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 7.7 </td> <td style="text-align:left;"> (1, 14) </td> </tr> <tr> <td style="text-align:left;"> Smk Int </td> <td style="text-align:left;"> -0.11 </td> <td style="text-align:left;"> (-0.17, -0.047) </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 0.1 </td> <td style="text-align:left;"> (-0.21, 0.42) </td> </tr> <tr> <td style="text-align:left;"> Smk Int Squared </td> <td style="text-align:left;"> 0.0027 </td> <td style="text-align:left;"> (-0.002, 0.0074) </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> -0.01 </td> <td style="text-align:left;"> (-0.024, 0.0039) </td> </tr> </tbody> </table> --- ## Continuous Model Results <img src="5_modeling1_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- ## Issues with the Continuous Treatment Model - We had to make strong assumptions about the distribution of `\(A \vert L\)`. - There are structural positivity issues -- it is not possible for everyone to reduce their smoking by 80 cigarettes per day. --- ## Effect Modification in Marginal Structural Models - We might be interested in estimating the causal effect within strata of `\(V\)`. - We can propose a marginal structural model that includes `\(V\)` $$ E[Y(a) \vert V] = \beta_0 + \beta_1 a + \beta_2 a V + \beta_3 V $$ - If `\(V\)` and `\(A\)` are both dichotomous, this model is saturated. - It is not really fully marginal any more, but we still call it a marginal structural model. --- ## Interpreting Parameters in the Effect Modification Model - Suppose that `\(V\)` is sex (0: Male, 1: Female) and `\(A\)` is quitting/not quitting. $$ E[Y(a) \vert V] = \beta_0 + \beta_1 a + \beta_2 a V + \beta_3 V $$ - `\(\beta_1\)` is the causal effect of quitting for men. - `\(\beta_1+ \beta_2\)` is the causal effect of quitting for women. - What is `\(\beta_3\)`? -- - `\(\beta_3\)` is not a causal parameter because we haven't made any exchangeability assumptions about `\(V\)`. - `\(\beta_3\)` is the difference between `\(E[Y(0)]\)` in females and `\(E[Y(0)]\)` in males but is not the causal effect of sex. --- ## Stablilized Weights for the EM Model - We have two choices for stabilized weights $$ \frac{f(A)}{f(A \vert L)} \qquad \text{or} \qquad \frac{f(A\vert V)}{f(A \vert L)} $$ - Which will be more efficient? -- - Conditioning on `\(V\)` in the numerator will make the numerator closer to the denominator. - The second choice of weights will be less variable. - So the second choice of weights is more efficient. --- ## Stablilized Weights for the EM Model - We need to compute `\(f(A \vert V)\)` in our data. - We can just use the stratified means. ```r fav_fit <- lm(qsmk ~ sex, data = dat) dat$pav <- ifelse(dat$qsmk == 0, 1-predict(fav_fit, type = "response"), predict(fav_fit, type = "response")) dat <- mutate(dat, sw_em = pav*w) summary(dat$sw_em) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.2930 0.8751 0.9554 0.9989 1.0803 3.8011 ``` --- ## Fitting the Effect Modification Model ```r msm_em <- geeglm( wt82_71 ~ qsmk*sex, data = dat, weights = sw_em, id = seqn, corstr = "independence") beta <- coef(msm_em) SE <- coef(summary(msm_em))[, 2] lcl <- beta - qnorm(0.975) * SE ucl <- beta + qnorm(0.975) * SE cbind(beta, lcl, ucl) ``` ``` ## beta lcl ucl ## (Intercept) 1.784446876 1.1771686 2.3917252 ## qsmk 3.521977634 2.2341453 4.8098100 ## sex -0.008724784 -0.8883884 0.8709389 ## qsmk:sex -0.159478525 -2.2097583 1.8908013 ``` - We have no strong evidence of effect modificaiton by sex. --- ## Unweighted Regression vs IP Weighting - What if we had conditioned on every variable in `\(L\)`? - The stabilized weights would be 1, so we would just be running a regular regression. - If we believe in the model for `\(E[Y(a) \vert L]\)`, then the regression coefficient is interpretable as a (conditional) causal parameter. - Using the marginal model with IP weighting, we need to believe that our estimates of `\(f(A \vert L)\)` are accurate. - In some cases, solving a prediction problem is easier than guessing the correct structural model for all confounders. --- ## Censoring and Missing Data With IP Weights - In our analysis, we have removed 63 individuals who filled in the survey in 1982 but who have a missing value for weight. <table> <thead> <tr> <th style="text-align:right;"> Quit Smoking </th> <th style="text-align:right;"> Number Missing </th> <th style="text-align:right;"> Percent Missing </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 38 </td> <td style="text-align:right;"> 3.2 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 25 </td> <td style="text-align:right;"> 5.8 </td> </tr> </tbody> </table> - We have more missing data for people who quit smoking than people who did not, so we could have selection bias. --- ## Weights for Censoring - We learned in L4 that, to correct for selection bias, we need to find a set of variables `\(L_c\)` such that `\(Y(C = 0) \ci C \vert L_c\)`. - We then need to estimate `\(W^C = 1/P[C = 0 \vert A, L_c]\)`, or the stabilized version `$$SW^C = \frac{P[C = 0 \vert A]}{P[C = 0 \vert A, L_c]}$$` - We will use the same set of variables, we used to compute weights for confounding. - The total weights will be `\(SW^C \cdot SW^A\)` --- ## Weights for Censoring - We will estimate `\(f(C=0 \vert A, L)\)` using a logistic model adjusting for + the exposure + sex, race age, education, exercise, life activity + linear and quadratic terms for smoking intensity, duration, and baseline weight. - The `dat.u` data frame includes data for all participants. ```r dat.u$cens <- ifelse(is.na(dat.u$wt82), 1, 0) fcal_fit <- glm( cens ~ as.factor(qsmk) + as.factor(sex) + as.factor(race) + age + I(age ^ 2) + as.factor(education) + smokeintensity + I(smokeintensity ^ 2) + smokeyrs + I(smokeyrs ^ 2) + as.factor(exercise) + as.factor(active) + wt71 + I(wt71 ^ 2), family = binomial(), data = dat.u ) fc0al <- 1 - predict(fcal_fit, type = "response") ``` --- ## Weights for Censoring - We can estimate the numerator of the stablilized weight, `\(f(C = 0 \vert A)\)` using the average within levels of `\(A\)`. ```r fca_fit <-glm(cens ~ as.factor(qsmk), family = binomial(), data = dat.u) fc0a <- 1 - predict(fca_fit, type = "response") sw.c <- fc0a/fc0al dat$sw.c <- sw.c[!is.na(dat.u$wt82)] dat$sw.comb <- dat$sw*dat$sw.c ``` ```r summary(dat$sw.c) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.9442 0.9782 0.9858 0.9990 1.0035 1.7180 ``` ```r summary(dat$sw.comb) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.3390 0.8593 0.9442 0.9974 1.0749 4.1829 ``` --- ## Combined Weights Results - Below are results using the combined weights for both censoring and confounding <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> Estimate </th> <th style="text-align:left;"> 95% CI </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Intercept </td> <td style="text-align:left;"> 1.69 </td> <td style="text-align:left;"> (1.23, 2.14) </td> </tr> <tr> <td style="text-align:left;"> Quit Smoking </td> <td style="text-align:left;"> 3.4 </td> <td style="text-align:left;"> (2.36, 4.45) </td> </tr> </tbody> </table> - Compared with the previous results <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> Estimate </th> <th style="text-align:left;"> 95% CI </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Intercept </td> <td style="text-align:left;"> 1.78 </td> <td style="text-align:left;"> (1.34, 2.22) </td> </tr> <tr> <td style="text-align:left;"> Quit Smoking </td> <td style="text-align:left;"> 3.44 </td> <td style="text-align:left;"> (2.41, 4.47) </td> </tr> </tbody> </table> --- ## Survey Weights - Typically, nationally representative surveys won't use a completely random sampling scheme. + They will try to over-sample smaller groups. + This allows them to get better estimates for less common covariate values. - Over-sampling introduces selection bias that needs to be corrected. - If the survey was done well, a specific over-sampling rule was used. + For example, you might select census tracts randomly according to a covariate informed probability and then select housees randomly within census tract. - This allows the survey administers to compute `\(P[C = 0 \vert L]\)` for every participant. --- ## Survey Weights - If `\(L\)` includes every factor affecting study inclusion, then `\(P[C = 0 \vert A, L] = P[C = 0 \vert L]\)` for any exposure `\(A\)`. - So we could use the same set of weights regardless of what exposure we are looking at. - Usually, survey administers will compute `\(P[ C = 0 \vert L]\)` and distribute `\(1/P[C = 0 \vert L]\)` as "survey weights". + Survey weights are censoring weights! - If you have survey data that come with survey weights, be sure to read about how these are calculated. + Most likely, if you want to estimate a parameter for the larger population, you will need to use the survey weights. --- # 3. G-Computation (Standardization) --- ## Standardization - In standardarization, we compute an estimate of `\(E[Y \vert A = a, L = l]\)` and plug it into the g-formula: $$ E[Y(a)] = \sum_l E[Y \vert A = a, L = l]P[L = l] $$ - If `\(L\)` is continuous, so we need to integrate $$ E[Y(a)] = \int_l P[Y \vert A = a, L = l]f(l) dl $$ --- ## Outcome Modeling - In the standardization formula, we need to estimate `\(P[Y \vert A = a, L = l]\)`. - To do this we fit a linear regression ```r f_y <-glm( wt82_71 ~ qsmk + sex + race + age + I(age * age) + as.factor(education) + smokeintensity + I(smokeintensity * smokeintensity) + smokeyrs + I(smokeyrs * smokeyrs) + as.factor(exercise) + as.factor(active) + wt71 + I(wt71 * wt71) + qsmk * smokeintensity, data = dat.u) summary(fit)$coefficients %>% head() ``` ``` ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -2.2425190870 1.3808360404 -1.6240300 1.043694e-01 ## sex -0.5274781689 0.1540496370 -3.4240793 6.168862e-04 ## race -0.8392636264 0.2100665490 -3.9952274 6.463219e-05 ## age 0.1212052168 0.0512662762 2.3642290 1.806764e-02 ## I(age^2) -0.0008245656 0.0005361152 -1.5380382 1.240393e-01 ## as.factor(education)2 -0.0287755026 0.1983506350 -0.1450739 8.846525e-01 ``` --- ## Outcome Modeling - This first step of fitting the outcome model is just fitting a regular linear regression. - Like we said when talked about MSMs, if we believe the model then we can use the coefficients to calculate a *conditional* causal effect of treatment within strata of `\(L\)`. - If there were no interactions between `\(A\)` and `\(L\)` in the model, the coefficient on `\(A\)` would give the marginal treatment effect. + This is true using a linear model but would not be true in a model with a non-linear link. - An outcome model with no interactions assumes **homogeneity**, i.e. the average treatment effect is the same in every level of `\(L\)`. + Typiclly, homogeneity is considered to be a very strong assumption. --- ## Outcome Modeling - Our outcome model included an interaction between `\(A\)` and smoking intensity. - In this model, the coefficient on `\(A\)` is the average treatment effect within strata of smoking intensity. - To estimate the average (marginal) treatment effect, we need to average over the distribution of smoking intensity, or more generally the distribution of `\(L\)`. --- ## Standardizing (G-Computation) - Fortunately, we do not need to compute `\(f(L)\)`. - Using the iterated expectation formula, we can write $$ \int_l P[Y \vert A = a, L = l]f(l)dl = E\left[ E\left[Y \vert A = a, L = l \right]\right] $$ - So we can estimate the standardized mean as $$ \frac{1}{n}\sum_{i=1}^n\hat{E}[Y \vert A = a, L_i] $$ --- ## Standardizing - For each person, we need to compute two values `\(\hat{E}[Y \vert A = 0, L_i]\)` and `\(\hat{E}[Y\vert A = 1, L_i]\)` - These are predicted values that we can get out of our previous regression. - We predict each person's outcome with `\(A = 0\)` and with `\(A = 1\)`. ```r # Make two copies of the data dat0 <- dat1 <- dat.u # In one copy everyone got treatment 0 dat0$qsmk <- 0 # In the second copy everyone got treatment 1 dat1$qsmk <- 1 Y0 <- predict(f_y, newdata = dat0, type = "response") %>% mean() Y1 <- predict(f_y, newdata = dat1, type = "response") %>% mean() cat(round(Y1, digits = 2), "-", round(Y0, digits = 2), "=", round(Y1-Y0, digits = 2)) ``` ``` ## 5.18 - 1.66 = 3.52 ``` --- ## Censoring - In our case, we have assumed that our set of confounders are also sufficient to adjust for censoring. - In the previous step, we predicted weight gain for all subjects, even those who had missing values in the original data. - If our outcome model is correct *and* `\(Y(C = 0) \ci C \vert L\)`, then this is sufficient to eliminate selection bias if `\(L\)` is observed without censoring. - If not, we may need to add additional variables specifically to deal with censoring. - Recall that the s-backdoor criterion gives us a way to select variables to condition on. --- ## Estimating the Variance - To compute the variance of the estimate, we need to bootstrap. - We will repeatedly re-sample our data with replacement. - Each time, we recompute the standardized effect we just computed in the observed data. - The standard deviation of our bootstrap samples consistently estimates the standard error of our estimate. --- ## Bootstrapping the Variance - Step 1: Write a function to compute the standardized mean ```r # Step 1: Write a function compute the standardized mean std_mean_func <- function(d){ boot_fit <-glm( wt82_71 ~ qsmk + sex + race + age + I(age * age) + as.factor(education) + smokeintensity + I(smokeintensity * smokeintensity) + smokeyrs + I(smokeyrs * smokeyrs) + as.factor(exercise) + as.factor(active) + wt71 + I(wt71 * wt71) + qsmk * smokeintensity, data = d) d0 <- d1 <- d d0$qsmk <- 0 d1$qsmk <- 1 Y0 <- predict(boot_fit, newdata = d0, type = "response") %>% mean() Y1 <- predict(boot_fit, newdata = d1, type = "response") %>% mean() return(Y1-Y0) } ``` --- ## Bootstrapping the Variance - Step 2: Repeatedly re-sample the data and compute the standardized mean in the re-sampled data set. ```r set.seed(1) samples <- replicate(n = 500, expr = sample(seq(nrow(dat.u)), size = nrow(dat.u), replace = TRUE)) res <- apply(samples, 2, FUN = function(ix){ boot_dat <- dat.u[ix,] boot_est <- std_mean_func(boot_dat) return(boot_est) }) se <- sd(res) eff <- Y1-Y0 ci <- eff + c(-1, 1)*qnorm(0.975)*se cat(round(eff, digits = 2), "(", round(ci, digits = 2), ")") ``` ``` ## 3.52 ( 2.58 4.46 ) ``` --- ## Effect Modification in G-Computation - If we want to estimate the average causal effect stratified by `\(V\)`, we can simply stratify by `\(V\)` before averaging our predicted values. ```r dat.u$Y0 <- predict(fit, newdata = dat0, type = "response") dat.u$Y1 <- predict(fit, newdata = dat1, type = "response") dat.u %>% group_by(sex) %>% summarize(Y0 = mean(Y0), Y1 = mean(Y1)) ``` ``` ## # A tibble: 2 × 3 ## sex Y0 Y1 ## <dbl> <dbl> <dbl> ## 1 0 0.291 0.291 ## 2 1 0.228 0.228 ``` --- ## IP Weighting vs G-Computation - When we are non-parametric, IP weighting and standardization are the same. - In parametric models, they rely on different assumptions and so are not the same. - In the IP weighted analysis, we need to trust our model for `\(f(A \vert L)\)`. - In the standardized analysis, we need to trust our model for `\(E[Y \vert A, L]\)`. - In either case, if our our parametric models are wrong, our predictions will be biased and so our estimators will be biased. - May be wise to compute both and compare. --- ## IP Weighting vs G-Computation - Chatton et al (2020) present simulation results comparing IPW, G-computation, and two methods we haven't talked about yet. - In their simulation scenario, G-computation always had a lower MSE than IPW - For IPW, they calculated variance using robust SEs (as we did earlier) and found slight over coverage (~97% for a 95% target). - They used parametric simulation to compute the variance of the G-computation estimate. This gave close to target coverage but slightly under in some cases (~94%). --- ## Chatton et al Results <img src="5_modeling1_files/figure-html/unnamed-chunk-30-1.png" style="display: block; margin: auto;" /> --- ## IP Weighting vs G-Computation - Snowden et al (2010) describe G-Compuatation as rarely used in epidemiology due to lack of awareness. - However, G-Computation can be more precise than IP weighting in some circumstances. + IP weighting can be inefficient when weights are very large. - Vansteelandt and Keiding note that IPW performs better when the covariate distribution is very different in cases and controls. + Depending on the model used fo G-Computation, the dominant group may have too much influence over estimate in these regions. + G-Computation may under-estimate the uncertainty about the counterfactuals in these regions. - Some people suggest that we shouldn't be trying to estimate the countefactuals in these regions at all, since in order to estimate them, we need to extrapolate into regions with low density of data points.