Simulate multivariate GWAS data

Usage

sim_mv(
  N,
  J,
  h2,
  pi,
  G = 0,
  est_s = FALSE,
  R_obs = NULL,
  R_E = NULL,
  R_LD = NULL,
  af = NULL,
  snp_effect_function = "normal",
  snp_info = NULL,
  sporadic_pleiotropy = TRUE,
  pi_exact = FALSE,
  h2_exact = FALSE
)

Arguments

N: GWAS sample size. N can be a scalar, vector, or matrix. If N is a scalar, all GWAS have the same sample size and there is no overlap between studies. If N is a vector, each element of N specifies the sample size of the corresponding GWAS and there is no overlap between studies. If N is a matrix, N_ii specifies the sample size of study i and N_ij specifies the number of samples present in both study i and study j. The elements of N must be positive but non-integer values will not generate an error.
J: Number of variants to simulate
h2: A scalar or vector giving the heritability of each trait.
pi: A scalar or vector giving the expected proportion of direct effect SNPs for each trait.
G: Matrix of direct effects. Rows correspond to the 'from' trait and columns correspond to the 'to' trait, so G[1,2] is the direct effect of trait 1 on trait 2. G should have 0 on the diagonal.
est_s: If TRUE, return estimates of se(`beta_hat`). If FALSE, the exact standard error of `beta_hat` is returned. Defaults to FALSE.
R_obs: Total observational correlation between traits. R_obs won't impact summary statistics unless there is sample overlap. See Details for default behavior.
R_E: Total correlation of the environmental components only. R_E and R_obs are alternative methods of specifying trait correlation. Use only one of these two options. R_E may be phased out in the future.
R_LD: Optional list of LD blocks. R_LD should have class list. Each element of R_LD can be either a) a matrix, b) a sparse matrix (class dsCMatrix) or c) an eigen decomposition (class eigen). All elements should be correlation matrices, meaning that they have 1 on the diagonal and are positive definite. See Details and vignettes.
af: Optional vector of allele frequencies. If R_LD is not supplied, af can be a scalar, vector or function. If af is a function it should take a single argument (n) and return a vector of n allele frequencies (See Examples). If R_LD is supplied, af must be a vector with length equal to the size of the supplied LD pattern (See Examples).
snp_effect_function: Optional function to generate variant effects. snp_effect_function can be a single function or list of functions of length equal to the number of traits (see Details).
snp_info: Optional data.frame of variant information to be passed to variant effect functions. If R_LD is specified, snp_info should have number of rows equal to the size of the supplied LD pattern. Otherwise snp_info should have J rows.
sporadic_pleiotropy: Allow sporadic pleiotropy between traits. Defaults to TRUE.
pi_exact: If TRUE, the number of direct effect SNPs for each trait will be exactly equal to round(pi*J).
h2_exact: If TRUE, the heritability of each trait will be exactly `h2`.
return_dat: Useful development option, not recommend for general users.

Value

A list with the following elements:

Simulated effect estimates and standard errors are contained in matrices

+ beta_hat: Effect estimates for each trait

+ se_beta_hat Standard error of effect estimates, equal to sqrt(1/N_m *Var(G_j)).

+ s_estimate Estimate of se_beta_hat. Present only if est_s = TRUE.

Four matrices contain direct and total marginal and joint SNP-trait associations: + direct_SNP_effects_marg and direct_SNP_effects_joint give direct effects of SNPs on traits. These are the same if there is no LD. If there is LD, direct_SNP_effects_marg is the direct component of the expected marginal association.

+ beta_marg and beta_joint give the total SNP effects (direct and indirect). These are the same if there is no LD. If there is LD, beta_marg is the total expected marginal association. i.e. beta_marg is the expected value of beta_hat.

Four matrices describe the covariance of traits and the row correlation of effect estimates

+ Sigma_G Genetic variance-covariance matrix. Diagonal elements are equal to heritability.

+ Sigma_E Environmental variance-covariance matrix.

+ trait_corr Population correlation of traits. trait_corr = Sigma_G + Sigma_E.

+ R Row correlation of beta_hat - beta_marg, equal to trait_corr scaled by the overlap proportion matrix.

Finally,

+ snp_info Is a data frame with variant information. If R_LD was omitted, snp_info contains only the allele frequency of each variant. If R_LD was included, snp_info also contains block and replicate information as well as any information supplied to the snp_info input parameter.

Details

This function generates GWAS summary statistics from a linear SEM specified by the matrix G. The previous "xyz" mode is now deprecated. If you are used to using this function in xyz mode, you can generate the corresponding G using the xyz_to_G function (see examples).

G should be square (#traits by #traits) matrix with 0s on the diagonal. All traits have variance 1, so G[i,j]^2 is the proportion of variance of trait j explained by the direct effect of trait i. You will get an error if you specify a cyclic DAG. There are also some combinations of DAG and heritability that are impossible and will give an error. For example in a two trait DAG where trait 1 has an effect of \(x\) on trait 2, and trait 1 has a heritability of \(h^2_1\), trait 2 must have a heritability of at least \((x^2)*(h^2_1)\).

To generate data with LD, supply R_LD which can be a list of matrices, sparse matrices, or eigen-decompositions. These matrices are interpreted as blocks in a block-diagonal LD matrix. The af argument must be provided and should be a vector with length equal to the total size of the LD pattern (the sum of the sizes of each block). R_LD does not need to have the same size as J. The LD pattern will be repeated or subset as necessary to generate the desired number of variants.

If R_obs is NULL (default value), we assume that direct environmental effects on each trait are independent and all environmental correlation results from the relationships specified in G. Alternatively R_obs can be any positive definite correlation matrix.

The snp_effect_function argument supplies a custom function to generate standardized direct variant effects. By default, standardized direct effects are generated from a normal distribution. This means that in the default mode, all variants have equal expected heritability explained regardless of allele frequency or other features. The function(s) passed to snp_effect_function will override this default. This function should take three arguments, n, sd, and snp_info. The output should be a vector of length n with expected total sum of squares equal to sd^2. Note: Even though the final effects returned will usually be on the per-allele scale, snp_effect_function should return standardized (per-sd) effect sizes.

Examples

# Two traits with no causal relationship and some environmental correlation
# specify completely overlapping GWAS
N <- matrix(1000, nrow = 2, ncol =2)
R_obs <- matrix(c(1, 0.3, 0.3, 1), nrow = 2, ncol = 2)
dat <- sim_mv(G = 2, N = N, J = 20000, h2 = c(0.4, 0.3), pi = 1000/20000,
               R_obs = R_obs)
#> SNP effects provided for 20000 SNPs and 2 traits.
dat$R # This is the true correlation of the estimation error of beta_hat
#>      [,1] [,2]
#> [1,]  1.0  0.3
#> [2,]  0.3  1.0
cor(dat$beta_hat - dat$beta_marg) # Should be similar to dat$R
#>           [,1]      [,2]
#> [1,] 1.0000000 0.2935818
#> [2,] 0.2935818 1.0000000

# The af argument can be a scalar, vector, or function.
dat <- sim_mv(N = N, J = 20000, h2 = c(0.4, 0.3), pi = 1000/20000,
               G = 2, R_obs = R_obs, af = function(n){rbeta(n = n, 1, 5)})
#> SNP effects provided for 20000 SNPs and 2 traits.

# A very simple example with LD
# Use a pattern of two small blocks of LD
A1 <- matrix(0.7, nrow = 10, ncol = 10)
diag(A1) <- 1
A2 <- matrix(0.1, nrow = 6, ncol = 6)
diag(A2) <- 1
# If using LD, af should have the same size as the LD pattern
af <- runif(n = 16)
dat <- sim_mv(N = N, J = 20000, h2 = c(0.4, 0.3), pi = 1000/20000,
               G = 2, R_obs = R_obs, R_LD = list(A1, A2), af = af)
#> SNP effects provided for 20000 SNPs and 2 traits.

# Use xyz_to_G to generate G from xyz specification
myG <- xyz_to_G(tau_xz = c(0.2, -0.3), tau_yz = c(0.1, 0.25),
        dir_xz = c(1, -1), dir_yz = c(1,1), gamma = 0)
# If N is a scalar or a vector, there is no sample overlap
dat <- sim_mv(N = 10000, J = 20000, h2 = rep(0.4, 4),
              pi = c(500, 500, 1000, 1000)/20000,
              G = myG)
#> SNP effects provided for 20000 SNPs and 4 traits.
plot(dat$beta_marg[,3], dat$beta_marg[,1])
abline(0, dat$total_trait_effects[3,1])