Simulate summary statistics — sim

Simulate summary statistics from a specified factor structure

Usage

sim_lf(
  F_mat,
  N,
  J,
  h2_trait,
  omega,
  h2_factor,
  pi_L,
  pi_theta,
  est_s = FALSE,
  R_E = NULL,
  R_obs = NULL,
  R_LD = NULL,
  af = NULL,
  sporadic_pleiotropy = TRUE,
  h2_exact = FALSE,
  pi_exact = FALSE,
  snp_effect_function_L = "normal",
  snp_effect_function_theta = "normal",
  snp_info = NULL
)

Arguments

F_mat: factor matrix M by K (M = number of traits, K = number of factors)
N: GWAS sample size. N can be a scalar, vector, or matrix. If N is a scalar, all GWAS have the same sample size and there is no overlap between studies. If N is a vector, each element of N specifies the sample size of the corresponding GWAS and there is no overlap between studies. If N is a matrix, N_ii specifies the sample size of study i and N_ij specifies the number of samples present in both study i and study j. The elements of N must be positive but non-integer values will not generate an error.
J: Total number of SNPs to generate
h2_trait: Heritability of each trait. Length M vector.
omega: Proportion of trait heritability mediated by factors. Length M vector.
h2_factor: Heritability of each factor. Length K vector.
pi_L: Proportion of non-zero elements in L_k. Length K factor
pi_theta: Proportion of non-zero elements in theta. Scalar or length M vector.
est_s: If TRUE, return estimates of se(beta_hat).
R_E: Correlation between environmental trait components not mediated by factors. M by M pd matrix.
R_obs: Observational correlation between traits. M by M pd matrix. At most one of R_E and R_obs can be specified.
R_LD: List of LD blocks. R_LD should have class list. Each element of R_LD can be either a) a matrix, b) a sparse matrix (class dsCMatrix) or c) an eigen decomposition (class eigen). All elements should be correlation matrices, meaning that they have 1 on the diagonal and are positive definite.
af: Optional vector of allele frequencies. If R_LD is not supplied, af can be a scalar, vector or function. If af is a function it should take a single argument (n) and return a vector of n allele frequencies (See Examples). If R_LD is supplied, af must be a vector with length equal to the size of the supplied LD pattern (See Examples).
sporadic_pleiotropy: Allow sporadic pleiotropy between traits. Defaults to TRUE.
h2_exact: If TRUE, the heritability of each trait will be exactly h2.
pi_exact: If TRUE, the number of direct effect SNPs for each trait will be exactly equal to round(pi*J).
snp_effect_function_L, snp_effect_function_theta: Optional function to generate variant effects in L or theta. snp_effect_function_L/theta can be a single function or list of functions of length equal to the number of factors/number of traits.
snp_info: Optional data.frame of variant information to be passed to variant effect functions. If R_LD is specified, snp_info should have number of rows equal to the size of the supplied LD pattern. Otherwise snp_info should have J rows.

Details

This function will generate GWAS summary statistics for M traits with K common factors. The matrix F_mat provides the effects of each factor on each trait, F_mat[i,j] gives the effect of factor j on trait i. The rows of F_mat will be scaled in order to provide desired proportion of hertiability of each trait explained by factors but the relative size and sign of elements within rows will be retained.

A random factor matrix can be generated using generate_random_F (see Examples).

It is possible to supply a non-feasible set of parameters. Usually this occurs if the heritability of the factors is low but the heritability of the traits is high leading to a contradiction. The function will return an error if this happens.

Trait covariance: Each trait is composed of four independent components, the genetic component mediated by factors, the environmental component mediated by factors, the genetic component not mediated by factors, and the environmental component not mediated by factors. Therefore, the total trait covariance can be decomposed into the sum of four corresponding covariance matrices.

Cov(T) = Sigma_FG + Sigma_FE + Sigma_GDir + Sigma_EDir

We assume that all cross-trait genetic sharing is explained by the factors so that Sigma_GDir is diagonal. Each factor is a sum of a genetic component and an environmental components and factors are independent (both genetic and environmental components) are independent across factors. This means that Sigma_FG = F S_FG F^T and Sigma_FE = F S_FE F^T where S_FG and S_FE are diagonal matrices. The parameter R_E specifies the correlation of the residual environmental component (i.e. R_E = cov2cor(Sigma_EDir). Alternatively, if R_obs is specified, Sigma_EDir will be chosen to give the desired observational correlation. In the returned object, Sigma_G is equal to the sum of the two genetic covariance components and Sigma_E is equal to the sum of the two environmental components. R gives the overall trait correlation matrix multiplied by the overlap proportion matrix, which is equal to the correlation in the error terms of beta_hat (See Examples).

Examples

myF <- generate_random_F(K = 3, M = 10, nz_factor = c(2, 3, 2),
                        omega = rep(0.8, 10),
                        h2_trait = rep(0.6, 10), pad = TRUE)
dat <- sim_lf(myF, N = 10000, J = 20000, h2_trait = rep(0.6, 10),
                      omega = rep(0.8, 10), pi_L = 0.1, pi_theta = 0.1)
#> SNP effects provided for 20000 SNPs and 10 traits.

myF <- diag(2)
N <- matrix(c(10000, 8000, 8000, 10000), nrow = 2)
R_E <- matrix(c(1, 0.6, 0.6, 1), nrow = 2)
dat <- sim_lf(F_mat = myF, N = N, J = 20000, h2_trait = rep(0.6, 2), omega = rep(1, 2), h2_factor = rep(1, 2),
                       pi_L = 0.1, pi_theta = 0.1, R_E = R_E)
#> SNP effects provided for 20000 SNPs and 2 traits.
dat$R
#>           [,1]      [,2]
#> [1,] 1.0000000 0.1846886
#> [2,] 0.1846886 1.0000000
cor(dat$beta_hat[,1]-dat$beta_joint[,1], dat$beta_hat[,2]-dat$beta_joint[,2])
#> [1] 0.1809276