Cox Mixed-Effects Model for Genome-Wide Association Studies

Liang He

2019-09-10

Overview

Time-to-event is one of the most important phenotypes in genetic epidemiology. The R-package, “coxmeg”, provides a set of utilities to fit a Cox mixed-effects model and to efficiently perform genome-wide association analysis of time-to-event phenotypes using a Cox mixed-effects model.

Installation

Most-recent version

Functions

The current version provides four functions.

Fit a Cox mixed-effects model with a sparse relatedness matrix

We illustrate how to use coxmeg to fit a Cox mixed-effects model with a sparse relatedness matrix. We first simulate a block-diagonal relatedness matrix for a cohort consisting of 200 families, each of which has five members.

library(coxmeg)
## Loading required package: Rcpp
library(MASS)
library(Matrix)
n_f <- 200
mat_list <- list()
size <- rep(5,n_f)
offd <- 0.5
for(i in 1:n_f)
{
  mat_list[[i]] <- matrix(offd,size[i],size[i])
  diag(mat_list[[i]]) <- 1
}
sigma <- as.matrix(bdiag(mat_list))
sigma = as(sigma,'dgCMatrix')

We use ‘dgCMatrix’ to save memory. Next, we simulate random effects and time-to-event outcomes assuming a constant baseline hazard function. We assume that the variance component is 0.2. We also simulate a risk factor with log(HR)=0.1.

n = nrow(sigma)
tau_var <- 0.2
x <- mvrnorm(1, rep(0,n), tau_var*sigma)
pred = rnorm(n,0,1)
myrates <- exp(x+0.1*pred-1)
y <- rexp(n, rate = myrates)
cen <- rexp(n, rate = 0.02 )
ycen <- pmin(y, cen)
outcome <- cbind(ycen,as.numeric(y <= cen))
head(outcome)
##            ycen  
## [1,] 10.3919174 1
## [2,]  2.0724135 1
## [3,]  2.3335449 1
## [4,]  0.6183988 1
## [5,]  1.3229176 1
## [6,]  2.1235979 1
sigma[1:5,1:5]
## 5 x 5 sparse Matrix of class "dgCMatrix"
##                         
## [1,] 1.0 0.5 0.5 0.5 0.5
## [2,] 0.5 1.0 0.5 0.5 0.5
## [3,] 0.5 0.5 1.0 0.5 0.5
## [4,] 0.5 0.5 0.5 1.0 0.5
## [5,] 0.5 0.5 0.5 0.5 1.0

We fit a Cox mixed-effects model using the coxmeg function. We set dense=FALSE to indicate that the relatedness matrix is sparse. However, the function will automatically treat it as dense if there are more than 50% non-zero elements in the matrix. We set order=1 to use the first-order approximation of the inverse Hessian matrix in the optimization.

re = coxmeg(outcome,sigma,pred,order=1,dense=FALSE)
## Remove 0 subjects censored before the first failure.
## There is/are 1 predictors. The sample size included is 1000.
## The relatedness matrix is treated as sparse.
re
## $beta
##           [,1]
## [1,] 0.1725636
## 
## $HR
##          [,1]
## [1,] 1.188347
## 
## $sd_beta
## [1] 0.03799398
## 
## $p
##              [,1]
## [1,] 5.575808e-06
## 
## $tau
## [1] 0.2989253
## 
## $iter
## [1] 18
## 
## $rank
## [1] 1000
## 
## $nsam
## [1] 1000
## 
## $int_ll
## [1] 11573.21

In the above result, tau is the estimated variance component, and int_ll is -2*log(lik) of the integrated/marginal likelihood of tau.

It should be noted that when the relatedness matrix is symmetric positive definite (SPD), coxmeg will make use of the sparsity by setting dense=FALSE regardless of whether the relatedness matrix or its inverse is sparse. However, when the relatedness matrix is symmetric positive semidefinite (SPSD), coxmeg can make use of the sparsity only when its inverse is sparse. When the relatedness matrix is SPSD and its inverse is dense, setting dense=FALSE will result in worse performance. In such a case, it would be better to use dense=TRUE or to convert the relatedness matrix to SPD or block-diagonal if possible.

We compare the results with coxme, which are slightly different due to different approximation of the log-determinant used in the estimation of the variance component. Also, the integrated log-likelihoods cannot be compared directly because different approximation of log-determinant is used.

library(coxme)
## Loading required package: survival
## Loading required package: bdsmatrix
## 
## Attaching package: 'bdsmatrix'
## The following object is masked from 'package:base':
## 
##     backsolve
bls <- c(1)
for(i in (size[1]-1):1)
{bls <- c(bls, c(rep(offd,i),1))}
tmat <- bdsmatrix(blocksize=size, blocks=rep(bls,n_f),dimnames=list(as.character(1:n),as.character(1:n)))
re_coxme = coxme(Surv(outcome[,1],outcome[,2])~as.matrix(pred)+(1|as.character(1:n)), varlist=list(tmat),ties='breslow')
re_coxme
## Cox mixed-effects model fit by maximum likelihood
## 
##   events, n = 953, 1000
##   Iterations= 7 35 
##                     NULL Integrated  Fitted
## Log-likelihood -5637.324  -5619.088 -5421.2
## 
##                    Chisq     df          p   AIC     BIC
## Integrated loglik  36.47   2.00 1.2024e-08 32.47   22.75
##  Penalized loglik 432.25 176.15 0.0000e+00 79.94 -776.11
## 
## Model:  Surv(outcome[, 1], outcome[, 2]) ~ as.matrix(pred) + (1 | as.character(1:n)) 
## Fixed coefficients
##                      coef exp(coef)   se(coef)    z       p
## as.matrix(pred) 0.1727242  1.188538 0.03804476 4.54 5.6e-06
## 
## Random effects
##  Group             Variable Std Dev   Variance 
##  as.character.1.n. Vmat.1   0.5487300 0.3011047

Perform GWAS of an age-at-onset phenotype with a sparse relatedness matrix

We illustrate how to perform a GWAS using the coxmeg_plink function. This function supports plink bed files. We provide example files in the package. The example plink files include 20 SNPs and 3000 subjects from 600 families. The following code performs a GWAS for all SNPs in the example bed files. The coxmeg_plink function will write a temporary .gds file for the SNPs in the folder specified by tmp_dir. The user needs to specify a tmp_dir to store the temporary file when bed is provided. The temporary file is removed after the analysis is done.

library(coxmeg)
bed = system.file("extdata", "example_null.bed", package = "coxmeg")
bed = substr(bed,1,nchar(bed)-4)
pheno = system.file("extdata", "ex_pheno.txt", package = "coxmeg")
cov = system.file("extdata", "ex_cov.txt", package = "coxmeg")

## building a relatedness matrix
n_f <- 600
mat_list <- list()
size <- rep(5,n_f)
offd <- 0.5
for(i in 1:n_f)
{
  mat_list[[i]] <- matrix(offd,size[i],size[i])
  diag(mat_list[[i]]) <- 1
}
sigma <- as.matrix(bdiag(mat_list))

re = coxmeg_plink(pheno,sigma,bed=bed,tmp_dir=tempdir(),cov_file=cov,detap=TRUE,dense=FALSE,verbose=FALSE)
## Excluding 0 SNP on non-autosomes
## Excluding 0 SNP (monomorphic: TRUE, MAF: 0.05, missing rate: 0)
re
## $summary
##     snp.id chromosome position allele      afreq   index         beta
## 1   null_0          1        1    d/D 0.30983333  null_0  0.015672101
## 2   null_1          1        2    d/D 0.23466667  null_1  0.019439150
## 3   null_2          1        3    D/d 0.14033333  null_2 -0.049845757
## 4   null_3          1        4    D/d 0.16183333  null_3  0.044130767
## 5   null_4          1        5    d/D 0.19933333  null_4  0.028473176
## 6   null_5          1        6    D/d 0.11800000  null_5 -0.114319159
## 7   null_6          1        7    d/D 0.09483333  null_6 -0.017981231
## 8   null_7          1        8    D/d 0.49683333  null_7 -0.004207897
## 9   null_8          1        9    d/D 0.31366667  null_8 -0.063741849
## 10  null_9          1       10    D/d 0.49183333  null_9 -0.008409562
## 11 null_10          1       11    d/D 0.34833333 null_10 -0.013581479
## 12 null_11          1       12    D/d 0.25100000 null_11  0.037508301
## 13 null_12          1       13    d/D 0.17500000 null_12 -0.017215848
## 14 null_13          1       14    D/d 0.06333333 null_13 -0.068207724
## 15 null_14          1       15    D/d 0.20833333 null_14 -0.013965386
## 16 null_15          1       16    d/D 0.17050000 null_15  0.002172773
## 17 null_16          1       17    D/d 0.33550000 null_16  0.004762350
## 18 null_17          1       18    d/D 0.26633333 null_17  0.001786995
## 19 null_18          1       19    D/d 0.09433333 null_18 -0.016052310
## 20 null_19          1       20    d/D 0.11650000 null_19 -0.022398126
##           HR    sd_beta           p
## 1  1.0157956 0.02938524 0.593803537
## 2  1.0196293 0.03222054 0.546298835
## 3  0.9513762 0.03860368 0.196628160
## 4  1.0451190 0.03701019 0.233106387
## 5  1.0288824 0.03432500 0.406811816
## 6  0.8919732 0.04234095 0.006934636
## 7  0.9821795 0.04655562 0.699325464
## 8  0.9958009 0.02717805 0.876957699
## 9  0.9382472 0.02958441 0.031195036
## 10 0.9916257 0.02730686 0.758108827
## 11 0.9865103 0.02859980 0.634872392
## 12 1.0382206 0.03113254 0.228282858
## 13 0.9829315 0.03628637 0.635183349
## 14 0.9340664 0.05698849 0.231357835
## 15 0.9861317 0.03431600 0.684034201
## 16 1.0021751 0.03685682 0.952990554
## 17 1.0047737 0.02859957 0.867749134
## 18 1.0017886 0.03098518 0.954009439
## 19 0.9840758 0.04731969 0.734435643
## 20 0.9778508 0.04231689 0.596600710
## 
## $tau
## [1] 0.04028041
## 
## $rank
## [1] 3000
## 
## $nsam
## [1] 3000

The above code first retrieves the full path of the files. If the full path is not given, coxmeg_plink will search the current working directory. The file name of the bed file should not include the suffix (.bed). The phenotype and covariate files have the same format as used in plink, and the IDs must be consistent with the bed files. Specifically, the phenotype file should include four columns including family ID, individual ID, time, and status. The covariate file always starts with two columns, family ID and individual ID. Missing values in the phenotype and covariate files are denoted by -9 and NA, respectively. In the current version, the coxmeg_plink function does not impute genotypes itself, and only SNPs without missing values will be analyzed, so it will be better to use imputed genotype data.

The coxmeg_plink function fist estimates the variance component with only the covariates, and then uses it to analyze each SNP after filtering. These two steps can be done separately as follows. The first command without bed only esitmates the variance component tau, and the second command uses the estimated tau to analyze the SNPs.

re = coxmeg_plink(pheno,sigma,cov_file=cov,detap=TRUE,dense=FALSE,verbose=FALSE)
re
## $tau
## [1] 0.04028041
## 
## $iter
## [1] 15
## 
## $rank
## [1] 3000
## 
## $nsam
## [1] 3000
re = coxmeg_plink(pheno,sigma,bed=bed,tmp_dir=tempdir(),tau=re$tau,cov_file=cov,detap=TRUE,dense=FALSE,verbose=FALSE)
## Excluding 0 SNP on non-autosomes
## Excluding 0 SNP (monomorphic: TRUE, MAF: 0.05, missing rate: 0)
re
## $summary
##     snp.id chromosome position allele      afreq   index         beta
## 1   null_0          1        1    d/D 0.30983333  null_0  0.015672101
## 2   null_1          1        2    d/D 0.23466667  null_1  0.019439150
## 3   null_2          1        3    D/d 0.14033333  null_2 -0.049845757
## 4   null_3          1        4    D/d 0.16183333  null_3  0.044130767
## 5   null_4          1        5    d/D 0.19933333  null_4  0.028473176
## 6   null_5          1        6    D/d 0.11800000  null_5 -0.114319159
## 7   null_6          1        7    d/D 0.09483333  null_6 -0.017981231
## 8   null_7          1        8    D/d 0.49683333  null_7 -0.004207897
## 9   null_8          1        9    d/D 0.31366667  null_8 -0.063741849
## 10  null_9          1       10    D/d 0.49183333  null_9 -0.008409562
## 11 null_10          1       11    d/D 0.34833333 null_10 -0.013581479
## 12 null_11          1       12    D/d 0.25100000 null_11  0.037508301
## 13 null_12          1       13    d/D 0.17500000 null_12 -0.017215848
## 14 null_13          1       14    D/d 0.06333333 null_13 -0.068207724
## 15 null_14          1       15    D/d 0.20833333 null_14 -0.013965386
## 16 null_15          1       16    d/D 0.17050000 null_15  0.002172773
## 17 null_16          1       17    D/d 0.33550000 null_16  0.004762350
## 18 null_17          1       18    d/D 0.26633333 null_17  0.001786995
## 19 null_18          1       19    D/d 0.09433333 null_18 -0.016052310
## 20 null_19          1       20    d/D 0.11650000 null_19 -0.022398126
##           HR    sd_beta           p
## 1  1.0157956 0.02938524 0.593803537
## 2  1.0196293 0.03222054 0.546298835
## 3  0.9513762 0.03860368 0.196628160
## 4  1.0451190 0.03701019 0.233106387
## 5  1.0288824 0.03432500 0.406811816
## 6  0.8919732 0.04234095 0.006934636
## 7  0.9821795 0.04655562 0.699325464
## 8  0.9958009 0.02717805 0.876957699
## 9  0.9382472 0.02958441 0.031195036
## 10 0.9916257 0.02730686 0.758108827
## 11 0.9865103 0.02859980 0.634872392
## 12 1.0382206 0.03113254 0.228282858
## 13 0.9829315 0.03628637 0.635183349
## 14 0.9340664 0.05698849 0.231357835
## 15 0.9861317 0.03431600 0.684034201
## 16 1.0021751 0.03685682 0.952990554
## 17 1.0047737 0.02859957 0.867749134
## 18 1.0017886 0.03098518 0.954009439
## 19 0.9840758 0.04731969 0.734435643
## 20 0.9778508 0.04231689 0.596600710
## 
## $tau
## [1] 0.04028041
## 
## $rank
## [1] 3000
## 
## $nsam
## [1] 3000

When the genotypes of a group of SNPs are stored in a matrix, the function coxmeg_m can be used to perform GWAS for each of the SNPs. Similarly, coxmeg_m first estimates the variance component without the SNPs. In the following example, we simulate 10 independent SNPs, and use coxmeg_m to perform association analysis.

geno = matrix(rbinom(nrow(sigma)*10,2,runif(nrow(sigma)*10,0.05,0.5)),nrow(sigma),10)
pheno_m = read.table(pheno)
re = coxmeg_m(geno,pheno_m[,3:4],sigma,detap=TRUE,dense=FALSE,verbose=FALSE)
re
## $summary
##            beta        HR    sd_beta          p
## 1   0.021467532 1.0216996 0.02956773 0.46781054
## 2  -0.004205801 0.9958030 0.03011438 0.88892795
## 3  -0.013959685 0.9861373 0.02996833 0.64134818
## 4  -0.025261625 0.9750548 0.02902697 0.38414671
## 5   0.065897194 1.0681169 0.02940553 0.02502739
## 6   0.007653429 1.0076828 0.03003647 0.79887407
## 7  -0.038480171 0.9622508 0.02961391 0.19380838
## 8  -0.009497958 0.9905470 0.02980277 0.74995871
## 9   0.037853723 1.0385793 0.02933529 0.19691816
## 10 -0.042613350 0.9582818 0.02999035 0.15534527
## 
## $tau
## [1] 0.04052206
## 
## $rank
## [1] 3000
## 
## $nsam
## [1] 3000

Perform GWAS of an age-at-onset phenotype with a dense relatedness matrix

When the relatedness matrix is dense and large (>5000), it will be more efficient to specify dense=TRUE, and use preconditioned conjugate gradiant solver=2 and stochastic lanczos quadrature detap=TRUE in the optimization. These can be specified as follows.

re = coxmeg_plink(pheno,sigma,bed=bed,tmp_dir=tempdir(),cov_file=cov,detap=TRUE,dense=TRUE,verbose=FALSE,solver=2)
## Excluding 0 SNP on non-autosomes
## Excluding 0 SNP (monomorphic: TRUE, MAF: 0.05, missing rate: 0)
re
## $summary
##     snp.id chromosome position allele      afreq   index         beta
## 1   null_0          1        1    d/D 0.30983333  null_0  0.015404190
## 2   null_1          1        2    d/D 0.23466667  null_1  0.019314010
## 3   null_2          1        3    D/d 0.14033333  null_2 -0.049315969
## 4   null_3          1        4    D/d 0.16183333  null_3  0.044002515
## 5   null_4          1        5    d/D 0.19933333  null_4  0.028347057
## 6   null_5          1        6    D/d 0.11800000  null_5 -0.113929813
## 7   null_6          1        7    d/D 0.09483333  null_6 -0.018283745
## 8   null_7          1        8    D/d 0.49683333  null_7 -0.004175230
## 9   null_8          1        9    d/D 0.31366667  null_8 -0.063689854
## 10  null_9          1       10    D/d 0.49183333  null_9 -0.008400074
## 11 null_10          1       11    d/D 0.34833333 null_10 -0.013541845
## 12 null_11          1       12    D/d 0.25100000 null_11  0.037145202
## 13 null_12          1       13    d/D 0.17500000 null_12 -0.017170157
## 14 null_13          1       14    D/d 0.06333333 null_13 -0.068107205
## 15 null_14          1       15    D/d 0.20833333 null_14 -0.014179297
## 16 null_15          1       16    d/D 0.17050000 null_15  0.002301227
## 17 null_16          1       17    D/d 0.33550000 null_16  0.004810214
## 18 null_17          1       18    d/D 0.26633333 null_17  0.001500359
## 19 null_18          1       19    D/d 0.09433333 null_18 -0.016065045
## 20 null_19          1       20    d/D 0.11650000 null_19 -0.022188725
##           HR    sd_beta           p
## 1  1.0155234 0.02929760 0.599038719
## 2  1.0195017 0.03211922 0.547625222
## 3  0.9518803 0.03847821 0.199962083
## 4  1.0449850 0.03689630 0.233026252
## 5  1.0287527 0.03422378 0.407508865
## 6  0.8923206 0.04223117 0.006980651
## 7  0.9818824 0.04643159 0.693744838
## 8  0.9958335 0.02711199 0.877610197
## 9  0.9382960 0.02947246 0.030695669
## 10 0.9916351 0.02723227 0.757732112
## 11 0.9865494 0.02850902 0.634785424
## 12 1.0378437 0.03103654 0.231376041
## 13 0.9829764 0.03617303 0.635024121
## 14 0.9341603 0.05681236 0.230601970
## 15 0.9859208 0.03420779 0.678504352
## 16 1.0023039 0.03674958 0.950069799
## 17 1.0048218 0.02852559 0.866089239
## 18 1.0015015 0.03090305 0.961277504
## 19 0.9840633 0.04715458 0.733337748
## 20 0.9780556 0.04220244 0.599048887
## 
## $tau
## [1] 0.03614275
## 
## $rank
## [1] 3000
## 
## $nsam
## [1] 3000

The above command estimates HRs and reports p-values. Instead, a score test, which is computationally much more efficient, can be used by specifying score=TRUE.

re = coxmeg_plink(pheno,sigma,bed=bed,tmp_dir=tempdir(),tau=re$tau,cov_file=cov,detap=TRUE,dense=TRUE,verbose=FALSE,solver=2,score=TRUE)
## Excluding 0 SNP on non-autosomes
## Excluding 0 SNP (monomorphic: TRUE, MAF: 0.05, missing rate: 0)
re
## $summary
##     snp.id chromosome position allele      afreq   index  score_test
## 1   null_0          1        1    d/D 0.30983333  null_0 0.276484586
## 2   null_1          1        2    d/D 0.23466667  null_1 0.361595590
## 3   null_2          1        3    D/d 0.14033333  null_2 1.641833015
## 4   null_3          1        4    D/d 0.16183333  null_3 1.422774170
## 5   null_4          1        5    d/D 0.19933333  null_4 0.686367612
## 6   null_5          1        6    D/d 0.11800000  null_5 7.296248610
## 7   null_6          1        7    d/D 0.09483333  null_6 0.155135978
## 8   null_7          1        8    D/d 0.49683333  null_7 0.023749750
## 9   null_8          1        9    d/D 0.31366667  null_8 4.664929704
## 10  null_9          1       10    D/d 0.49183333  null_9 0.095176585
## 11 null_10          1       11    d/D 0.34833333 null_10 0.225605581
## 12 null_11          1       12    D/d 0.25100000 null_11 1.432454321
## 13 null_12          1       13    d/D 0.17500000 null_12 0.225330766
## 14 null_13          1       14    D/d 0.06333333 null_13 1.437458476
## 15 null_14          1       15    D/d 0.20833333 null_14 0.171816004
## 16 null_15          1       16    d/D 0.17050000 null_15 0.003922069
## 17 null_16          1       17    D/d 0.33550000 null_16 0.028467336
## 18 null_17          1       18    d/D 0.26633333 null_17 0.002358340
## 19 null_18          1       19    D/d 0.09433333 null_18 0.115959311
## 20 null_19          1       20    d/D 0.11650000 null_19 0.276623990
##              p
## 1  0.599014656
## 2  0.547621418
## 3  0.200074157
## 4  0.232947327
## 5  0.407402645
## 6  0.006909873
## 7  0.693674770
## 8  0.877523367
## 9  0.030784684
## 10 0.757696555
## 11 0.634801543
## 12 0.231364309
## 13 0.635007819
## 14 0.230551057
## 15 0.678502885
## 16 0.950063988
## 17 0.866014793
## 18 0.961267763
## 19 0.733458949
## 20 0.598922560
## 
## $tau
## [1] 0.03614275
## 
## $rank
## [1] 3000
## 
## $nsam
## [1] 3000

In the results, score_test is the score test statistics, which follow a chi-sq distribution.

Handle positive semidefinite relatedness matrices

We now assume that the first two subjects in the sample are monozygotic twins. In this case, the relatedness matrix becomes positive semidefinite. Specifying spd=FALSE will tell coxmeg_plink to handle a positive semidefinite relatedness matrix.

sigma[2,1] = sigma[1,2] = 1
re = coxmeg_plink(pheno,sigma,cov_file=cov,detap=TRUE,dense=FALSE,verbose=FALSE,spd=FALSE)
## Warning in chol.default(x, pivot = TRUE): the matrix is either rank-
## deficient or indefinite
re
## $tau
## [1] 0.04038302
## 
## $iter
## [1] 15
## 
## $rank
## [1] 2999
## 
## $nsam
## [1] 3000

The warning indicates that the relatedness matrix is not full rank. Because there is a twin pair in the sample, the rank of the relatedness matrix is less than the sample size. If the user is not sure whether the relatedness matrix is positive definite or positive semidefinite, it is better to use spd=FALSE.