SPACox: SaddlePoint Approximation implementation of a surival...
In WenjianBI/SPACox: SaddlePoint Approximation implementation of survival analysis

Description Usage Arguments Details Value Examples

View source: R/Library.r

A fast and accurate method for a genome-wide survival analysis on a large-scale dataset.

SPACox(
  obj.null,
  Geno.mtx,
  Cutoff = 2,
  impute.method = "fixed",
  missing.cutoff = 0.15,
  min.maf = 1e-04,
  CovAdj.cutoff = 5e-05,
  G.model = "Add"
)

`obj.null`	an R object returned from function SPACox_Null_Model()
`Geno.mtx`	a numeric genotype matrix with each row as an individual and each column as a genetic variant. Column names of genetic variations and row names of subject IDs are required. Missng genotype should be coded as NA. Both hard-called and imputed genotype data are supported.
`Cutoff`	a numeric value (Default: 2) to specify the standard deviation cutoff to be used. If the test statistic lies within the standard deviation cutoff, its p value is calculated based on a normal distribution approximation, otherwise, its p value is calculated based on a saddlepoint approximation.
`impute.method`	a character string (default: "fixed") to specify the method to impute missing genotypes. "fixed" imputes missing genotypes (NA) by assigning the mean genotype value (i.e. 2p where p is MAF).
`missing.cutoff`	a numeric value (default: 0.15) to specify the cutoff of the missing rates. Any variant with missing rate higher than this cutoff will be excluded from the analysis.
`min.maf`	a numeric value (default: 0.0001) to specify the cutoff of the minimal MAF. Any SNP with MAF < cutoff will be excluded from the analysis.
`CovAdj.cutoff`	a numeric value (default: 5e-5). If the p-value is less than this cutoff, then we would use an additional technic to adjust for covariates.

To run SPACox, the following two steps are required:

Step 1. Use function SPACox_Null_Model() to fit a null Cox model.
Step 2: Use function SPACox() to calculate p value for each genetic variant.

SPACox uses a hybrid strategy with both saddlepoint approximation and normal distribution approximation. Generally speaking, saddlepoint approximation is more accurate than, but a little slower than, the traditional normal distribution approximation. Hence, when the score statistic is close to 0 (i.e. p-values are not small), we use the normal distribution approximation. And when the score statistic is far away from 0 (i.e. p-values are small), we use the saddlepoint approximation. Argument 'Cutoff' is to specify the standard deviation cutoff.

To calibrate the score statistics, SPACox uses martingale residuals which are calculated via R package survival. All extentions (such as strata, ties, left-censoring) supported by package survival could also be used in SPACox. Time-varying covariates are also supported by splitting each subject into several observations. Simulation studies and real data analyses indicate that SPACox works well if one subject corresponds to 2~3 observations. While, if there are more than 4 observations for each subject, SPACox has not been fully evaluated and the results should be carefully intepreted.

Sometimes, the order of subjects between phenotype data and genotype data are different, which could lead to some errors. To avoid that, we ask users to specify the IDs of both phenotype data (pIDs) and genotype data (gIDs) when fitting the null model. Users are responsible to check the consistency between pIDs and formula, and the consistency between gIDs and Geno.mtx.

an R matrix with the following columns

`MAF`	Minor allele frequencies
`missing.rate`	Missing rates
`p.value.spa`	p value (recommanded) from a saddlepoint approximation.
`p.value.norm`	p value from a normal distribution approximation.
`Stat`	score statistics
`Var`	estimated variances of the score statistics
`z`	z values corresponding to the score statistics

# Simulation phenotype and genotype
N = 10000
nSNP = 1000
MAF = 0.1
Phen.mtx = data.frame(ID = paste0("IID-",1:N),
                      event=rbinom(N,1,0.5),
                      time=runif(N),
                      Cov1=rnorm(N),
                      Cov2=rbinom(N,1,0.5))
Geno.mtx = matrix(rbinom(N*nSNP,2,MAF),N,nSNP)

# NOTE: The row and column names of genotype matrix are required.
rownames(Geno.mtx) = paste0("IID-",1:N)
colnames(Geno.mtx) = paste0("SNP-",1:nSNP)
Geno.mtx[1:10,1]=NA   # please use NA for missing genotype

# Attach the survival package so that we can use its function Surv()
library(survival)
obj.null = SPACox_Null_Model(Surv(time,event)~Cov1+Cov2, data=Phen.mtx,
                             pIDs=Phen.mtx$ID, gIDs=rownames(Geno.mtx))
SPACox.res = SPACox(obj.null, Geno.mtx)

# we recommand using column of 'p.value.spa' to associate genotype with time-to-event phenotypes
head(SPACox.res)

## missing data in response/indicator variables is also supported. Please do not remove pIDs of subjects with missing data, the program will do it.
Phen.mtx$event[2] = NA
Phen.mtx$Cov1[5] = NA
obj.null = SPACox_Null_Model(Surv(time,event)~Cov1+Cov2, data=Phen.mtx,
                             pIDs=Phen.mtx$ID, gIDs=rownames(Geno.mtx))
SPACox.res = SPACox(obj.null, Geno.mtx)

# The below is an example code to use survival package
coxph(Surv(time,event)~Cov1+Cov2+Geno.mtx[,1], data=Phen.mtx)