weizhouUMICH/SAIGE: Efficiently controlling for case-control imbalance and sample relatedness in single-variant assoc tests (SAIGE) and controlling for sample relatedness in region-based assoc tests in large cohorts and biobanks (SAIGE-GENE)

an R package that implements the Scalable and Accurate Implementation of Generalized mixed model that uses the saddlepoint approximation (SPA)(mhof, J. P. , 1961; Kuonen, D. 1999; Dey, R. et.al 2017) and large scale optimization techniques to calibrate case-control ratios in logistic mixed model score tests (Chen, H. et al. 2016) in large PheWAS. Estimated effect sizes are provided. Support VCF,BCF and SAV as input format for dosages using the SAVVY library and support BGEN as input format for dosages using the BGEN library. QR decomposition for the covariate matrix. Allow models with no covariates. Allow using large number of genetic markers to construct GRM (e.g. > 600,000 markers). Fixed a Bug for the Tstat output. Use coefficent of variations for trace estimators and variane ratio estimation. Fixed a bug for colSums() when there is no covariate. BETA and Tstat are now for the alt allele for both quantitative and binary traits. Added an option for leave-one-chromosome-out (LOCO), cut off for the coefficient of variation for trace estimates and variance ratio estimates. Update the savvy library. conditional analysis and gene-based test with sparse GRM. Add option for categorical variance ratio and sparseSigma for single-variant analysis for quantitattive traits. (bianry traits needs more work). Add step 1 using GMMAT package. Work with R-3.5.1. Update eta for fitting null GLMM for binary traits. Add a script createSparseGRM.R to create a sparse GRM. Use sparse matrix for gene-based tests. Fixed a bug for step 1 (updating eta). Replace pcg with solve for sparseGRM. Remove markers with negative variance from gene-based tests. Speed up the step 1 for quantitative traits. Use sparse GRM to estiamte the inital tau. Fix a bug for step 2 when phi is very small. Add the missing arguments cateVarRatioMinMACVecExclude and cateVarRatioMaxMACVecInclude to step 1 functions. Compared to 0.35.5.6, this version has master branch merged in. The following comments are copied from master that are missed since the master-gene was branched out from master. Update SAIGE as SPAtest library is updated (bug fixed). A bug is fixed in the savvy library for vcf and sav format input. A bug is fixed for step 1 for binary traits, which may not affect much on step 1 results. Break when tau is 0 for quantitative traits in step 1. Add code to check sample size in sample file and dosage file. remove covariates in case of perfect separation. handle missing dosages. 0.35.6.3:add a function to extract the diag of GRM. 0.35.7: clean code. 0.35.8: merge 0.35.7 to 0.29.7 in the master branch. 0.35.8.1: fix some errors in documentation and warning message for binary traits. 0.35.8.2: minor changes-fix error message, change MAC to MAF, add a line to check if the chomosome in plink file is numeric or not, add rsid to the header when input file is bgen. 0.35.8.3: fix a bug in the function getCovM_nopcg, which affected the conditional analysis for binary traits. Merge hyacz/master to use cget to manage superlu.0.35.8.5: implement the function to account for unbalanced case control ratio for gene-based tests. 0.35.8.6: fixed the output bug when the genotype matrix has rank 1 for binary phenotypes and add an argument minMAFtoConstructGRM for step 0 and step 1 to allow users to specify the minumum MAF of markers used to construct GRM. 0.35.8.7: fixed the bug when there is no covariate specified, added an argument IsOutputNinCaseCtrl for step 2 to allow for output sample sizes in cases and controls for binary traits in the output file, and fixed the out of boundary bug for LOCO. 0.35.8.8: Fixes a matrix inversion issue in the null model and adds an optional argument for the null computation to remove binary covariates with low counts by juhis. 0.36.1: 1. fixed the freq calculation for mean impute for missing genotypes in plinkFile 2. Diagonal elements of GRM are now estimated using markers in plinkFile with MAF >= minMAFforGRM 3. Conditional analysis for gene- or region-based test for binary traits is now accounting for case-control imbalance 4. plain dosage files are no longer supported for step 2 so no external boost_iostream library is needed. 0.36.2: The option weights.beta.common is not fully correctly developed, so we make weights.beta.common equal to weights.beta.rare for now. Instead of output NA for SKAT-O p values when the function SKAT:::Met_SKAT_Get_Pvalue failed, output 2*min(SKAT p, Burden p, 0.5). allow for specifying customized weights for markers in gene- or region-based tests with arguments in step 2: weightsIncludeinGroupFile and weights_for_G2_cond. 0.36.3: Add option IsOutputBETASEinBurdenTest in step 2 to output effect sizes for burden tests. 0.36.3.1: minor changes. 0.36.3.2: fixed a bug for gene-based conditioning tests with multipel conditioning markers and re-check markers after dropping samples with missing dosages/genotypes in gene-based tests. 0.36.3.3: trying to fix a bug when there is only one sample or marker left after dropping samples with missing dosages/genotypes in gene-based tests. 0.36.4: add an option includeNonautoMarkersforVarRatio in step 1. If TRUE, non-autosomal markers are also used for variance ratio estimation, which will make the algorithm more appropriate for assoc tests for non-autosomal markers; use the new function with sparse sigma for p-values for single variants in gene-based tests; assign AF to be 0 if all samples have missing genotypes or dosages. 0.36.4.1: Trying to fix a bug when minMAFforGRM is set and LOCO=TRUE. 0.36.4.2: fix a typo to extract p.value. 0.36.5: fix an issue for LOCO=TRUE. This issue was introduced when the option minMAFforGRM was introduced. 0.36.5.1: add the option SPAcutoff, If the test statistic lies within the standard deviation cutoff of the mean, p-value based on traditional score test is returned. Otherwise, SPA will be applied. Default value of SPAcutoff is 2 (corresponding p.value.NA 0.05). 0.36.6. add an option IsOutputHetHomCountsinCaseCtrl to output the heterozygous and homozygous counts in cases and controls. 0.37. fixed a bug for AC values when bgen input with missing dosages was used and the missing doages are mean imputed. 0.38. further fixed the bug for output the allele 2 when bgen input with missing dosages was dropped and sampleFile is no longer needed if VCF file is used in Step 2. add --IsOverwriteVarianceRatioFile in step 1 to overwrite the variance ratio file. 0.39. fixed a sample reading error for the conditional analysis when VCF input is used for step 2. 0.39.1. add an option --IsOutputlogPforSingle to output log(P) for single-variant assoc tests. v0.39.1 requires SPAtest 3.1.2. 0.39.2. add three options --sampleFile_male, --X_PARregion, --is_rewrite_XnonPAR_forMales for chromosome X association tests, in which genotypes/dosages of non-PAR region of males will be multiplied by 2. 0.39.3. add five options --sexCol, --FemaleCode, --FemaleOnly, --MaleCode, --MaleOnly to perform sex-specific Step 1. 0.39.4. use sparse matrix to represent genotype matrix for gene-based tests to save memory. 0.41. improve the LOCO feature, implement LOCO for gene- and region- based tests (require --chrom to be specified), and with minInfo cutoff, if the input VCF files do not contain info scores, info will be output as NA and markers won't be filtered out. fixed an issue when subsetting pre-calcuated terms (regress X out of G) to drop missing dosages. 0.42. fixed a bug for variance ratio adjustion when account for case-control imbalance for gene-based tests. minMAC is set to 1/(2*N) instead of 0 if is_rewrite_XnonPAR_forMales=TRUE. 0.42.1. uncomment isSparse=FALSE for quantitative traits. This was commented out for testing in 0.42. 0.43. Further modify the sparse version of the score test for quantitative traits. This causes slight different assoc tests for variants with MAF < 0.05 for quantitative traits. Set LOCO = TRUE to the default values for step 1 and step 2. In step 2, --chrom needs to be specified for LOCO=TRUE. 0.43.1. with LOCO=TRUE, remove model results for other chromosomes to save memory usage for Step 2. 0.43.2. add scripts to calcuate the effectize sample size in Step 1 for binary traits. 0.43.3. error "FALis_rewrite_XnonPAR_forMalesSE not found" has been fixed. 0.44. fixed the error "Phi_ccadj[-indexNeg, -indexNeg]", inverse normalization will only be performed for quantitative traits, for step 2, bgen input requires the sample file, vcf input does not require a seperate sample file. If sample file is not provided, sample ids will be read from vcf file. 0.44.1 fixed the error " X %*% Z : non-conformable arguments" for monomorphic variants. merged Jonathon's codes to update savvy to savvy 2.0. For markers in VCF or SAV files without imputation info R2 values, the imputationInfo column will be 1 in the output file, so the markers will not but removed by minInfo. 0.44.2: add an option useSparseGRMtoFitNULL to allow for fitting the null model using the sparse GRM and add options to collapse the ultra-rare variants in the set-based tests. 0.44.5 speed up the computation when LOCO=TRUE, re-write code for leave-one-chromosome-out in Step 1 to have more efficient parallel computation 2. Speed up the single-variant association tests when running gene-based tests. 0.44.6 Set --method_to_CollapseUltraRare="absence_or_presence" as default to collpase ultra-rare varaints with MAC <= 10. We call this version SAIGE-GENE+. SAIGE-GENE+ has well controlled type I error rates when the maximum MAF cutoff (maxMAFforGroupTest) is lower than 1%, e.g. 0.01% or 0.1%. 0.44.6.1 add the function CCT for Cauchy combination. 0.44.6.2 add extdata/extractNglmm.R to extract the effective sample size without running Step 1. extdata/cmd_extractNeff.sh has the pipeline. The effective sample size (Nglmm) is differently calculated than the previous versions. 0.44.6.4: make IsOutputlogPforSingle work for quantitative trait . remove the rsid in the output when the input is bgen. 0.45. comment out the part to estimate the effective sample sizes, which may not convert and take very long; put <= instead of < for maxMAF in the gene-based tests. 0.45.1. calcuate effective sample size for quantitative traits

Getting started

Package details

AuthorWei Zhou, Zhangchen Zhao, Seunggeun Lee, Cristen Willer
MaintainerWei Zhou <zhowei@umich.edu>
LicenseGPL (>= 2)
Version0.45.1
Package repositoryView on GitHub
Installation Install the latest version of this package by entering the following in R:
install.packages("remotes")
remotes::install_github("weizhouUMICH/SAIGE")
weizhouUMICH/SAIGE documentation built on May 6, 2022, 12:34 a.m.