knitr::opts_chunk$set(eval = FALSE)
Firstly, combine data from COVID-19 testing results, hospital inpatient data, and death records to generate multiple COVID-19 related phenotypes, using the makePhenotypes function. The risk.factor phenotype, extracts and formats several of the most well-known risk factors for COVID-19.
## Obtain phenotypes pheno <- makePhenotypes(ukb.data = "ukb0000.tab", res.eng = "20210426_covid19_result_england.txt", res.wal = "20210426_covid19_result_wales.txt", res.sco = "20210426_covid19_result_scotland.txt", death.file = "20210408_death.txt", death.cause.file = "20210408_death_cause.txt", hesin.file = "20210502_hesin.txt", hesin_diag.file = "20210502_hesin_diag.txt", hesin_oper.file = "20210502_hesin_oper.txt", hesin_critical.file "20210222_hesin_critical.txt", code.file "coding19.tsv") ## Obtain covariates covariates <- risk.factor(ukb.data = "ukb0000.tab", ABO.data = "covid19_misc.txt.gz", hesin.file = "20210502_hesin.txt", res.eng = "20210426_covid19_result_england.txt", res.wal = "20210426_covid19_result_wales.txt", res.sco = "20210426_covid19_result_scotland.txt")
Now carry out sample and variant level QC, using data supplied by UK Biobank.
For sampleQC, ancestry must be specified as "WhiteBritish" or "all".
For variantQC, MAF and INFO (imputation quality) filters can be specified, or defaults to MAF = 0.001 and INFO = 0.5.
sampleQC(ukb.data = "ukb0000.tab", withdrawnFile = "w0000_20200820.csv", ancestry="all", software="SAIGE") variantQC(snpQC = "ukb_snp_qc.txt", mfiDir = "qcDir", mafFilt=0.001, infoFilt=0.5)
Lists of IDs to include/exclude are outputted, with appropriate format.
If selected software is "plink":
If selected software is "SAIGE":
The below commands generates a phenotype file that can be used to run a GWAS using SAIGE, of COVID +ve individuals vs COVID -ve.
By default, this function appends 20PCs of ancestry, and genotyping array to the phenotype file. User can also (optionally) specify additional covariates (eg as generated by the "risk.factor" function, above).
makeGWASFiles(ukb.data = "ukb0000.tab", pheno = pheno, covariates = covariates, phe.name = "pos.neg", cov.name = c("sex", "age"), includeSampsFile = sampleInclude_SAIGE.txt, software = "SAIGE", prefix = "phenotypes_SAIGE")
This outputs the file phenotypes_SAIGE.txt
Filtering of the genetic data requires:
For running SAIGE, we need to create:
1. a Plink format file, with SNPs to be used for calculating the genetic relatedness matrix.
Directly genotyped data comes in Plink format, and split by chromosome. ```{bash filter plink}
for CHR in {1..22}; do echo ukb_cal_chr${CHR}_v2.bed ukb_snp_chr${CHR}_v2.bim ukb0000.fam >> plinkList.txt done
plink \ --merge-list plinkList.txt \ --make-bed \ --out allChr_temp
plink \ --bfile cleaningTemp/allChr_temp \ --keep sampleInclude_plink.txt \ --extract snpsIncludeRSIDs_GRM.txt \ --make-bed \ --out grmSNPs
rm allChr_temp*
This creates binary Plink files: grmSNPs.bed/fam/bim. 2. filtered bgen files for running the GWAS Imputed data comes in bgen format, and split by chromosome. These files should be filtered prior to running to the GWAS, but be kept split by chromosome, so the GWAS can be run in parallel. ```{bash filter bgen} ## Filter chromosome 1 CHR=1 qctool \ -g ukb_imp_chr${CHR}_v3.bgen \ -s ukb0000_imp_chr${CHR}_v3_s487296.sample \ -incl-snpids snpIncludeRSIDs_minMa0.001_minInfo0.8.txt \ -incl-samples sampleInclude_SAIGE.txt \ -og filtered_chr${CHR}.bgen \ -os filtered_chr${CHR}.sample
We like to run SAIGE using a container image, with Singularity (https://sylabs.io/singularity/), and this is what is shown below.
SAIGE can be installed and run in many ways. See https://github.com/weizhouUMICH/SAIGE for details (including the download for the container (Docker) image).
SAIGE runs in two steps:
Here we are fitting the null model, with COVID +ve individuals vs COVID -ve as the phenotype ("pos.neg"), and sex, age, 10PCs of ancestry, and genotyping array as covariates.
```{bash saige step 1}
singularity run ./saige_0.36.3.2.sif step1_fitNULLGLMM.R \ --plinkFile=grmSNPs \ --phenoFile=phenotypes_SAIGE.txt \ --phenoCol=pos.neg \ --covarColList=sex,age,array,pc1,pc2,pc3,pc4,pc5,pc6,pc7,pc8,pc9,pc10 \ --sampleIDColinphenoFile=ID \ --traitType=binary \ --outputPrefix=nullModel \ --nThreads=4 \ --LOCO=FALSE
This outputs the files nullModel.rda and nullModel.varianceRatio.txt to be used in step 2. 2. association analysis Using the output from step 1, run the association analysis. Note - a little reformatting of the data is needed first: the sample file header needs to be removed, and the bgen files needs indexing with bgenix (bgenix is included in the SAIGE download). ```{bash saige step 2} # Remove header from sample file sed -e '1,2d' filtered_chr${CHR}.sample > filtered_chr${CHR}_noHead.sample # Index bgen file bgenix -g filtered_chr${CHR}.bgen -index -clobber # Run GWAS singularity run ./saige_0.36.3.2.sif step2_SPAtests.R \ --bgenFile=filtered_chr${CHR}.bgen \ --bgenFileIndex=filtered_chr${CHR}.bgen.bgi \ --minMAF=0.001 \ --sampleFile=filtered_chr${CHR}_noHead.sample \ --GMMATmodelFile=nullModel.rda \ --varianceRatioFile=nullModel.varianceRatio.txt \ --SAIGEOutputFile=chr${CHR}_results.txt \ --numLinesOutput=2 \ --IsOutputAFinCaseCtrl=TRUE
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.