Using UKB.COVID19 package for GWAS with SAIGE

knitr::opts_chunk$set(eval = FALSE)

Generate phenotypes and covariates

Firstly, combine data from COVID-19 testing results, hospital inpatient data, and death records to generate multiple COVID-19 related phenotypes, using the makePhenotypes function. The risk.factor phenotype, extracts and formats several of the most well-known risk factors for COVID-19.

## Obtain phenotypes

pheno <- makePhenotypes(ukb.data = "ukb0000.tab", 
                        res.eng = "20210426_covid19_result_england.txt", 
                        res.wal = "20210426_covid19_result_wales.txt", 
                        res.sco = "20210426_covid19_result_scotland.txt", 
                        death.file = "20210408_death.txt",  
                        death.cause.file = "20210408_death_cause.txt", 
                        hesin.file = "20210502_hesin.txt", 
                        hesin_diag.file = "20210502_hesin_diag.txt", 
                        hesin_oper.file = "20210502_hesin_oper.txt", 
                        hesin_critical.file "20210222_hesin_critical.txt", 
                        code.file "coding19.tsv")

## Obtain covariates

covariates <- risk.factor(ukb.data = "ukb0000.tab", 
                        ABO.data = "covid19_misc.txt.gz", 
                        hesin.file =  "20210502_hesin.txt", 
                        res.eng = "20210426_covid19_result_england.txt", 
                        res.wal = "20210426_covid19_result_wales.txt", 
                        res.sco = "20210426_covid19_result_scotland.txt")

Running a GWAS

Quality control and files for GWAS

Now carry out sample and variant level QC, using data supplied by UK Biobank.
For sampleQC, ancestry must be specified as "WhiteBritish" or "all".
For variantQC, MAF and INFO (imputation quality) filters can be specified, or defaults to MAF = 0.001 and INFO = 0.5.

sampleQC(ukb.data = "ukb0000.tab", 
         withdrawnFile = "w0000_20200820.csv", 
         ancestry="all", 
         software="SAIGE") 

variantQC(snpQC = "ukb_snp_qc.txt", 
          mfiDir = "qcDir", 
          mafFilt=0.001, 
          infoFilt=0.5) 

Sample QC outputs

Lists of IDs to include/exclude are outputted, with appropriate format.

If selected software is "plink":

If selected software is "SAIGE":

Variant QC outputs

Create phenotype files for input

The below commands generates a phenotype file that can be used to run a GWAS using SAIGE, of COVID +ve individuals vs COVID -ve.
By default, this function appends 20PCs of ancestry, and genotyping array to the phenotype file. User can also (optionally) specify additional covariates (eg as generated by the "risk.factor" function, above).

makeGWASFiles(ukb.data = "ukb0000.tab", 
              pheno = pheno, 
              covariates = covariates, 
              phe.name = "pos.neg", 
              cov.name = c("sex", "age"), 
              includeSampsFile = sampleInclude_SAIGE.txt,
              software = "SAIGE", 
              prefix = "phenotypes_SAIGE")

This outputs the file phenotypes_SAIGE.txt

Data filtering

Filtering of the genetic data requires:

For running SAIGE, we need to create:
1. a Plink format file, with SNPs to be used for calculating the genetic relatedness matrix.

Directly genotyped data comes in Plink format, and split by chromosome. ```{bash filter plink}

list of Plink files

for CHR in {1..22}; do echo ukb_cal_chr${CHR}_v2.bed ukb_snp_chr${CHR}_v2.bim ukb0000.fam >> plinkList.txt done

merge chromosomes together

plink \ --merge-list plinkList.txt \ --make-bed \ --out allChr_temp

extract variants and European samples for GRM calculations

plink \ --bfile cleaningTemp/allChr_temp \ --keep sampleInclude_plink.txt \ --extract snpsIncludeRSIDs_GRM.txt \ --make-bed \ --out grmSNPs

rm allChr_temp*

This creates binary Plink files: grmSNPs.bed/fam/bim.   


2. filtered bgen files for running the GWAS

Imputed data comes in bgen format, and split by chromosome. These files should be filtered prior to running to the GWAS, but be kept split by chromosome, so the GWAS can be run in parallel.  

```{bash filter bgen}

## Filter chromosome 1

CHR=1

qctool \
  -g ukb_imp_chr${CHR}_v3.bgen  \
  -s ukb0000_imp_chr${CHR}_v3_s487296.sample \
  -incl-snpids snpIncludeRSIDs_minMa0.001_minInfo0.8.txt \
  -incl-samples sampleInclude_SAIGE.txt \
  -og filtered_chr${CHR}.bgen \
  -os filtered_chr${CHR}.sample

Running a GWAS with SAIGE

We like to run SAIGE using a container image, with Singularity (https://sylabs.io/singularity/), and this is what is shown below.

SAIGE can be installed and run in many ways. See https://github.com/weizhouUMICH/SAIGE for details (including the download for the container (Docker) image).

SAIGE runs in two steps:

  1. Fitting the null model

Here we are fitting the null model, with COVID +ve individuals vs COVID -ve as the phenotype ("pos.neg"), and sex, age, 10PCs of ancestry, and genotyping array as covariates.

```{bash saige step 1}

singularity run ./saige_0.36.3.2.sif step1_fitNULLGLMM.R \ --plinkFile=grmSNPs \ --phenoFile=phenotypes_SAIGE.txt \ --phenoCol=pos.neg \ --covarColList=sex,age,array,pc1,pc2,pc3,pc4,pc5,pc6,pc7,pc8,pc9,pc10 \ --sampleIDColinphenoFile=ID \ --traitType=binary \ --outputPrefix=nullModel \ --nThreads=4 \ --LOCO=FALSE

 This outputs the files nullModel.rda and nullModel.varianceRatio.txt to be used in step 2.  

 2. association analysis

 Using the output from step 1, run the association analysis. 

 Note - a little reformatting of the data is needed first: the sample file header needs to be removed, and the bgen files needs indexing with bgenix (bgenix is included in the SAIGE download).  

```{bash saige step 2}

# Remove header from sample file
sed -e '1,2d' filtered_chr${CHR}.sample > filtered_chr${CHR}_noHead.sample

# Index bgen file
bgenix -g filtered_chr${CHR}.bgen -index -clobber

# Run GWAS
singularity run ./saige_0.36.3.2.sif step2_SPAtests.R   \
        --bgenFile=filtered_chr${CHR}.bgen \
        --bgenFileIndex=filtered_chr${CHR}.bgen.bgi \
        --minMAF=0.001 \
        --sampleFile=filtered_chr${CHR}_noHead.sample \
        --GMMATmodelFile=nullModel.rda \
        --varianceRatioFile=nullModel.varianceRatio.txt \
        --SAIGEOutputFile=chr${CHR}_results.txt \
        --numLinesOutput=2 \
        --IsOutputAFinCaseCtrl=TRUE


Try the UKB.COVID19 package in your browser

Any scripts or data that you put into this service are public.

UKB.COVID19 documentation built on March 18, 2022, 8:03 p.m.