Sample Quality Check for NGS Data using SeqSQC package

Quick start

The QC process in SeqSQC included five different steps: missing rate check, sex check, inbreeding coefficient check, Identity-by-descent (IBD) check, and population outlier check. Problematic samples identified in each step could be removed from later steps. The entire sample level QC was wrapped up in one function: sampleQC. By executing this function, a problematic sample list with criteria from each QC step as well as a QC report with interactive plots in html format will be generated.

Here we use an exemplar NGS dataset as a study cohort to demonstrate the execution of the wrap-up function sampleQC. The example dataset, with four EUR (European) samples and one AFR (African) sample, is assembled from the 1000 Genomes Project. We labeled the one AFR sample as EUR to mimic a population outlier. Samples from the 1000 Genome Project are whole-genome sequencing dataset. To mimic whole exome sequencing (WES) data, we kept in the vcf file only the variants in capture regions of Agilent SureSelect Human Exon v5, one of the most popular WES capture platforms to date.

The code chunk below assumes that you have a vcf file called infile with samples from a single population, a sample annotation file called sample.annot with sample name, population (e.g., AFR, EUR, EAS (East Asian), SAS (South Asian), or ASN (Asian)) and gender info (male/female), and a bed file called cr which contains capture regions. User need to specify the name for output files, incuding the gds file containing genotypes, and the SeqSQC(discussed in section below) object with QC results.

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("SeqSQC")
library(SeqSQC)

The wrap-up function sampleQC transforms the vcf file into SeqSQC object, or take the SeqSQC object generated from LoadVfile directly as input, evaluates all sample QC steps (as a convenient all inclusive method), outputs the QC results into your designated directory, and generates a sample QC report in html format. We recommend saving the output SeqSQC object into an RData.

This is an example when we have input as vcf file, sample annotation file, and capture region bed file. Note that our example vcf file only include variants in chromosome 1, and the capture region only include the

infile <- system.file("extdata", "example_sub.vcf", package="SeqSQC")
sample.annot <- system.file("extdata", "sampleAnnotation.txt", package="SeqSQC")
cr <- system.file("extdata", "CCDS.Hs37.3.reduced_chr1.bed", package="SeqSQC")
outdir <- tempdir()
outfile <- file.path(outdir, "testWrapUp")
seqfile <- sampleQC(vfile = infile, output = outfile, capture.region = cr, sample.annot = sample.annot, format.data = "NGS", format.file = "vcf", QCreport = FALSE)
save(seqfile, file="seqfile.RData")

This is an example when we directly use SeqSQC object as input, and evaluate all sample QC steps.

load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC"))
gfile <- system.file("extdata", "example.gds", package="SeqSQC")
seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile))

seqfile <- sampleQC(vfile = seqfile, output = "testWrapUp", QCreport = TRUE)
save(seqfile, file="seqfile.Rdata")

Detailed descriptions of data structure and functionality usage of SeqSQC are provided as below. Users are recommended to use these specific QC functions for each QC step if they want to skip any of them by using sampleQC().

Data preparation

Input data

SeqSQC takes VCF file as input.

seqfile <- LoadVfile(vfile = infile, output = outfile, capture.region = cr, sample.annot = sample.annot)

SeqSQC class

We define an object class SeqSQC to store the genotype data, variant / sample annotation data, and the sample QC results.

load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC"))
gfile <- system.file("extdata", "example.gds", package="SeqSQC")
seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile))
slotNames(seqfile)

A SeqSQC object is a list of two objects. The first object is the filepath of the GDS (discussed in section below) file which stores the genotype information from the original VCF file.

gdsfile(seqfile)

The second object is a list of sample information and QC results, which include the dimension (# of samples and variants), sample annotation, and QC results for sample missing rate, sex check, inbreeding outlier check, IBD check, and population outlier check.

QCresult(seqfile)
head(QCresult(seqfile)$sample.annot)

GDS class

The genotype information and variant annotation from the input VCF file will be stored in Genomic Data Structure (GDS) format (r Biocpkg("gdsfmt")). Compared to VCF format, the GDS format could increase the storage efficiency by 5 fold and data access speed by 2-3 fold. We only include the bi-allelic single nucleotide variants (SNVs) from the VCF input for sample QC analysis. Other information from the VCF file, including the chromosome, position, snp rs id, reference allele, alternative allele, quality score and filter can also be passed into the gds file. The functions SeqOpen and closefn.gds can be used to open and close the gds file in SeqSQC object. It is recommended to close the gds file once it has been opened.

showfile.gds(closeall=TRUE)
dat <- SeqOpen(seqfile)
dat
closefn.gds(dat)

Standard workflow

Sample missing rate check

Samples with missing rate > 0.1 are considered problematic. The function MissingRate and plotQC(QCstep = "MissingRate") calculate and plot the sample missing rate respectively.

The result from sample missing rate check contains three columns: sample name, sample missing rate, and an indicator of whether the sample has a missing rate greater than 0.1. The value of the outlier column is set to NA for benchmark samples. When running the QC process through the wrap-up function sampleQC, problematic samples identified in each QC step are automatically remove before getting to the next step. However when a QC step is executed separately, users need to specify the list of problematic samples to be removed using the remove.samples option.

seqfile <- MissingRate(seqfile, remove.samples=NULL)
res.mr <- QCresult(seqfile)$MissingRate
tail(res.mr)

The function plotQC(QCstep = "MissingRate") generates the plot for the sample missing rate, where the problematic samples with missing rate greater than 0.1 are highlighted in the plot. The default plot generated is not interactive. Users can generate the interactive plot in each QC step by specifying the interactive argument to be TRUE. The interactive plot allows users to visually inspect the QC result by putting the cursor on samples of interest.

plotQC(seqfile, QCstep = "MissingRate")

In this plot, all five samples in the study cohort have missing rate of zero.

Sex check

After filtering out the pseudo-autosomal region in X chromosome, we calculate the sample inbreeding coefficient with variants on X chromosome for all samples in the study cohort and for benchmark samples of the same population as the study cohort. The function SexCheck predicts the sample gender and plotQC(QCstep = "SexCheck") draws the plot for X chromosome inbreeding coefficients.

The result contains sample name, reported gender, X chromosome inbreeding coefficient, and predicted gender. Samples are predicted to be female or male if the inbreeding coefficient is below 0.2, or greater than 0.8. The samples with discordant reported gender and predicted gender are considered as problematic. When the inbreeding coefficient is within the range of [0.2, 0.8], "0" will be shown in the column of pred.sex to indicate ambiguous gender, which won't be considered as problematic.

seqfile <- SexCheck(seqfile, remove.samples=NULL)
res.sexc <- QCresult(seqfile)$SexCheck
tail(res.sexc)

The function plotQC(QCstep = "SexCheck") generates the plot for the inbreeding coefficient on X chromosome where samples are labeled with different color according to their self-reported gender. If there is any sample detected to be gender mismatched by SeqSQC, it will be highlighted with a red circle.

plotQC(seqfile, QCstep = "SexCheck")

In this plot, none of the five samples in the study cohort have dis-concordant self-reported and predicted genders.

Inbreeding check

Using LD-pruned autosomal variants, we calculate the inbreeding coefficient for each sample in the study cohort and for benchmark samples of the same population as the study cohort. Samples with inbreeding coefficients that are five standard deviations beyond the mean are considered problematic. The function Inbreeding and plotQC(QCstep = "Inbreeding") calculates and plots the inbreeding coefficients respectively.

The result contains sample name, inbreeding coefficient, and an indicator of whether the inbreeding coefficient is five standard deviation beyond the mean. For Benchmark samples the indicator column is set to be "NA".

seqfile <- Inbreeding(seqfile, remove.samples=NULL)
res.inb <- QCresult(seqfile)$Inbreeding
tail(res.inb)

The function plotQC(QCstep = "Inbreeding") generates the plot for the inbreeding coefficient. Problematic samples with extreme inbreeding coefficients will be highlighted in the plot.

plotQC(seqfile, QCstep = "Inbreeding")

In this plot, none of the five samples in the study cohort have extreme inbreeding coefficients.

IBD check

Using LD-pruned autosomal variants, we calculate the IBD coefficients for all sample pairs. we then predict related sample pairs in study cohort by using the support vector machine (SVM) method with linear kernel and the known relatedness embedded in benchmark data as training set. All predicted related pairs are also required to have coefficient of kinship ≥ 0.08. The sample with higher missing rate in each related pair is selected for removal from further analysis. The function IBD calculates the IBD coefficients for each sample pair and predicts the relatedness for samples in the study cohort. The function plotQC(QCstep = "IBD") draws the descent coefficients, K0 and K1, for each pair.

The result contains sample names, the descent coefficients k0, k1 and kinship, self-reported relationship and predicted relationship for each pair of samples. Sample pairs with discordant self-reported and predicted relationship are considered as problematic.

seqfile <- IBD(seqfile, remove.samples=NULL)
res.ibd <- QCresult(seqfile)$IBD
head(res.ibd)

The function plotQC(QCstep = "IBD") draws the descent coefficients k0 and k1 for each sample pair. The relationship for each sample pair are labelled by different colors. If there is any problematic sample pair detected to be cryptically related by SeqSQC, it will be highlighted with a red circle.

plotQC(seqfile, QCstep = "IBD")

In this plot, none of the five samples in the study cohort have cryptic relationship with each other.

Population outlier check

Using LD-pruned autosomal variants we calculate the eigenvectors and eigenvalues for principle component analysis (PCA). We use the benchmark samples as training dataset, and predict the population group for each sample in the study cohort using the top four principle components. Samples with discordant predicted and self-reported population groups are considered problematic. The function PCA performs the PCA analysis and identifies population outliers in study cohort.

The result contains sample name, reported population, data resource (benchmark / study cohort), the first four eigenvectors and predicted population. In the example data, we identified one population outlier. The sample HG02585 was reported as EUR but predicted to be AFR. HG02585 is indeed a AFR sample. We put it as an intended population outlier in the example data.

seqfile <- PCA(seqfile, remove.samples=NULL)
res.pca <- QCresult(seqfile)$PCA
tail(res.pca)

The function plotQC(QCstep = "PCA") generates the plot of first two principle components for each sample. Samples with different population are labelled by different colors. If there is any population outlier detected by SeqSQC, it will be highlighted with a red circle.

plotQC(seqfile, QCstep = "PCA")

In the following interactive plot, the intended population outlier HG02585 was grouped closer to the AFR benchmark samples, and far from the other four EUR samples. Users could easily pick it out and remove it from downstream analysis.

plotQC(seqfile, QCstep = "PCA", interactive=TRUE)

Summary of QC results

Problematic sample list

The SeqSQC function ProblemList generates a data frame including all problematic samples, and the reasons for removal recommendation. We recommend users to execute this function after finishing all QC steps in standard workflow, so that they can get a full list of problematic samples.

problemList(seqfile)
save(seqfile, file="seqfile.Rdata")

reporting of results

After finish all or some of the QC steps, the users can use RenderReport to generate the report in html format with optional interactive plots.

RenderReport(seqfile, output="report.html", interactive=TRUE)

How to get help for SeqSQC

Any SeqSQC question can be posted to the Bioconductor support site, which serves as a searchable knowledge base of questions and answers:

https://support.bioconductor.org

Posting a question and tagging with “SeqSQC” will automatically send an alert to the package authors to respond on the support site.

Session info

sessionInfo()


Try the SeqSQC package in your browser

Any scripts or data that you put into this service are public.

SeqSQC documentation built on Nov. 8, 2020, 5:03 p.m.