PCACheck: Population outlier check with SeqSQC object input file.

View source: R/PCACheck.R

PCACheckR Documentation

Population outlier check with SeqSQC object input file.

Description

Function to perform principle component analysis for all samples and to infer sample ancestry.

Usage

PCACheck(
  seqfile,
  remove.samples = NULL,
  npcs = 4,
  LDprune = TRUE,
  missing.rate = 0.1,
  ss.cutoff = 300,
  maf = 0.01,
  hwe = 1e-06,
  ...
)

Arguments

seqfile

SeqSQC object, which includes the merged gds file for study cohort and benchmark.

remove.samples

a vector of sample names for removal from PCA calculation. Could be problematic samples identified from previous QC steps, or user-defined samples.

npcs

the number principle components to use for the population prediction in SVM model. The default value is 4, and it is required to be <= 10.

LDprune

whether to use LD-pruned snp set, the default is TRUE.

missing.rate

to use the SNPs with "<= missing.rate" only; if NaN, no threshold. By default, we use missing.rate = 0.1 to filter out variants with missing rate greater than 10%.

ss.cutoff

the minimum sample size (300 by default) to apply the MAF filter. This sample size is the sum of study samples and the benchmark samples of the same population as the study cohort.

maf

to use the SNPs with ">= maf" if sample size defined in above argument is greater than ss.cutoff; otherwise NaN is used by default for no MAF threshold.

hwe

to use the SNPs with Hardy-Weinberg equilibrium p >= hwe if sample size defined in above argument is greater than ss.cutoff; otherwise no hwe threshold. The default is 1e-6.

...

Arguments to be passed to other methods.

Details

Using LD-pruned autosomal variants (by default), we calculate the eigenvectors and eigenvalues for principle component analysis (PCA). We use the benchmark samples as training dataset, and predict the population group for each sample in the study cohort based on the top four eigenvectors. Samples with discordant predicted and self-reported population groups are considered problematic. The function PCACheck performs the PCA analysis and identifies population outliers in study cohort.

Value

a data frame with sample name, reported population, data resource (benchmark vs study cohort), the first four eigenvectors and the predicted population.

Author(s)

Qian Liu qliu7@buffalo.edu

Examples

load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC"))
gfile <- system.file("extdata", "example.gds", package="SeqSQC")
seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile))
seqfile <- PCACheck(seqfile, remove.samples=NULL, LDprune=TRUE, missing.rate=0.1)
res.pca <- QCresult(seqfile)$PCA
tail(res.pca)

Liubuntu/SeqSQC documentation built on April 12, 2024, 6:39 p.m.