pcair: PC-AiR: Principal Components Analysis in Related Samples

Description Usage Arguments Details Value Author(s) References See Also Examples

Description

pcair is used to perform a Principal Components Analysis using genome-wide SNP data for the detection of population structure in a sample. Unlike a standard PCA, PC-AiR accounts for sample relatedness (known or cryptic) to provide accurate ancestry inference that is not confounded by family structure.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
## S4 method for signature 'gds.class'
pcair(gdsobj, kinobj = NULL, divobj = NULL,
                            kin.thresh = 2^(-11/2), div.thresh = -2^(-11/2),
                            unrel.set = NULL, 
                            sample.include = NULL, snp.include = NULL,
                            num.cores = 1L, verbose = TRUE, ...)
## S4 method for signature 'SNPGDSFileClass'
pcair(gdsobj, ...)
## S4 method for signature 'GdsGenotypeReader'
pcair(gdsobj, ...)
## S4 method for signature 'GenotypeData'
pcair(gdsobj, ...)
## S4 method for signature 'SeqVarGDSClass'
pcair(gdsobj, ...)

Arguments

gdsobj

An object providing a connection to a GDS file.

kinobj

A symmetric matrix of pairwise kinship coefficients for every pair of individuals in the sample: upper and lower triangles must both be filled; diagonals should be self-kinship or set to a non-missing constant value. This matrix is used for partitioning the sample into the 'unrelated' and 'related' subsets. See 'Details' for how this interacts with kin.thresh and unrel.set. IDs for each individual must be set as the column names of the matrix. This matrix may also be provided as a GDS object; see 'Details'.

divobj

A symmetric matrix of pairwise ancestry divergence measures for every pair of individuals in the sample: upper and lower triangles must both be filled; diagonals should be set to a non-missing constant value. This matrix is used for partitioning the sample into the 'unrelated' and 'related' subsets. See 'Details' for how this interacts with div.thresh. IDs for each individual must be set as the column names of the matrix. This matrix may be identical to kinobj. This matrix may be NULL to ignore ancestry divergence. This matrix may also be provided as a GDS object; see 'Details'.

kin.thresh

Threshold value on kinobj used for declaring each pair of individuals as related or unrelated. The default value is 2^(-11/2) ~ 0.022, corresponding to 4th degree relatives. See 'Details' for how this interacts with kinobj.

div.thresh

Threshold value on divobj used for deciding if each pair of individuals is ancestrally divergent. The default value is -2^(-11/2) ~ -0.022. See 'Details' for how this interacts with divobj.

unrel.set

An optional vector of IDs for identifying individuals that are forced into the unrelated subset. See 'Details' for how this interacts with kinobj.

sample.include

An optional vector of IDs for selecting samples to consider for either set.

snp.include

An optional vector of snp or variant IDs to use in the analysis.

num.cores

The number of cores to use.

verbose

Logical indicator of whether updates from the function should be printed to the console; the default is TRUE.

...

Additional arguments to pass to snpgdsPCA, such as eigen.cnt to control the number of PCs returned.

Details

The basic premise of PC-AiR is to partition the entire sample of individuals into an ancestry representative 'unrelated subset' and a 'related set', perform standard PCA on the 'unrelated subset', and predict PC values for the 'related subset'.

We recommend using software that accounts for population structure to estimate pairwise kinship coefficients to be used in kinobj. Any pair of individuals with a pairwise kinship greater than kin.thresh will be declared 'related.' Kinship coefficient estimates from the KING-robust software are typically used as measures of ancestry divergence in divobj. Any pair of individuals with a pairwise divergence measure less than div.thresh will be declared ancestrally 'divergent'. Typically, kin.thresh and div.thresh are set to be the amount of error around 0 expected in the estimate for a pair of truly unrelated individuals.

There are multiple ways to partition the sample into an ancestry representative 'unrelated subset' and a 'related subset'. In all of the scenarios described below, the set of all samples is limited to those in sample.include when it is specified (i.e. not NULL):

If kinobj is specified, divobj is specified, and unrel.set = NULL, then the PC-AiR algorithm is used to find an 'optimal' partition of all samples (see 'References' for a paper describing the PC-AiR algorithm).

If kinobj is specified, divobj is specified, and unrel.set is specified, then all individuals with IDs in unrel.set are forced in the 'unrelated subset' and the PC-AiR algorithm is used to partition the rest of the sample; this is especially useful for including reference samples of known ancestry in the 'unrelated subset'.

If kinobj is specified, and divobj = NULL, then kinobj is used to define the unrelated set but ancestry divergence is ignored.

If kinobj = NULL, and unrel.set is specified, then all individuals with IDs in unrel.set are put in the 'unrelated subset' and the rest of the individuals are put in the 'related subset'.

If kinobj = NULL, and unrel.set = NULL, then the function will perform a "standard" PCA analysis.

NOTE: kinobj and divobj may be identical.

Value

An object of class 'pcair'. A list including:

vectors

A matrix of principal components; each column is a principal component. Sample IDs are provided as rownames. The number of PCs returned can be adjusted by supplying the eigen.cnt argument, which is passed to snpgdsPCA.

values

A vector of eigenvalues matching the principal components. These values are determined from the standard PCA run on the 'unrelated subset'.

rels

A vector of IDs for individuals in the 'related subset'.

unrels

A vector of IDs for individuals in the 'unrelated subset'.

kin.thresh

The threshold value used for declaring each pair of individuals as related or unrelated.

div.thresh

The threshold value used for determining if each pair of individuals is ancestrally divergent.

sample.id

A vector of IDs for the samples used in the analysis.

nsamp

The total number of samples in the analysis.

nsnps

The total number of SNPs used in the analysis.

varprop

The variance proportion for each principal component.

call

The function call passed to pcair.

method

A character string. Either "PC-AiR" or "Standard PCA" identifying which method was used for computing principal components. "Standard PCA" is used if no relatives were identified in the sample.

Author(s)

Matthew P. Conomos

References

Conomos M.P., Miller M., & Thornton T. (2015). Robust Inference of Population Structure for Ancestry Prediction and Correction of Stratification in the Presence of Relatedness. Genetic Epidemiology, 39(4), 276-293.

Manichaikul, A., Mychaleckyj, J.C., Rich, S.S., Daly, K., Sale, M., & Chen, W.M. (2010). Robust relationship inference in genome-wide association studies. Bioinformatics, 26(22), 2867-2873.

See Also

pcairPartition for a description of the function used by pcair that can be used to partition the sample into 'unrelated' and 'related' subsets without performing PCA. plot.pcair for plotting. kingToMatrix for creating a matrix of pairwise kinship coefficient estimates from KING output text files that can be used for kinobj or divobj. GWASTools for a description of the package containing the following functions: GenotypeData for a description of creating a GenotypeData class object for storing sample and SNP genotype data, MatrixGenotypeReader for a description of reading in genotype data stored as a matrix, and GdsGenotypeReader for a description of reading in genotype data stored as a GDS file. Also see snpgdsBED2GDS in the SNPRelate package for a description of converting binary PLINK files to GDS. The generic functions summary and print.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# file path to GDS file
gdsfile <- system.file("extdata", "HapMap_ASW_MXL_geno.gds", package="GENESIS")
# read in GDS data
HapMap_geno <- gdsfmt::openfn.gds(gdsfile)
# load saved matrix of KING-robust estimates
data("HapMap_ASW_MXL_KINGmat")
# run PC-AiR
mypcair <- pcair(HapMap_geno, kinobj = HapMap_ASW_MXL_KINGmat, 
                 divobj = HapMap_ASW_MXL_KINGmat)
gdsfmt::closefn.gds(HapMap_geno)

Example output

Using kinobj and divobj to partition samples into unrelated and related sets
Working with 173 samples
Identifying relatives for each sample using kinship threshold 0.0220970869120796
Identifying pairs of divergent samples using divergence threshold -0.0220970869120796
Partitioning samples into unrelated and related sets...
Unrelated Set: 97 Samples 
Related Set: 76 Samples
Performing PCA on the Unrelated Set...
Principal Component Analysis (PCA) on genotypes:
Excluding 0 SNP on non-autosomes
Excluding 3 SNPs (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
    # of samples: 97
    # of SNPs: 19,997
    using 1 thread
    # of principal components: 32
PCA:    the sum of all selected genotypes (0,1,2) = 1022535
CPU capabilities: Double-Precision SSE2
Thu Feb 11 08:11:22 2021    (internal increment: 1180)

[..................................................]  0%, ETC: ---        
[==================================================] 100%, completed, 0s
Thu Feb 11 08:11:22 2021    Begin (eigenvalues and eigenvectors)
Thu Feb 11 08:11:22 2021    Done.
Predicting PC Values for the Related Set...
SNP Loading:
    # of samples: 97
    # of SNPs: 19,997
    using 1 thread
    using the top 32 eigenvectors
SNP Loading:    the sum of all selected genotypes (0,1,2) = 1022535
Thu Feb 11 08:11:22 2021    (internal increment: 9456)

[..................................................]  0%, ETC: ---        
[==================================================] 100%, completed, 0s
Thu Feb 11 08:11:22 2021    Done.
Sample Loading:
    # of samples: 76
    # of SNPs: 19,997
    using 1 thread
    using the top 32 eigenvectors
Sample Loading:    the sum of all selected genotypes (0,1,2) = 799404
Thu Feb 11 08:11:22 2021    (internal increment: 12072)

[..................................................]  0%, ETC: ---        
[==================================================] 100%, completed, 0s
Thu Feb 11 08:11:22 2021    Done.

GENESIS documentation built on Jan. 30, 2021, 2:01 a.m.