or relatedness

pcrelate

R Documentation

PC-Relate: Model-Free Estimation of Recent Genetic Relatedness

Description

pcrelate is used to estimate kinship coefficients, IBD sharing probabilities, and inbreeding coefficients using genome-wide SNP data. PC-Relate accounts for population structure (ancestry) among sample individuals through the use of ancestry representative principal components (PCs) to provide accurate relatedness estimates due only to recent family (pedigree) structure.

Usage

## S4 method for signature 'GenotypeIterator'
pcrelate(gdsobj,
	pcs,
	scale = c('overall', 'variant', 'none'),
	ibd.probs = TRUE,
	sample.include = NULL,
	training.set = NULL,
	sample.block.size = 5000,
	maf.thresh = 0.01,
	maf.bound.method = c('filter', 'truncate'),
	small.samp.correct = TRUE,
	BPPARAM = bpparam(),
	verbose = TRUE)
## S4 method for signature 'SeqVarIterator'
pcrelate(gdsobj,
	pcs,
	scale = c('overall', 'variant', 'none'),
	ibd.probs = TRUE,
	sample.include = NULL,
	training.set = NULL,
	sample.block.size = 5000,
	maf.thresh = 0.01,
	maf.bound.method = c('filter', 'truncate'),
	small.samp.correct = TRUE,
	BPPARAM = bpparam(),
	verbose = TRUE)
samplesGdsOrder(gdsobj, sample.include)
calcISAFBeta(gdsobj,
        pcs,
        sample.include,
        training.set = NULL,
	    BPPARAM = bpparam(),
        verbose = TRUE)
pcrelateSampBlock(gdsobj,
        betaobj,
        pcs,
        sample.include.block1,
        sample.include.block2,
	scale = c('overall', 'variant', 'none'),
	ibd.probs = TRUE,
	maf.thresh = 0.01,
	maf.bound.method = c('filter', 'truncate'),
	BPPARAM = bpparam(),
	verbose = TRUE)
correctKin(kinBtwn, kinSelf,
        pcs,
        sample.include = NULL)
correctK2(kinBtwn, kinSelf,
        pcs,
        sample.include = NULL,
        small.samp.correct = TRUE)
correctK0(kinBtwn)

Arguments

`gdsobj`	An object of class `SeqVarIterator` from the package SeqVarTools, or an object of class `GenotypeIterator` from the package GWASTools, containing the genotype data for the variants and samples to be used for the analysis.
`pcs`	A matrix of principal components (PCs) to be used for ancestry adjustment. Each column represents a PC, and each row represents an individual. IDs for each individual must be set as the row names of the matrix.
`scale`	A character string taking the values 'overall', 'variant', or 'none' indicating how genotype values should be standardized. This should be set to 'overall' (the default) in order to do a PC-Relate analysis; see 'Details' for more information.
`ibd.probs`	Logical indicator of whether pairwise IBD sharing probabilities (k0, k1, k2) should be estimated; the default is TRUE.
`sample.include`	A vector of IDs for samples to include in the analysis. If NULL, all samples in `gdsobj` are included.
`training.set`	An optional vector of IDs identifying which samples to use for estimation of the ancestry effect when estimating individual-specific allele frequencies. If NULL, all samples in sample.include are used. See 'Details' for more information.
`sample.block.size`	The number of individuals to read-in/analyze at once; the default value is 5000. See 'Details' for more information.
`maf.thresh`	Minor allele frequency threshold; if an individual's estimated individual-specific minor allele frequency at a SNP is less than this value, that indivdiual will either have that SNP excluded from the analysis or have their estimated indivdiual-specific minor allele frequency truncated to this value, depending on `maf.bound.method`. The default value is 0.01.
`maf.bound.method`	How individual-specific minor allele frequency estimates less that `maf.thresh` are handled. When set to ‘filter' (default), SNPs for which an individual’s estimated individual-specific minor allele frequency are below `maf.thresh` are excluded from the analysis for that individual. When set to 'truncate', estimated individual-specific minor allele frequncies below `maf.thresh` have their value set to `maf.thresh`.
`small.samp.correct`	Logical indicator of whether to implement a small sample correction. The default is `TRUE`, but must be set to `FALSE` if `sample.block.size` is less than the number of samples or if `scale` is `'none'`.
`BPPARAM`	A `BiocParallelParam` object to process blocks of variants in parallel. If not provided, the default back-end returned by `bpparam` will be used.
`verbose`	Logical indicator of whether updates from the function should be printed to the console; the default is TRUE.
`betaobj`	Outut of `calcISAFBeta`.
`sample.include.block1`	A vector of IDs for samples to include in block 1.
`sample.include.block2`	A vector of IDs for samples to include in block 2.
`kinBtwn`	Output of `pcrelateSampBlock`.
`kinSelf`	Output of `pcrelateSampBlock`.

Details

The basic premise of PC-Relate is to estimate kinship coefficients, IBD sharing probabilities, and inbreeding coefficients that reflect recent family (pedigree) relatedness by conditioning out genetic similarity due to distant population structure (ancestry) with ancestry representative principal components (PCs).

It is important that the PCs used in pcs to adjust for ancestry are representative of ancestry and NOT family structure, so we recommend using PCs calculated with PC-AiR (see: pcair).

pcrelate uses the BiocParallel package to process iterator chunks in parallel. See the BiocParallel documentation for more information on the default behaviour of bpparam and how to register different parallel backends. If serial execution is desired, set BPPARAM=BiocParallel::SerialParam(). Note that parallel execution requires more RAM than serial execution.

In order to perform relatedness estimation, allele frequency estimates are required for centering and scaling genotype values. Individual-specific allele frequencies calculated for each individual at each SNP using the PCs specified in pcs are used. There are muliple choices for how genotype values are scaled. When scale is 'variant', centered genotype values at each SNP are divided by their expected variance under Hardy-Weinberg equilibrium. When scale is 'overall', centered genotype values at all SNPs are divided by the average across all SNPs of their expected variances under Hardy-Weinberg equilibrium; this scaling leads to more stable behavior when using low frequency variants. When scale is 'none', genotype values are only centered and not scaled; this won't provide accurate kinship coefficient estimates but may be useful for other purposes. Set scale to 'overall' to perform a standard PC-Relate analysis; this is the default. If scale is set to 'variant', the estimators are very similar to REAP.

The optional input training.set allows the user to specify which samples are used to estimate the ancestry effect when estimating individual-specific allele frequencies. Ideally, training.set is a set of mutually unrelated individuals. If prior information regarding pedigree structure is available, this can be used to select training.set, or if pcair was used to obtain the PCs, then the individuals in the PC-AiR 'unrelated subset' can be used. If no prior information is available, all individuals should be used.

The sample.block.size can be specified to alleviate memory issues when working with very large data sets. If sample.block.size is smaller than the number of individuals included in the analysis, then individuals will be analyzed in separate blocks. This reduces the memory required for the analysis, but genotype data must be read in multiple times for each block (to analyze all pairs), which increases the number of computations required.

calcISAFBeta and pcrelateSampBlock are provided as separate functions to allow parallelization for large sample sizes. pcrelate calls both of these functions internally. When calling these functions separately, use samplesGdsOrder to ensure the sample.include argument is in the same order as the GDS file. Use correctKin, correctK2, and correctK0 after all sample blocks have been completed.

Value

An object of class 'pcrelate'. A list including:

`kinBtwn`	A data.frame of estimated pairwise kinship coefficients and IBD sharing probabilities (if `ibd.probs` is `TRUE`).
`kinSelf`	A data.frame of estimated inbreeding coefficients.

Author(s)

Matthew P. Conomos

References

Conomos M.P., Reiner A.P., Weir B.S., & Thornton T.A. (2016). Model-free Estimation of Recent Genetic Relatedness. American Journal of Human Genetics, 98(1), 127-148.

Examples

library(GWASTools)

# file path to GDS file
gdsfile <- system.file("extdata", "HapMap_ASW_MXL_geno.gds", package="GENESIS")
# read in GDS data
HapMap_geno <- GdsGenotypeReader(filename = gdsfile)
# create a GenotypeData class object
HapMap_genoData <- GenotypeData(HapMap_geno)
# load saved matrix of KING-robust estimates
data("HapMap_ASW_MXL_KINGmat")
# run PC-AiR
mypcair <- pcair(HapMap_genoData, kinobj = HapMap_ASW_MXL_KINGmat, 
                divobj = HapMap_ASW_MXL_KINGmat)

# create a GenotypeBlockIterator object
HapMap_genoData <- GenotypeBlockIterator(HapMap_genoData)
# run PC-Relate
mypcrel <- pcrelate(HapMap_genoData, pcs = mypcair$vectors[,1,drop=FALSE],
				training.set = mypcair$unrels,
				BPPARAM=BiocParallel::SerialParam())
head(mypcrel$kinBwtn)
head(mypcrel$kinSelf)

grm <- pcrelateToMatrix(mypcrel)
dim(grm)

close(HapMap_genoData)

smgogarten/GENESIS documentation built on Feb. 4, 2025, 4:28 p.m.