preprocess.genetic.data: A function to pre-process case-parent triad or...

View source: R/preprocess.genetic.data.R

preprocess.genetic.dataR Documentation

A function to pre-process case-parent triad or disease-discordant sibling data.

Description

This function performs several pre-processing steps, intended for use before function run.gadgets.

Usage

preprocess.genetic.data(
  case.genetic.data,
  complement.genetic.data = NULL,
  father.genetic.data = NULL,
  mother.genetic.data = NULL,
  ld.block.vec = NULL,
  bp.param = bpparam(),
  snp.sampling.probs = NULL,
  categorical.exposures = NULL,
  continuous.exposures = NULL,
  mother.snps = NULL,
  child.snps = NULL,
  lower.order.gxe = FALSE
)

Arguments

case.genetic.data

The genetic data of the disease affected children from case-parent trios or disease-discordant sibling pairs. If searching for maternal SNPs that are related to risk of disease in the child, some of the columns in case.genetic.data may contain maternal SNP genotypes (See argument mother.snps for how to indicate which SNPs columns correspond to maternal genotypes). Columns are SNP allele counts, and rows are individuals. This object may either be of class matrix' OR of class 'big.matrix'. If of class 'big.matrix' it must be file backed as type 'integer' (see the bigmemory package for more information). The ordering of the columns must be consistent with the LD structure specified in ld.block.vec. The genotypes cannot be dosages imputed with uncertainty.

complement.genetic.data

A genetic dataset for the controls corresponding to the genotypes in case.genetic.data.For SNPs that correspond to the affected child in case.genetic.data, the corresponding column in complement.genetic.data should be set equal to mother allele count + father allele count - case allele count. If using disease-discordant siblings this argument should be the genotypes for the unaffected siblings. For SNPs in case.genetic.data that represent maternal genotypes (if any) the corresponding column in complement.genetic.data should be the paternal genotypes for that SNP. Regardless, complement.genetic.data may be an object of either class matrix' OR of class 'big.matrix'. If of class 'big.matrix' it must be file backed as type 'integer' (see the bigmemory package for more information). Columns are SNP allele counts, rows are families. If not specified, father.genetic.data and mother.genetic.data must be specified. The genotypes cannot be dosages imputed with uncertainty.

father.genetic.data

The genetic data for the fathers of the cases in case.genetic.data. This should only be specified when searching for epistasis or GxGxE effects based only on case-parent triads, and not when searching for maternal SNPs that are related to the child's risk of disease. Columns are SNP allele counts, rows are individuals. This object may either be of class 'matrix' OR of class 'big.matrix'. If of class big.matrix' it must be file backed as type 'integer' (see the bigmemory package for more information). The genotypes cannot be dosages imputed with uncertainty.

mother.genetic.data

The genetic data for the mothers of the cases in case.genetic.data. This should only be specified when searching for epistasis or GxGxE effects based only on case-parent triads, and not when searching for maternal SNPs that are related to the child's risk of disease. Columns are SNP allele counts, rows are individuals. This object may either be of class 'matrix' OR of class 'big.matrix'. If of class big.matrix' it must be file backed as type 'integer' (see the bigmemory package for more information). The genotypes cannot be dosages imputed with uncertainty.

ld.block.vec

An integer vector specifying the linkage blocks of the input SNPs. As an example, for 100 candidate SNPs, suppose we specify ld.block.vec <- c(25, 50, 25). This vector indicates that the input genetic data has 3 distinct linkage blocks, with SNPs 1-25 in the first linkage block, 26-75 in the second block, and 76-100 in the third block. Note that this means the ordering of the columns (SNPs) in case.genetic.data must be consistent with the LD blocks specified in ld.block.vec. In the absence of outside information, a reasonable default is to consider SNPs to be in LD if they are located on the same biological chromosome. If case.genetic.data includes both maternal and child SNP genotypes, we recommend considering any maternal SNP and any child SNP located on the same nominal biological chromosome as 'in linkage'. E.g., we recommend considering any maternal SNPs located on chromosome 1 as being 'linked' to any child SNPs located on chromosome 1, even though, strictly speaking, the maternal and child SNPs are located on separate pieces of DNA. If not specified, ld.block.vec defaults to assuming all input SNPs are in linkage, which may be overly conservative and could adversely affect performance.

bp.param

The BPPARAM argument to be passed to bplapply when estimating marginal disease associations for each SNP. If using a cluster computer, this parameter needs to be set with care. See BiocParallel::bplapply for more details.

snp.sampling.probs

A vector indicating the sampling probabilities of the SNPs in case.genetic.data. SNPs will be sampled in the genetic algorithm proportional to the values specified. If not specified, by default, chi-square statistics of association will be computed for each SNP, and sampling will be proportional to the square root of those statistics. If user specified, the values of snp.sampling.probs need not sum to 1, they just need to be positive real numbers. See argument prob from function sample for more details.

categorical.exposures

(experimental) A matrix or data.frame of integers corresponding to categorical exposures corresponding to the cases in case.genetic.data. Defaults to NULL, which will result in GADGETS looking for epistatic interactions, rather than SNP by exposure interactions. categorical.exposures should not be missing any data; families with missing exposure data should be removed from the analysis prior to input.

continuous.exposures

(experimental) A matrix or data.frame of numeric values representing continuous exposures corresponding to the families in case.genetic.data. Defaults to NULL, which will result in GADGETS searching for epistatic interactions, rather than SNP by exposure interactions. continuous.exposures should not be missing any data; families with missing exposure data should be removed from the analysis prior to input.

mother.snps

If searching for maternal SNPs that are associated with disease in the child, the indices of the maternal SNP columns in object case.genetic.data. Otherwise does not need to be specified.

child.snps

If searching for maternal SNPs that are associated with disease in the child, the indices of the child SNP columns in object case.genetic.data. Otherwise does not need to be specified.

lower.order.gxe

(experimental) A boolean indicating whether, if multiple exposures of interest are input, E-GADGETS should search for only for genetic interactions with the joint combination of exposures (i.e., GxGxExE interactions), or if it should additionally search for lower-order interactions that involve subsets of the exposures that were input (i.e., GxGxE in addition to GxGxExE). The default, FALSE, restricts the search to GxGxExE interactions. Users should be cautious about including large numbers of input exposures, and, if they do, very cautious about setting this argument to TRUE.

Value

A list containing the following:

case.genetic.data

A matrix of case/maternal genotypes.

complement.genetic.data

A matrix of complement/sibling/paternal genotypes. If running E-GADGETS, this is set to a 1x1 matrix whose single entry is 0, and not used

mother.genetic.data

If running E-GADGETS, A matrix of maternal genotypes, otherwise a 1x1 matrix whose single entry is 0.0, and not used

father.genetic.data

If running E-GADGETS, A matrix of mpaternal genotypes, otherwise a 1x1 matrix whose single entry is 0.0, and not used

chisq.stats

A vector of chi-square statistics corresponding to marginal SNP-disease associations, if snp.sampling.probs is not specified, and snp.sampling.probs otherwise.

ld.block.vec

A vector eaul to cumsum(ld.block.vec).

exposure.mat

A design matrix of the input categorical and continuous exposures, if specified. Otherwise NULL.

E_GADGETS

A boolean indicating whether a GxGxE search is desired.

mother.snps

A vector of the column indices of maternal SNPs in case.genetic.data, set to NULL if not applicable.

child.snps

A vector of the column indices of child SNPs in case.genetic.data, set to NULL if not applicable.

Examples


data(case)
data(dad)
data(mom)
case <- as.matrix(case)
dad <- as.matrix(dad)
mom <- as.matrix(mom)
res <- preprocess.genetic.data(case[, 1:10],
                               father.genetic.data = dad[ , 1:10],
                               mother.genetic.data = mom[ , 1:10],
                               ld.block.vec = c(10))


mnodzenski/epistasisGA documentation built on Jan. 17, 2023, 7:07 p.m.