disteg: Calculate distance between two gene expression data sets

View source: R/disteg.R

distegR Documentation

Calculate distance between two gene expression data sets

Description

Calculate a distance between all pairs of individuals for two gene expression data sets

Usage

disteg(
  cross,
  pheno,
  pmark,
  min.genoprob = 0.99,
  k = 20,
  min.classprob = 0.8,
  classprob2drop = 1,
  repeatKNN = TRUE,
  max.selfd = 0.3,
  phenolabel = "phenotype",
  weightByLinkage = FALSE,
  map.function = c("haldane", "kosambi", "c-f", "morgan"),
  verbose = TRUE
)

Arguments

cross

An object of class "cross" containing data for a QTL experiment. See the help file for qtl::read.cross() in the R/qtl package (https://rqtl.org). There must be a phenotype named "id" or "ID" that contains the individual identifiers.

pheno

A data frame of phenotypes (generally gene expression data), stored as individuals x phenotypes. The row names must contain individual identifiers.

pmark

Pseudomarkers that are closest to the genes in pheno, as output by find.gene.pseudomarker().

min.genoprob

Threshold on genotype probabilities; if maximum probability is less than this, observed genotype taken as NA.

k

Number of nearest neighbors to consider in forming a k-nearest neighbor classifier.

min.classprob

Minimum proportion of neighbors with a common class to make a class prediction.

classprob2drop

If an individual is inferred to have a genotype mismatch with classprob > this value, treat as an outlier and drop from the analysis and then repeat the KNN construction without it.

repeatKNN

If TRUE, repeat k-nearest neighbor a second time, after omitting individuals who seem to not be self-self matches

max.selfd

Min distance from self (as proportion of mismatches between observed and predicted eQTL genotypes) to be excluded from the second round of k-nearest neighbor.

phenolabel

Label for expression phenotypes to place in the output distance matrix.

weightByLinkage

If TRUE, weight the eQTL to account for their relative positions (for example, two tightly linked eQTL would each count about 1/2 of an isolated eQTL)

map.function

Used if weightByLinkage is TRUE

verbose

if TRUE, give verbose output.

Details

We consider the expression phenotypes in batches, by which pseudomarker they are closest to. For each batch, we pull the genotype probabilities at the corresponding pseudomarker and use the individuals that are in common between cross and pheno and whose maximum genotype probability is above min.genoprob, to form a classifier of eQTL genotype from expression values, using k-nearest neighbor (the function class::knn()). The classifier is applied to all individuals with expression data, to give a predicted eQTL genotype. (If the proportion of the k nearest neighbors with a common class is less than min.classprob, the predicted eQTL genotype is left as NA.)

If repeatKNN is TRUE, we repeat the construction of the k-nearest neighbor classifier after first omitting individuals whose proportion of mismatches between observed and inferred eQTL genotypes is greater than max.selfd.

Finally, we calculate the distance between the observed eQTL genotypes for each individual in cross and the inferred eQTL genotypes for each individual in pheno, as the proportion of mismatches between the observed and inferred eQTL genotypes.

If weightByLinkage is TRUE, we use weights on the mismatch proportions for the various eQTL, taking into account their linkage. Two tightly linked eQTL will each be given half the weight of a single isolated eQTL.

Value

A matrix with nind(cross) rows and nrow(pheno) columns, containing the distances. The individual IDs are in the row and column names. The matrix is assigned class "lineupdist".

The names of the genes that were used to construct the classifier are saved in an attribute "retained".

The observed and inferred eQTL genotypes are saved as attributes "obsg" and "infg".

The denominators of the proportions that form the inter-individual distances are in the attribute "denom".

Author(s)

Karl W Broman, broman@wisc.edu

See Also

distee(), summary.lineupdist(), pulldiag(), omitdiag(), findCommonID(), find.gene.pseudomarker(), calc.locallod(), plot.lineupdist(), class::knn(), plotEGclass()

Examples

library(qtl)

# load example data
data(f2cross, expr1, pmap, genepos)


# calculate QTL genotype probabilities
f2cross <- calc.genoprob(f2cross, step=1)

# find nearest pseudomarkers
pmark <- find.gene.pseudomarker(f2cross, pmap, genepos)

# line up individuals
id <- findCommonID(f2cross, expr1)

# calculate LOD score for local eQTL
locallod <- calc.locallod(f2cross[,id$first], expr1[id$second,], pmark)

# take those with LOD > 25
expr1s <- expr1[,locallod>25,drop=FALSE]

# calculate distance between individuals
#     (prop'n mismatches between obs and inferred eQTL geno)
d <- disteg(f2cross, expr1s, pmark)

# plot distances
plot(d)

# summary of apparent mix-ups
summary(d)

# plot of classifier for and second eQTL
par(mfrow=c(2,1), las=1)
plotEGclass(d)
plotEGclass(d, 2)


kbroman/lineup documentation built on May 10, 2023, 6:02 p.m.