knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(knitr)
library(RecordMatching)

Overview

This vignette demonstrates a step-by-step guide for record-matching between SNP profiles to STR profiles when two samples are from the identical individual using linkage disequilibrium between samples. The method is based on Kim et al. 2018.

Extensions to related individuals use the same pipeline as detailed in this vignette only with different input files.

Setup

External software requirements

In order to run the pipeline, there are two external software requires.

Set environment variables

We first need to set the base directory where the output files of the record matching pipeline will be saved. We also set the full path to the BEAGLE jar file and vcftools executable. The following environment variables need to be set.

The above variables will be used in all functions in the record-matching pipeline and is set by setup function. To run this vignette, set the environment variables in the setup function below to user's own local paths:

setup(base.dir='(path to local save directory)',
      bgl.jar='(full path to dir)/beagle.(version).jar',
      vcf.exe='(full path to dir)/vcftools')

Input files requirements

Each file should contain information about single STR locus and/or its surrounding SNPs only.

In addition, HapMap GrCh36, GrCh37, or GrCh38 genetic maps with cM units in PLINK format are required. Each file (map.f) should contain a map info of a single chromosome. They can be downloaded from the BEAGLE web.

In this vignette, all the above files are provided as a package data.

Markers

In this example, we will use the following 13 CODIS loci.

marker.f <- system.file("extdata", "marker_positions.txt",
                        package="RecordMatching", mustWork=TRUE)
marker.info <- read.table(marker.f, header=TRUE, as.is=TRUE)
kable(marker.info)

One can include specific loci only by subsetting the marker.info as shown in the example below. For a quick testing of the pipeline, start with one locus.

loci.include <- c(3,5)
marker.info <- marker.info[loci.include,]
n.loci <- dim(marker.info)[1]

In general, including more loci results in the better record-matching accuracy. In this vignette, all 13 available CODIS loci are used.

Phase SNP-STR reference panel

If the reference panel is unphased, it must be phased first using BEAGLE. Any missing values are also imputed in this step. phase.ref function saves the output phased reference panel to (base.dir)/ref_phased directory. The output file name is the same as ref.f but with _phs postfix.

From the STR-SNP reference panel, each STR marker's allele frequency is computed using ref.al.freq function. The output is saved to (base.dir)/ref_alfrq directory. Each marker's allele frequency file has ref_(marker name).frq format.

for (m in marker.info$name) {
    ref.f <- system.file("extdata", "twin_test", "reference",
                         paste0("ref_", m, ".vcf"),
                         package="RecordMatching", mustWork=TRUE)

    # if reference panel is unphased, phase them. 
    phase.ref(ref.f)

    # otherwise, go straight to allele frequency
    ref.al.freq(ref.f, m)
}

Impute STR genotypes and compute compute genotype probabilities

We now impute genotypes of individuals at a STR marker locus from its surrounding SNPs snp.f using a reference panel ref.f and a genetic map map.fby running impute.str function.

Note that the reference panel must be phased and have no missing values for imputation with BEAGLE.

The imputed and phased SNP-STR vcf file is stored at (base.dir)/imputed_str where the base.dir was set by running setup at the beginning of the pipeline. From the imputed SNP-STR haplotype, the impute.str function also extracts imputed genotypes and genotype probabilities at the STR locus and save it in the (base.dir)/imputed_str folder. The output file name containing imputed genotypes probabilities is imp_str_(marker name).GP.FORMAT.

for (i in 1:dim(marker.info)[1]) {
    m <- marker.info$name[i]
    chr <- marker.info$chr[i]
    snp.f <- system.file("extdata", "twin_test", "SNP",
                         paste("snp_", m, ".vcf", sep=''),
                         package="RecordMatching", mustWork=TRUE)
    ref.f <- system.file("extdata", "twin_test", "reference",
                         paste("ref_", m, ".vcf", sep=''),
                         package="RecordMatching", mustWork=TRUE)
    map.f <- system.file("extdata", "twin_test", "plink.GRCh36.map",
                         paste("plink.chr", chr, '.GRCh36.map', sep=''),
                         package="RecordMatching", mustWork=TRUE)

    impute.str(snp.f=snp.f, ref.f=ref.f, map.f=map.f, marker=m)
}

Compute match score matrix

We characterize the familial relationship between two individuals $A$ and $B$ by a vector of the three Cotterman identity coefficients, $\mathbf{\Delta} = (\Delta_0, \Delta_1, \Delta_2)$, respectively representing the probabilities of three identity states $C_0, C_1, C_2$. Each $\Delta_k$ represents the probability that two diploid individuals share $k$ alleles identically by descent at an autosomal locus .

For two non-inbred individuals, individual $A$ with a STR profile $R_A$ and individual $B$ with a SNP profile $S_B$, we test a hypothesis that $A$ and $B$ are related with a relationship defined by $\mathbf{\Delta}_\text{test}$, against the null hypothesis in which the two individuals are unrelated. To test the hypothesis, we use the log-likelihood-ratio match score:

$$\lambda(R_A, S_B) = \ln [\mathbb{P}(R_A \mid S_B, \mathbf{\Delta}_\text{test})] - \ln[\mathbb{P}(R_A) ].$$

Assuming independence of the STR loci, so that genotypes at separate STRs are independent, the match score can be expressed as a sum of log-likelihood ratios across STR loci: $$\lambda(R_A, S_B) = \sum_{\ell=1}^{L} \Bigg[\ln [\mathbb{P}(R_{A\ell} \mid S_{B\ell}, \mathbf{\Delta}\text{test}) ] - \ln [ \mathbb{P}(R{A\ell}) ] \Bigg].$$ Three relatedness scenarios considered in Kim et al. 2018 are

Here, we demonstrate the case $\mathbf{\Delta}\text{test}=(0,0,1)$ when the true relationship is $\mathbf{\Delta}\text{true}=(0,0,1)$. Other test hypothesis can be examined by updating cott.vec below.

cott.vec <- c(0,0,1) 

llr.single.irt <- list()

for (i in 1:dim(marker.info)[1]) {
    m <- marker.info$name[i]
    str.f <- system.file("extdata", "twin_test", "STR",
                         paste("str_", m, ".GT.FORMAT", sep=''),
                         package="RecordMatching", mustWork=TRUE)
    llr.single.irt[[i]] <- comp.match.mat(cott.vec=cott.vec, marker=m, str.f=str.f)
    names(llr.single.irt)[i] <- m
}

mat <- Reduce('+', llr.single.irt)

The final match score matrix using r n.loci loci is shown below. The following figure corresponds to Figure 2(A) in Kim et al. 2018.

set.seed(312)
brks.crse <- c(-1000,-40,-30,-20,-10,0,10,20,1000)
rand.order <- sample.int(dim(mat)[1])
par(mar=c(1,1,0,0))
image(mat[rand.order, rand.order], axes=FALSE, asp=1, frame.plot=FALSE,
      col=rev(heat.colors(length(brks.crse) - 1)), breaks=brks.crse)
mtext('STR Profile', side=1)
mtext('SNP Profile', side=2)

Compute match accuracies

From the match score matrix, we now compute the record-matching accuracy. We consider four different match-assignment scenarios: one-to-one matching, one-to-many matching with a query SNP profile, one-to-many matching with a query STR profile, and needle-in-haystack matching.

In one-to-one matching, we assume that it is already known that each profile in the SNP dataset has exactly one true relative in the STR dataset and vice versa. To find the pairing of profiles that maximizes the sum of match scores across all paired profiles, we use the Hungarian algorithm. Record-matching accuracy is defined as the fraction of STR profiles matched to the correct SNP profiles.

The match accuracies under all four scenarios can be computed using comp.match.acc function.

match.acc <- comp.match.acc(mat)
match.acc <- comp.match.acc(mat)
kable(match.acc)


jk2236/RecordMatching documentation built on Dec. 21, 2021, 12:10 a.m.