knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 10, fig.height = 10, out.width = "500px", out.height = "500px" )
The repvar package is designed to provide a simple way of finding the minimum number of samples needed to represent all variables in a given sparse binary matrix. The impetus for this was driven by a need to resequence a small set of samples from a population genetic study to serve as positive controls for allele calls.
The main algorithm is in the function rpv_indices()
. In it's simplest form,
the algorithm works like so:
There are four functions in this package:
rpv_indices()
searches a matrix with the above algorithm and returns a
vector of rownames for that matrix.rpv_find()
randomizes the rows of the matrix and performs rpv_indices()
on the randomizations to return a list of candidates to be the minimum set.rpv_stats()
computes diversity indices for a matrix.rpv_image()
allows you to visualize your matrix as an image.Because there are several possibilities for minimum number of samples, the data
should be randomized and iterated over several times. This can be done with
rpv_find()
:
library("repvar") data("monilinia") # First, view our data loci <- sapply(strsplit(colnames(monilinia), "[.]"), "[", 1) rpv_image(monilinia, f = loci) # Now randomly sample 200 times to get a set of vectors set.seed(2018) id_list <- rpv_find(monilinia, n = 200, cut = TRUE, progress = FALSE) id_list lengths(id_list)
We can see here that we have 3 possible sets of samples that have 30 samples
each. We can compare them using rpv_stats()
, which calculate entropy
statistics:
do.call("rbind", lapply(id_list, function(i) rpv_stats(monilinia[i, ])))
From this, we can see that they are all relatively equivalent, but the first set has the least missing data and the greatest evenness (E5).
Let's see what samples this represents:
rpv_image(monilinia, f = loci, highlight = id_list[[1]])
rpv_image(monilinia[id_list[[1]], ], f = loci)
To demonstrate why we need to shuffle the data, we will be using a data set representing 264 samples of Monilinia fructicola collected from peach orchards in Georgia genotyped at 13 microsatellite loci representing 95 alleles. The data are stored in a 264x95 matrix where each row represents a sample and each column represents an allele. Black cells indicate the presence of an allele, blue cells indicate an absence and white cells indicate missing data.
library("repvar") data("monilinia") loci <- sapply(strsplit(colnames(monilinia), "[.]"), "[", 1) rpv_image(monilinia, f = loci)
Immediately you should be able to see that some loci---like CHMFc4---have few alleles that are fairly well-represented while other loci---like SEA---have several alleles with the majority of them represented by a handful of samples.
If we run rpv_indices()
on our data, we get a vector of samples that represent
all of the alleles:
print(ids <- rpv_indices(monilinia)) length(ids) colSums(monilinia[ids, ]) # all of the variables are represented
Let's highlight where these sampels came from:
rpv_image(monilinia, f = loci, highlight = ids)
If we visualize just these samples, we can see that each locus is accounted for at least once:
rpv_image(monilinia[ids, ], f = loci)
However, if we shuffle the data set:
set.seed(2018) monilinia_shuffled <- monilinia[sample(nrow(monilinia)), ] rpv_image(monilinia_shuffled, f = loci) # shuffled data
We get a different answer:
print(new_ids <- rpv_indices(monilinia_shuffled)) par(mfrow = c(1, 2)) rpv_image(monilinia, f = loci, highlight = new_ids, newplot = FALSE, # avoid creating a new plot col = c("grey95", "grey10"), # color background grey idcol = c("#B15928", "#FFFF99") # reverse foreground colors ) rpv_image(monilinia, f = loci, highlight = ids, newplot = FALSE, # avoid creating a new plot col = c("grey95", "grey10"), # color background grey idcol = c("#B15928", "#FFFF99") # reverse foreground colors ) par(mfrow = c(1, 1))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.