In zkamvar/repvar: Extract Samples to Represent All Variables

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 10,
  fig.height = 10,
  out.width = "500px", 
  out.height = "500px"
)

What the repvar package does

The repvar package is designed to provide a simple way of finding the minimum number of samples needed to represent all variables in a given sparse binary matrix. The impetus for this was driven by a need to resequence a small set of samples from a population genetic study to serve as positive controls for allele calls.

How it works

The main algorithm is in the function rpv_indices(). In it's simplest form, the algorithm works like so:

count the number of observations in each column and sort them
set a counter to count the number of columns that haven't been accounted for
create a vector to store the sample names
while the counter is greater than zero i. record the first sample from each of the columns with the smallest number of observations ii. take the union of those sample names and any other sample names that have been collected iii. remove the columns these samples represent from the original column list iv. set the counter to the number of columns left in the original column list
verify that all the columns have been accounted for by the sample names
return the vector of sample names

Functions in this package

There are four functions in this package:

rpv_indices() searches a matrix with the above algorithm and returns a vector of rownames for that matrix.
rpv_find() randomizes the rows of the matrix and performs rpv_indices() on the randomizations to return a list of candidates to be the minimum set.
rpv_stats() computes diversity indices for a matrix.
rpv_image() allows you to visualize your matrix as an image.

Quickstart

Because there are several possibilities for minimum number of samples, the data should be randomized and iterated over several times. This can be done with rpv_find():

library("repvar")
data("monilinia")
# First, view our data
loci <- sapply(strsplit(colnames(monilinia), "[.]"), "[", 1)
rpv_image(monilinia, f = loci)

# Now randomly sample 200 times to get a set of vectors
set.seed(2018)
id_list <- rpv_find(monilinia, n = 200, cut = TRUE, progress = FALSE)
id_list
lengths(id_list)

We can see here that we have 3 possible sets of samples that have 30 samples each. We can compare them using rpv_stats(), which calculate entropy statistics:

do.call("rbind", lapply(id_list, function(i) rpv_stats(monilinia[i, ])))

From this, we can see that they are all relatively equivalent, but the first set has the least missing data and the greatest evenness (E5).

Let's see what samples this represents:

rpv_image(monilinia, f = loci, highlight = id_list[[1]])

rpv_image(monilinia[id_list[[1]], ], f = loci)

Detailed Example: Monilinia fructicola

To demonstrate why we need to shuffle the data, we will be using a data set representing 264 samples of Monilinia fructicola collected from peach orchards in Georgia genotyped at 13 microsatellite loci representing 95 alleles. The data are stored in a 264x95 matrix where each row represents a sample and each column represents an allele. Black cells indicate the presence of an allele, blue cells indicate an absence and white cells indicate missing data.

library("repvar")
data("monilinia")
loci <- sapply(strsplit(colnames(monilinia), "[.]"), "[", 1)
rpv_image(monilinia, f = loci)

Immediately you should be able to see that some loci---like CHMFc4---have few alleles that are fairly well-represented while other loci---like SEA---have several alleles with the majority of them represented by a handful of samples.

If we run rpv_indices() on our data, we get a vector of samples that represent all of the alleles:

print(ids <- rpv_indices(monilinia))
length(ids)
colSums(monilinia[ids, ]) # all of the variables are represented

Let's highlight where these sampels came from:

rpv_image(monilinia, f = loci, highlight = ids)

If we visualize just these samples, we can see that each locus is accounted for at least once:

rpv_image(monilinia[ids, ], f = loci)

However, if we shuffle the data set:

set.seed(2018)
monilinia_shuffled <- monilinia[sample(nrow(monilinia)), ]
rpv_image(monilinia_shuffled, f = loci) # shuffled data

We get a different answer:

print(new_ids <- rpv_indices(monilinia_shuffled))
par(mfrow = c(1, 2))
rpv_image(monilinia, f = loci, highlight = new_ids, 
          newplot = FALSE,                 # avoid creating a new plot
          col = c("grey95", "grey10"),     # color background grey
          idcol = c("#B15928", "#FFFF99")  # reverse foreground colors
          )
rpv_image(monilinia, f = loci, highlight = ids, 
          newplot = FALSE,                 # avoid creating a new plot
          col = c("grey95", "grey10"),     # color background grey
          idcol = c("#B15928", "#FFFF99")  # reverse foreground colors
          )
par(mfrow = c(1, 1))