Tutorial for the R MMD package"

knitr::opts_chunk$set(echo = TRUE)

The MMD package was developed in [1]. This tutorial illustrates the functioning of the MMD package using Campylobacter isolates from human and five animal sources (Cattle, Chicken, Pig, Sheep, and WB). For illustration purposes, the example uses genotypes of reduced size consisting 10 isolates per host reservoir described with genotypes of 10 SNPs. The data used in this tutorial comes with the MMD package. Other datasets studied in Ref. [1] are available from this link . To use these examples, simply download the data from the link.

Source attribution

Setting the parameters for source attribution

Set the file names of the file csv containing genotype data and file pop containing population names used in this example:

datafile <- system.file("extdata", "Campylobacter_10SNP_HlW.csv", package = "MMD")
popfile <- system.file("extdata", "Campylobacter_10SNP_HlW.pop", package = "MMD")

Note that the full path to the files should be specified. For example:

datafile
popfile

This example searches for the data from the MMD package which is found using system.file. In general, however, in this step you should specify the path to the directory where you have stored the data you want to analyse.

Set the name of the population to be attributed. In this case, we want to attribute Campylobacter isolates from "Human":

ToAttribute <- "Human"

Introduce the population names of the genotypes to be used as sources (these are available from the file *.pop):

sourcenames <- c("Cattle","Chicken","Pig","Sheep","WB")

Set the maximum number of loci to be used in the analysis:

NL <- 100

NL was set to 100 loci as an example but the program will use 10 loci since the genotypes in the file Campylobacter_10SNP_HlW.csv consist of 10 SNPs.

Running the function MMD_attr for source attribution (verbose=FALSE by default even if verbose is not specified; set verbose=TRUE to see a progress bar for computations if you wish):

attribution <- MMD_attr(datafile,popfile,NL,sourcenames,ToAttribute,verbose=FALSE)

Source attribution results:

In this example, the source attribution results were stored in attribution:

attribution

The list contains 7 items:

For example, a bar chart for the mean of the probability of attribution of Campylobacter human isolates to food sources can be easily obtained from attribution as follows:

barplot(height = attribution[[3]]$mean,names.arg = attribution[[3]]$Sources,las=2,cex.names = 0.8,main=NULL, ylab = expression("Attribution probability, p"[s]))

Self-attribution

This example illustrates self-attribution of Campylobacter isolates from the pig reservoir of the dataset used above. The genotypes from pig are split into training and test datasets. The test dataset is defined by randomly drawing 50% of the isolates from pig reservoir; the remaining isolates define the training dataset.

Setting the parameters

<!--To start with, we set the name of the data file (without extension), population name of the genotypes to be attributed, population name of the genotypes to be used as sources and maximum number of loci to be used:

datadir <- "~/Work/Source_Attribution/R_projects/project"
dataname <- "Campylobacter_10SNP_HlW"
NL <- 100
sourcenames <- c("Cattle","Chicken","Pig","Sheep","WB")

-->

To run self-attribution, the function MMD_attr can be executed as follows:

SelfAttribution <- MMD_attr(datafile,popfile,NL,sourcenames,ToAttribute = "Pig",SelfA = "yes",optq = "yes", pqmin = 0, pqmax = 0.5, np=1000, fSelfA = 0.5)

Here,
ToAttribute is set to the "Pig" reservoir which will be analysed for self-attribution.
SelfA is set to "yes" to do self-attribution (SelfA is set to "no" by default in the function MMD_attr, i.e. it is set by default to do source attribution instead of self-attribution).
fSelfA is the fraction of genotypes in the "Pig" reservoir used to define the test dataset. Here, this is set to 0.5 which, in fact, is the default value.
optq is set to "yes" to optimise the q-quantile for self-attribution accuracy (optq is set to "no" by default in the function MMD_attr).
Optimisation of the q-quantile is done by considering a partition of np values in the interval [pqmin,pqmax] of the q-quantile. The value of verbose was not explicitly indicated and the function will use the default value, verbose=FALSE, so that a progress bar will not be displayed.

Self-attribution results

The results are stored in the list SelfAttribution:

SelfAttribution

The list contains 8 items:

In this case, the bar chart for the probability that isolates of unknown origin are attributed to each of the sources can be obtained as follows:

barplot(height = SelfAttribution[[3]]$mean,names.arg = SelfAttribution[[3]]$Sources,las=2,cex.names = 0.8,main=NULL, ylab = expression("Attribution probability, p"[s]))

Despite the small number of SNPs used in the example dataset, it is interesting that the program predicts a relatively high probability of attribution to the correct source of pig Campylobacter isolates.

Selection of informative loci

The package allows informative loci to be selected in terms of entropy and redundancy measures.

Entropy measures

MMD_Entropy provides entropy measures for individual loci. It can be called as follows:

EntropyLoci <- MMD_Entropy(datafile,popfile,NL,sourcenames)

Again, here you can specify verbose=TRUE to see a progress bar for computations if you wish.

Outputs of MMD_Entropy

The results give a list of 7 items:

EntropyLoci

Using the list EntropyLoci one can easily plot histograms for the different entropies,

histHlT <- hist(EntropyLoci[[6]]$HlT,col="#1C86EE",main=NULL,xlab=expression(paste("H"^"T")),ylab="Frequency")
histHlB <- hist(EntropyLoci[[6]]$HlB,col="#1C86EE",main=NULL,xlab=expression(paste("H"^"B")),ylab="Frequency")
histHlW <- hist(EntropyLoci[[6]]$HlW,col="#1C86EE",main=NULL,xlab=expression(paste("H"^"W")),ylab="Frequency")

or a scatterplot of the entropy between sources vs. entropy within sources for each locus:

plot(cbind(EntropyLoci[[6]]$HlW,EntropyLoci[[6]]$HlB),col="#1C86EE",main=NULL, xlab=expression(paste("H"^"W")),ylab=expression(paste("H"^"B")))

Redundancy of loci

In a sequential selection of loci, the redundancy of the n-th locus can be calculated using the funtion MMD_Rn as follows:

RedundancyLoci <- MMD_Rn(datafile,popfile,NL,sourcenames)

Outputs of MMD_Rn

The results give a list of 9 items:

RedundancyLoci

For example, one can plot the redundancy Rn of loci as a function of n and the redundacy of loci sorted in increasing order:

plot(RedundancyLoci[[6]],col="#EE4000",main=NULL,log="x",type="l",xlab="n-th selected locus",ylab=expression("R"[n]))
plot(RedundancyLoci[[8]],col="#EE4000",main=NULL,log="x",type="l",xlab="n-th selected locus",ylab=expression("R"[n]))

References

[1] Perez-Reche, F.J., Rotariu, O., Lopes, B.S., Forbes, K.J., Strachan, N.J.C. Mining whole genome sequence data to efficiently attribute individuals to source populations, Scientific Reports 10, 12124 (2020). https://www.nature.com/articles/s41598-020-68740-6



Try the MMD package in your browser

Any scripts or data that you put into this service are public.

MMD documentation built on Jan. 26, 2021, 5:05 p.m.