knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

This tutorial will show you how to load three replicates of a sample from files and align them with some peptides. We'll show the basic process first, and then give some details about how you would go about classifying the spectra based on peptide information. To summarise, the process involves the following steps:

How to load data

Loading a reference data set

The bacollite package comes with some data files that we can use during this tutorial, but we need some R code to find where they are on our system. Let's get R to tell us the folder that the data is in:

fpath <- system.file("extdata",package="bacollite")

We'll be using the fpath variable in what follows. With your own data, you would know where the samples are on your file system and you'd do something like fpath <- "c:\mydata\"

The MALDI data files that we are using are called:

20150326_SF_YG65_94_I11.txt  
20150326_SF_YG65_94_I14.txt  
20150326_SF_YG65_94_I17.txt

You can see that the last part of the file name before the ".txt" is the spot reference, and there are three spots: I11, I14 and I17. We need this information to use the load.sample function to load the three samples into R, like this:

library(bacollite)
froot = sprintf("%s/20150326_SF_YG65_94_",fpath)
sample <- load.sample(froot = froot,spots = c("I11","I14","I17"),name="folio 42")

Let's go through this line by line:

Let's have a look at the structure of the sample object using R's str function - this will show you what the arguments mean:

str(sample)

This tells us that sample is a list of five pieces of data:

Each of the three sample data entries in the list is a data frame containing the MALDI spectrum data that we'll use for the analysis.

That's it for loading data. There are several additional options that you can use to load things like CSV files, or files with different extensions, which will be the subject of a future vignette.

How to load some peptides

There are several sets of petide sequence that come with bacollite, but here we'll just use some simple ones to demonstrate the process. The bacollite package comes with some sets of peptides that we can use to descriminate between samples from sheep, cow and goat, as described in paper [1] in the README file. lets have a look at these files. The peptides in dm_sheep are:

require(knitr)
kable(dm_sheep)

Those in dm_cow are:

kable(dm_cow)

...and those in dm_goat are:

kable(dm_goat)

Generating deer peptides.

This section shows how you would add another species set of markers, using peptides with the same collagen position and hydroxylation level as the sheep/cow/goat peptides shown above. Here we have a concatenated sequence for deer collagen 1A1 and 1A2 (todo: note where this sequence has come from):

GPMGPSGPRGIPGPPGAPGPQGFQGPPGEPGEPGASGPMGPRGPPGPPGKNGDDGEAGKPGRPGERGPPGPQGARGIPGTAGIPGMKXXXGFSGI
DGAKGDAGPAGPKGEPGSPGENGAPGQMGPRXXXXXXGRPGAPGPAGARGNDGATGAAGPPGPTGPAGPPGFPGAVGAKGEAGPQGPRGSEGPQG
VRGEPGPPGPAGAAGPAGNPGADGQPGAKGANGAPGIAGAPGFPGARGPSGPQGPSGPPGPKGNSGEPGAPGSKGDTGAKGEPGPTGIQGPPGPA
GEEGKRXXXGEPGPAGIPGPPGERXXXXXXGFPGADGVAGPKGPAGERGSPGPAGPKGSPGEAGRPGEAGIPGAKGITGSPGSPGPDGKTGPPGP
AGQDGRPGPPGPPGARGQAGVMGFPGPKGAAGEPGKAGERGVPGPPGAVGPAGKDGEAGAQGPPGPAGPAGERGEQGPAGSPGFQGIPGPAGPPG
EAGKPGEQGVPGDIGAPGPSGARXXXXXXXXXGVQGPPGPAGPRGANGAPGNDGAKGDAGAPGAPGSQGAPGIQGMPGERGAAGIPGPKGDRGDA
GPKGADGAPGKDGVRGITGPIGPPGPAGAPGDKGETGPSGPAGPTGARGAPGDRGEPGPPGPAGFAGPPGADGQPGAKXXXXXXXXXGDAGPPGP
AGPAGPPGPIGNVGAPGPKXXXGSAGPPGATGFPGAAGRVGPPGPSGNAGPPGPPGPAGKXXXXXXRGETGPAGRPGEVGPPGPPGPAGEKGAPG
ADGPAGAPGTPGPQGIAGQRGVVGIPGQRXXXGFPGIPGPSGEPGKQGPSGASGERGPPGPMGPPGIAGPPGESGREGAPGAEGSPGRDGSPGPK
GDRGETGPAGPPGAPGAPGAPGPVGPAGKSGDRGETGPAGPAGPIGPVGARGPAGPQGPRGDKGETGEQGDRXXXXXXGFSGIQGPPGPPGSPGE
QGPSGASGPAGPRGPPGSAGTPGKDGINGIPGPIGPPGPRXXTGDAGPAGPPGPPGPPGPPGPPKGPMGIMGPRGPPGASGAPGPQGFQGPPGEP
GEPGQTGPAGARXXXXXXXXAGEDGHPGKPGRPGERGVVGPQGARGFPGTPGIPGFKXXXGHNGIDGIKGQPGAPGVKGEPGAPGENGTPGQTGA
RXXXXXXGRVGAPGPAGARGSDGSVGPVGPAGPIGSAGPPGFPGAPGPKGEIGPVGNPGPAGPAGPRGEVGIPGISGPVGPPGNPGANGIPGAKG
AAGIPGVAGAPGIPGPRGIPGPVGAAGATGARGIVGEPGPAGSKGESGNKGEPGAVGQPGPPGPSGEEGKRGSTGEIGPAGPPGPPGIRXXXXXX
XXXXXXXXAGVMGPAGSRXXXXXXXXXGPNGDSGRPGEPGIMGPRGFPGSPGNIGPAGKEGPVGIPGIDGRPGPIGPAGARGEPGNIGFPGPKGP
TGDPGKAGEKGHAGIAGARXXXXXXXXXXXXXXXXXXXXXXXXGEQGPAGPPGFQGIPGPAGTAGEAGKPGERGIPGEFGIPGPAGARXXXGPPG
ESGAAGPAGPIGSRGPSGPPGPDGNKGEPGVVGAPGTAGPSGPSGIPGERGAAGIPGGKGEKGETGIRXXXXXXXXXXXXGAPGAVGAPGPAGAN
GDRGEAGPAGPAGPAGPRXXXXXXGEVGPAGPNGFAGPAGAAGQPGAKGERXXXGPKGENGPVGPTGPVGAAGPSGPNGPPGPAGSRGDGGPPGA
TGFPGAAGRTGPPGPSGISGPPGPPGPAGKEGIRGPRGDQGPVGRTGETGASGPPGFAGEKGPSGEPGTAGPPGTPGPQGIIGPPGFIGIPGSRX
XXGIPGVAGSVGEPGPIGIAGPPGARGPPGNVGNPGVNGAPGEAGRDGNPGNDGPPGRDGQPGHKGERGYPGNAGPVGTAGAPGPQGPVGPTGKH
GNRGEPGPAGAVGPAGAVGPRGPSGPQGIRGDKGEPGDKGPRGIPGIKGHNGIQGIPGIAGHHGDQGAPGAVGPAGPRGPAGPSGPAGKDGRTGQ
PGAVGPAGIRGSQGSQGPAGPPGPPGPPGPPGPS

We can load this sequence into an R variable like this:

deerseq <- "GPMGPSGPRGIPGPPGAPGPQGFQGPPGEPGEPGASGPMGPRGPPGPPGKNGDDGEAGKPGRPGERGPPGPQGARGIPGTAGIPGMKXXXGFSGIDGAKGDAGPAGPKGEPGSPGENGAPGQMGPRXXXXXXGRPGAPGPAGARGNDGATGAAGPPGPTGPAGPPGFPGAVGAKGEAGPQGPRGSEGPQGVRGEPGPPGPAGAAGPAGNPGADGQPGAKGANGAPGIAGAPGFPGARGPSGPQGPSGPPGPKGNSGEPGAPGSKGDTGAKGEPGPTGIQGPPGPAGEEGKRXXXGEPGPAGIPGPPGERXXXXXXGFPGADGVAGPKGPAGERGSPGPAGPKGSPGEAGRPGEAGIPGAKGITGSPGSPGPDGKTGPPGPAGQDGRPGPPGPPGARGQAGVMGFPGPKGAAGEPGKAGERGVPGPPGAVGPAGKDGEAGAQGPPGPAGPAGERGEQGPAGSPGFQGIPGPAGPPGEAGKPGEQGVPGDIGAPGPSGARXXXXXXXXXGVQGPPGPAGPRGANGAPGNDGAKGDAGAPGAPGSQGAPGIQGMPGERGAAGIPGPKGDRGDAGPKGADGAPGKDGVRGITGPIGPPGPAGAPGDKGETGPSGPAGPTGARGAPGDRGEPGPPGPAGFAGPPGADGQPGAKXXXXXXXXXGDAGPPGPAGPAGPPGPIGNVGAPGPKXXXGSAGPPGATGFPGAAGRVGPPGPSGNAGPPGPPGPAGKXXXXXXRGETGPAGRPGEVGPPGPPGPAGEKGAPGADGPAGAPGTPGPQGIAGQRGVVGIPGQRXXXGFPGIPGPSGEPGKQGPSGASGERGPPGPMGPPGIAGPPGESGREGAPGAEGSPGRDGSPGPKGDRGETGPAGPPGAPGAPGAPGPVGPAGKSGDRGETGPAGPAGPIGPVGARGPAGPQGPRGDKGETGEQGDRXXXXXXGFSGIQGPPGPPGSPGEQGPSGASGPAGPRGPPGSAGTPGKDGINGIPGPIGPPGPRXXTGDAGPAGPPGPPGPPGPPGPPKGPMGIMGPRGPPGASGAPGPQGFQGPPGEPGEPGQTGPAGARXXXXXXXXAGEDGHPGKPGRPGERGVVGPQGARGFPGTPGIPGFKXXXGHNGIDGIKGQPGAPGVKGEPGAPGENGTPGQTGARXXXXXXGRVGAPGPAGARGSDGSVGPVGPAGPIGSAGPPGFPGAPGPKGEIGPVGNPGPAGPAGPRGEVGIPGISGPVGPPGNPGANGIPGAKGAAGIPGVAGAPGIPGPRGIPGPVGAAGATGARGIVGEPGPAGSKGESGNKGEPGAVGQPGPPGPSGEEGKRGSTGEIGPAGPPGPPGIRXXXXXXXXXXXXXXAGVMGPAGSRXXXXXXXXXGPNGDSGRPGEPGIMGPRGFPGSPGNIGPAGKEGPVGIPGIDGRPGPIGPAGARGEPGNIGFPGPKGPTGDPGKAGEKGHAGIAGARXXXXXXXXXXXXXXXXXXXXXXXXGEQGPAGPPGFQGIPGPAGTAGEAGKPGERGIPGEFGIPGPAGARXXXGPPGESGAAGPAGPIGSRGPSGPPGPDGNKGEPGVVGAPGTAGPSGPSGIPGERGAAGIPGGKGEKGETGIRXXXXXXXXXXXXGAPGAVGAPGPAGANGDRGEAGPAGPAGPAGPRXXXXXXGEVGPAGPNGFAGPAGAAGQPGAKGERXXXGPKGENGPVGPTGPVGAAGPSGPNGPPGPAGSRGDGGPPGATGFPGAAGRTGPPGPSGISGPPGPPGPAGKEGIRGPRGDQGPVGRTGETGASGPPGFAGEKGPSGEPGTAGPPGTPGPQGIIGPPGFIGIPGSRXXXGIPGVAGSVGEPGPIGIAGPPGARGPPGNVGNPGVNGAPGEAGRDGNPGNDGPPGRDGQPGHKGERGYPGNAGPVGTAGAPGPQGPVGPTGKHGNRGEPGPAGAVGPAGAVGPRGPSGPQGIRGDKGEPGDKGPRGIPGIKGHNGIQGIPGIAGHHGDQGAPGAVGPAGPRGPAGPSGPAGKDGRTGQPGAVGPAGIRGSQGSQGPAGPPGPPGPPGPPGPS"

Then we can generate all the peptides we see if we digested the sequence with trypsin like this:

deerpep <- parse.seq(deerseq,max.missed.cleaves = 1)

Note here that we are generating all peptides with up to 1 missed cleave here. OK, that gives us a set of all peptides possible from Deer. We can find the sequences by looking at the seqpos variable for the other species, and looking for a peptide in that vicinity. We want peptides that correspond to the positions in the collagen that we used in the sheep cow goat study:

require(knitr, quietly = T)
kable(dm_sheep)

The sheep peptide with masses at 1180 and 1196 has a seqpos value of 1997, so let's list the deer peptides with a seqpos in a range around that value:

require(knitr, quietly = 0)
dp1 <- deerpep[deerpep$seqpos>1992 & deerpep$seqpos<2002 & deerpep$nglut == 0 & (deerpep$nhyd == 0 | deerpep$nhyd == 1) & deerpep$missed.cleaves == 0,]
kable(dp1)

Great, it looks like we've found it, and the sequence is identical to sheep.

Let's repeat the procedure for the seqpos around 1776:

require(knitr, quietly = 0)
dp2 <- deerpep[deerpep$seqpos>1771 & deerpep$seqpos<1781 & deerpep$nglut == 0 & (deerpep$nhyd == 4 | deerpep$nhyd == 5) & deerpep$missed.cleaves == 0,]
kable(dp2)

OK, the final peptide has a missed cleave, so it's a bit more tricky to create. First, let's find the peptide

require(knitr, quietly = 0)
dp3 <- deerpep[deerpep$seqpos>585 & deerpep$seqpos<595 & deerpep$nglut == 0 & (deerpep$nhyd == 2) & deerpep$missed.cleaves == 1,]
kable(dp3)

Now we can make the set of deer peptides (note we are putting them in the same order as the sheep peptides at this point)

dm_deer <- rbind(dp1,dp3,dp2)
kable(dm_deer)

Run the alignment

OK, now we've got a sample and some peptides, we can do alignments, using the ms_fit function. Let's do this for each set of markers:

sheep_fit <- ms_fit(peptides = dm_sheep,sample = sample,doplot = F,force=T,gauss = 0.2)
cow_fit   <- ms_fit(peptides = dm_cow,  sample = sample,doplot = F,force=T,gauss = 0.2)
goat_fit  <- ms_fit(peptides = dm_goat, sample = sample,doplot = F,force=T,gauss = 0.2)
deer_fit  <- ms_fit(peptides = dm_deer, sample = sample,doplot = F,force=T,gauss = 0.2)

That's nice, but it's not very informative - even if we do str(goat_fit). If we want to see nice plots of these, we can set the doplot option to TRUE and make a nice figure:

par(mfrow=c(4,5))
gauss <- 0.2
sheep_fit <- ms_fit(peptides = dm_sheep, sample = sample,doplot = T,force=T,gauss = 0.2)
cow_fit   <- ms_fit(peptides = dm_cow,   sample = sample,doplot = T,force=T,gauss = 0.2)
goat_fit  <- ms_fit(peptides = dm_goat,  sample = sample,doplot = T,force=T,gauss = 0.2)
deer_fit  <- ms_fit(peptides = dm_deer,  sample = sample,doplot = T,force=T,gauss = 0.2)

The information in these plots is as follows: - The x axis shows the mass range under consideration - The y axis is a scale from 0 to 1 - Each sample is plotted with a coloured line and a grey line - aligned samples 1,2 and 3 are coloured red, green and blue respectively. There original (unshifted) mass positions are also indicated with a grey line. This is useful for bad alignments, because it becomes clear that - The intensity of each sample is scaled by the highest peak in the same spectrum. This allows for a better comparison across many peptides, becuase you see clearly where peaks are small. - The isotopic distribution of the target peptides is shown as a seris of five "pinheads". These are scaled by the relative value of the most common isotope. For lower masses, this is usually the first isotope but for higher masses it is usually the second. - As much information as possible is placed in the title of each plot, but this can be overkill if you want to see mmultiple plots in one figure as shown above.

At this point, we have a set of correlation data for each replicate sample with the peptides for each species. The next step is to use this data to carry out a classification of the sample.

Perform a classification

We carry out a classification by passing the data objects we created with ms_fit into the classfier function cor_id. Let's write this as if it was a function first, then call it - that'll help us explain the process.

First we need to create a list of the alignments, and a vector of names for the species ID like this:

cordata <- list()
cordata[[1]]<-sheep_fit
cordata[[2]]<-cow_fit
cordata[[3]]<-goat_fit
cordata[[4]]<-deer_fit
corlab <- c("Sheep","Cow","Goat","Deer")

The reason we put this data into a list is because it makes it easier to process each species in turn using the same code. If we have a list we can use a loop to iterate through every sample.

#remove objects created by previous runs so we can illustrate the process using str()
rm(cld)

The first stage in the classification is to determine whether a correlation between a sample and a peptide is a 'hit'. Rather than use a single threshold on the correlation, we use a range of thresholds (stored in the variable corlim below) and see if each alignment is above each value in that range. The code below organises this process:

#Arguments
cd <- cordata
labs <- corlab

#Function
corlim = seq(0,1,0.05)
scores <- vector(length = length(cd))
scores[] <- 0
laglim <- 0.6

#massage the raw cordata into a form we can work with: 
cld <- list()
for(cc in 1:length(cd)){
  cld[[cc]] <- corlim_data(cd[[cc]],laglim)
}

We can look at the structure of the cld object, which gives us an idea of what the code above achieves:

str(cld)

As you can see, we have a list of four data frames, one for each species. Each data frame has a set of values for each correleation threshold under consideration. The nh values give the number of hits for each threshold, and the sc value is the sum of the ion counts for the peaks that are classed as a hit (we won't be using this variable for this analysis).

The main part of the classification is to combine this data into a score for each candidate species ID. Scores are accumulated for each correlation threshold value. For each given value and number of 'hits'

#initialise the scores for each sample
for(ss in 1:length(cld))
  cld[[ss]]$cumscore = 0

for(cl in 1:length(corlim)){
  for(ss in 1:length(cld)){
    nh <- cld[[ss]]$nh[cl]
    maxonh<-0
    for(tt in 1:length(cld)){
      if(ss != tt){
        maxonh <- max(maxonh,cld[[tt]]$nh[cl])
      }
    }

    if(nh > maxonh){
      cld[[ss]]$cumscore[cl] <- (nh-maxonh)*corlim[cl]
    }
  }
}

With these values to hand, we can create a structure to hold the result:

result <- data.frame("id" = corlab, "score" = 0)

for(ss in 1:length(cld)){
  message(sprintf("Score for species %d (%s) = %f" ,ss,corlab[ss],sum(cld[[ss]]$cumscore)))
  result$score[ss] <- sum(cld[[ss]]$cumscore)
}

This gives a score fore each species. In this case, the sample is unambiguously classified as cow. Sometimes, where a sample is noisy or contaminated, there can be non-zero scores for more than one species. This is useful, as it gives us an indication as to whether the automated classification is particularly strong or not.

The final stage in the analysis is to generate a graphical representation of the scoring process. Let's create that, and then discuss its features

title = sprintf("Sample '%s': manual ID: '%s'; Calc ID: '%s'\nscores ",sample$name,"unknown",result$id[result$score == max(result$score)])

for(ss in 1:nrow(result)){
  title = sprintf("%s %s = %0.3f",title,result$id[ss],result$score[ss])
}

par(mar = c(4,4,5,4))
plot(NA,xlab="Correlation Threshold",ylab = "Number of Hits",ylim=c(0,15),xlim = c(0,1), main = title)

points(x=corlim,y=cld[[1]]$nh,col="green")
points(x=corlim,y=cld[[2]]$nh,col="red")
points(x=corlim,y=cld[[3]]$nh,col="blue")
points(x=corlim,y=cld[[4]]$nh,col="black")

lines(x=corlim,y=cld[[1]]$nh,col="green")
lines(x=corlim,y=cld[[2]]$nh,col="red")
lines(x=corlim,y=cld[[3]]$nh,col="blue")
lines(x=corlim,y=cld[[4]]$nh,col="black")


legend("topright",legend = corlab,col=c("green","red","blue","black"),lty = 1,pch=1)

This gives us a clear graphical representation of the correlation scores. Let's make some observations about this plot:



bioarch-sjh/bacollite documentation built on Oct. 7, 2022, 3:34 p.m.