RFLPtools: Analysis of DNA fragment samples and standalone BLAST report files


Package RFLPtools aims at

see also [@Flessa13].

In this short vignette we describe and demonstrate the available functions.

We start with loading the package.


RFLP data

As a first step we can perform a QC of the samples using function RFLPqc. This is possible, if the first band corresponds to the total length of the fragment. The QC consists of a comparison of the length of the first band with the sum of the lengths of the remaining bands for each sample. If the sum is smaller than QC.lo times the length of the first band or larger than QC.up times the length of the first band, respectively, a text message is printed.

Dir <- system.file("extdata", package = "RFLPtools") # input directory 
filename <- file.path(Dir, "AZ091016_report.txt")
RFLP1 <- read.rflp(file = filename)

RFLP2 <- RFLPqc(RFLP1, rm.band1 = FALSE) # identical to RFLP1
identical(RFLP1, RFLP2)


RFLP4 <- RFLPqc(RFLP1, rm.band1 = TRUE, QC.rm = TRUE)

We load some example data and compute the Euclidean distance ...

res <- RFLPdist(RFLPdata)
names(res) ## number of bands

Of course, we can also use other well-known distances implemented in function dist.

res1 <- RFLPdist(RFLPdata, distfun = function(x) dist(x, method = "manhattan"))
res2 <- RFLPdist(RFLPdata, distfun = function(x) dist(x, method = "maximum"))

Correlation distances can for instance be applied by using function corDist of package MKomics.

res3 <- RFLPdist(RFLPdata, distfun = corDist)

As we obtain a list of dist objects we can easily perform hierarchical clustering.

plot(hclust(res[[1]]), main = "Euclidean distance")
plot(hclust(res1[[1]]), main = "Manhattan distance")
plot(hclust(res2[[1]]), main = "Maximum distance")
plot(hclust(res3[[1]]), main = "Pearson correlation distance")

For splitting the dendrogram into clusters we apply function cutree.

clust4bd <- hclust(res[[2]])
cgroups50 <- cutree(clust4bd, h=50)

Another possibility to display the similarity of the samples are so-called (dis-)similarity matrices that can for instance be generated by function simPlot of package \pkg{MKomics}.

myCol <- colorRampPalette(brewer.pal(8, "RdYlGn"))(128)
ord <- order.dendrogram(as.dendrogram(hclust(res[[1]])))
temp <- as.matrix(res[[1]])
simPlot(temp[ord,ord], col = rev(myCol), minVal = 0, 
        labels = colnames(temp), title = "(Dis-)Similarity Plot")

We can also use function levelplot of package lattice to display (dis-)similarity matrices.

print(levelplot(temp[ord,ord], col.regions = rev(myCol),
          at = do.breaks(c(0, max(temp)), 128),
          xlab = "", ylab = "",
          ## Rotate labels of x-axis
          scales = list(x = list(rot = 90)),
          main = "(Dis-)Similarity Plot"))

If some bands may be missing, we can apply function RFLPdist2 specifying the number of missing bands we expect.

## Euclidean distance
res0 <- RFLPdist(RFLPdata, nrBands = 9)
res1 <- RFLPdist2(RFLPdata, nrBands = 9, nrMissing = 1)
res2 <- RFLPdist2(RFLPdata, nrBands = 9, nrMissing = 2)
res3 <- RFLPdist2(RFLPdata, nrBands = 9, nrMissing = 3)

Again hierarchical clustering of the results is straight forward.

plot(hclust(res0), main = "0 bands missing")
plot(hclust(res1), main = "1 band missing")
plot(hclust(res2), main = "2 bands missing")
plot(hclust(res3), main = "3 bands missing")

There are also ways to handle low-bp bands, which are likely to be absent in some of the samples. First, one can use function RFLPlod to remove all bands below a given threshold LOD before further analyses. For displaying the data we use function RFLPplot.

RFLPdata.lod <- RFLPlod(RFLPdata, LOD = 60)
par(mfrow = c(1, 2))
RFLPplot(RFLPdata, nrBands = 4, ylim = c(40, 670))
RFLPplot(RFLPdata.lod, nrBands = 4, ylim = c(40, 670))
title(sub = "After applying RFLPlod")

In addition, one can specify LOD in a call to function RFLPdist2. If LOD is specified, it is assumed that missing bands only occur for molecular weights smaller than LOD.

res0 <- RFLPdist(RFLPdata, nrBands = 4)
res1.lod <- RFLPdist2(RFLPdata, nrBands = 4, nrMissing = 1, LOD = 60)
ord <- order.dendrogram(as.dendrogram(hclust(res1.lod)))
temp <- as.matrix(res1.lod)
simPlot(temp[ord,ord], col = rev(myCol), minVal = 0, 
        labels = colnames(temp), 
        title = "(Dis-)Similarity Plot\n1 band missing below LOD")

We can also make a comparison to reference data.

RFLPrefplot(RFLPdata, RFLPref, nrBands = 9, cex.axis = 0.8)

BLAST data

To analyze tabular report files generated with standalone BLAST from NCBI (see https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download), a function for reading the BLAST report files is included (read.blast).

Possible steps are:

1) Install NCBI BLAST 2) Generate and import database(s) 3) Apply BLAST with options outfmt and out; e.g.

blastn -query Testquery -db Testdatabase -outfmt 6 -out out.txt


blastn -query Testquery -db Testdatabase -outfmt 10 -out out.csv

One could also call BLAST from inside R by using function system

system("blastn -query Testquery -db Testdatabase -outfmt 6 -out out.txt")

4) Read in the results

## -outfmt 6
test.res <- read.blast(file = "out.txt")


## -outfmt 10
test.res <- read.blast(file = "out.csv", sep = ",")

We now read in a example file included in folder extdata of our package.

Dir <- system.file("extdata", package = "RFLPtools") # input directory 
filename <- file.path(Dir, "BLASTexample.txt")
BLAST1 <- read.blast(file = filename)

This example BLAST data is also available as loadable example data.


The loaded data.frame can be used to compute similarities between the BLASTed sequences via function simMatrix. This function includes the following steps:

1) the length of each sequence (LS) comprised in the input data file is extracted.

2) if there is more than one comparison for one sequence including different parts of the respective sequence, that one with maximum base length is chosen.

3) the number of matching bases (mB) is calculated by multiplying two variables given in the BLAST output: the identity between sequences (%) and the number of nucleotides divided by 100.

4) the resulting value is rounded to the next integer.

5) the similarity is calculated by dividing mB by LS and saved in the corresponding similarity matrix.

If the similarity of a combination is not shown in the BLAST report file (because the similarity was lower than 70%), this comparison is included in the similarity matrix with the result zero.

res <- simMatrix(BLASTdata)

Optionally, the range of sequence length can be specified to exclude sequences, which were too short or too long, respectively.

res1 <- simMatrix(BLASTdata, sequence.range = TRUE, Min = 100, Max = 450)
res2 <- simMatrix(BLASTdata, sequence.range = TRUE, Min = 500)

We display the similarity matrix.

simPlot(res2, col = myCol, minVal = 0, cex.axis = 0.5,
        labels = colnames(res2), title = "(Dis-)Similarity Plot")

Alternatively, we can again use function levelplot of package lattice.

txt <- trellis.par.get("add.text")
txt$cex <- 0.5
trellis.par.set("add.text" = txt)
myCol <- colorRampPalette(brewer.pal(8, "RdYlGn"))(128)
print(levelplot(res2, col.regions = myCol,
          at = do.breaks(c(0, max(res2)), 128),
          xlab = "", ylab = "", 
          ## Rotate labels of x axis
          scales = list(x = list(rot = 90)),
          main = "(Dis-)Similarity Plot"))

We can also convert the similarity matrix to an object of S3 class "dist".

res.d <- sim2dist(res2)

After the conversion we can for instance perform hierarchical clustering.

## hierarchical clustering
plot(hclust(res.d), cex = 0.7)




Try the RFLPtools package in your browser

Any scripts or data that you put into this service are public.

RFLPtools documentation built on Feb. 8, 2022, 5:06 p.m.