In larssnip/microRMS: Processing and analysis of RMS data

knitr::opts_chunk$set(echo = TRUE)

Installation

To install directly from GitHub you first need to install the packages devtools. Start R and:

install.packages("devtools")

or if you use RStudio, use the meny Tools - Install Packages... You may also need the package githubinstall. If so, install it the same way as above.

Next, you should be able to install this package by

devtools::install_github("larssnip/microrms")

The R functions RMSobject() and readMapper() calls upon the external software vsearch that must be installed and available on the system, see VSEARCH on GitHub. You need to know how to start vsearch on your system, and these functions have the argument vsearch.exe where you specify the exact text that will run vsearch. As an example, in Windows the command to start vsearch is simply "vsearch", but in case you run it as a singularity container the corresponding command would perhaps be "singularity exec <container_name> vsearch" or something along that line.

Below are two short tutorial, to get you going. The first shows a basic analysis on a very small data set. The second digs more into the resolution of this method.

Tutorial 1 - microbial community composition

This is a short step-by-step tutorial on a small toy example to illustrate a typical RMS study. We want to estimate the abundance of various taxa in a microbial community. Just as for shotgun metagenome data, this is a 'closed reference' type of assignment. This means we have a set of already sequenced genomes, and estimate the relative abundances of these in the sequenced samples. Instead of full shotgun sequencing, we have used the Reduced Metagenome Sequencing approach, i.e. sequencing of amplicons obtained by restriction enzyme cutting of the genomes. The use of RMS to discover new taxa is out of the scope of this tutorial.

You need to install the tidyverse and ggdendro R packages as these are used in the example code below, in addition to the microrms package and its dependencies.

Download the archive RMS_tutorial.tar.gz, and unzip it to some folder named RMS_tutorial. It should contain the folders gnm, fastq, fasta, frg and tmp, where the latter three are empty. There should also a file gold_standard.txt. The last folder, ecoli_frg, is used in the second tutorial below.

Start R/RStudio and make the RMS_tutorial your working directory for this R session, e.g.

setwd("C:/my_tutorials/RMS_tutorial")   # edit this

where you need to replace the path above with the correct one depending on where you unzipped the archive.

Create a new R-script, copy the code chunks below into it, and save it in the RMStutorial folder. Step through this, and inspect the results.

After some code chunks below there are some output lines, usually starting with ##. Do not copy these into your R-script.

Creating the RMS object

First, we need a collection of sequenced genomes that cover the community we want to study. For this tutorial, we have collected 10 genomes in the gnm/ folder. From every genome, we collect all RMS fragments, cluster these, and create the fragment cluster copy number matrix which is the central data structure of this method. In real life your genome collection would be larger, hundreds or thousands of genomes, depending on the environment you study.

All information is stored in an rms object, which is simply a list with several tables and matrices. It is convenient to have all data assembled into one object like this, and since this is a simple list you have full access to all components.

Collecting RMS fragments

First, we collect the RMS fragments from each genome, and store these as fasta-formatted files in a separate folder. This is something you do once for each genome. You also find the file gnm/genome_table.txt among the genomes, containing a small table with one row for each genome. It is required you have such a metadata table with information about the genomes. This table must contain the columns genome_id and genome_file, but may contain any other genome metadata columns in addition to these.

The genome_id should be a text without spaces, and unique to each genome. It is added to the header-lines of the fragment fasta files, to indicate the genome of origin for all fragments. Here we used the prefix of the genome fasta filenames as genome_id.

The genome_file column specify the genome fasta files, but without the path to where they are located. Keep it this way. The reason is we re-use these filenames for the fragment fasta files, but in another folder. Thus, we supply the path as a separate argument when needed.

The first code chunk reads genomes, collect their RMS fragments, and store these as fasta files in the frg/ folder:

library(tidyverse)
library(microrms)
gnm.dir <- "gnm"   # genome fasta files are found here
frg.dir <- "frg"   # fragment fasta files end up here
genome.tbl <- suppressMessages(read_delim("gnm/genome_table.txt", delim = "\t"))
for(i in 1:nrow(genome.tbl)){
  readFasta(file.path(gnm.dir, genome.tbl$genome_file[i])) %>% 
    getRMSfragments(genome.id = genome.tbl$genome_id[i]) %>% 
    writeFasta(out.file = file.path(frg.dir, genome.tbl$genome_file[i]))
}

We notice all these genomes have hundreds or even thousands of RMS fragments. If one of these genomes (or a very close relative) is present in a sample, these fragments will be amplified and sequenced in the RMS method.

This job is done once for each genome, and the fragment files are stored (genome files are no longer needed). Thus, if you later want to add more genomes, you only run this for the new genomes. Notice that the fragment files have names identical to the genome files, and must be in a separate folder. Thus, the genome_file column in genome.tbl is used to name both genome and fragment files.

Also, note the readfasta()/writeFasta() functions from the microseq package handles gzipped files. Since the genome_file names have the extension .gz the fragment files will also be compressed.

The RMS object

Select the genomes you want to be able to recognize later. This means you may slice() or filter() your genome.tbl to only contain the rows (=genomes) you are interested in. Here we use all genomes.

The creation of the RMS object requires the use of the vsearch software for clustering. The function RMSobject() will not work unless vsearch is a valid command in your system/computer.

rms.obj <- RMSobject(genome.tbl, frg.dir, vsearch.exe = "vsearch", tmp = "tmp")

Note that you need to specify some folder for temporary output ("tmp"). You may delete its content after the computations are finished.

The clustering of these fragments results in more than 17000 clusters, using the default identity, see ?RMSobject. The resulting object rms.obj is a list with two tables and a matrix.

The rms.obj$Genome.tbl is a copy of the genome.tbl, but has got two new columns, and should be inspected right away. The column N_clusters lists the number of fragment clusters in each genome, and N_unique how many of these are unique to each genome:

print(rms.obj$Genome.tbl)

In this case there are plenty of unique clusters for all genomes. If two or more genomes are very closely related, this number will shrink towards zero, making the recognition of each impossible.

The rms.obj$Cluster.tbl is a table listing information about all fragment clusters, one row for each cluster. Here is a listing of the first entries:

head(rms.obj$Cluster.tbl)

Very similar fragments (identity above the threshold set by the identity argument to RMSobject) will belong to the same cluster. Closely related genome will typically share several fragments. The Cluster.tbl indicate how many genomes contain each fragment (N.genomes), and also which genomes (Members). Note this table has a Header and a Sequence column, making it possible to write this to a fasta file using writeFasta().

The rms.obj$Cpn.mat is the copy number matrix. This is a central data structure of the RMS method. It has one row for each fragment cluster, and one column for each genome. Here are the first few rows and columns:

print(rms.obj$Cpn.mat[1:10,1:4])

The copy numbers indicate how many copies of a given fragment cluster are found in a given genome. By far most copy numbers are zero, i.e. a fragment is usually found in one or a few genomes only. For this reason, the Cpn.mat is represented as a sparse data matrix in R, using the Matrix package. This save a lot of memory, at the cost of slightly slower computations. The zero elements are displayed as dots above. Most nonzero elements are 1, but some fragment clusters may occur several times in some genomes.

Genome similarity and resolution

When estimating the relative abundances of taxa, there is always a lower resolution, i.e. we cannot separate between genomes who are too similar. This will apply to all methods, not only RMS. For 16S sequencing, the resolution is usually at the genus rank, i.e. we cannot in general separate species from each other. With full shotgun sequencing, the resolution is higher, and we can separate species and even some strains within species if the sequencing is deep enough and the strains are different enough. With RMS we may also separate within species.

In this toy example, we have chosen genomes who are all of different species, even if six of them are from the same genus (Bacteroides). Thus, the problem of too similar genomes is probably not a big one here. In the next tutorial below, we dig deeper into this problem.

The similarity between genomes is, in the RMS context, simply the similarity between their respective columns in the copy number matrix (rms.obj$Cpn.mat). If two genomes have very similar columns here, it means they have more or less the same fragments, and by sequencing these we cannot distinguish if we are actually seing one or the other of these two genomes.

One way of quantifying the pairwise difference between the genomes is to compute the correlation distance between all genomes based on the copy number matrix. A distance of 0.0 means identical genomes (identical columns in the copy number matrix), and the largest possible distance of 2.0 means the genomes have orthogonal copy number vectors. From such distances we can compute a hierarchical clustering and display the genomes in a dendrogram:

library(ggdendro)
D <- corrDist(rms.obj$Cpn.mat)
tree <- hclust(as.dist(D), method = "single")
ggd <- ggdendrogram(dendro_data(tree),
                    rotate = T,
                    theme_dendro = F) +
  labs(x = "", y = "Correlation distance")
print(ggd)

All branches look deep and nice here, with correlation distances above 0.9. If branches get too shallow (correlation distance close to zero) in this tree, the genomes in such clades will be difficult/impossible to separate since they share too many fragments. The solution to this is to cluster the genomes, which is illustrated in the next tutorial below.

Processing reads

The read processing means essentially taking the fastq files from sequencing as input and producing a fasta file as output, for each sample. This step is independent of what we did above. You only do this once for each sample, and it is the resulting fasta files we use in the downstream analysis.

The data from sequencing are the paired fastq files in the folder fastq/. Again, it is recommended to have a table like the one in fastq/sample_table.txt with metadata about each sample. There should always be a column sample_id with a unique text for each sample. Also, this sample.tbl has two columns R1_file and R2_file specifying the corresponding fastq file names.

In addition we also create the required column fasta_file below, containing the name of the resulting fasta files with reads for each sample. Here we create this from the sample_id, which means this text must be useful as a file name prefix (e.g. no / or spaces inside). The table may contain this column already, with no need to create it. Again, paths should not be part of any file names, we supply them as separate inputs. Below we output these files to the fasta folder.

There is no R function for doing the read processing, since this may be done in many different ways. Here is an R script with some suggested code for doing this processing using the vsearch software. Note the explicit decompression of the fastq-files. This is only added here in case you run this on a Windows 10 computer, on which we have found vsearch is not capable of reading gzipped files. On all other system these code-chunks should be deleted, and vsearch will read compressed files directly. The (long) screen output from this is hidden in this document:

#################
### The settings
vsearch.exe <- "vsearch"  # command to start vsearch
fq.dir <- "fastq"         # path to folder with (input) fastq files
fa.dir <- "fasta"         # path to folder with (output) fasta files
tmp.dir <- "tmp"          # temporary files, delete in the end
PCR.forward.primer <- "GACTGCGTACCAATTC"
PCR.reverse.primer <- "GATGAGTCCTGAGTAA"
min.read.length <- 30
maxee <- 0.02

#####################
### The sample table
sample.tbl <- suppressMessages(read_delim("fastq/sample_table.txt",
                                          delim = "\t")) %>% 
  mutate(fasta_file = str_c(sample_id, ".fasta"))

############################################################
### Looping over all samples
### 1) Filtering by maxee, discarding read-pairs
### 2) Merging read-pairs
### 3) Trimming primers from merged reads
### 4) Trimming primers from un-merged reads
### 5) Writing all reads to fasta-file
### 6) De-replicating and saving one fasta-file per sample
###########################################################
Nf <- str_length(PCR.forward.primer)
Nr <- str_length(PCR.reverse.primer)
for(i in 1:nrow(sample.tbl)){
  ##-- code chunk only needed for Windows 10 computers
  R.utils::gunzip(file.path(fq.dir, sample.tbl$R1_file[i]))
  sample.tbl$R1_file[i] <- str_remove(sample.tbl$R1_file[i], ".gz$")
  R.utils::gunzip(file.path(fq.dir,sample.tbl$R2_file[i]))
  sample.tbl$R2_file[i] <- str_remove(sample.tbl$R2_file[i], ".gz$")
  ##-- end code chunk for Windows 10

  cat("\n\n##### VSEARCH quality filtering sample", sample.tbl$sample_id[i], "...\n")
  cmd <- paste(vsearch.exe,
               "--fastq_filter", file.path(fq.dir, sample.tbl$R1_file[i]),
               "--reverse",      file.path(fq.dir, sample.tbl$R2_file[i]),
               "--fastq_maxee_rate", maxee,
               "--fastqout", file.path(tmp.dir, "filtered_R1.fq"),
               "--fastqout_rev", file.path(tmp.dir, "filtered_R2.fq"))
  system(cmd)

  cat("\n\n##### VSEARCH mergings read-pairs...\n")
  cmd <- paste(vsearch.exe,
               "--fastq_mergepairs", file.path(tmp.dir, "filtered_R1.fq"),
               "--reverse",          file.path(tmp.dir, "filtered_R2.fq"),
               "--fastq_allowmergestagger",
               "--fastq_minmergelen", min.read.length,
               "--fastaout", file.path(tmp.dir, "merged.fa"),
               "--fastqout_notmerged_fwd", file.path(tmp.dir, "notmerged_R1.fq"),
               "--fastqout_notmerged_rev", file.path(tmp.dir, "notmerged_R2.fq"))
  system(cmd)

  cat("\n\n##### VSEARCH trimming primers from merged reads...\n")
  cmd <- paste(vsearch.exe,
               "--fastx_filter", file.path(tmp.dir, "merged.fa"),
               "--fastq_stripleft",  Nf,
               "--fastq_stripright", Nr,
               "--fastq_minlen", min.read.length,
               "--relabel", "'size=2;pair'",
               "--fastaout", file.path(tmp.dir, "merged_filt.fa"))
  system(cmd)

  cat("\n\n##### VSEARCH trimming primers from un-merged reads...\n")
  cmd <- paste(vsearch.exe,
               "--fastq_filter", file.path(tmp.dir, "notmerged_R1.fq"),
               "--fastq_stripleft", Nf,
               "--fastq_minlen", min.read.length,
               "--relabel", str_c("'size=1;notmerged_R1_'"),
               "--fastaout", file.path(tmp.dir, "notmerged_R1_filt.fa"))
  system(cmd)
  cmd <- paste(vsearch.exe,
               "--fastq_filter", file.path(tmp.dir, "notmerged_R2.fq"),
               "--fastq_stripleft", Nr,
               "--fastq_minlen", min.read.length,
               "--relabel", str_c("'size=1;notmerged_R2_'"),
               "--fastaout", file.path(tmp.dir, "notmerged_R2_filt.fa"))
  system(cmd)

  cat("\n\n##### VSEARCH adding all reads to one fasta-file...\n")
  ok <- file.append(file1 = file.path(tmp.dir, "merged_filt.fa"),
                    file2 = file.path(tmp.dir, "notmerged_R1_filt.fa"))
  cmd <- paste(vsearch.exe,
               "--fastx_revcomp", file.path(tmp.dir, "notmerged_R2_filt.fa"),
               "--fastaout", file.path(tmp.dir, "notmerged_R2_filt_rc.fa"))
  system(cmd)
  ok <- file.append(file1 = file.path(tmp.dir, "merged_filt.fa"),
                    file2 = file.path(tmp.dir, "notmerged_R2_filt_rc.fa"))

  cat("\n\n##### VSEARCH de-replicating sample", sample.tbl$sample_id[i], "...\n")
  cmd <- paste(vsearch.exe,
               "--derep_fulllength", file.path(tmp.dir, "merged_filt.fa"),
               "--minuniquesize", 1,
               "--minseqlength", min.read.length,
               "--sizein --sizeout",
               "--relabel", str_c(sample.tbl$sample_id[i], ":uread_"),
               "--output", file.path(fa.dir, sample.tbl$fasta_file[i]))
  system(cmd)

  ##-- code chunk only needed for Windows 10 computers
  R.utils::gzip(file.path(fq.dir, sample.tbl$R1_file[i]))
  sample.tbl$R1_file[i] <- str_c(sample.tbl$R1_file[i], ".gz")
  R.utils::gzip(file.path(fq.dir, sample.tbl$R2_file[i]))
  sample.tbl$R2_file[i] <- str_c(sample.tbl$R2_file[i], ".gz")
  ##-- end code chunk for Windows 10
}

Note that you should have created the folders fasta and tmp before running this script. The first is where the resulting fasta files appear. The second is just temporary files. They may be nice to have for debugging, in case something goes wrong, but should be deleted in the end.

In this script we remove primers from the reads, but in many cases this has been done as part of de-mutliplexing, and then these steps should be omitted. We also set minimum read length to 30 and maximum error rate to 0.02. These may be edited, depending on your data. You may find it strange that the fastq-files in fastq are not compressed (.gz). This is very common, and the only reason they are not here is that we have found the vsearch has some problems reading gzipped files under Windows 10. If you run this on a windows 10 computer, we reccomend de-compressing the files used as input to vsearch, to be on the safe side, e.g. by using gunzip() and gzip() in R

Before we are done with this step, we add the sample.tbl to our rms.obj. Note that sample.tbl must have at least the two columns sample_id and fasta_file for the downstream analysis (the R1_file and R2_file may still be present but are no longer needed):

rms.obj <- addSampleTable(rms.obj, sample.tbl)

In this way we have the information about our samples in the same object as we have all other information. Inspect the rms.obj to verify the Sample.tbl is now another element in this list. Note that the function used above will replace an existing Sample.tbl inside the rms.obj if the latter already contains such a table. This means you may re-use the genome collection (and the corresponding Genome.tbl, Cluster.tbland Cpn.mat) with other samples (reads) by adding a different Sample.tbl.

Mapping reads to clusters

The next step is to map reads from each sample to the fragment clusters, and obtain a readcount matrix. This matrix has one column for each sample, and one row for each fragment cluster, similar to an OTU or ASV matrix for 16S amplicon data.

The readcount matrix

The readMapper() function needs the rms.obj with information about fragment clusters (rms.obj$Cluster.tbl) and samples (rms.obj$Sample.tbl), and uses vsearch to search with reads against the fragment cluster centroids. The path to the fasta files with processed reads is also required, since the rms.obj$Sample.tbl has the filenames, but not their path.

rms.obj <- readMapper(rms.obj, fa.dir, vsearch.exe = "vsearch", tmp.dir = "tmp")

The matrix Readcount.mat is added as a new component to the returned rms.obj. There should be one column for each sample and one row for each fragment cluster. Here are the first rows:

print(rms.obj$Readcount.mat[1:10,1:4])

The readMapper() also adds information about how many reads are in each sample, and how many mapped to some RMS fragments. This appears as two new columns in the rms.obj$Sample.tbl:

print(rms.obj$Sample.tbl)

Length normalization

We must expect the readcounts from RMS amplicons to have some length bias, due to the PCR amplification. We may plot and see if this is indeed the case:

rms.obj$Cluster.tbl %>% 
  select(Length) %>% 
  bind_cols(as_tibble(rms.obj$Readcount.mat)) %>% 
  pivot_longer(cols = -Length, names_to = "Sample", values_to = "Readcounts") %>%  
  ggplot() +
  geom_point(aes(x = Length, y = Readcounts), alpha = 0.3) +
  scale_y_log10() +
  facet_wrap(~Sample) -> plt
print(plt)

Note the log-transformed y-axes. Since we have limited fragments to lengths of 30 to 500 only, the bias is not severe, but the shortest and longest fragments have slightly lower readcounts. Let us try to normalize:

rms.obj.norm <- normLength(rms.obj)

We make the same plot with the normalized data:

rms.obj.norm$Cluster.tbl %>% 
  select(Length) %>% 
  bind_cols(as_tibble(rms.obj.norm$Readcount.mat)) %>% 
  pivot_longer(cols = -Length, names_to = "Sample", values_to = "Readcounts") -> tbl
plt %+% tbl %>% print

There is some effect on the most extreme lengths, as usual. Note the log-transformed y-axes, giving the illusion of large distortions in the smallest readcounts. They actually change little compared to the more normal readcount values.

If we decide to stick to the normalized data, we do not need both rms.obj and rms.obj.norm:

rms.obj <- rms.obj.norm
rm(rms.obj.norm)

Estimating abundances

Finally, we estimate the abundances of the various genomes in our samples.

The Constrained Ordinary Least Square (COLS)

Abundance estimation is done by the rmscols() function. The supplied rms.obj must contain the cluster copy number matrix rms.obj$Cpn.mat and the readcounts rms.obj$Readcount.mat. The idea is to look for a linear combination of genome abundances that, given cluster copy numbers, best explains the observed readcounts in a sample:

abd.mat <- rmscols(rms.obj)

The abd.mat is a matrix with one row for each genome, and one column for each sample. The numbers in a column are the relative abundances for each genome in the corresponding sample.

We may plot the results as stacked bar charts:

abd.mat %>% 
  as_tibble(rownames = "genome_id") %>% 
  pivot_longer(cols = -genome_id, names_to = "sample_id", values_to = "Estimated") -> long.tbl

ggplot(long.tbl) +
  geom_col(aes(x = sample_id, y = Estimated, fill = genome_id), color = "black")

Comparing to gold standard

The sequencing fastq files in fastq/ are simulated data. The art software (https://www.ncbi.nlm.nih.gov/pubmed/22199392) was used to simulate Illumina HiSeq reads. Biases typical for RMS data (i.e. due to fragment length and fragment-specific amplification bias) were added as described in Snipen et al, 2020.

The file gold_standard.txt contains the actual relative abundances of all genomes in all samples. Let us compare the estimated abundances from above to this, and also replace genome_id with organism_name in the figure legend:

suppressMessages(read_delim("gold_standard.txt", delim = "\t")) %>%
  left_join(rms.obj$Genome.tbl, by = "genome_id") %>% 
  select(genome_id, organism_name, starts_with("Sample")) %>% 
  pivot_longer(cols = c(-genome_id, -organism_name), names_to = "sample_id", values_to = "Gold.standard") %>% 
  full_join(long.tbl, by = c("genome_id", "sample_id")) %>% 
  pivot_longer(cols = c(-genome_id, -sample_id, -organism_name), names_to = "Type", values_to = "Abundance") %>% 
  ggplot() +
  geom_col(aes(x = Type, y = Abundance, fill = organism_name), color = "black") +
  facet_wrap(~sample_id)

New data

This forms a code template, and by replacing the genomes in gnm/ by your genomes, and the sequencing data in fastq/ by your own data, it should be possible to run an analysis.

Beware that for large collections of genomes (thousands), computations will be much slower than in this tutorial. You may also need a computer with a lot of memory, even if the copy number matrix has been implemented as a sparse matrix here.

Tutorial 2 - subspecies resolution

As mentioned above, the RMS method has a potential for estimating abundances down to below the species rank, i.e.separating strains. This is in fact one of the strengths of RMS. The reason for this sensitivity is the fact that we know a priori which fragments belong to which genomes.

If two genomes are very similar, they will also share many RMS fragments. As with all methods, there will always be a lower resolution, i.e. it is impossible to 'see' the difference between two identical genomes! How close can two genomes be, and still be separated by RMS?

The key to the resolution by RMS is the copy number matrix mentioned above. When estimating the abundances, its covariance matrix must be inverted. If two or more genomes are too similar, the corresponding columns of the copy number matrix are also very similar. This lead to a singular, or close-to singular, covariance matrix that cannot be inverted. Even if the matrix can be inverted, similar columns means results will be very unstable, leading to poor estimates. This effect can be measured by the condition value of the covariance matrix of the copy number matrix. If this is very large, the results will be unstable.

In the folder ecoli_frg we have the RMS fragment files for a random selection of 50 Escherichia coli genomes. It also contains the genome table, and we use this to create an RMS object:

library(tidyverse)
library(microrms)
ecoli.tbl <- read_delim("ecoli_frg/genome_table.txt", delim = "\t")
ecoli.rms.obj <- RMSobject(ecoli.tbl, frg.dir = "ecoli_frg", vsearch.exe = "vsearch", tmp = "tmp")

Notice that these 50 genomes produces 'only' 8602 fragment clusters, which is half of what the 10 genomes in tutorial 1 had. This is of course because here all genomes are from the same species, and they share many fragments. By inspecting the Genome.tbl we can see how many unique clusters each genome has:

print(ecoli.rms.obj$Genome.tbl)

The last two columns tell us there are around 1000 fragments in an E. coli genome, but the number of unique fragments is much smaller, even down to zero in a couple of cases. Thus, we cannot expect to be able to separate all these genomes by RMS.

Let us compute correlation distances, and make a dendrogram, like we did in tutorial 1:

library(ggdendro)
d <- corrDist(ecoli.rms.obj$Cpn.mat)
tree <- hclust(as.dist(d), method = "single")
ggd <- ggdendrogram(dendro_data(tree),
                    rotate = T,
                    theme_dendro = F) +
  labs(x = "", y = "Correlation distance", title = "All E. coli genomes")
print(ggd)

It is clear that some distances are close to zero. We may also compute the condition value from the copy number matrix:

print(conditionValue(ecoli.rms.obj$Cpn.mat))

How large condition value can we tolerate? The perfect value is 1, but this is never achieved with real data. a value below 10 is extremely good. The condition value for the copy number matrix in tutorial 1 is around 3.7, and we saw how all those genomes separated nicely in the dendrogram. Condition values below 100 or even 1000 is still quite acceptable, and even 10 000 is not all that bad. Going above this, we must expect some substantial errors in some abundance estimates. When later running rmscols() to estimate abundances, the number of iterations is also a measure indicating how separable the genomes are. A larger condition value means more iterations are needed, and less precise estimates.

Let us 'prune' the genome collection such that we get a smaller set of genomes, that we are capable of separating better. This is done by genome clustering. Genomes who are too similar are clustered into a group, and only the centroide genome in this group is used in the analysis, as a representative for the group. We use the function genomeClustering() for this:

ecoli.rms.obj.cls100 <- genomeClustering(ecoli.rms.obj, max.cond = 100, verbose = T)

Here we specified that we tolerate a condition value of maximum 100. You would typically choose values of $10^1$, $10^2$, $10^3$ or $10^4$ here, even if any value above 1 is allowed. The resulting ecoli.rms.obj.cls100 has a new Genome.tbl that we should inspect:

print(ecoli.rms.obj.cls100$Genome.tbl)

It has 27 rows, instead of the original 50, i.e. 23 genomes less. These 'lost' genomes are now represented by some other, and the new column members_genome_id lists the original genome_id to all members of each cluster (comma separated). Also, notice there are now many more unique fragments for each cluster centroide genome.

We may now re-compute the condition value and the dendrogram:

print(conditionValue(ecoli.rms.obj.cls100$Cpn.mat))

d <- corrDist(ecoli.rms.obj.cls100$Cpn.mat)
tree <- hclust(as.dist(d), method = "single")
ggd <- ggdendrogram(dendro_data(tree),
                    rotate = T,
                    theme_dendro = F) +
  labs(x = "", y = "Correlation distance", title = "Clustered E. coli genomes")
print(ggd)

Note that no correlation distance are now below 0.20, and we expect to be able to separate these genomes fairly well. In Snipen et al (2021){target="blank"} we demonstrate this on a much larger collection of E. coli genomes. Note also that this is something you do in silico prior to any experimental efforts, since it only involves the sequenced genomes. Having sequenced some samples, you proceed as in tutorial 1 in order to estimate how abundant these clusters are in the samples.

larssnip/microRMS documentation built on July 19, 2023, 1:06 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

larssnip/microRMS
Processing and analysis of RMS data

In larssnip/microRMS: Processing and analysis of RMS data

Installation

Tutorial 1 - microbial community composition

Creating the RMS object

Collecting RMS fragments

The RMS object

Genome similarity and resolution

Processing reads

Mapping reads to clusters

The readcount matrix

Length normalization

Estimating abundances

The Constrained Ordinary Least Square (COLS)

Comparing to gold standard

New data

Tutorial 2 - subspecies resolution

R Package Documentation

Browse R Packages

We want your feedback!

larssnip/microRMS Processing and analysis of RMS data

In larssnip/microRMS: Processing and analysis of RMS data

Installation

Tutorial 1 - microbial community composition

Creating the RMS object

Collecting RMS fragments

The RMS object

Genome similarity and resolution

Processing reads

Mapping reads to clusters

The readcount matrix

Length normalization

Estimating abundances

The Constrained Ordinary Least Square (COLS)

Comparing to gold standard

New data

Tutorial 2 - subspecies resolution

R Package Documentation

Browse R Packages

We want your feedback!

larssnip/microRMS
Processing and analysis of RMS data