README.md

Rbec: a tool for analysis of amplicon sequencing data from synthetic microbial communities

Our manuscript is currently available in ISME Communications.

Rbec is a tool for analysing amplicon sequencing data from synthetic communities (SynComs), where the reference sequences for each strain are already available. Rbec can accurately correct PCR and sequencing errors, identify intra-species polymorphic variation, and detect contaminations in SynCom amplicon data.

Content

1. Algorithm description

2. Installation

3. Usage

    --Parameters

    --Output files

    --Characterizing microbial communities in SynComs

    --Detecting contaminants

    --Dependence on accurate reference sequences

4. Example starting from raw data

5. Credits

Algorithm description

Dereplication and assignment of initial abundances

Initial abundances for each sequence present in the reference database are inferred on the basis of the number of exact matches found in the uncorrected sequencing reads. First, merged reads are dereplicated into unique tags in a sample-wise manner. Sequence quality scores for each unique tag are averaged over the scores of all identical copies of that sequence and for every residue. Each reference sequence is assigned an initial abundance equal to the number of identical copies of that unique tag found in the sample. If a reference sequence does not have any tags that exactly matches it, the strain from which the reference sequence is derived is marked as ‘absent’ from the sample.

Assignment of erroneous reads to error-generating reference sequences

Unique tags that do not exactly match any sequence in the reference database are initially assumed to originate from erroneous sequencing reads generated by a given reference sequence. In order to identify the most likely error-generating reference sequence for each unique tag, k-mer (k=7) distances between each unique tag and each reference sequence are calculated. We use k-mer distances for pairwise comparisons between unique tag queries and reference sequences instead of computationally intensive global alignments to improve the time performance of the algorithm. In addition, these distance calculations can be performed in parallel to further increase performance (see parameter ‘threads’). The reference sequence showing the lowest k-mer distance to the query unique tag is marked as the candidate error-producing sequence from which the corresponding erroneous sequences originate. If multiple candidates with the same k-mer distance are found, only the reference sequence with the highest initial abundance is considered as the original error-generating sequence, as sequences with higher abundances are more likely to generate erroneous sequences.

Estimation of the transition and error matrices

To calculate the probability that an unique tag is produced by a given error-generating reference sequence, transition and error matrices need to be estimated. The transition matrix is a 20 by 43 matrix where the rows represent the transition combinations (e.g., A→A, A→T, A→G, A→C, T→T, …, C→G, C→C; including insertions), and the columns represent the sequence quality scores. This transition matrix can be estimated by performing a global alignment between a random set of subsampled unique sequences (or ‘tags’; 5,000 reads by default) and the reference sequences. Entries in the transition matrix are calculated by counting the number of each transition combination along the length of the alignment. The log-transformed transition matrix is then fitted with a weighted loess function to generate the error matrix.

Calculation of error-generation probabilities

We assume that the mismatches between query and reference are generated independently, so the rate at which a unique tag i is produced from the error-generating reference j, designated λ, is calculated by the product over the error probabilities at each position of the alignment l:

Where L is the total length of the alignment. Similar to the error-aware model implemented in DADA2 [1], the abundance probability of each unique tag is calculated using the Poisson distribution:

Where E is the expectation of the Poisson distribution, ai is the abundance of a unique tag i, and aj' is the aggregated abundances of all unique tags assigned to a reference sequence j. Unique tags with a P-value lower than 10-40, and an expectation lower than 0.05 are discarded. This expectation cut-off is intended to retain tags that could be produced at least once by the reference with the probability above 5%. The aim of this step is to retain tags that are generated from intra-strain amplicon sequence variants, which show high abundance relative to the reference sequence but do not exceed the P-value cut-off. It is possible that, for certain experiments, modifying these parameters could be useful. For instance, in a community containing very low abundance strains, lowering the minimum expectation threshold might increase the sensitivity of the algorithm. Similarly, if the presence of low-abundance contaminants closely related to a reference strain is of particular concern, increasing the minimum P-value threshold will help identify potential contaminants, at the risk of generating a larger number of false positives. As discussed above, Rbec is able to identify polymorphic paralogues of the marker gene for a given strain. By summing up the number of (polymorphic) paralogues identified for each reference sequence from each strain, we can obtain an estimate of the copy number of the marker gene for each strain in the SynCom. Normalization of the output relative abundances using this internally inferred copy number estimate is conducted by Rbec by default (see also documentation regarding output files), although this can be controlled by the user (see parameter ‘cn’). However, identical paralogues of the marker gene will not be detected, and the copy number inferred by Rbec may be underestimated. Additionally, the user can provide a table with copy number information for each strain (e.g., obtained from whole-genome sequences, if available) for abundance normalization.

Iterative correction of unique tags

Tags above the P-value or E threshold are then randomly subsampled (5 000 reads by default) and aligned to the reference sequences in an iterative process. In each iteration, the error matrix is updated with tags corrected during the last iteration. The iterations continue until the number of corrected reads falls below two fixed thresholds, which we set to determine whether the iteration should stop or not. These two thresholds correspond to the absolute and relative differences in the number of corrected reads between the present and previous iterations, and are calculated as follows:

Nk and Nk-1 denote the number of corrected reads in the kth and k-1th iteration respectively. The threshold based on relative differences is used to appropriately stop the iterative process for samples with low sequencing depths, since they can easily satisfy the cut-off based on absolute differences. Once both of these two conditions are met, iterations stop and each reference sequence is assigned an abundance equal to the aggregated abundance of all its assigned unique tags.

Installation

Rbec is a free R package. Installation of Rbec requires the 'dada2' package (https://benjjneb.github.io/dada2/dada-installation.html) to installed beforehand manually with the following command:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("dada2")

The latest version of Rbec can be installed with devtools:

devtools::install_github("PengfanZhang/Rbec", dependencies = TRUE)

Alternatively, you can download from Bioconductor:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("Rbec")

Usage

Parameters

fastq: path of the FASTQ file containing the merged amplicon sequencing reads (Ns are not allowed in the reads).

reference: path to the nucleotide FASTA file containing the unique reference sequences. Each sequence must be in one line (Ns are not allowed).

outdir: path to the output directory.

threads: number of threads used, default 1.

sampling_size: sampling size for calculating the error matrix, default 5,000.

ascii: ASCII characters used to encode phred scores (33 or 64), default 33.

min_cont_obs_abd: minimum observed abundance for a non-corrected unique tag to be considered a potential contaminant (default 200).

min_cont_abd: minimum relative abundance for a non-corrected unique tag to be considered a potential contaminant (default 0.03).

min_E: minimum value expectation of the Poisson distribution for identification of paralogues (default 0.05).

min_P: minimum P-value of the Poisson distribution required to correct a sequencing read (default 1e-40).

ref_seeker: method used for finding the candidate error-producing reference sequence for a tag showing identical lowest k-mer distance to multiple references. 1 for the abundance-based method; 2 for the transition probability-based method, default 1.

cn: path to a tab-separated table containing the copy number of the marker gene for each strain. The first column of the table should contain the strain IDs, and the second the copy number estimates (real values are allowed). If no table is provided, Rbec will normalize the abundance based on the internally inferred copy number (which tends to be un underestimate). Rbec will output the normalized as well as the non-normalized abundance tables. Default NULL.

Output files

strain_table.txt: tab-separated file containing the strain-level community composition table.

strain_table_normalized.txt: tab-separated file strain-level community composition table, normalized using copy-number information (see also documentation on the parameter ‘cn’).

contamination_seq.fna: nucleotide FASTA file containing the sequences of potential contaminants.

rbec.log: text file containing the percentage of corrected reads per sample, which can be used to predict potentially contaminated (see also parameter ‘min_cont_abd’).

paralogue_seq.fna: nucleotide FASTA file including sequences of polymorphic paralogues sequences found for each strain.

lambda_final.out: test file containing the lambda and P-values of the Poisson distribution for each unique tag.

error_matrix_final.out: text file containing the error matrix used in the final iteration.

Characterizing microbial communities in SynComs

To generate microbial community compsition profiles, the Rbec function can be used. Below there is an example using a small test amplicon mock community data.

fq <- system.file("extdata", "test_raw_merged_reads.fastq.gz", package="Rbec")

ref <- system.file("extdata", "test_ref.fasta", package="Rbec")

Rbec(fq, ref, tempdir(), 1, 500, 33)

Users can merge the profiling tables from different samples into a single one by using 'mergeSequenceTables' function in DADA2.

Detecting contaminants

One of the main sources of technical variation in gnotobiotic experiments is caused by microbial contaminations occurring during the development of the experiment or already present during input SynCom preparation. One of the features of Rbec is the assessment of likely contaminated samples based the recruitment ratio of sequencing reads across samples. When analysing data with Rbec, a separate log file is provided as an output for each sample. To predict contaminated samples, a text file containing a list of all paths to the log files needs to be provided before running the following command:

log_path <- list.files(paste(path.package("Rbec"), "inst/extdata/contamination_test", sep="/"), recursive=TRUE, full.names=TRUE)

log_file <- tempfile()

writeLines(log_path, log_file)

Contam_detect(log_file, tempdir())

This command will generate a plot showing the distribution of percentages of corrected reads across the whole sample set and a log file with predicted contaminated samples are generated. As a general rule, 90% or more of reads should be corrected in clean SynCom samples.

Dependence on accurate reference sequences

Since Rbec is a reference-based method for error correction in sequencing reads, the accuracy of the reference sequence would critically influence the result. Inference of marker gene sequences from draft genome assemblies or via Sanger sequencing might lead to errors that may negatively impact the accuracy of the results. If errors are present in the reference sequence of a certain strain and no reads can be perfectly aligned to that reference (with an initial abundance of 0 for that reference), Rbec flags this strain as absent with an abundance of 0. One tricky way to overcome this issue in which references are not 100% accurate is that you can look up the contamination sequences outputted by the Rbec function at the first round and align the contamination sequences to the references of strains that are missing in the community. When this occurs, the correct sequence will be flagged by Rbec as a putative contaminant. The user can use this information to replace the erroneous entry in the reference database and re-run Rbec. This should be done with care to avoid true contaminants to be mistaken by members of the original input SynCom when their phylogenetic distance is low.

Example starting from raw data

In the testdata folder, you can find the rawdata and the corresponding reference file from a real synthetic community. In this section, we show a walkthrough example of how to pre-process the raw data to be analyzed with Rbec.

Rbec can use as an input FASTQ files from either single- or pair-end sequencing data. In addition, it requires a database of reference sequences in FASTA format.

Preparation of reference sequences

Sequences in the reference database should be truncated to the region exactly matching the amplicon reads. For example, If we sequence the V5-V7 region of the synthetic bacterial community, the reference sequence of each strain in the reference database should also be truncated to retain the V5-V7 region only rather than the full-length of 16S rRNA sequences. To truncate the reference sequences into a specific region, a tool such as cutadapt can be used:

cutadapt -g AACMGGATTAGATACCC -a GGAAGGTGGGGATGACGT -n 2 -o testdata/reference_V5V7.fasta testdata/reference_full.fasta

Rbec currently only supports the reference database in a non-wrapped format. Towards this end, the following Unix command can be used to transform the wrapped sequences into non-wrapped ones.

awk 'BEGIN{seqs=""}{if(/^>/){if(seqs!=""){print seqs;seqs=""}; print $0} else{seqs=seqs""$0}}END{print seqs}' testdata/reference_V5V7.fasta > testdata/reference_V5V7_nonwrapped.fasta

or use Seqkit

seqkit seq -w 0 -o testdata/reference_V5V7_nonwrapped.fasta testdata/reference_V5V7.fasta

Preprocess of the raw data

Firstly, pair-end raw reads should be merged by using any reads merging tools:

flash2 testdata/test_R1.fastq.gz testdata/test_R2.fastq.gz -M 250 -o test -x 0.25 -d testdata

Amplicon reads should be manually filtered by excluding short reads or reads with ambiguous bases using USEARCH or a similar software before running Rbec. Below is an example using USEARCH:

usearch -fastq_filter testdata/test.extendedFrags.fastq -fastqout testdata/test.extendedFrags.filtered.fastq -fastq_maxns 0 -fastq_minlen 200

Start with Rbec

With the filtered, merged raw reads FASTQ file and the reference database FASTA, you can invoke Rbec in your R environment to explore your samples:

Rbec("testdata/test.extendedFrags.filtered.fastq", "testdata/reference_V5V7_nonwrapped.fasta", 1, 2000)

Credits

Pengfan Zhang (pzhang@mpipz.mpg.de); Ruben Garrido-Oter (garridoo@mpipz.mpg.de)



PengfanZhang/Rbec documentation built on Sept. 13, 2023, 1 a.m.