RMSobject: Constructing an RMS object

View source: R/dbase.R

RMSobjectR Documentation

Constructing an RMS object

Description

Constructs an RMS object with information about a set of genomes.

Usage

RMSobject(
  genome.tbl,
  frg.dir,
  vsearch.exe = "vsearch",
  identity = 0.99,
  min.length = 30,
  max.length = 500,
  verbose = TRUE,
  threads = 1,
  tmp.dir = "tmp"
)

Arguments

genome.tbl

A table (data.frame or tibble) with genome information, see below.

frg.dir

Path to folder with fragment fasta files.

vsearch.exe

Text with the VSEARCH executable command.

identity

The sequence identity for clustering fragments (0.0-1.0).

min.length

Minimum fragment length (integer).

max.length

Maximum fragment length (integer).

verbose

Turn on/off output text during processing (logical).

threads

Number of threads to be used by vsearch (integer).

tmp.dir

Name of folder for temporary output, will be created if not already existing.

Details

The genome.tbl has a row for each genome to include in the RMS database. There must be a column named genome_file, containing fasta filenames. These must be the names of the fasta files containing the RMS fragments from each genome. Use getRMSfragments to create these fasta files, ensuring the fasta headers follow the pattern <genome.ID>_RMSx, where <genome.ID> is some text unique to each genome and x is some integer. The genome.tbl may contain other columns as well, but genome_file is required.

The vsearch.exe is the exact command to invoke the VSEARCH software. This is normally just "vsearch", but if you run this as a singularity container (or any other container) it may be something like "srun singularity exec <container_name> vsearch".

Value

A list with the following objects: Cluster.tbl, Cpn.mat and Genome.tbl.

The Cluster.tbl is a tibble with data about all fragment clusters. It contains columns with data about each cluster, including the centroid Sequence and its Header, making it possible to write the table to a fasta file using writeFasta.

The Cpn.mat is the copy number matrix, implemented as a sparse dgeMatrix from the Matrix package. It has one row for each fragment cluster and one column for each genome. This is the central data structure for de-convolving the genome content from read-count data, see rmscols.

The Genome.tbl is a copy of the argument genome.tbl, but with columns N_cluster and N_unique added, containing the number of clusters and the number of unique fragment clusters to each genome.

Author(s)

Lars Snipen.

See Also

getRMSfragments.


larssnip/microRMS documentation built on July 19, 2023, 1:06 a.m.