lsux: Extract variable/conserved domains from LSU rDNA

lsuxR Documentation

Extract variable/conserved domains from LSU rDNA

Description

Extracts alternating variable and conserved domains from the contiguous rDNA regions which form the eukaryotic ribosomal large subunit, i.e. 5.8S RNA, ITS2, and 28S RNA. For the purposes of this document, this region will be referred to as the 32S precursor RNA, as in humans, although its actual size in Svedberg units varies between lineages.

Usage

lsux(
  seq,
  cm_5.8S = system.file(file.path("extdata", "RF00002.cm"), package = "inferrnal"),
  cm_32S = system.file(file.path("extdata", "fungi_32S.cm"), package = "LSUx"),
  glocal = TRUE,
  global = FALSE,
  ITS1 = FALSE,
  cpu = NULL,
  mxsize = NULL,
  quiet = TRUE
)

Arguments

seq

(single filename readable by readBStringSet, XStringSet-class, ShortRead-class, or character vector) sequences to extract regions from

cm_5.8S

(filename) covariance model for 5.8S rRNA

cm_32S

(filename) covariance model for 32S pre-rRNA (5.8S, ITS2, and LSU)

glocal

(logical scalar) if TRUE, use glocal alignment in cmsearch

global

(logical scalar) if TRUE, use global alignment in cmalign

ITS1

(logical scalar) if TRUE, include sequence fragment before 5.8S (if any) as ITS1

cpu

(integer scalar) number of threads to use in Infernal calls. If length is greater than 1, then if cmalign fails, it will be retried with subsequent values.

mxsize

(double scalar or vector) passed on to cmalign. If length is greater than 1, then if cmalign fails, it will be retried with subsequent values.

quiet

(logical scalar) passed on to cmsearch

Details

Input sequences should contain, at a minimum, a significant fraction of the 5.8S RNA, which is used to define the 5' end of 32S. Any base pairs before the 5' end of 5.8S will be considered to be ITS1 (ITS1 = TRUE) or discarded (ITS1 = FALSE). Input sequences should not extend past the end of the 32S model at the 3' end.

LSUx requires two covariance models: one for 5.8S, which is used in cmsearch, and one for 32S, which is used in cmalign.

The 5.8S model can be RF00002 from Rfam (the default), or an equivalent. It must be calibrated using cmcalibrate from Infernal.

The 32S model must include annotations in the reference line ("#=GC RF" in the seed alignment) to distinguish conserved and variable regions. The annotations should be sequential characters in the range "1..9A..Z" for conserved domains, "v" for variable domains, and "." for unaligned gaps in the seed alignment. In the output, the conserved domains will be named "5_8S", "LSU1", "LSU2", ...; the variable domains will be named "ITS2", "V1", "V2", ...

Two example models are included, both based on the RDP fungal LSU CM, and annotated with variable regions according to Raué (1988). The first, system.file(file.path("extdata", "fungal_32S.cm"), package = "LSUx"), includes the full LSU region. The second, system.file(file.path("extdata", "fungal_32S_LR5.cm"), package = "LSUx"), is truncated at the binding site of the LR5 primer, and should be faster for input sequences which do not extend past that point. The seed alignments are also provided.

If generating similar truncated alignments with different endpoints, it is critical to remove unpaired secondary structure elements from the "#=GC SS_cons" line of the seed alignment.

Value

a tibble with one row for each region found for each input sequence. The columns are:

seq_id (character)

the sequence name from seq

length (integer)

the length of the original sequence in base pairs

region (character)

the name of the found domain. Can be "5_8S", "ITS2", "LSU1", "V2", "LSU2", "V3", etc.

start (integer)

the starting base for that domain in this sequence.

end (integer)

as start, but giving the end base for the domain.

Examples

# the sample data was amplified with primers ITS1 and LR5, so the truncated
# cm is appropriate.
seq <- system.file("extdata/sample.fasta", package = "inferrnal")
cm_32S_trunc <- system.file(
    file.path("extdata", "fungi_32S_LR5.cm"),
    package = "LSUx"
)
lsux(seq, cm_32S = cm_32S_trunc, ITS1 = TRUE, cpu = 1)

brendanf/LSUx documentation built on April 7, 2024, 9:27 p.m.