LTRharvest: Run LTRharvest to predict putative LTR Retrotransposons
In HajkD/LTRpred: De novo functional annotation of retrotransposons

LTRharvest

R Documentation

Run LTRharvest to predict putative LTR Retrotransposons

Description

This function implements an interface between R and the LTRharvest command line tool to predict putative LTR retrotransposons from R.

Usage

LTRharvest(
  genome.file,
  index.file = NULL,
  range = c(0, 0),
  seed = 30,
  minlenltr = 100,
  maxlenltr = 3500,
  mindistltr = 4000,
  maxdistltr = 25000,
  similar = 70,
  mintsd = 4,
  maxtsd = 20,
  vic = 60,
  overlaps = "no",
  xdrop = 5,
  mat = 2,
  mis = -2,
  ins = -3,
  del = -3,
  motif = NULL,
  motifmis = 0,
  output.path = NULL,
  verbose = TRUE
)

Arguments

`genome.file`	path to the genome file in `fasta` format.
`index.file`	specify the name of the enhanced suffix array index file that is computed by `suffixerator`. This opten can be used in case the suffix file was previously generated, e.g. during a previous call of this function. In this case the suffix array index file does not need to be re-computed for new analyses. This is particularly useful when running `LTRharvest` with different parameter settings.
`range`	define the genomic interval in which predicted LTR transposons shall be reported . In case `range[1] = 1000` and `range[2] = 10000` then candidates are only reported if they start after position 1000 and end before position 10000 in their respective sequence coordinates. If `range[1] = 0` and `range[2] = 0`, so `range = c(0,0)` (default) then the entire genome is being scanned.
`seed`	the minimum length for the exact maximal repeats. Only repeats with the specified minimum length are considered in all subsequent analyses. Default is `seed = 30`.
`minlenltr`	minimum LTR length. Default is `minlenltr = 100`.
`maxlenltr`	maximum LTR length. Default is `maxlenltr = 3500`.
`mindistltr`	minimum distance of LTR starting positions. Default is `mindistltr = 4000`.
`maxdistltr`	maximum distance of LTR starting positions. Default is `maxdistltr = 25000`.
`similar`	minimum similarity value between the two LTRs in percent. `similar = 70`.
`mintsd`	minimum target site duplications (TSDs) length. If no search for TSDs shall be performed, then specify `mintsd = NULL`. Default is `mintsd = 4`.
`maxtsd`	maximum target site duplications (TSDs) length. If no search for TSDs shall be performed, then specify `maxtsd = NULL`. Default is `maxtsd = 20`.
`vic`	number of nucleotide positions left and right (the vicinity) of the predicted boundary of a LTR that will be searched for TSDs and/or one motif (if specified). Default is `vic = 60`.
`overlaps`	specify how overlapping LTR retrotransposon predictions shall be treated. If `overlaps = "no"` is selected, then neither nested nor overlapping predictions will be reported in the output. In case `overlaps = "best"` is selected then in the case of two or more nested or overlapping predictions, solely the LTR retrotransposon prediction with the highest similarity between its LTRs will be reported. If `overlaps = "all"` is selected then all LTR retrotransposon predictions will be reported whether there are nested and/or overlapping predictions or not. Default is `overlaps = "best"`.
`xdrop`	specify the xdrop value (> 0) for extending a seed repeat in both directions allowing for matches, mismatches, insertions, and deletions. The xdrop extension process stops as soon as the extension involving matches, mismatches, insersions, and deletions has a score smaller than T -X, where T denotes the largest score seen so far. Default is `cdrop = 5`.
`mat`	specify the positive match score for the X-drop extension process. Default is `mat = 2`.
`mis`	specify the negative mismatch score for the X-drop extension process. Default is `mis = -2`.
`ins`	specify the negative insertion score for the X-drop extension process. Default is `ins = -3`.
`del`	specify the negative deletion score for the X-drop extension process. Default is `del = -3`.
`motif`	specify 2 nucleotides for the starting motif and 2 nucleotides for the ending motif at the beginning and the ending of each LTR, respectively. Only palindromic motif sequences - where the motif sequence is equal to its complementary sequence read backwards - are allowed, e.g. `motif = "tgca"`. Type the nucleotides without any space separating them. If this option is not selected by the user, candidate pairs will not be screened for potential motifs. If this options is set but no allowed number of mismatches is specified by the argument `motifmis` and a search for the exact motif will be conducted. If `motif = NULL` then no explicit motif is being specified.
`motifmis`	allowed number of mismatches in the TSD motif specified in `motif`. The number of mismatches needs to be between [0,3]. Default is `motifmis = 0`.
`output.path`	a path/folder to store all results returned by `LTRharvest`. If `output.path = NULL` (Default) then a folder with the name of the input genome file will be generated in the current working directory of R and all results are then stored in this folder.
`verbose`	logical value indicating whether or not detailed information shall be printed on the console.

Details

The LTRharvest function provides an interface to the LTRharvest command line tool and furthermore takes care of the entire folder handling, output parsing, and data processing of the LTRharvest prediction.

Internally a folder named output.path_ltrharvest is generated and all computations returned by LTRharvest are then stored in this folder. These files (see section Value) are then parsed and returned as list of data.frames by this function.

LTRharvest can be used as independently or as initial pre-computation step to sufficiently detect LTR retrotransposons with LTRdigest.

Value

The LTRharvest function generates the following output files:

*_BetweenLTRSeqs.fsa : DNA sequences of the region between the LTRs in fasta format.
*_Details.tsv : A spread sheet containing detailed information about the predicted LTRs.
*_FullLTRRetrotransposonSeqs.fsa : DNA sequences of the entire predicted LTR retrotransposon.
*_index.fsa : The suffixarray index file used to predict putative LTR retrotransposonswith LTRharvest.
*_Prediction.gff : A spread sheet containing detailed additional information about the predicted LTRs (partially redundant with the *_Details.tsv file).

The ' * ' is an place holder for the name of the input genome file.

Author(s)

Hajk-Georg Drost

References

D Ellinghaus, S Kurtz and U Willhoeft. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics (2008). 9:18.

Most argument specifications are adapted from the User manual of LTRharvest.

Examples

## Not run: 

# Run LTRharvest for H sapines partial Y chromosome using standard parameters
LTRharvest(genome.file = system.file("Hsapiens_ChrY.fa", package = "LTRpred"))

## End(Not run)

HajkD/LTRpred documentation built on April 22, 2022, 4:35 p.m.