mk.reference: Makes a reference file for Salmon

View source: R/mk.reference.r

mk.referenceR Documentation

Makes a reference file for Salmon

Description

This function creates decoys and a transcriptome that will be used by Salmon. It also creates a reference file to import the estimates after the Salmon run. The user can enter a RepMask file without deleting co-transcribed or overlapping repeats with the RepMask argument, or enter a RepMask file without co-transcribed but overlapping repeats with the RepMask.clean argument, or a file free of co-transcribed or overlapping repeats with the RepMask.ovlp.clean argument. When the file contains co-transcribed repeats, it must indicate rm.cotrans = T and when the file contains overlaps it must indicate overlapping = T.

Usage

mk.reference(
  RepMask,
  overlapping = T,
  by = "classRep",
  trme,
  threads = 1,
  annot_by = "transcripts",
  rule = c(80, 80, 80),
  best.by = "total_repeat_length",
  outdir,
  over.res = "HS",
  trpt.length = NULL,
  ...
)

Arguments

RepMask

RepeatMasker output file. If rm.cotrans = F it is assumed that the file does not contain cotranscribed repeats. If overlapping = F it is assumed that the file does not contain overlapping.

overlapping

Indicates whether the RepMask file contains overlapping repetitions (TRUE) or not (FALSE). When the RepMask file contains overlapping repetitions, the ovlp.res() function will be used to solve them and the resolution criteria must be indicated (higher score (HS), longer length (LE) or lower Kimura distances (LD))

by

The column by which the repeats will be classified: 'classRep' (default) or 'namRep'.

trme

transcriptome in fasta format

threads

Number of cores to use in the processing. By default threads = 1

annot_by

A character vector indicating whether the annotations should be made by "transcripts" or by "fragments". When annot_by = "transcripts", the proportion of each transposon class/family in each transcript is calculated and the transcript is annotated with the class/family with the highest coverage.

rule

A numerical vector respectively indicating the minimum percentage of identity, the percentage of the length of class/family repeat with respect to the length of the transcript, and the length (in base pairs) of the repeat to be analyzed. #The position of the numbers indicates respectively: Example: c(80, 60, 100) indicates that those repeats with 80% identity or more in at least 60% of the transcript, and are at least 100 bp in length will be annotated as target TEs. Default is c(80,80,80)

best.by

Defines if only the best match of each transcript/sequence id should be returned (by default best.by = NULL which shows all matches for the sequence). The user can choose whether to be based on the longest repeat length ('total_repeat_length') or the highest percent identity ('per_divergence'). The mk.reference() function uses the best.by argument when references are annotated by transcripts (annot_by = 'transcripts') A logical vector indicating its only the longest repeats for each transcript is reported. By default best = TRUE

outdir

Output directory

over.res

Indicates the method by which the repetition overlap will be resolved. HS: higher score, bases are assigned to the element with the highest score LS: longer element, bases are assigned to the longest element LD: lower divergence, bases are assigned to the element with the least divergence. in all cases both elements have the same characteristics, the bases are assigned to the first element.

trpt.length

A data.frame with two columns: the first column must contain the name of the transcripts, and the second the length corresponding to each transcript. The default is trpt.length=NULL, and the lengths for each transcript are taken from the RepeatMasker file.

rm.cotrnas

logical vector indicating whether co-transcribed repeats should be removed

align

.align file

anot

annotation file in outfmt6 format. It is necessary when the option rm.cotrans = T

gff3

gff3 file. It is necessary when the option rm.cotrans = T

stranded

logical vector indicating if the library is strand specific

cleanTEsProt

logical vector indicating whether the search for TEs-related proteins should be carried out (e.g. transposases, integrases, env, reverse transcriptase, etc.). We recommend that users use a curated annotations file, in which these genes have been excluded; therefore the default option is F. When T is selected, a search is performed against a database obtained from UniProt, so we recommend that the annotations file have this format for the subject sequence id (e.g. "CO1A2_MOUSE"/"sp|Q01149|CO1A2_MOUSE"/"tr|H9GLU4|H9GLU4_ANOCA")

featureSum

Returns statistics related to the characteristics of the transcripts. Requires a gff3 file. If TRUE, returns a list of the

ignore.aln.pos

The RepeatMasker alignments file may have discrepancies in the repeats positions with respect to the output file. If you selected over.res = "LD", then you can choose whether to take into account the positions of the alignment file or to take the average per repeats class (default).


FemeniasM/ExplorATEproject documentation built on Nov. 30, 2022, 5:26 p.m.