STAR.align.folder: Align all libraries in folder with STAR

Description Usage Arguments Details Value See Also Examples

View source: R/STAR.R

Description

Does either all files as paired end or single end, so if you have mix, split them in two different folders.
If STAR halts at .... loading genome, it means the STAR index was aborted early, then you need to run: STAR.remove.crashed.genome(), with the genome that crashed, and rerun.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
STAR.align.folder(
  input.dir,
  output.dir,
  index.dir,
  star.path = STAR.install(),
  fastp = install.fastp(),
  paired.end = FALSE,
  steps = "tr-ge",
  adapter.sequence = "auto",
  min.length = 20,
  mismatches = 3,
  trim.front = 0,
  max.multimap = 10,
  alignment.type = "Local",
  max.cpus = min(90, detectCores() - 1),
  wait = TRUE,
  include.subfolders = "n",
  resume = NULL,
  multiQC = TRUE,
  script.folder = system.file("STAR_Aligner", "RNA_Align_pipeline_folder.sh", package =
    "ORFik"),
  script.single = system.file("STAR_Aligner", "RNA_Align_pipeline.sh", package =
    "ORFik")
)

Arguments

input.dir

path to fast files to align, the valid input files will be search for from formats: fast files (.fasta, .fastq, .fq, or.fa) with or without compression of .gz. Also either paired end or single end reads. Pairs will automatically be detected from similarity of naming, usualy with a .1 and .2 in the end. If files are renamed, where pairs are not similarily named, this process will fail to find correct pairs.

output.dir

directory to save indices, default: paste0(dirname(arguments[1]), "/STAR_index/"), where arguments is the arguments input for this function.

index.dir

path to STAR index folder. Path returned from ORFik function STAR.index, when you created the index folders.

star.path

path to STAR, default: STAR.install(), if you don't have STAR installed at default location, it will install it there, set path to a runnable star if you already have it.

fastp

path to fastp trimmer, default: install.fastp(), if you have it somewhere else already installed, give the path. Only works for unix (linux or Mac OS), if not on unix, use your favorite trimmer and give the output files from that trimmer as input.dir here.

paired.end

a logical: default FALSE, alternative TRUE. If TRUE, will auto detect pairs by names. If yes running on a folder: The folder must then contain an even number of files and they must be named with the same prefix and sufix of either _1 and _2, 1 and 2, etc. If SRR numbers are used, it will start on lowest and match with second lowest etc.

steps

a character, default: "tr-ge", trimming then genome alignment
steps of depletion and alignment wanted: The posible candidates you can use are:

  • tr : trim reads

  • co : contamination merged depletion

  • ph : phix depletion

  • rR : rrna depletion

  • nc : ncrna depletion

  • tR : trna depletion

  • ge : genome alignment

  • all: run steps: "tr-co-ge" or "tr-ph-rR-nc-tR-ge", depending on if you have merged contaminants or not

If not "all", a subset of these ("tr-co-ph-rR-nc-tR-ge")
If co (merged contaminants) is used, non of the specific contaminants can be specified, since they should be a subset of co.
The step where you align to the genome is usually always included, unless you are doing pure contaminant analysis. For Ribo-seq and TCP(RCP-seq) you should do rR (ribosomal RNA depletion), so when you made the STAR index you need the rRNA step, either use rRNA from .gtf or manual download. (usually just download a Silva rRNA database for SSU&LSU at: https://www.arb-silva.de/) for your species.

adapter.sequence

character, default: "auto". Auto detect adapter using fastp adapter auto detection, checking first 1.5M reads. (auto detect adapter, is not very reliable for Ribo-seq, so then you must include a manually specified, else alignment will most likely fail!). If already trimmed or trimming not wanted: adapter.sequence = "disable" .You can manually assign adapter like: "ATCTCGTATGCCGTCTTCTGCTTG" or "AAAAAAAAAAAAA". You can also specify one of the three presets:

  • illumina (standard for 100 bp sequencing): AGATCGGAAGAGC

  • small_RNA (standard for ~50 bp sequencing): TGGAATTCTCGG

  • nextera: CTGTCTCTTATA

min.length

20, minimum length of aligned read without mismatches to pass filter.

mismatches

3, max non matched bases. Excludes soft-clipping, this only filters reads that have defined mismatches in STAR. Only applies for genome alignment step.

trim.front

0, default trim 0 bases 5'. For Ribo-seq set use 0. Ignored if tr (trim) is not one of the arguments in "steps"

max.multimap

numeric, default 10. If a read maps to more locations than specified, will skip the read. Set to 1 to only get unique mapping reads. Only applies for genome alignment step. The depletions are allowing for multimapping.

alignment.type

default: "Local": standard local alignment with soft-clipping allowed, "EndToEnd" (global): force end-to-end read alignment, does not soft-clip.

max.cpus

integer, default: min(90, detectCores() - 1), number of threads to use. Default is minimum of 90 and maximum cores - 1. So if you have 8 cores it will use 7.

wait

a logical (not NA) indicating whether the R interpreter should wait for the command to finish, or run it asynchronously. This will be ignored (and the interpreter will always wait) if intern = TRUE. When running the command asynchronously, no output will be displayed on the Rgui console in Windows (it will be dropped, instead).

include.subfolders

"n" (no), do recursive search downwards for fast files if "y".

resume

default: NULL, continue from step, lets say steps are "tr-ph-ge": (trim, phix depletion, genome alignment) and resume is "ge", you will then use the assumed already trimmed and phix depleted data and start at genome alignment, useful if something crashed. Like if you specified wrong STAR version, but the trimming step was completed. Resume mode can only run 1 step at the time.

multiQC

logical, default TRUE. Do mutliQC comparison of STAR alignment between all the samples. Outputted in aligned/LOGS folder. See ?STAR.multiQC

script.folder

location of STAR index script, default internal ORFik file. You can change it and give your own if you need special alignments.

script.single

location of STAR single file alignment script, default internal ORFik file. You can change it and give your own if you need special alignments.

Details

Can only run on unix systems (Linux and Mac), and requires minimum 30GB memory on genomes like human, rat, zebrafish etc.
If for some reason the internal STAR alignment bash script will not work for you, like if you have a very small genome. You can copy the internal alignment script, edit it and give that as the Index script used for this function.
The trimmer used is fastp (the fastest I could find), works on mac and linux. If you want to use your own trimmer set file1/file2 to the location of the trimmed files from your program.
A note on trimming from creator of STAR about trimming: "adapter trimming it definitely needed for short RNA sequencing. For long RNA-seq, I would agree with Devon that in most cases adapter trimming is not advantageous, since, by default, STAR performs local (not end-to-end) alignment, i.e. it auto-trims." So trimming can be skipped for longer reads.

Value

output.dir, can be used as as input in ORFik::create.experiment

See Also

Other STAR: STAR.align.single(), STAR.allsteps.multiQC(), STAR.index(), STAR.install(), STAR.multiQC(), STAR.remove.crashed.genome(), getGenomeAndAnnotation(), install.fastp()

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# First specify directories wanted
annotation.dir <- "~/Bio_data/references/Human"
fastq.input.dir <- "~/Bio_data/raw_data/Ribo_seq_subtelny/"
bam.output.dir <- "~/Bio_data/processed_data/Ribo_seq_subtelny_2014/"

## Download some SRA data and metadata
# info <- download.SRA.metadata("DRR041459", fastq.input.dir)
# download.SRA(info, fastq.input.dir, rename = FALSE)
## Now align 2 different ways, without and with contaminant depletion

## No contaminant depletion:
# annotation <- getGenomeAndAnnotation("Homo sapiens", annotation.dir)
# index <- STAR.index(annotation)
# STAR.align.folder(fastq.input.dir, bam.output.dir,
#                   index, paired.end = FALSE)

## All contaminants merged:
# annotation <- getGenomeAndAnnotation(
#    organism = "Homo_sapiens",
#    phix = TRUE, ncRNA = TRUE, tRNA = TRUE, rRNA = TRUE,
#    output.dir = annotation.dir
#    )
# index <- STAR.index(annotation)
# STAR.align.folder(fastq.input.dir, bam.output.dir,
#                   index, paired.end = FALSE,
#                   steps = "tr-ge")

ORFik documentation built on March 27, 2021, 6 p.m.