STAR.align.folder | R Documentation |
Does either all files as paired end or single end,
so if you have mix, split them in two different folders.
If STAR halts at .... loading genome, it means the STAR
index was aborted early, then you need to run:
STAR.remove.crashed.genome(), with the genome that crashed, and rerun.
STAR.align.folder(
input.dir,
output.dir,
index.dir,
star.path = STAR.install(),
fastp = install.fastp(),
paired.end = FALSE,
steps = "tr-ge",
adapter.sequence = "auto",
quality.filtering = FALSE,
min.length = 20,
mismatches = 3,
trim.front = 0,
max.multimap = 10,
alignment.type = "Local",
allow.introns = TRUE,
max.cpus = min(90, BiocParallel::bpparam()$workers),
wait = TRUE,
include.subfolders = "n",
resume = NULL,
multiQC = TRUE,
keep.contaminants = FALSE,
keep.contaminants.type = c("bam", "fastq")[1],
keep.unaligned.genome = FALSE,
script.folder = system.file("STAR_Aligner", "RNA_Align_pipeline_folder.sh", package =
"ORFik"),
script.single = system.file("STAR_Aligner", "RNA_Align_pipeline.sh", package = "ORFik")
)
input.dir |
path to fast files to align, the valid input files will be search for from formats: (".fasta", ".fastq", ".fq", or ".fa") with or without compression of .gz. Also either paired end or single end reads. Pairs will automatically be detected from similarity of naming, separated by something as .1 and .2 in the end. If files are renamed, where pairs are not similarily named, this process will fail to find correct pairs! |
output.dir |
directory to save indices, default: paste0(dirname(arguments[1]), "/STAR_index/"), where arguments is the arguments input for this function. |
index.dir |
path to STAR index folder. Path returned from ORFik function STAR.index, when you created the index folders. |
star.path |
path to STAR, default: STAR.install(), if you don't have STAR installed at default location, it will install it there, set path to a runnable star if you already have it. |
fastp |
path to fastp trimmer, default: install.fastp(), if you have it somewhere else already installed, give the path. Only works for unix (linux or Mac OS), if not on unix, use your favorite trimmer and give the output files from that trimmer as input.dir here. |
paired.end |
a logical: default FALSE, alternative TRUE. If TRUE, will auto detect
pairs by names. Can not be a combination of both TRUE and FALSE! |
steps |
a character, default: "tr-ge", trimming then genome alignment
If not "all", a subset of these ("tr-co-ph-rR-nc-tR-ge") |
adapter.sequence |
character, default: "auto". Auto detect adapter using fastp
adapter auto detection, checking first 1.5M reads. (Auto detection of adapter will
not work 100% of the time (if the library is of low quality), then you must rerun
this function with specified adapter from fastp adapter analysis.
, using FASTQC or other adapter detection tools, else alignment will most likely fail!).
If already trimmed or trimming not wanted:
adapter.sequence = "disable" .You can manually assign adapter like:
"ATCTCGTATGCCGTCTTCTGCTTG" or "AAAAAAAAAAAAA". You can also specify one of the three
presets:
Paired end auto detection uses overlap sequence of pairs, to use the slower more secure paired end adapter detection, specify as: "autoPE". |
quality.filtering |
logical, default FALSE. Not needed for modern
library prep of RNA-seq, Ribo-seq etc (usually < ~ 0.5
If you are aligning bad quality data, set this to TRUE.
|
min.length |
20, minimum length of aligned read without mismatches to pass filter. Anything under 20 is dangerous, as chance of random hits will become high! |
mismatches |
3, max non matched bases. Excludes soft-clipping, this only filters reads that have defined mismatches in STAR. Only applies for genome alignment step. |
trim.front |
0, default trim 0 bases 5'. For Ribo-seq use default 0. Ignored if tr (trim) is not one of the arguments in "steps" |
max.multimap |
numeric, default 10. If a read maps to more locations than specified, will skip the read. Set to 1 to only get unique mapping reads. Only applies for genome alignment step. The depletions are allowing for multimapping. |
alignment.type |
default: "Local": standard local alignment with soft-clipping allowed, "EndToEnd" (global): force end-to-end read alignment, does not soft-clip. |
allow.introns |
logical, default TRUE. Allow large gaps of N in reads during genome alignment, if FALSE: sets –alignIntronMax to 1 (no introns). NOTE: You will still get some spliced reads if you assigned a gtf at the index step. |
max.cpus |
integer, default: |
wait |
a logical (not |
include.subfolders |
"n" (no), do recursive search downwards for fast files if "y". |
resume |
default: NULL, continue from step, lets say steps are "tr-ph-ge": (trim, phix depletion, genome alignment) and resume is "ge", you will then use the assumed already trimmed and phix depleted data and start at genome alignment, useful if something crashed. Like if you specified wrong STAR version, but the trimming step was completed. Resume mode can only run 1 step at the time. |
multiQC |
logical, default TRUE. Do mutliQC comparison of STAR alignment between all the samples. Outputted in aligned/LOGS folder. See ?STAR.multiQC |
keep.contaminants |
logical, default FALSE. Create and keep contaminant aligning bam files, default is to only keep unaliged fastq reads, which will be further processed in "ge" genome alignment step. Useful if you want to do further processing on contaminants, like specific coverage of specific tRNAs etc. |
keep.contaminants.type |
logical, default "bam". If aligned files of contaminants are kept, which format to output as, only supports "bam" for now. Fasta / Fastq will be implemented later. |
keep.unaligned.genome |
logical, default FALSE. Create and keep reads that did not align at the genome alignment step, default is to only keep the aliged bam file. Useful if you want to do further processing on plasmids/custom sequences. |
script.folder |
location of STAR index script, default internal ORFik file. You can change it and give your own if you need special alignments. |
script.single |
location of STAR single file alignment script, default internal ORFik file. You can change it and give your own if you need special alignments. |
Can only run on unix systems (Linux, Mac and WSL (Windows Subsystem Linux)),
and requires a minimum of 30GB memory on genomes like human, rat, zebrafish etc.
If for some reason the internal STAR alignment bash script will not work for you,
like if you want more customization of the STAR/fastp arguments.
You can copy the internal alignment script,
edit it and give that as the script used for this function.
The trimmer used is fastp (the fastest I could find), also works on
(Linux, Mac and WSL (Windows Subsystem Linux)).
If you want to use your own trimmer set file1/file2 to the location of
the trimmed files from your program.
A note on trimming from creator of STAR about trimming:
"adapter trimming it definitely needed for short RNA sequencing.
For long RNA-seq, I would agree with Devon that in most cases adapter trimming
is not advantageous, since, by default, STAR performs local (not end-to-end) alignment,
i.e. it auto-trims." So trimming can be skipped for longer reads.
output.dir, can be used as as input in ORFik::create.experiment
Other STAR:
STAR.align.single()
,
STAR.allsteps.multiQC()
,
STAR.index()
,
STAR.install()
,
STAR.multiQC()
,
STAR.remove.crashed.genome()
,
getGenomeAndAnnotation()
,
install.fastp()
# First specify directories wanted (temp directory here)
config_file <- tempfile()
#config.save(config_file, base.dir = tempdir())
#config <- ORFik::config(config_file)
## Yeast RNA-seq samples (small genome)
#project <- ORFik::config.exper("chalmers_2012", "Saccharomyces_cerevisiae", "RNA-seq", config)
#annotation.dir <- project["ref"]
#fastq.input.dir <- project["fastq RNA-seq"]
#bam.output.dir <- project["bam RNA-seq"]
## Download some SRA data and metadata (subset to 50k reads)
# info <- download.SRA.metadata("SRP012047", outdir = conf["fastq RNA-seq"])
# info <- info[1:2,] # Subset to 2 first libraries
# download.SRA(info, fastq.input.dir, rename = FALSE, subset = 50000)
## No contaminant depletion:
# annotation <- getGenomeAndAnnotation("Saccharomyces cerevisiae", annotation.dir)
# index <- STAR.index(annotation)
# STAR.align.folder(fastq.input.dir, bam.output.dir,
# index, paired.end = FALSE) # Trim, then align to genome
## Human Ribo-seq sample (NB! very large genome and libraries!)
## Requires >= 32 GB memory
#project <- ORFik::config.exper("subtelny_2014", "Homo_sapiens", "Ribo-seq", config)
#annotation.dir <- project["ref"]
#fastq.input.dir <- project["fastq Ribo-seq"]
#bam.output.dir <- project["bam Ribo-seq"]
## Download some SRA data and metadata (full libraries)
# info <- download.SRA.metadata("DRR041459", fastq.input.dir)
# download.SRA(info, fastq.input.dir, rename = FALSE)
## Now align 2 different ways, without and with contaminant depletion
## No contaminant depletion:
# annotation <- getGenomeAndAnnotation("Homo sapiens", annotation.dir)
# index <- STAR.index(annotation)
# STAR.align.folder(fastq.input.dir, bam.output.dir,
# index, paired.end = FALSE)
## All contaminants merged:
# annotation <- getGenomeAndAnnotation(
# organism = "Homo_sapiens",
# phix = TRUE, ncRNA = TRUE, tRNA = TRUE, rRNA = TRUE,
# output.dir = annotation.dir
# )
# index <- STAR.index(annotation)
# STAR.align.folder(fastq.input.dir, bam.output.dir,
# index, paired.end = FALSE,
# steps = "tr-ge")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.