STAR_run: Map reads to genome
In e-myers/rnaseq: Process, Analyze and Visualize RNA-seq Data

Description Usage Arguments Details Author(s) Examples

Map reads to genome using STAR. Most parameter descriptions are from STAR manual.

STAR_run(readFilesIn, genomeDir, starPath = "/opt/STAR/bin/MacOSX_x86_64/",
  outDest = "./", outSuffix = "", outTrimString = "", runThreadN = 1,
  readFilesInR2 = NULL, settings = NULL, settingsOverride = NULL,
  quantMode = NULL)

`readFilesIn`	Character - files with sequences to map (if paired-end data, the R1 files)
`genomeDir`	String - Path to directory where genome indices were generated by STAR
`starPath`	String - Path to directory with STAR executable
`outDest`	String - Directory where output files should be saved
`outSuffix`	String - will be appended to original filename
`outTrimString`	String - will be trimmed from output filenames - don't include input file extension, as this is trimmed from basename(readFilesIn)
`runThreadN`	Numeric - How many cores to use
`readFilesInR2`	Character - If data are paired, the R2 files
`settings`	String - Use ENCODE settings for long or short RNA ("ENCODE_long" or "ENCODE_short")
`settingsOverride`	String - Additional inputs to STAR; anything here that is also in ENCODE settings will override the ENCODE value

Take a list of fastq files containing reads, and get alignments with STAR. STAR's defaults are the defaults here. Override by defining "settings" as "ENCODE_long" or "ENCODE_short", to use ENCODE settings for long or short RNA. Override specific parameter values in the ENCODE settings, and set any other parameters you want, with settingsOverride. This argument is a string that will be tacked on to the command issued to the command line, as-is. If an input file doesn't exist, you'll get an error. If an output file (at least a Log.out file) does exist, it'll just skip the corresponding input file. For paired-end data, readFilesIn are the R1 files and readFilesIn2 are the R2 files. They need to have R1 / R2 or r1 / r2 in the filenames, which based on looking online is the norm. TIME: ~10m per 5G fastq. 3-6 hours for an entire total-RNA dataset (~25-30 samples). Nuc-seq took only ~30m. COMMAND LINE EXAMPLE, using ENCODE settings for long RNA (if you want to play with the parameters while looking at just one file, this might be easiest): /opt/STAR/bin/MacOSX_x86_64/STAR –genomeDir $genomeDir –readFilesIn $pathTrimmed$sn$trimmedSuffix –outFilterType BySJout –outFilterMultimapNmax 20 –alignSJoverhangMin 8 –alignSJDBoverhangMin 1 –outFilterMismatchNmax 999 –outFilterMismatchNoverLmax 0.04 –alignIntronMin 20 –alignIntronMax 1000000 –alignMatesGapMax 1000000 –outSAMtype 'BAM SortedByCoordinate' –outFileNamePrefix $pathMapped$sn$mappedSuffix#'

Emma Myers

Example using paired-end data.
gdir = '/Volumes/CodingCLub1/STAR_stuff/indexes/refGene_gtf_maxLen75/'
fastqs=dir(paste(dataPath,'trimmed',sep=''), pattern='fastq', full.names=TRUE)
r1=fastqs[which(regexpr("R1",fastqs)>0)]
r2=fastqs[which(regexpr("R2",fastqs)>0)]
Use ENCODE settings for short RNA, except for outFilterMatchNmin
STAR_run(r1,genomeDir=gdir,outDest='sandbox/packtest/', outTrimString='_R1_001_trimmed', readFilesInR2=r2, runThreadN=8, settings="ENCODE_short", settingsOverride=c("--outFilterMatchNmin", "8") )