make_trim: Make trimmed and quality filtered fastq files
In Danis102/seqpac: Seqpac: A Framework for smallRNA analysis in R using Sequence-Based Counts

make_trim

R Documentation

Make trimmed and quality filtered fastq files

Description

make_trim uses parallel processing to generate trimmed reads.

Usage

make_trim(
  input,
  output,
  indels = TRUE,
  concat = 12,
  check_mem = FALSE,
  threads = 1,
  chunk_size = NULL,
  polyG = c(type = NULL, min = NULL, mismatch = NULL),
  adapt_3_set = c(type = "hard_rm", min = 10, mismatch = 0.1),
  adapt_3 = "AGATCGGAAGAGCACACGTCTGAACTCCAGTCACTA",
  adapt_5_set = c(type = NULL, min = NULL, mismatch = NULL),
  adapt_5 = NULL,
  seq_range = c(min = NULL, max = NULL),
  quality = c(threshold = 20, percent = 0.8)
)

Arguments

`input`	Character indicating the path to a directory containing fastq formatted sequence files either compressed (.fastq.gz) or text (.fastq).
`output`	Character indicating the path to a directory where the compressed trimmed fastq files should be saved (as basename trim.fastq.gz).
`indels`	Logical whether indels should be allowed in the mismatch between read and adaptor sequences. Note that indel trimming will only be done on fragments that have failed to be trimmed by other means. Default=TRUE.
`concat`	Integer setting the threshold for trimming concatemer-like adaptors. Even when adaptor synthesis and ligation are strictly controlled, concatemer-like ("di-") adaptor sequences are formed. When `concat` is an integer, `make_trim` will search all trimmed sequences for additional adaptor sequence using a shorter version of the provided adaptor in `adapt_3`. The length of the short adaptor is controlled by `adapt_3_set$min`. If an additional, shorter, adaptor sequence is found in a read, trimming will only occur if the resulting sequence is <= concat shorter in length than the first trimmed sequence. Thus, if `concat`=12, a read with a NNNNN-XXXXX-YYYYYYYYY composition, where N is 'real' nucleotides and X/Y are two independent adaptor sequences, the trimming will result in NNNNN-XXXXX. If instead `concat`=5 then trimming will result in NNNNN. Note that setting `concat` too low will result in trimming of real nucleotides that just happend to share sequence with the adaptor. As default `concat`=12, which have been carefully evaluated in relation to the risk of trimming real sequence. If `concat`=NULL, concatemer-like adaptor trimming will not be done.
`check_mem`	Logical whether a memory check should be performed. A memory check estimates the approximate memory usage given the fastq file sizes and number of threads. The check happens before entering the parallel trimming loop, and will give an immediate warning given intermediary risky memory estimates, and an error if the risk is very high. Note, this option depends on the parallel and benchmarkme packages. (default=FALSE)
`threads`	Integer stating the number of parallel jobs. Note, that reading multiple fastq files drains memory, using up to 10Gb per fastq file. To avoid crashing the system due to memory shortage, make sure that each thread on the machine have at least 10 Gb of memory available, unless your fastq files are very small or very large. Use `parallel::detectcores()` to see available threads on the machine. Never use all threads! To run large fastq on low end computer use 'chunk_size'.
`chunk_size`	Integer indicating the number of reads to be read into memory in each chunk. When chunk_size=NULL (default), the whole fastq will be processed as one. Thus, running 5 threads will keep 5 full fastq files in memory simultaneously (fast but demanding). Setting chunk_size to an integer will instead read only a portion of each fastq at a time and then append the trimmed results on disc (slow but less demanding). Try using chunk_size=1000000 reads, less will be extremely slow. Rather decrease threads if you encounter memory problems.
`polyG`	Character vector with three inputs named 'type', 'min' and 'mismatch' that controls poly G trimming. This trimming might be necessary for fastq files from two channel illumina sequencers (e.g. NextSeq and NovaSeq) where no color signals are interpreted as 'G', but where read failure sometimes also results in long stretches of 'no-signal-Gs'. Works similar to `adapt_3_set` but instead of an adaptor sequence a poly 'G' string is constructed from the 'min' input option. Thus, if polyG=c(type="hard_rm", min=10, mismatch=0.1) then sequences containing 'GGGGGGGGGGNNNNN...etc' with 10 percent mismatch will be removed from the output fastq file.
`adapt_3_set`	Character vector with three inputs named 'type', 'min' and 'mismatch' that controls 3' trimming. As default, adapt_3_set=c(type="hard_rm", min=10, mismatch=0.1). Options for each input are: type= 'hard_trim': Trims all up to the very last nucleotide. 'hard_rm': Same as 'hard_trim' but removes untrimmed sequences. 'hard_save': Same as 'hard_trim' but saves all untrimmed sequences in new fastq file. 'soft_trim': No trimming on the last 3' nucleotides equal to 1/2 of 'min'. 'soft_rm': Same as 'soft_trim' but removes untrimmed sequences. 'soft_save': Same as 'soft_trim' but saves all untrimmed sequences in new fastq file. min= Integer that controls the short adaptor/poly-G/concatamer trimming and soft trimming. When min=10, a short version of the adaptor is generated containing the 10 first nucleotides. The short version of the adaptor therefore controls the minimum number of overlaps between adaptor/poly-G. Short adaptor trimming will only occur on untrimmed reads that have failed trimming with full adaptor. The min option also controls how many terminal nucleotides that will escape trimming when type is set to 'soft' (1/2 of min). mismatch= Numeric controlling the percent mismatch. For instance, if min=10 and mismatch=0.1, then 1 mismatch is allowed in the minimum overlap.
`adapt_3`	Character string defining the 3' adaptor sequence. In most library preparation protocols, the 3' adaptor is the only adaptor needed trimming.
`adapt_5_set`	Currently not supported, but will be in future updates. Same as `adapt_3_set` but controls the behavior of 5' trimming when a sequence is provided in `adapt_5`.
`adapt_5`	Character string defining a 5' adaptor sequence. 5' adaptor sequences are only trimmed in specialized protocols or in very short paired-end sequencing.
`seq_range`	Numeric vector with two inputs named 'min' and 'max', that controls the sequence size filter. For example, if `seq_range=c(min=15, max=50)` the function will extract sequences in the range between 15-50 nucleotides after trimming. As default, `seq_range=c(min=NULL, max=NULL)` and will retain all trimmed sequences.
`quality`	Numeric vector with two inputs named 'threshold' and 'percent' that controls the fastq quality filter. For example, if `quality=c(threshold=20, percent=0.8)` (default) the function will extract sequences that pass a fastq phred score >=20 in 80 percent of the nucleotides after trimming. If `quality=c(threshold=NULL, percent=NULL)` then all sequences will be retained (not default!).

Details

Given a file path to sequence files in fastq format (extension .fastq.gz or .fastq) this function will first trim the sequences from adaptor sequence, and then apply sequence size and phred score quality filters. On systems with low memory or very large fastq files, users may choose to handle each file in chunks on the expense of time performance.

Value

Exports a compressed trimmed fastq file to the output folder (file name: basename+trim.fastq.gz). If any type="save" has been set, an additional fastq file (file name includes 'REMOVED') will be exported. In addition, overview statisics (progress report) will be returned as a data.frame.

Examples


############################################################ 
### Seqpac fastq trimming with the make_trim function 
### (for more streamline options see the make_counts function) 
 ############################################################ 
### Seqpac fastq trimming with the make_counts function 
### using default settings for NEBNext small RNA adaptor 

# Seqpac includes strongly down-sampled smallRNA fastq.
sys_path = system.file("extdata", package = "seqpac", mustWork = TRUE)
input <- list.files(path = sys_path, pattern = "fastq", all.files = FALSE,
                full.names = TRUE)

# Run make_trim using NEB-next adaptor
# (Here we store the trimmed files in the input folder)
# (but you may choose where to store them using the output option)

input <- unique(dirname(input))  # mak_trim searches for fastq in input dir
output <- tempdir()              # save in tempdir


list.files(input, pattern = "fastq") #before

prog_report  <-  make_trim(
       input=input, output=output, threads=1, 
       adapt_3_set=c(type="hard_rm", min=10, mismatch=0.1), 
       adapt_3="AGATCGGAAGAGCACACGTCTGAACTCCAGTCACTA", 
       polyG=c(type="hard_trim", min=20, mismatch=0.1),
       seq_range=c(min=14, max=70),
       quality=c(threshold=20, percent=0.8))
       
list.files(path = output, pattern = "fastq") #after

# How did it go? Check progress report:  
prog_report     
       

## The principle:
# input <- "/some/path/to/untrimmed_fastq_folder"
# output <- "/some/path/to/output_folder_for_trimmed_fastq"
# 
# prog_report  <-  make_trim(
#      input=input, output=output, 
#      threads=<how_many_parallel>, 
#      check_mem=<should_memory_be_checked?>, 
#      adapt_3_set=<options_for_main_trimming, 
#      adapt_3=<sequence_to be trimmed, 
#      polyG=<Illumina_Nextseq_type_poly_G_trimming>,
#      seq_range=<what_sequence_range_to_save>,
#      quality=<quality_filtering_options>))

Danis102/seqpac documentation built on Aug. 26, 2023, 10:15 a.m.

Danis102/seqpac index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Danis102/seqpac
Seqpac: A Framework for smallRNA analysis in R using Sequence-Based Counts

make_trim: Make trimmed and quality filtered fastq files
In Danis102/seqpac: Seqpac: A Framework for smallRNA analysis in R using Sequence-Based Counts

Make trimmed and quality filtered fastq files

Description

Usage

Arguments

Details

Value

See Also

Examples

Related to make_trim in Danis102/seqpac...

R Package Documentation

Browse R Packages

We want your feedback!

Danis102/seqpac Seqpac: A Framework for smallRNA analysis in R using Sequence-Based Counts

make_trim: Make trimmed and quality filtered fastq files In Danis102/seqpac: Seqpac: A Framework for smallRNA analysis in R using Sequence-Based Counts

Make trimmed and quality filtered fastq files

Description

Usage

Arguments

Details

Value

See Also

Examples

Related to make_trim in Danis102/seqpac...

R Package Documentation

Browse R Packages

We want your feedback!

Danis102/seqpac
Seqpac: A Framework for smallRNA analysis in R using Sequence-Based Counts

make_trim: Make trimmed and quality filtered fastq files
In Danis102/seqpac: Seqpac: A Framework for smallRNA analysis in R using Sequence-Based Counts