preprocess: Filtering of minicircle sequences

View source: R/preprocess.R

preprocessR Documentation

Filtering of minicircle sequences

Description

The preprocess function is used to filter minicircle sequences based on sequence length and circularization success. When minicircle sequences are assembled with KOMICS, individual fasta files are generated for each sample. This function allows you to filter these sequences based on their length and whether they are circularized or not. The filtered sequences are then written into individual FASTA files in the current working directory.

Usage

preprocess(files, groups, circ = TRUE, min = 500, max = 1500, writeDNA = TRUE)

Arguments

files

a character vector containing the names of the fasta files. Each file corresponds to the minicircle sequences of a specific sample. The file names should be in the format sampleA.minicircles.fasta, sampleB.minicircles.fasta, and so on (output of KOMICS).

groups

a factor specifying the group (e.g., species) to which each sample belongs. It should have the same length as the list of files, indicating the group assignment for each sample.

circ

a logical parameter that determines whether non-circularized minicircle sequences should be included or excluded from the filtering process. By default, non-circularized sequences are excluded (circ = TRUE). If you are interested in including non-circularized sequences, you can set the parameter to FALSE.

min

the minimum length threshold for filtering minicircle sequences. Sequences with a length below this threshold will be excluded. The default value is set to 500.

max

the maximum length threshold for filtering minicircle sequences. Sequences with a length above this threshold will be excluded. The default value is set to 1500.

writeDNA

a logical parameter that determines whether the filtered minicircle sequences should be written in FASTA format to the current working directory. By default, the filtered sequences are written (writeDNA = TRUE). If you are only interested in other output values like plots and summary, you can set this parameter to FALSE.

Value

samples

the sample names based on the input files.

N_MC

a table containing the sample names, the corresponding group assignment, and the number of minicircle sequences (N_MC) before and after filtering.

plot

a barplot visualizing the number of minicircle sequences per sample before and after filtering.

summary

the total number of minicircle sequences before and after filtering.

Examples

require(ggplot2)
data(exData)

### setwd("")

### run function
table(exData$species)
pre <- preprocess(files = system.file("extdata", exData$fastafiles, package="rKOMICS"),
                  groups = exData$species,
                  circ = TRUE, min = 500, max = 1200, writeDNA = FALSE)

pre$summary 

### visualize results
barplot(pre$N_MC[,"beforefiltering"], 
        names.arg = pre$N_MC[,1], las=2, cex.names=0.4)

### alter plot
pre$plot + labs(caption = paste0('N of MC sequences before and after filtering, ', Sys.Date()))


rKOMICS documentation built on July 9, 2023, 7:46 p.m.