preprocessing: Preprocess Experimental Data for Genomic Analysis

View source: R/preprocessing.R

preprocessingR Documentation

Preprocess Experimental Data for Genomic Analysis

Description

This function orchestrates a pipeline for preprocessing genomic data, including filtering annotations, splitting transcripts into windows, retrieving bedgraph values, and generating a final annotated table.

Usage

preprocessing(exptabpath, gencodepath, windsize, maptrackpath,
blacklistpath, genomename = NA, nbcputrans = 1, finaltabpath = tempdir(),
finaltabname = "anno.tsv", tmpfold = file.path(tempdir(), "tmptepr"),
saveobjectpath = tempdir(), savefinaltable = TRUE, reload = FALSE,
showtime = FALSE, showmemory = FALSE, deletetmp = TRUE, chromtab = NA,
forcechrom = FALSE, verbose = TRUE)

Arguments

exptabpath

Character. Path to the experiment table file.

gencodepath

Character. Path to the Gencode annotation file.

windsize

Integer. Window size for splitting transcripts.

maptrackpath

Character. Path to the mappability track file.

blacklistpath

Character. Path to the blacklist file.

genomename

Character. Name of the genome assembly (e.g., "hg38"). Default is NA. If left to NA, chromtab should be provided.

nbcputrans

Integer. Number of CPUs to use for transcript processing. Default is 1.

finaltabpath

Character. Path where the final annotated table will be saved. Default is tempdir().

finaltabname

Character. Name of the final annotated table file. Default is anno.tsv.

tmpfold

Character. Path to a temporary folder for intermediate files. Default is file.path(tempdir(), "tmptepr").

saveobjectpath

Character. Path to save intermediate objects. Default is tempdir().

savefinaltable

Logical. Whether to save the final table to disk. Default is TRUE.

reload

Logical. Whether to reload intermediate objects if available. Default is FALSE.

showtime

Logical. Whether to display timing information. Default is FALSE.

showmemory

Logical. Whether to display memory usage information. Default is FALSE.

deletetmp

Logical. Whether to delete temporary files after processing. Default is TRUE.

chromtab

A Seqinfo object retrieved with the rtracklayer method SeqinfoForUCSCGenome. If NA, the method is called automatically and the genomename should be provided. Default is NA.

forcechrom

Logical indicating if the presence of non-canonical chromosomes in chromtab (if not NA) should trigger an error. Default is FALSE.

verbose

Logical. Whether to display detailed progress messages. Default is TRUE.

Details

The 'preprocessing' function performs several key tasks: 1. Filters Gencode annotations to retrieve "transcript" annotations. 2. Differentiates between protein-coding (MANE_Select) and long non-coding (lncRNA, Ensembl_canonical) transcripts. 3. Splits transcripts into windows of size 'windsize'. 4. Processes bedgraph files to retrieve values, exclude blacklisted regions, and retain high-mappability intervals. 5. Generates a final annotated table with scores derived from the above steps.

Temporary files created during processing are optionally deleted at the end.

Value

A data frame representing the final table containing transcript information and scores on 'windsize' windows for all experiments defined in the experiment table (exptabpath).

See Also

[retrieveanno], [makewindows], [blacklisthighmap], [createtablescores]

Examples


## Data
exptabpath <- system.file("extdata", "exptab-preprocessing.csv", package = "tepr")
gencodepath <- system.file("extdata", "gencode-chr13.gtf", package = "tepr")
maptrackpath <- system.file("extdata", "k50.umap.chr13.hg38.0.8.bed",
  package = "tepr")
blacklistpath <- system.file("extdata", "hg38-blacklist-chr13.v2.bed",
    package = "tepr")
windsize <- 200
genomename <- "hg38"

## Copying bedgraphs to the current directory
expdfpre <- read.csv(exptabpath)
bgpathvec <- sapply(expdfpre$path, function(x) system.file("extdata", x,
    package = "tepr"))
expdfpre$path <- bgpathvec
write.csv(expdfpre, file = "exptab-preprocessing.csv", row.names = FALSE,
    quote = FALSE)
exptabpath <- "exptab-preprocessing.csv"

## Testing preprocessing
finaltabtest <- preprocessing(exptabpath, gencodepath, windsize, maptrackpath,
    blacklistpath, genomename = genomename, verbose = FALSE)


tepr documentation built on June 8, 2025, 10:46 a.m.