FiltPacBio: FiltPacBio Function (Filter)

View source: R/Filter.R

FiltPacBioR Documentation

FiltPacBio Function (Filter)

Description

Filter out data from contigs or Modifications that do not reach criterias of selection.

Usage

FiltPacBio(
  grangesPacBioGFF,
  gposPacBioCSV = NULL,
  cContigToBeRemoved = NULL,
  dnastringsetGenome,
  nContigMinSize = -1,
  listPctSeqByContig,
  nContigMinPctOfSeq = -1,
  listMeanCovByContig,
  nContigMinCoverage = -1,
  cParamNameForFilter = NULL,
  listMeanParamByContig = NULL,
  nContigFiltParamLoBound = NULL,
  nContigFiltParamUpBound = NULL,
  nFiltParamLoBoundaries = NULL,
  nFiltParamUpBoundaries = NULL,
  cFiltParamBoundariesToInclude = NULL,
  nModMinIpdRatio = NULL,
  nModMinScore = NULL,
  nModMinCoverage = NULL,
  listFdrEstByThrIpdRatio = NULL,
  listFdrEstByThrScore = NULL
)

Arguments

grangesPacBioGFF

A GRanges object containing PacBio GFF data to be filtered OR A List of GRanges object, with the sequence for each motif associated to the modification, and containing PacBio GFF data to be filtered.

gposPacBioCSV

An UnStitched GPos object containing PacBio CSV data to be filtered. Defaulst to NULL.

cContigToBeRemoved

Names of contigs for which the data will be removed. gposPacBioCSV must be provided if using this argument. Defaults to NULL.

dnastringsetGenome

A DNAStringSet object containing the sequence for each contig.

nContigMinSize

Minimum size for contigs to keep. Contigs with a size below this value will be removed. gposPacBioCSV must be provided if using this argument. Defaults to -1 (= no filter).

listPctSeqByContig

List containing, for each strand, the percentage of sequencing for each contig. This list must be composed of 2 dataframes (one by strand) called f_strand and r_strand. In each dataframe, "refName" column returning names of contigs and "seqPct" column returning percentage of sequencing. gposPacBioCSV must be provided if using this argument.

nContigMinPctOfSeq

Minimum percentage of sequencing for contigs to keep. Contigs with a percentage below this value will be removed. gposPacBioCSV must be provided if using this argument. Defaults to 95.

listMeanCovByContig

List containing, for each strand, the mean of coverage for each contig. This list must be composed of 2 dataframes (one by strand) called f_strand and r_strand. In each dataframe, "refName" column returning names of contigs and "mean_coverage" column returning mean of coverage. gposPacBioCSV must be provided if using this argument.

nContigMinCoverage

Minimum mean coverage for contigs to keep. Contigs with a mean coverage below this value will be removed. gposPacBioCSV must be provided if using this argument. Defaults to 20.

cParamNameForFilter

A character vector giving the name of the parameter to be filtered. Must correspond to the name of one column in the object provided with grangesModPos.

listMeanParamByContig

List containing, for each strand, the mean of a given parameter for each contig. This list must be composed of 2 dataframes (one by strand) called f_strand and r_strand. In each dataframe, "refName" column returning names of contigs and "mean_"[parameter name] column returning the mean of the given parameter. If not NULL, remove contigs that are too far away from the mean of the Parameter of all contigs (which are not included in the interval centered on the mean) in the list provided. Defaults to NULL.

nContigFiltParamLoBound

A numeric value to be removed from the mean of the given parameter of all contigs (calculates the lower bound of the interval centered on the mean). Defaults to NULL.

nContigFiltParamUpBound

A numeric value to be added to the mean of the given parameter of all contigs (calculates the upper bound of the interval centered on the mean). Defaults to NULL.

nFiltParamLoBoundaries

A numeric vector returning the lower boundaries of intervals. Must have the same length as "nFiltParamUpBoundaries". Defaults to NULL.

If this parameter is provided, the function will remove modifications which have values of the given parameter that are not included in the intervals provided with "nFiltParamLoBoundaries" and "nFiltParamUpBoundaries".

nFiltParamUpBoundaries

A numeric vector returning the upper boundaries of intervals. Must have the same length as "nFiltParamLoBoundaries". Defaults to NULL.

If this parameter is provided, the function will remove modifications which have values of the given parameter that are not included in the intervals provided with "nFiltParamLoBoundaries" and "nFiltParamUpBoundaries".

cFiltParamBoundariesToInclude

A character vector describing which interval boundaries must be included in the intervals provided. Can be "upperOnly" (only upper boundaries), "lowerOnly" (only lower boundaries), "both" (both upper and lower boundaries) or "none" (do not include upper and lower boundaries). If NULL, both upper and lower boundaries will be included (= "both"). Defaults to NULL. cFiltParamBoundariesToInclude = NULL #can be "upperOnly","lowerOnly","both", "none' (NULL = "both" for all)

nModMinIpdRatio

Minimum ipdRatio for all Modifications to be kept. Modifications with an ipdRatio below this value will be removed. Defaults to NULL (no filter).

nModMinScore

Minimum score for all Modifications to be kept. Modifications with a score below this value will be removed. Defaults to NULL (no filter).

nModMinCoverage

Minimum coverage for all Modifications to be kept. Modifications with a coverage below this value will be removed. Defaults to NULL (no filter).

listFdrEstByThrIpdRatio

A list of thresholds on ipdRatio for each motif associated to the modification.

listFdrEstByThrScore

A list of thresholds on score for each motif associated to the modification.

Value

A list with filtered gposPacBioCSV and filtered gposPacBioGFF.

Examples

# loading genome
myGenome <- Biostrings::readDNAStringSet(system.file(
  package = "DNAModAnnot", "extdata",
  "ptetraurelia_mac_51_sca171819.fa"
))
myGrangesGenome <- GetGenomeGRanges(myGenome)

# Preparing a gposPacBioCSV and a grangesPacBioGFF datasets
myGrangesPacBioGFF <-
  ImportPacBioGFF(
    cPacBioGFFPath = system.file(
      package = "DNAModAnnot", "extdata",
      "ptetraurelia.modifications.sca171819.gff"
    ),
    cNameModToExtract = "m6A",
    cModNameInOutput = "6mA",
    cContigToBeAnalyzed = names(myGenome)
  )
myGposPacBioCSV <-
  ImportPacBioCSV(
    cPacBioCSVPath = system.file(
      package = "DNAModAnnot", "extdata",
      "ptetraurelia.bases.sca171819.csv"
    ),
    cSelectColumnsToExtract = c(
      "refName", "tpl", "strand", "base",
      "score", "ipdRatio", "coverage"
    ),
    lKeepExtraColumnsInGPos = TRUE, lSortGPos = TRUE,
    cContigToBeAnalyzed = names(myGenome)
  )

# Preparing ParamByStrand Lists
myPct_seq_csv <- GetSeqPctByContig(myGposPacBioCSV, grangesGenome = myGrangesGenome)
myMean_cov_list <- GetMeanParamByContig(
  grangesData = myGposPacBioCSV,
  dnastringsetGenome = myGenome,
  cParamName = "coverage"
)

# Filtering
myFiltered_data <- FiltPacBio(
  grangesPacBioGFF = myGrangesPacBioGFF,
  gposPacBioCSV = myGposPacBioCSV, cContigToBeRemoved = NULL,
  dnastringsetGenome = myGenome, nContigMinSize = 1000,
  listPctSeqByContig = myPct_seq_csv, nContigMinPctOfSeq = 95,
  listMeanCovByContig = myMean_cov_list, nContigMinCoverage = 20
)
myFiltered_data$csv
myFiltered_data$gff

AlexisHardy/DNAModAnnot documentation built on Feb. 27, 2023, 12:03 a.m.