FiltContig: FiltContig Function (Filter)

View source: R/Filter.R

FiltContigR Documentation

FiltContig Function (Filter)

Description

Filter out data from contigs that do not reach criterias of selection.

Usage

FiltContig(
  gposModBasePos,
  grangesModPos,
  cContigToBeRemoved = NULL,
  dnastringsetGenome,
  nContigMinSize = -1,
  listPctSeqByContig,
  nContigMinPctOfSeq = 95,
  listMeanCovByContig,
  nContigMinCoverage = 20
)

Arguments

gposModBasePos

An UnStitched GPos object containing PacBio CSV data to be filtered.

grangesModPos

A GRanges object containing PacBio GFF data to be filtered.

cContigToBeRemoved

Names of contigs for which the data will be removed. Defaults to NULL.

dnastringsetGenome

A DNAStringSet object containing the sequence for each contig.

nContigMinSize

Minimum size for contigs to keep. Contigs with a size below this value will be removed. Defaults to -1 (= no filter).

listPctSeqByContig

List containing, for each strand, the percentage of sequencing for each contig. This list must be composed of 2 dataframes (one by strand) called f_strand and r_strand. In each dataframe, "refName" column returning names of contigs and "seqPct" column returning percentage of sequencing.

nContigMinPctOfSeq

Minimum percentage of sequencing for contigs to keep. Contigs with a percentage below this value will be removed. Defaults to 95.

listMeanCovByContig

List containing, for each strand, the mean of coverage for each contig. This list must be composed of 2 dataframes (one by strand) called f_strand and r_strand. In each dataframe, "refName" column returning names of contigs and "mean_coverage" column returning mean of coverage.

nContigMinCoverage

Minimum mean coverage for contigs to keep. Contigs with a mean coverage below this value will be removed. Defaults to 20.

Value

A list with filtered gposModBasePos and filtered grangesModPos.

Examples

# loading genome
myGenome <- Biostrings::readDNAStringSet(system.file(
  package = "DNAModAnnot", "extdata",
  "ptetraurelia_mac_51_sca171819.fa"
))
myGrangesGenome <- GetGenomeGRanges(myGenome)

# Preparing a gposPacBioCSV and a grangesPacBioGFF datasets
myGrangesPacBioGFF <-
  ImportPacBioGFF(
    cPacBioGFFPath = system.file(
      package = "DNAModAnnot", "extdata",
      "ptetraurelia.modifications.sca171819.gff"
    ),
    cNameModToExtract = "m6A",
    cModNameInOutput = "6mA",
    cContigToBeAnalyzed = names(myGenome)
  )
myGposPacBioCSV <-
  ImportPacBioCSV(
    cPacBioCSVPath = system.file(
      package = "DNAModAnnot", "extdata",
      "ptetraurelia.bases.sca171819.csv"
    ),
    cSelectColumnsToExtract = c(
      "refName", "tpl", "strand", "base",
      "score", "ipdRatio", "coverage"
    ),
    lKeepExtraColumnsInGPos = TRUE, lSortGPos = TRUE,
    cContigToBeAnalyzed = names(myGenome)
  )

# Preparing ParamByStrand Lists
myPct_seq_csv <- GetSeqPctByContig(myGposPacBioCSV, grangesGenome = myGrangesGenome)
myMean_cov_list <- GetMeanParamByContig(
  grangesData = myGposPacBioCSV,
  dnastringsetGenome = myGenome,
  cParamName = "coverage"
)

# Filtering
myFiltered_data <- FiltContig(myGposPacBioCSV, myGrangesPacBioGFF,
  cContigToBeRemoved = NULL,
  dnastringsetGenome = myGenome, nContigMinSize = 1000,
  listPctSeqByContig = myPct_seq_csv, nContigMinPctOfSeq = 95,
  listMeanCovByContig = myMean_cov_list, nContigMinCoverage = 20
)
myFiltered_data$csv
myFiltered_data$gff

AlexisHardy/DNAModAnnot documentation built on Feb. 27, 2023, 12:03 a.m.