In AlexisHardy/DNAModAnnot: Toolbox for DNA Modifications filtering and annotation

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

DNAModAnnot is a R package providing a comprehensive toolkit for the analysis and annotation of DNA modifications (e.g. 6-methyladenine (6mA)). Its modular architecture allows the analysis of modification detection performed using Pacific Biosciences (PacBio) kineticsTools or Oxford Nanopore Technologies via DeepSignal software.

This documentation provides a guided use of this package using PacBio input data. By following this tutorial, you will learn to:

1st step: load and filter the genome and DNA modification data
2nd step: analyze the global distribution and motif of DNA modification data
3rd step: filter DNA modification data using False Discovery Rate estimations
4th step: analyze the DNA modification patterns within genomic annotations

Installation

This package is available via GitHub ( https://github.com/AlexisHardy/DNAModAnnot ) and can be installed using the following code:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}

BiocManager::install(c('Biostrings', 'BSgenome', 'Gviz', 'seqLogo'))

Then, you can install DNAModAnnot using the tar.gz file from GitHub repository:

setwd("path/to/package/file/")
install.packages("DNAModAnnot_0.0.0.9018.tar.gz", repos = NULL, type = 'source')

Or you can directly install from GitHub using devtools package:

install.packages("devtools")
library(devtools)
install_github("AlexisHardy/DNAModAnnot")

You should then be able to load the package into your R session with:

library(DNAModAnnot)

Required data

DNAModAnnot can load data generated by the PacBio platform after using the Pacific Biosciences (PacBio) kineticsTools on PacBio RSI/RSII/Sequel/SequelII data. The following output files will be used by DNAModAnnot to perform a stringent analysis of detected modification patterns:

modifications.csv file (PacBioCSV): contains kinetics from all bases sequenced.
modifications.gff file (PacBioGFF): contains kinetics for all modified bases detected along with methylation level (or fraction) for each of those modifications.

It is also possible for DNAModAnnot to load data generated by the Nanopore platform after using the DeepSignal software. Here, the output of the call_modification_frequency.py script (containing modification detection parameters for all target sites sequenced) will then be required (cf Deepsignal documentation).

1st step: load and filter the genome and DNA modification data

Before analyzing modifications distribution, the package DNAModAnnot proposes various functions to assess the quality of sequencing and filter out bias (e.g. lack of coverage) coming from some scaffolds/bases.

You can load a genome (fasta file) using readDNAStringSet from Biostrings package.
The GetGenomeGRanges function will return a GRanges object representing the scaffolds which will be used by various functions later.

# Loading genome assembly (fasta file or DNAStringSet)
ptetraurelia_genome_fa <- system.file(
  package = "DNAModAnnot", "extdata",
  "ptetraurelia_mac_51_sca171819.fa"
)
ptetraurelia_genome <- Biostrings::readDNAStringSet(ptetraurelia_genome_fa)
ptetraurelia_genome_range <- GetGenomeGRanges(ptetraurelia_genome)

The GetAssemblyReport function returns a classic report (similar to reports from QUAST software) in a data.frame object.

# Statistics summary of genome
report_assembly <- GetAssemblyReport(
  dnastringsetGenome = ptetraurelia_genome,
  cOrgAssemblyName = "ptetraurelia_mac_51"
)

The GetContigCumulLength function returns a data.frame with scaffolds size and cumulated size. This data.frame can be used by DrawContigCumulLength function to plot cumulated length by contig.

# Cumulative length of genome assembly
contig_cumulative_length <- GetContigCumulLength(ptetraurelia_genome)
DrawContigCumulLength(
  nContigCumsumLength = contig_cumulative_length$cumsum_Mbp_length,
  cOrgAssemblyName = "ptetraurelia_mac_51",
  lGridInBackground = FALSE
)

*Cumulative length (Mbp) of 3 contigs from ptetraurelia_mac_51 genome assembly.* Contigs are ordered from largest to smallest contig. Here, only data from 3 contigs of similar size are used.

The modifications.csv (PacBioCSV) and modifications.gff (PacBioGFF) files can be loaded using the ImportPacBioCSV and ImportPacBioGFF functions respectively.

ImportPacBioCSV will import PacBioCSV file as an Unstitched GPos object.

Note that SortGPos=TRUE will take a longer time for the importation but will allow to greatly reduce the size of the GPos output.

ImportPacBioGFF will import PacBioGFF file as a GRanges object.

Note that ImportPacBioGFF will import one modification type at a time that you must precise using cNameModToExtract option.

# loading SMRT-seq data (PacBio gff and csv files)
PacBioGFF_path <- system.file(package = "DNAModAnnot", "extdata", "ptetraurelia.modifications.sca171819.gff")
PacBioGFF_granges <- ImportPacBioGFF(
  cPacBioGFFPath = PacBioGFF_path,
  cNameModToExtract = "m6A",
  cModNameInOutput = "6mA",
  cContigToBeAnalyzed = names(ptetraurelia_genome)
)
PacBioCSV_path <- system.file(package = "DNAModAnnot", "extdata", "ptetraurelia.bases.sca171819.csv")
PacBioCSV_gpos <- ImportPacBioCSV(
  cPacBioCSVPath = PacBioCSV_path,
  cSelectColumnsToExtract = c(
    "refName", "tpl", "strand",
    "base", "score",
    "ipdRatio", "coverage"
  ),
  lKeepExtraColumnsInGPos = TRUE,
  lSortGPos = TRUE,
  cContigToBeAnalyzed = names(ptetraurelia_genome)
)

You can use the GetSeqPctByContig function with the PacBioCSV GPos object to retrieve, into a list, the percentage of sequenced bases by contig. This list can be used for filtering with the FiltPacBio function.
You can also draw a barplot by strand using the DrawBarplotBothStrands function with the content of this list.

# Percentage of sequencing by scaffold and by strand
contig_percentage_sequencing <- GetSeqPctByContig(PacBioCSV_gpos,
  grangesGenome = ptetraurelia_genome_range
)
DrawBarplotBothStrands(
  nParamByContigForward = contig_percentage_sequencing$f_strand$seqPct,
  nParamByContigReverse = contig_percentage_sequencing$r_strand$seqPct,
  cContigNames = contig_percentage_sequencing$f_strand$refName,
  cGraphName = "Percentage of sequencing per contig"
)

*Percentage of sequenced bases (percentage of sequencing) per contig and per strand.* Contigs are ordered from largest to smallest contig. Upper part: forward strand; lower part: reverse strand. Percentage of sequencing = 100 x Number of sequenced bases/Number of total bases\*. \*According to the genome assembly sequence provided. Here, the percentage of sequencing is very low because only a small sample is provided. The percentage of sequencing is similar between the contigs/strands: there is no sequencing bias in one of the contigs.

The same approach can be used for any numeric parameter available in the PacBioCSV GPos and PacBioGFF GRanges objects. You can calculate the mean by scaffold for a numeric parameter using the GetMeanParamByContig function. Then, you can use the output list to represent barplots by strand using the DrawBarplotBothStrands function again.

# Mean coverage by scaffold and by strand
contig_mean_coverage <- GetMeanParamByContig(
  grangesData = PacBioCSV_gpos,
  dnastringsetGenome = ptetraurelia_genome,
  cParamName = "coverage"
)
DrawBarplotBothStrands(
  nParamByContigForward = contig_mean_coverage$f_strand$mean_coverage,
  nParamByContigReverse = contig_mean_coverage$r_strand$mean_coverage,
  cContigNames = contig_mean_coverage$f_strand$refName,
  cGraphName = "Mean Coverage per contig"
)

*Mean coverage of sequenced bases per contig and per strand.* Contigs are ordered from largest to smallest contig. Upper part: forward strand; lower part: reverse strand. Here, no bias can be observed: the mean coverage is between 120 and 100 for all contigs/strands.

To represent the distribution of a numeric parameter, the DrawDistriHistBox function allows you to plot an histogram along a boxplot showing the global range of this parameter.

# coverage distribution of all bases sequenced
DrawDistriHistBox(PacBioCSV_gpos$coverage,
  cGraphName = "Coverage distribution of all bases sequenced",
  cParamName = "Coverage",
  lTrimOutliers = FALSE
)

*Coverage distribution of all sequenced bases.* Grey bars: density of coverage values. Red line: smooth representation of the density of coverage values. Boxplot: represents the coverage distribution while highlighting the limit with the outliers values. Here, the coverage values are concentrated around 115 but there are some outliers below 70.

You can use the FiltPacBio function to filter the PacBioCSV GPos and PacBioGFF GRanges objects.
This function will return a list with the new PacBioCSV GPos as the first element of the list, and the new PacBioGFF GRanges as the second element of the list.
Some options of this function require PacBioCSV GPos object or the sequence of the genome (dnastringsetGenome object). (see FiltPacBio documentation for more information)

# First filter: contigs
PacBio_filtered_data <- FiltPacBio(
  gposPacBioCSV = PacBioCSV_gpos,
  grangesPacBioGFF = PacBioGFF_granges,
  cContigToBeRemoved = NULL,
  dnastringsetGenome = ptetraurelia_genome,
  nContigMinSize = 1000,
  listPctSeqByContig = contig_percentage_sequencing,
  nContigMinPctOfSeq = 1,
  listMeanCovByContig = contig_mean_coverage,
  nContigMinCoverage = 20
)
PacBioCSV_gpos_filt1 <- PacBio_filtered_data$csv
PacBioGFF_granges_filt1 <- PacBio_filtered_data$gff

2nd step: analyze the global distribution and motif of DNA modification data

This part will describe some tools which can be used before and/or after filtering based on False Discovery (FDR) estimations.

You can use the GetModReport function to obtain a data.frame giving global informations about modification distribution (Modification counts, ratio, motifs associated, mean of parameters provided...).

# report basic Mod for remaining scaffolds
report_modifications <- GetModReportPacBio(
  grangesGenome = ptetraurelia_genome_range,
  grangesPacBioGFF = PacBioGFF_granges_filt1,
  gposPacBioCSV = PacBioCSV_gpos_filt1,
  cOrgAssemblyName = "ptetraurelia_mac_51",
  dnastringsetGenome = ptetraurelia_genome,
  cBaseLetterForMod = "A",
  cModNameInOutput = "6mA"
)

You can use the GetModRatioByContig function with the PacBioGFF GRanges object to retrieve, in a list, the modification ratio (ModRatio) by scaffold.
Again, you can draw a barplot by strand using DrawBarplotBothStrands function with the content of this list.

# Modif ratio by scaffold and by strand
contig_modification_ratio <- GetModRatioByContig(PacBioGFF_granges_filt1,
  PacBioCSV_gpos_filt1[PacBioCSV_gpos_filt1$base == "A"],
  dnastringsetGenome = ptetraurelia_genome,
  cBaseLetterForMod = "A"
)
DrawBarplotBothStrands(
  nParamByContigForward = contig_modification_ratio$f_strand$Mod_ratio,
  nParamByContigReverse = contig_modification_ratio$r_strand$Mod_ratio,
  cContigNames = contig_modification_ratio$f_strand$refName,
  cGraphName = "Modif/Base ratio per contig (Sequenced sites only)"
)

*Modification ratio (Modified bases / Sequenced bases) per contig and per strand.* Contigs are ordered from largest to smallest contig. Upper part: forward strand; lower part: reverse strand. Modification ratio = Number of Modified bases / Number of Sequenced bases. Here, the modification ratio is similar between strands but not between contigs: this must be considered as it could result in a potential bias in the analysis of some annotated features.

You can reveal enrichment (or enrichment + depletion) around modified bases by plotting a logo with the DrawModLogo function based on the Logolas package.

Here, you need to provide the sequence around each modified base. You can get those sequences by using the PacBioGFF GRanges object with the GetGRangesWindowSeqandParam function and retrieving the sequence column.
You can decide the length of the sequence to be compared around the modifications using the nUpstreamBpToAdd and nDownstreamBpToAdd options.

# Sequence logo associated to Modif detected
PacBioGFF_granges_with_sequence <- GetGRangesWindowSeqandParam(PacBioGFF_granges_filt1,
  ptetraurelia_genome_range,
  dnastringsetGenome = ptetraurelia_genome,
  nUpstreamBpToAdd = 5,
  nDownstreamBpToAdd = 5
)
DrawModLogo(
  dnastringsetSeqAroundMod = as(PacBioGFF_granges_with_sequence$sequence, "DNAStringSet"),
  nGenomicBgACGT = c(0.35, 0.15, 0.15, 0.35), cYunit = "ic_hide_bg",
  nPositionsToAnnotate = c(6), cAnnotationText = c("6mA"), nTagTextFontSize = 12
)

*Modification sequence logo with enrichment score.* Enrichment score is calculated for each position: negative values reflects base depletion while positive values means enrichment. Here, the 6mA are aligned on the 6th base position: 6mA is associated with AT or AG motifs; a depletion of T at position 5 and a depletion of A at position 8 can be observed.

The FiltPacBio function can be used to filter PacBioGFF GRanges only depending on the other arguments provided. (see FiltPacBio documentation for more information)

# Second filter: fraction
PacBioGFF_granges_filt2 <- FiltPacBio(
  grangesPacBioGFF = PacBioGFF_granges_filt1,
  cParamNameForFilter = "frac",
  nFiltParamLoBoundaries = 0.05, nFiltParamUpBoundaries = 1,
  cFiltParamBoundariesToInclude = "upperOnly"
)$gff

3rd step: filter the DNA modification data using False Discovery Rate estimations

DNAModAnnot provides tools to estimate False Discovery Rate (FDR) by threshold on parameters associated to modification detection.
These FDR estimations can help deciding filters to be used for filtering DNA modifications to be kept.

First, you can use the ExtractQuantifByModMotif function to retrieve a list containing the following elements:

the names of the motifs over-represented with the modification
the same motifs + the position of the modification in these motifs
a table containing the percentage of modiﬁcations in each motif tested.
a PacBioGFF GRangesList with one GRanges object by motif over-represented with the modification.
The minimum proportion to define motifs as "over-represented" with the modification can be modified using the nModMotifMinProp option.

The PacBioGFF GRangesList object will be used to keep only data from motifs over-represented.

# Extract Mod Data by motif over-represented (at least % of motifs)
motif_pct_and_PacBioGFF_grangeslist <- ExtractListModPosByModMotif(
  grangesModPos = PacBioGFF_granges_filt2,
  grangesGenome = ptetraurelia_genome_range,
  dnastringsetGenome = ptetraurelia_genome,
  nUpstreamBpToAdd = 0, nDownstreamBpToAdd = 1,
  nModMotifMinProp = 0.05,
  cBaseLetterForMod = "A",
  cModNameInOutput = "6mA"
)

Before estimating FDR, you must convert the PacBioCSV GPos object into a GRanges object. This will allows you to extract the sequence with the PacBioCSV data (using the GetGRangesWindowSeqandParam function) which is required for later steps.

Then, GetFdrEstListByThresh function can be used to retrieve a list (by motif over-represented) of data.frames. Each data.frame contains the FDR estimated by threshold (for the provided parameter) (and the adjusted FDR ~= cumulative minimum FDR).

This FDR estimation list can be used by the GetFdrBasedThreshLimit function to retrieve, in a list, the threshold associated to a chosen FDR.

Note that this chosen FDR is defined by nFdrPropForFilt option. The threshold will be defined by the closest value below this level of FDR for each motif (except for some motifs if lUseBestThrIfNoFdrThr=TRUE is used).

This FDR estimation list can also be used by the DrawFdrEstList function to represent for each motif the distribution of FDR estimation by threshold.

# Motif FDR estimation for each motif over-represented (at least % of motifs)
BaseCSV_granges_filt1 <- as(PacBioCSV_gpos_filt1[PacBioCSV_gpos_filt1$base == "A"], "GRanges")
BaseCSV_granges_with_sequence <- GetGRangesWindowSeqandParam(
  grangesData = BaseCSV_granges_filt1,
  grangesGenome = ptetraurelia_genome_range,
  dnastringsetGenome = ptetraurelia_genome,
  nUpstreamBpToAdd = 0,
  nDownstreamBpToAdd = 1
)
score_fdr_by_motif_list <- GetFdrEstListByThresh(
  grangesDataWithSeq = BaseCSV_granges_with_sequence,
  grangesDataWithSeqControl = NULL,
  cNameParamToTest = "score",
  nRoundDigits = 1,
  cModMotifsAsForeground = motif_pct_and_PacBioGFF_grangeslist$motifs_to_analyse
)
score_fdr_by_motif_limit <- GetFdrBasedThreshLimit(score_fdr_by_motif_list,
  nFdrPropForFilt = 0.05,
  lUseBestThrIfNoFdrThr = TRUE
)
DrawFdrEstList(
  listFdrEstByThr = score_fdr_by_motif_list,
  cNameParamToTest = "score",
  nFdrPropForFilt = 0.05
)

*False Discovery Rate (FDR) estimation (adjusted) by score threshold and by motif over-represented with the modification.* The red lines represent the 5% FDR estimation level (or the closest FDR estimation level below 5%) along its associated threshold on score. Here, for a 5% FDR estimation filter on score, only AT motifs reach a FDR estimation below 5% (using a score threshold of 49).

Finally, the output of the GetFdrBasedThreshLimit function can be used for filtering by motif using the FiltPacBio function.

Note that grangesModPos must be a GRangesList object in this case.

PacBioGFF_grangeslist_filt <- FiltPacBio(
  grangesPacBioGFF = motif_pct_and_PacBioGFF_grangeslist$GRangesbyMotif,
  listFdrEstByThrIpdRatio = NULL,
  listFdrEstByThrScore = score_fdr_by_motif_limit
)$gff

4th step: analyze the DNA modification patterns within genomic annotations

This part will describe the tools that can be used after filtering based on False Discovery (FDR) estimations and can be launched for each motif independently.

You can use the following commands to select data for one motif:

# Select one motif for the analysis by motif
motifs_base <- motif_pct_and_PacBioGFF_grangeslist$motifs_to_analyse[motif_pct_and_PacBioGFF_grangeslist$motifs_to_analyse == "AT"]
motifs_modifications <- motif_pct_and_PacBioGFF_grangeslist$mod_motif[motifs_base == motif_pct_and_PacBioGFF_grangeslist$motifs_to_analyse]
PacBioGFF_granges_filtAT <- PacBioGFF_grangeslist_filt[[motifs_base]]
BaseCSV_granges_filtAT <- BaseCSV_granges_with_sequence[BaseCSV_granges_with_sequence$sequence == motifs_base, ]

To analyse the distribution of modifications in the genome, you must provide an annotation file as a GRanges object.
The package rtracklayer contains various functions for file importation as GRanges objects depending on the format of your annotation file.
For some functions, the feature "intergenic" will be required for comparison between genes and intergenic regions. Here, you can use the PredictMissingAnnotation function to complete the annotation with "intergenic" features (and/or introns/exons if your annotation provides exons or introns positions).

# Loading GFF annotation file (Minimum annotation required: gene (seqnames, start, end, strand))
annotations_path <- system.file(package = "DNAModAnnot", "extdata", "ptetraurelia_mac_51_annotation_v2.0_sca171819.gff3")
annotations_range <- rtracklayer::readGFFAsGRanges(annotations_path)
annotations_range <- PredictMissingAnnotation(
  grangesAnnotations = annotations_range,
  grangesGenome = ptetraurelia_genome_range,
  cFeaturesColName = "type",
  cGeneCategories = c("gene"),
  lAddIntronRangesUsingExon = TRUE
)

Counts by feature

The GetModBaseCountsByFeature function returns the counts of the DNA modification (and its unmodified base) for each feature provided.
The DrawModBasePropByFeature function can then calculate (and draw as a barplot) the proportion of modified (or unmodified) base between features to be compared.

# Mod annotation by feature
annotations_range_ModBase_counts <- GetModBaseCountsByFeature(
  grangesAnnotations = annotations_range,
  grangesModPos = PacBioGFF_granges_filtAT,
  gposModTargetBasePos = BaseCSV_granges_filtAT,
  lIgnoreStrand = FALSE
)
DrawModBasePropByFeature(
  grangesAnnotationsWithCounts = annotations_range_ModBase_counts,
  cFeaturesToCompare = c("gene", "intergenic"),
  lUseCountsPerkbp = TRUE,
  cBaseMotif = motifs_base,
  cModMotif = motifs_modifications
)

*6mAT and any AT cumulated proportion (from counts per kbp) between genes and intergenic regions.* Here, 6mAT is enriched in genes and its proportion is not explained by the proportion of AT sites (which is more present in intergenic regions).

Quantitative parameter by Feature Modification counts categories

You can also compare a quantitative parameter versus Modifications counts in features.

Here you can load an example with RNA-seq data for the annotation provided:

# Loading additional Data: RNA-seq dataset
expression_file_path <- system.file(package = "DNAModAnnot", "extdata", "ptetraurelia.gene_expression.sca171819.tsv")
expression_dataframe <- read.table(
  file = expression_file_path,
  header = TRUE, sep = "\t", dec = ",",
  quote = "\"", fill = FALSE
)
expression_dataframe <- expression_dataframe[, c("ID", "T30")]

You need to load the quantitative parameter to be compared with modification counts as a new column within the annotation (+counts) GRanges object.

# merge RNA-seq data with Annotation GRanges object
genes_range_ModBase_counts_param <- annotations_range_ModBase_counts[annotations_range_ModBase_counts$type == "gene"]
GenomicRanges::mcols(genes_range_ModBase_counts_param) <- merge(
  x = GenomicRanges::mcols(genes_range_ModBase_counts_param),
  by.x = "Name",
  y = expression_dataframe,
  by.y = "ID"
)

Then, you can use the DrawParamPerModBaseCategories function. This will plot the distribution of the quantitative parameter provided by category of modified (and unmodified) base counts.

# Comparison of quantitative parameter with Mod annotation
DrawParamPerModBaseCategories(
  grangesAnnotationsWithCounts = genes_range_ModBase_counts_param,
  cParamColname = "T30",
  cParamFullName = "Gene expression at T30",
  cParamYLabel = "Normalized RNA-seq read counts (T30)",
  cSelectFeature = "gene",
  lUseCountsPerkbp = TRUE,
  cBaseMotif = motifs_base,
  cModMotif = motifs_modifications,
  lBoxPropToCount = FALSE
)

*Gene expression at T30 distribution per categories of 6mAT (or AT) counts per kbp.* Gene expression at T30 is estimated using the normalized RNA-seq read counts from Arnaiz et al. 2017. There seems to be a slight increase of gene expression in genes enriched in 6mA (which is not linked to an increase in AT proportion). Here, increasing the number of genes could help having a clearer result.

Counts within feature

The GetModBaseCountsWithinFeature function returns the counts of modified base (and its unmodified target base) within each feature provided.
For this, each feature provided is cut into a specific number of windows (defined by the nWindowsNb option) and counts are returned for each window of each feature.

The DrawModBaseCountsWithinFeature function can then represent the distribution within features through a barplot.

# Mod annotation within feature
gene_annotation_range <- annotations_range[annotations_range$type == "gene", ]
gene_annotation_range <- GetModBaseCountsWithinFeature(
  grangesAnnotations = gene_annotation_range,
  grangesModPos = PacBioGFF_granges_filtAT,
  gposModTargetBasePos = BaseCSV_granges_filtAT,
  lIgnoreStrand = FALSE,
  nWindowsNb = 20
)
DrawModBaseCountsWithinFeature(
  grangesAnnotationsWithCountsByWindow = gene_annotation_range,
  cFeatureName = "gene",
  cBaseMotif = motifs_base,
  cModMotif = motifs_modifications
)

*Cumulated 6mAT (or AT) proportion within gene.* Genes are cut into windows of relative size and cumulated 6mAT or AT proportion is computed for each window. Here, the cumulated AT proportion remain similar along genes (except a small enrichment at 3' extremity) while 6mAT is enriched at the beginning of genes.

Distance from feature

The GetDistFromFeaturePos function returns, for each feature provided, its distance, in bp, from a modification (using a window of specific size around each feature).
If the feature provided is larger than 1bp, each feature boundaries will be used instead for computing the distance.

If the lGetGRangesInsteadOfListCounts option is set to FALSE, the GetDistFromFeaturePos function will return instead a list of dataframes giving the counts (or proportion) of modified (or unmodified) bases by distance toward the feature.
This list can be represented on a plot using DrawModBasePropDistFromFeature function.

You can also add 1 or 2 additional axis on the plot to compare modification signal with another parameters by using the AddToModBasePropDistFromFeaturePlot function.
Here is an example using MNase-seq data to represent nucleosomes:

you can load a Bam file using the BamFile function from RSamtools package and the readGAlignments function from GenomicAlignments package.
Then, you can retrieve the central position of each read with the GetGposCenterFromGRanges function.
Finally, you can use again the GetDistFromFeaturePos function before using the AddToModBasePropDistFromFeaturePlot function.

# ModBase distance from feature/feature limit
Mod_distance_feature_countslist <- GetDistFromFeaturePos(
  grangesAnnotations = annotations_range,
  cSelectFeature = "gene",
  grangesData = PacBioGFF_granges_filtAT,
  lGetGRangesInsteadOfListCounts = FALSE,
  lGetPropInsteadOfCounts = TRUE,
  cWhichStrandVsFeaturePos = "both", nWindowSizeAroundFeaturePos = 600,
  lAddCorrectedDistFrom5pTo3p = TRUE,
  cFeaturePosNames = c("TSS", "TTS")
)
Base_distance_feature_countslist <- GetDistFromFeaturePos(
  grangesAnnotations = annotations_range,
  cSelectFeature = "gene",
  grangesData = BaseCSV_granges_filtAT,
  lGetGRangesInsteadOfListCounts = FALSE,
  lGetPropInsteadOfCounts = TRUE,
  cWhichStrandVsFeaturePos = "both", nWindowSizeAroundFeaturePos = 600,
  lAddCorrectedDistFrom5pTo3p = TRUE,
  cFeaturePosNames = c("TSS", "TTS")
)
DrawModBasePropDistFromFeature(
  listModCountsDistDataframe = Mod_distance_feature_countslist,
  listBaseCountsDistDataframe = Base_distance_feature_countslist,
  cFeaturePosNames = c("TSS", "TTS"),
  cBaseMotif = motifs_base,
  cModMotif = motifs_modifications
)

# ModBase and Reads center (from Bam file) distance from feature/feature limit
bamfile_path <- system.file(package = "DNAModAnnot", "extdata", "PTET_MonoNuc_3-2new.pe.sca171819.sorted.bam")
bamfile_object <- Rsamtools::BamFile(file = bamfile_path)
bamfile_ranges <- as(GenomicAlignments::readGAlignments(bamfile_object), "GRanges")
bamfile_ranges <- GetGposCenterFromGRanges(grangesData = bamfile_ranges)
bamfile_distance_feature_countslist <- GetDistFromFeaturePos(
  grangesAnnotations = annotations_range,
  cSelectFeature = "gene",
  grangesData = bamfile_ranges,
  lGetGRangesInsteadOfListCounts = FALSE,
  lGetPropInsteadOfCounts = FALSE,
  cWhichStrandVsFeaturePos = "both", nWindowSizeAroundFeaturePos = 600,
  lAddCorrectedDistFrom5pTo3p = TRUE,
  cFeaturePosNames = c("TSS", "TTS")
)
AddToModBasePropDistFromFeaturePlot(
  dPosCountsDistFeatureStart = bamfile_distance_feature_countslist[[1]],
  dPosCountsDistFeatureEnd = bamfile_distance_feature_countslist[[2]],
  cSubtitleContent = "Along with nucleosome center distance (MonoNuc_3-2newreplicate)",
  cParamYLabel = "Nucleosome center count (MonoNuc_3-2newreplicate)",
  cParamColor = "cyan3",
  lAddAxisOnLeftSide = TRUE
)

*Distribution of 6mAT sites, AT sites and Nucleosome centers distance from Transcription Start Sites (TSS; left panel) or Transcription Termination Sites (TTS; right panel).* Proportion is computed for each base position around TSS and TTS. Grey dotted vertical line: position of the TSS or TTS. Red: 6mAT proportion; grey: AT proportion; blue: nucleosome center counts (represents the positions/accumulation of nucleosomes). No peaks of AT sites can be observed: those are evenly distributed around TSS and TTS. Here, a few peaks of 6mAT signal appears between peaks of the Nucleosome periodic signal downstream TSS: 6mAT seems slightly more enriched between well-positioned nucleosomes downstream TSS.

Local visualisation

You can see locally the distribution of modifications (for any modification or for some motifs) by using the Gviz package.
After preparing a set of tracks, you can display them using the plotTracks function from Gviz package. (see Gviz package for more information)

DNAModAnnot provides several functions which can be used alongside the Gviz package for local visualisation.

The ExportFilesForGViz function allows the user to export files which can be used for streaming (except for the gff3 format) with the plotTracks function from Gviz package. Here, this makes possible to use streaming also for genomic annotations using the bam format (and using the ImportBamExtendedAnnotationTrack function when making tracks). (see ExportFilesForGViz function documentation for more information)

ipdRatio6mABigwig <- system.file(package = "DNAModAnnot", "extdata", "ipdRatio6mATsites.bw")
ptet51GenesBam <- system.file(package = "DNAModAnnot", "extdata", "ptet51Genes.bam")

# #Gviz file making for vignette (not run)
ExportFilesForGViz(
  cFileNames = c(
    ipdRatio6mABigwig,
    ptet51GenesBam
  ),
  listObjects = list(
    PacBioGFF_granges_filtAT,
    annotations_range[annotations_range$type == "gene"]
  ),
  cFileFormats = c("bw", "bam"),
  lBigwigParametersByStrand = c(TRUE, NA),
  cBigwigParameters = c("ipdRatio", NA),
  cBamXaParameters = c(NA, "Name"),
  dnastringsetGenome = ptetraurelia_genome
)

To plot an Ideogram, you can use the IdeogramTrack function from Gviz package. But if you do not have any information on cytobands, you can use the AdaptedIdeogramTrackWithoutBandsData function to plot an empty representation of chromosome.

If the chromosome names that you are using are present in the UCSC database, you can use the command options(ucscChromosomeNames=TRUE) before using the Gviz package. Otherwise, you must use options(ucscChromosomeNames=FALSE) command instead.

cContigToViz <- unique(GenomicRanges::seqnames(ptetraurelia_genome_range))

# Generating Ideogram TRACK--------
options(ucscChromosomeNames = FALSE)
trackIdeogram <- AdaptedIdeogramTrackWithoutBandsData(
  grangesGenome = ptetraurelia_genome_range,
  cContigToViz = cContigToViz,
  cOrgAssemblyName = "ptetraurelia_mac_51"
)

When making tracks using files generated with the ExportFilesForGViz function, you must add stream = TRUE to allow streaming when displaying tracks.

# Generating classic Gviz tracks

# GenomeAxis TRACK------
trackGenomeAxis <- Gviz::GenomeAxisTrack(cex = 1)

# Sequence TRACK--------
trackSequence <- Gviz::SequenceTrack(ptetraurelia_genome_fa,
  chromosome = cContigToViz,
  add53 = TRUE,
  complement = FALSE, cex = 0.8, stream = TRUE
)

# DATATRACK--------
trackData6mATipdRatio <- Gviz::DataTrack(ipdRatio6mABigwig,
  stream = TRUE, name = "6mAT\nipdRatio", type = "histogram",
  col.histogram = c("red"), fill = "red",
  background.title = "darkred", col = NULL
)

trackDataNuclCoverage <- Gviz::DataTrack(bamfile_path,
  stream = TRUE, name = "Nucleosome\ncoverage\n(MNase-seq)",
  type = "histogram",
  col.histogram = c("blue"), fill = "blue",
  background.title = "darkblue", col = NULL
)

To display genomic annotations (in adapted bam file; see ExportFilesForGViz function documentation) using streaming, you must use the ImportBamExtendedAnnotationTrack function as the import function (via stream and importFunction options while making the annotation track).

# Generating Annotation TRACK with streaming--------
trackAnnotation <- Gviz::AnnotationTrack(ptet51GenesBam,
  name = "Gene", stacking = "squish",
  stream = TRUE, importFunction = ImportBamExtendedAnnotationTrack,
  group = "tag", groupAnnotation = "tag",
  just.group = "below",
  fontcolor.group = "black", fontsize.group = 18,
  fill = "lightblue", shape = "fixedArrow",
  arrowHeadWidth = 50, lwd = 3,
  background.title = "darkblue"
)

In order to allow the names of the genomic features to be displayed, the "mapping" sub-list of the generated annotation track must be completed with the new "id" and "group" values.

trackAnnotation@mapping <- list(id = "tag", group = "tag")

Once the tracks are all generated, you can use the plotTracks function from Gviz package to display them on a chosen location (defined by the chromosome,from and to options).

# Plotting Tracks--------
Gviz::plotTracks(
  trackList = list(
    trackIdeogram, trackGenomeAxis,
    trackData6mATipdRatio, trackSequence,
    trackDataNuclCoverage, trackAnnotation
  ),
  chromosome = "scaffold51_17",
  from = 361000, to = 365000
)

*Local visualisation of 6mAT sites distribution along their ipdRatio and MNase-seq coverage.* Here, clusters of 6mAT signal are more frequently enriched between peaks of MNase-seq coverage downstream Transcription Start Sites.

sessionInfo()

AlexisHardy/DNAModAnnot documentation built on Feb. 27, 2023, 12:03 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

AlexisHardy/DNAModAnnot
Toolbox for DNA Modifications filtering and annotation

In AlexisHardy/DNAModAnnot: Toolbox for DNA Modifications filtering and annotation

Introduction

Installation

Required data

1st step: load and filter the genome and DNA modification data

2nd step: analyze the global distribution and motif of DNA modification data

3rd step: filter the DNA modification data using False Discovery Rate estimations

4th step: analyze the DNA modification patterns within genomic annotations

Counts by feature

Quantitative parameter by Feature Modification counts categories

Counts within feature

Distance from feature

Local visualisation

R Package Documentation

Browse R Packages

We want your feedback!

AlexisHardy/DNAModAnnot Toolbox for DNA Modifications filtering and annotation

In AlexisHardy/DNAModAnnot: Toolbox for DNA Modifications filtering and annotation

Introduction

Installation

Required data

1st step: load and filter the genome and DNA modification data

2nd step: analyze the global distribution and motif of DNA modification data

3rd step: filter the DNA modification data using False Discovery Rate estimations

4th step: analyze the DNA modification patterns within genomic annotations

Counts by feature

Quantitative parameter by Feature Modification counts categories

Counts within feature

Distance from feature

Local visualisation

R Package Documentation

Browse R Packages

We want your feedback!

AlexisHardy/DNAModAnnot
Toolbox for DNA Modifications filtering and annotation