dagLogo Vignette

suppressPackageStartupMessages({
    library(dagLogo)
    library(biomaRt)
    library(UniProt.ws)
    library(motifStack)
    library(Biostrings)
    library(grDevices)
    })

Introduction

A sequence logo has been widely used as a graphical representation of nucleic acid sequence motifs. There is a R package seqlogo[@Oliver2006] for drawing sequence logos for a single DNA motif. There is another R package motifStack[@ou2018motifstack] for depicting individual sequence motif as well as multiple motifs for amino acid (AA), DNA and RNA sequences.

IceLogo[@Colaert2009] is a tool developed in Java for visualizing significantly conserved sequence patterns in a list of aligned AA sequences against a set of background sequences. Compared to webLogo[@Crooks2004], which relies on information theory, IceLogo builds on probability theory. It is reported that IceLogo has a more dynamic nature and is more appropriate for analysis of conserved sequence patterns.

However, IceLogo can only compare conserved sequences to reference sequences at the individual amino acid level. As we know, some conserved sequence patterns are not conserved at the individual amino acid level, but conserved at the level of amino acid group characteristic of their physical and chemical properties, such as charge and hydrophobicity.

Here we developed a R/Bioconductor package dagLogo, for visualizing significantly conserved sequence patterns relative to a proper background set of sequences, with or without grouping amino acid residuals based on their physical and chemical properties. Figure 1 shows the flowchart of performing analysis using dagLogo. Comparing to existing tools, dagLogo allows aligned or not aligned subsequences of different length as input; Provides more options and functions to generate various background sets that can be tailored to fit the experimental design; Both significantly over- and under-represented amino acid residues can be plotted; AA residues can be grouped and statistical significance test can be performed at the group level.

Figure 1. Flowchart of performing analysis using _dagLogo_. Two ways to prepare an object of Proteome are colored in greenish and yellowish, while two alternative ways to build an object of dagPeptides are colored in blue and red..

Figure 1. Flowchart of performing analysis using dagLogo. Two ways to prepare an object of Proteome are colored in greenish and yellowish, while two alternative ways to build an object of dagPeptides are colored in blue and red.

Step-by-step guide on using dagLogo

First load the library dagLogo

library(dagLogo)

Step 1: Fetching peptide sequences from BioMart

Case 1: Fetch sequences using the fetchSequence function in biomaRt package given a list of gene identifiers and the corresponding positions of the anchoring AA.

##just in case biomaRt server does not response
if (interactive())
{
    try({
            mart <- useMart("ensembl")
            fly_mart <-
            useDataset(mart = mart, dataset = "dmelanogaster_gene_ensembl")
            dat <- read.csv(system.file("extdata", "dagLogoTestData.csv",
                                        package = "dagLogo"))
            seq <- fetchSequence(IDs = as.character(dat$entrez_geneid),
                                 anchorPos = as.character(dat$NCBI_site),
                                 mart = fly_mart,
                                 upstreamOffset = 7,
                                 downstreamOffset = 7)
            head(seq@peptides)
   })
}

Case 2: Fetch sequences using the fetchSequence function in biomaRt package given a list of gene identifiers and the corresponding peptide subsequences of interest with the anchoring AA marked such as asterisks or lower case of one or more AA letters.

if (interactive())
{
    try({
            mart <- useMart("ensembl")
            fly_mart <-
            useDataset(mart = mart, dataset = "dmelanogaster_gene_ensembl")
            dat <- read.csv(system.file("extdata", "dagLogoTestData.csv",
                                        package = "dagLogo"))
            seq <- fetchSequence(IDs = as.character(dat$entrez_geneid),
                                 anchorAA = "*",
                                 anchorPos = as.character(dat$peptide),
                                 mart = fly_mart,
                                 upstreamOffset = 7,
                                 downstreamOffset = 7)
            head(seq@peptides)
    })
}

In the following example, the anchoring AA is marked as lower case "s" for amino acid serine.

if(interactive()){
    try({
            dat <- read.csv(system.file("extdata", "peptides4dagLogo.csv",
                                package = "dagLogo"))
            ## cleanup the data
            dat <- unique(cleanPeptides(dat, anchors = "s"))

            mart <- useMart("ensembl")
            human_mart <-
            useDataset(mart = mart, dataset = "hsapiens_gene_ensembl")
            seq <- fetchSequence(IDs = toupper(as.character(dat$symbol)),
                                type = "hgnc_symbol",
                                anchorAA = "s",
                                anchorPos = as.character(dat$peptides),
                                mart = human_mart,
                                upstreamOffset = 7,
                                downstreamOffset = 7)
            head(seq@peptides)
    })
}

The function cleanPeptides can be used to select a subset of data to analyze when input data contains multiple anchoring AAs, represented as lower case of AAs.

if(interactive()){
    dat <- read.csv(system.file("extdata", "peptides4dagLogo.csv",
                package="dagLogo"))
    dat <- unique(cleanPeptides(dat, anchors = c("s", "t")))
    mart <- useMart("ensembl", "hsapiens_gene_ensembl")
    seq <- fetchSequence(toupper(as.character(dat$symbol)), 
                         type="hgnc_symbol",
                         anchorAA=as.character(dat$anchor),
                         anchorPos=as.character(dat$peptides), 
                         mart=mart, 
                         upstreamOffset=7, 
                         downstreamOffset=7)
    head(seq@peptides)
}

Similarly, peptide sequences can be fetched from an object of Proteome.

Case 3: Prepare an object of dagPeptides using prepareProteome and formatSequence functions sequentially given a list of unaligned/aligned ungapped peptide sequences.

dat <- unlist(read.delim(system.file("extdata", "grB.txt", package = "dagLogo"),
                        header = F, as.is = TRUE))
##prepare proteome from a fasta file, 
##the fastq file is subset of human proteome for this vignette. 
proteome <- prepareProteome(fasta = system.file("extdata",
                                                "HUMAN.fasta",
                                                package = "dagLogo"), 
                            species = "Homo sapiens")
##prepare an object of dagPeptides
seq <- formatSequence(seq = dat, proteome = proteome, upstreamOffset = 14,
                      downstreamOffset = 15)

Step 2: Building background models

Once you have an object of dagPeptides in hand, you can start to build a background model for statistical significance test. The background could be a set of random subsequences of a whole proteome or your inputs. To build a background model from a whole proteome, an object of Proteome is required. Sequences provided by a fasta file or downloaded from the UniProt database can be used to prepare a Proteome object. Case 3 in step 1 shows how to prepare a Proteome object from a fasta file. The following code snippet shows how to prepare an object of Proteome using the UniProt.ws package.

if(interactive()){
    proteome <- prepareProteome("UniProt", species = "Homo sapiens")
}

The prepared Proteome object can be used as a background model for the following statistical significance test using Fisher’s exact test or Z-test.

bg_fisher <- buildBackgroundModel(seq, background = "wholeProteome", 
                                  proteome = proteome, testType = "fisher")
bg_ztest <- buildBackgroundModel(seq, background = "wholeProteome", 
                                 proteome = proteome, testType = "ztest")

Step 3: Statistical significance test for differential usage of amino acids with or without grouping

Statistical significance test can be performed at the AA level without making any change to the formatted and aligned amino acids. Alternatively, statistical significance test can be performed at the AA group level, where amino acids are grouped based on their physical or chemical properties. To group the AAs, the formatted and aligned AA symbols are replaced by a new set of symbols representing their corresponding groups. For example, if AA charge is of your interest, then group symbols "P", "N" and "U" are used to replace amino acids with positive charge, negative charge and no charge respectively. A few pre-built grouping schemes have been made available for you to specify as follows.

## no grouping
t0 <- testDAU(seq, dagBackground = bg_ztest)

## grouping based on chemical properties of AAs.
t1 <- testDAU(dagPeptides = seq, dagBackground = bg_ztest,
              groupingScheme = "chemistry_property_Mahler_group")

## grouping based on the charge of AAs.
t2 <- testDAU(dagPeptides = seq, dagBackground = bg_ztest, 
              groupingScheme = "charge_group")

## grouping based on the consensus similarity.
t3 <- testDAU(dagPeptides = seq, dagBackground = bg_ztest, 
              groupingScheme = "consensus_similarity_SF_group")

## grouping based on the hydrophobicity. 
t4 <- testDAU(dagPeptides = seq, dagBackground = bg_ztest, 
              groupingScheme = "hydrophobicity_KD")

## In addition, dagLogo allows users to use their own grouping 
## Scheme. The following example shows how to supply a customized 
## scheme. Please note that the user-supplied grouping is named 
## as "custom_group" internally.
## Add a grouping scheme based on the level 3 of BLOSUM50
color = c(LVIMC = "#33FF00", AGSTP = "#CCFF00",
          FYW = '#00FF66', EDNQKRH = "#FF0066")
symbol = c(LVIMC = "L", AGSTP = "A", FYW = "F", EDNQKRH = "E")
group = list(
    LVIMC = c("L", "V", "I", "M", "C"), 
    AGSTP = c("A", "G", "S", "T", "P"),
    FYW = c("F", "Y", "W"),
    EDNQKRH = c("E", "D", "N", "Q", "K", "R", "H"))
addScheme(color = color, symbol = symbol, group = group) 
t5 <- testDAU(dagPeptides = seq, dagBackground = bg_ztest, 
              groupingScheme = "custom_group")

Step 4: Visualize significance test results

We can use a heatmap or logo to display the statistical significance test results.

##Plot a heatmap to show the results
dagHeatmap(t0) 
## dagLogo showing differentially used AAs without grouping
dagLogo(t0) 
## dagLogo showing AA grouped based on properties of individual a amino acid.
dagLogo(t1, groupingSymbol = getGroupingSymbol(t1@group), legend = TRUE)
## grouped on the basis of charge.
dagLogo(t2, groupingSymbol = getGroupingSymbol(t2@group), legend = TRUE)
## grouped on the basis of consensus similarity.
dagLogo(t3, groupingSymbol = getGroupingSymbol(t3@group), legend = TRUE)
## grouped on the basis of hydrophobicity.
dagLogo(t4, groupingSymbol = getGroupingSymbol(t4@group), legend = TRUE)

Session Info

sessionInfo()


Try the dagLogo package in your browser

Any scripts or data that you put into this service are public.

dagLogo documentation built on Dec. 5, 2020, 2 a.m.