carepat: Cis-acting regulatory element predictions for Arabidopsis,...

View source: R/Wimtrap.R

carepatR Documentation

Cis-acting regulatory element predictions for Arabidopsis, tomato, rice and maize

Description

Predicts the location of transcription factor binding sites (=cis-acting regulatory elements) in various conditions for Arabidopsis thaliana, Solanum lycopersicum, Oryza sativa and Zea mays. The function integrates 7 pre-built general models obtained based on a more or less extended set of genomic data and trained from different organisms/conditions. These models almost all integrate the degree of opening of the chromating (DHS: DNAseI hypersensitive sites) and results of digital genomic footprinting (DGF: digital genomic footprints) in the conditions that can be studied using carepat. These represent genomic data with high potenital of prodectivity (see details).

Usage

carepat(
  organism = c("Arabidopsis thaliana", "Solanum lycopersicum", "Oryza sativa",
    "Zea mays"),
  condition = c("seedlings", "flowers", "roots", "roots_non_hairs", "seed_coats",
    "seedlings_dark7d", "seedlings_dark7dLD24h", "seedlings_dark7dlight3h",
    "seedlings_dark7dlight30min", "seedlings_heatshock", "ripening_fruits",
    "immature_fruits"),
  TFnames = NULL,
  pfm = NULL,
  show_annotations = FALSE,
  score_threshold = 0.5
)

Arguments

organism

"Arabidopsis thaliana", "Solanum lycopersicum", "Oryza sativa" or "Zea mays"

condition

Character indicating the studied condition. For Arabidopsis thaliana: "seedlings", "flowers", "roots", "roots_non_hairs","seed_coats", "seedlings_dark7d", "seedlings_dark7dLD24h","seedlings_dark7dlight3h", "seedlings_dark7dlight30min" or "seedlings_heatshock"". For Solanum lycopersicum: "ripening_fruits" or "immaturefruits". For Oryza sativa, "seedlings" or "roots". For Zea mays: "seedlings".

TFnames

Character vector setting the name(s) of the studied transcription factors. These names have to follow the AGI (Arabidopsis)/Solyc (Tomato) nomenclature to allow the retrieval of the motis from PlantTFDB database.Otherwise, if you input the motifs from a local file through pfm, these names have to be among those described in that file.

pfm

Path to a file including the position frequency or weight matrices (PFMs or PWMs) of the motifs recognized by the considered transcription factors (training and/or studied TFs). This file can be in different formats, determined based on the file extension: raw pfm (".pfm"), jaspar (".jaspar"), meme (".meme"), transfac (".transfac"), homer (".motif") or cis-bp (".txt").pfm can be set to NULL (default value) if you provide the results of pattern-matching obtained from an external source (see the argument matches).

show_annotations

A logical. Default = FALSE. If TRUE, the annotation of the potential binding sites with the genomic features extracted from their genomic regions will be output.

score_threshold

A numeric (comprised between 0 and 1). Sets the minimum prediction score output by the TFBSmodel to predict a potential binding site as a binding site of a studied transcription factor in the studied condition. Higher the prediction score, higher is the specificity and lower the sensitivity. Default = 0.5.

Details

The following table details, for each organism-condition that can be studied using carepat, the model that is considered: from which training organism-condition it has been obtained and which genomic features.

studied = training organism studied condition training condition genomic features
Arabidopsis thaliana whole seedlings ("seedlings") whole seedlings ("seedlings") Layers 1, 2, 3, 4 + Full layer 5
Arabidopsis thaliana flowers in stages 4-5 ("flowers") flowers in stages 4-5 ("flowers") Layers 1, 2, 3, 4 + DHS, Cme
Arabidopsis thaliana seedling roots ("roots") whole seedlings ("seedlings") Layers 1, 2, 3, 4 + DHS
Arabidopsis thaliana non-hair part of seedling roots ("roots_non_hair") whole seedlings ("seedlings") Layers 1, 2, 3, 4 + DHS
Arabidopsis thaliana seed coats, 4 days after anthesis ("seedlings_coats") whole seedlings ("seedlings") Layers 1, 2, 3, 4 + DHS
Arabidopsis thaliana heat-shocked seedlings ("seedlings_heatshock) whole seedlings ("seedlings") Layers 1, 2, 3, 4 + DHS
Arabidopsis thaliana dark-grown seedlings ("seedlings_dark7d") whole seedlings ("seedlings") Layers 1, 2, 3, 4 + DHS
Arabidopsis thaliana dark-grown seedlings exposed to 30 min of light ("seedlings_dark7d30min") whole seedlings ("seedlings") Layers 1, 2, 3, 4 + DHS
Arabidopsis thaliana dark-grown seedlings exposed to 3h of light ("seedlings_dark7d3h") whole seedlings ("seedlings") Layers 1, 2, 3, 4 + DHS
Arabidopsis thaliana dark-grown seedlings exposed to a long day cycle ("seedlings_dark7dLD24h") whole seedlings ("seedlings") Layers 1, 2, 3, 4 + DHS
Solanum lycopersicum ripening fruits ("ripening_fruits") ripening fruits ("ripening_fruits") Layers 1, 2, 3, 4 + DHS, Cme, H3K27me3
Solanum lycopersicum immature fruits ("immature_fruits") ripening fruits ("ripening_fruits") Layers 1, 2, 3 + DHS, Cme, H3K27me3
Oryza sativa whole seedlings ("seedlings") whole seedlings ("seedlings") Layers 1, 2, 3, 4 + DHS, Cme, H3K36me3, H3K27ac, H3K27me3, H3K4me3, H3K9ac, H4K12ac
Oryza sativa seedling roots ("roots") whole seedlings ("seedlings") Layers 1, 2, 3, 4 + DHS, Cme, H3K36me3, H3K27ac, H3K27me3, H3K4me3, H3K9ac, H4K12ac
Zea mays whole seedlings ("seedlings") whole seedlings ("seedlings") Layers 1, 2, 3, 4 + DHS, Cme

The different layers of genomic features are composed of the following:

  • Layer 1: results of pattern-matching (log10 p-value of the score and local density of matches)

  • Layer 2: phastcons-scored conserved elements (for Arabidopsis and the tomato) and conserved non-coding sequences (for Arabiopsis only)

  • Layer 3: position on the gene (promoter, proximal promoter, 5'untranslated region, coding sequence, intron, 3'untranslated region, downstream region, distance to the transcription start and termination site of the gene)

  • Layer 4: local signals of digital footprints

  • Layer 5: local signals of Chromatin state, Cytosine methylation (Cme), Histone 2A.Z positioning (H2AZ), DNA looping (Dloop), Nucleosomes positioning (Nuc), Histone2B monoubiquitination (H2BuB), Monomethylation on lysine 4 of the histone 3 (H3K4me1), Dimethylation on lysine 4 of the histone 3 (H3K4me2), Trimethylation on lysine 4 of the histone 4 (H3K4me3), Dimethylation on lysine 9 of the histone 3 (H3K9me2), Monomethylation on lysine 27 of the histone 3 (H3K27me1), Trimethylation on lysine 27 of the histone 3 (H3K27me3), Trimethylation on lysine 36 of the histone 4 (H3K36me3), Acetylation on lysine 9 of histone 3 (H3K9ac), Acetylation on lysine 14 of histone 3 (H3K14ac), Acetylation on lysine 18 of histone 3 (H3K18ac), Acetylation on lysine 27 of histone 3 (H3K27ac), Acetylation on lysine 56 of histone 3 (H3K56ac), Phosphorylation on tyrosine 3 of histone 3 (H3T3ph), Acetylation on lysine 5 of histone 4 (H4K5ac), Acetylation on lysine 8 of histone 4 (H4K8ac), Acetylation on lysine 12 of histone 4 (H4K12ac), Acetylation on lysine 16 of histone 4 (H4K16ac).

The source of the data used to train the models and, if applicable, to transfer them the studied conditions are described in the file "Sources.ods" on the "RivereQuentin/carepat" repository.

Value

A data.table listing the predicted binding sites. The 'TF' column annotates the potential binding sites with their cognate transcription factor. Additionally, the data.table describes, for the potential binding sites, the chromosomic coordinates, the closest transcript (relatively to the transcript start site) and the prediction score. Optionally, the data.table might also include the genomic features used to make the predictions. NB: The chromosomic coordinates are expressed according to the following assemblies: TAIR10 (Arabidopsis thaliana), SL3.0 (Solanum lycopersicum), IRGSP-1.0 (Oryza sativa) and Zm-B73-REFERENCE-NAM-5.0 (Zea mays).

See Also

plotPredictions() to vizualize the results for a given potential target gene.

Examples

#Predictions of the binding sites of "AT2G46830" in flowers of Arabidopsis
CCA1predictions.flowers <- carepat(organism = "Arabidopsis thaliana",
                                  condition = "flowers",
                                  TFnames = "AT2G46830")
#Predictions of the binding sites of "Solyc00g024680.1" in immature fruits of tomato
DOF24predictions.immature <- carepat(organism = "Solanum lycopersicum",
                                  condition = "immature_fruits",
                                  TFnames = "Solyc00g024680.1")


RiviereQuentin/Wimtrap documentation built on June 29, 2024, 7:17 p.m.