run_kmer_tsma: _k_-mer-based Transcript Set Motif Analysis

Description Usage Arguments Details Value See Also Examples

View source: R/main.R

Description

Calculates the enrichment of putative binding sites in foreground sets versus a background set using k-mers to identify putative binding sites

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
run_kmer_tsma(
  foreground_sets,
  background_set,
  motifs = NULL,
  k = 6,
  fg_permutations = 5000,
  kmer_significance_threshold = 0.01,
  produce_plot = TRUE,
  p_adjust_method = "BH",
  p_combining_method = "fisher",
  n_cores = 1
)

Arguments

foreground_sets

list of foreground sets; a foreground set is a character vector of DNA or RNA sequences (not both) and a strict subset of the background_set

background_set

character vector of DNA or RNA sequences that constitute the background set

motifs

a list of motifs that is used to score the specified sequences. If is.null(motifs) then all Transite motifs are used.

k

length of k-mer, either 6 for hexamers or 7 for heptamers

fg_permutations

numer of foreground permutations

kmer_significance_threshold

p-value threshold for significance, e.g., 0.05 or 0.01 (used for volcano plots)

produce_plot

if TRUE volcano plots and distribution plots are created

p_adjust_method

see p.adjust

p_combining_method

one of the following: Fisher (1932) ("fisher"), Stouffer (1949), Liptak (1958) ("SL"), Mudholkar and George (1979) ("MG"), and Tippett (1931) ("tippett") (see p_combine)

n_cores

number of computing cores to use

Details

Motif transcript set analysis can be used to identify RNA binding proteins, whose targets are significantly overrepresented or underrepresented in certain sets of transcripts.

The aim of Transcript Set Motif Analysis (TSMA) is to identify the overrepresentation and underrepresentation of potential RBP targets (binding sites) in a set (or sets) of sequences, i.e., the foreground set, relative to the entire population of sequences. The latter is called background set, which can be composed of all sequences of the genes of a microarray platform or all sequences of an organism or any other meaningful superset of the foreground sets.

The k-mer-based approach breaks the sequences of foreground and background sets into k-mers and calculates the enrichment on a k-mer level. In this case, motifs are not represented as position weight matrices, but as lists of k-mers.

Statistically significantly enriched or depleted k-mers are then used to calculate a score for each RNA-binding protein, which quantifies its target overrepresentation.

Value

A list of lists (one for each transcript set) with the following components:

enrichment_df the result of compute_kmer_enrichment
motif_df
motif_kmers_dfs
volcano_plots volcano plots for each motif (see draw_volcano_plot)
perm_test_plots plots of the empirical distribution of k-mer enrichment values for each motif
enriched_kmers_combined_p_values
depleted_kmers_combined_p_values

See Also

Other TSMA functions: draw_volcano_plot(), run_matrix_tsma()

Other k-mer functions: calculate_kmer_enrichment(), check_kmers(), compute_kmer_enrichment(), count_homopolymer_corrected_kmers(), draw_volcano_plot(), estimate_significance_core(), estimate_significance(), generate_kmers(), generate_permuted_enrichments(), run_kmer_spma()

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# define simple sequence sets for foreground and background
foreground_set1 <- c(
  "CAACAGCCUUAAUU", "CAGUCAAGACUCC", "CUUUGGGGAAU",
  "UCAUUUUAUUAAA", "AAUUGGUGUCUGGAUACUUCCCUGUACAU",
  "AUCAAAUUA", "AGAU", "GACACUUAAAGAUCCU",
  "UAGCAUUAACUUAAUG", "AUGGA", "GAAGAGUGCUCA",
  "AUAGAC", "AGUUC", "CCAGUAA"
)
foreground_set2 <- c("UUAUUUA", "AUCCUUUACA", "UUUUUUU", "UUUCAUCAUU")
foreground_sets <- list(foreground_set1, foreground_set2)
background_set <- unique(c(foreground_set1, foreground_set2, c(
  "CCACACAC", "CUCAUUGGAG", "ACUUUGGGACA", "CAGGUCAGCA",
  "CCACACCGG", "GUCAUCAGU", "GUCAGUCC", "CAGGUCAGGGGCA"
)))

# run k-mer based TSMA with all Transite motifs (recommended):
# results <- run_kmer_tsma(foreground_sets, background_set)

# run TSMA with one motif:
motif_db <- get_motif_by_id("M178_0.6")
results <- run_kmer_tsma(foreground_sets, background_set, motifs = motif_db)
## Not run: 
# define example sequence sets for foreground and background
foreground_set1 <- gsub("T", "U", transite:::ge$foreground1_df$seq)
foreground_set2 <- gsub("T", "U", transite:::ge$foreground2_df$seq)
foreground_sets <- list(foreground_set1, foreground_set2)
background_set <- gsub("T", "U", transite:::ge$background_df$seq)

# run TSMA with all Transite motifs
results <- run_kmer_tsma(foreground_sets, background_set)

# run TSMA with a subset of Transite motifs
results <- run_kmer_tsma(foreground_sets, background_set,
  motifs = get_motif_by_rbp("ELAVL1"))

# run TSMA with user-defined motif
toy_motif <- create_kmer_motif(
  "toy_motif", "example RBP",
  c("AACCGG", "AAAACG", "AACACG"), "example type", "example species", "user"
)
results <- run_matrix_tsma(foreground_sets, background_set,
  motifs = list(toy_motif))

## End(Not run)

transite documentation built on Nov. 8, 2020, 5:27 p.m.