run_matrix_spma: Matrix-based Spectrum Motif Analysis
In kkrismer/transite: RNA-binding protein motif analysis

run_matrix_spma

R Documentation

Matrix-based Spectrum Motif Analysis

Description

SPMA helps to illuminate the relationship between RBP binding evidence and the transcript sorting criterion, e.g., fold change between treatment and control samples.

Usage

run_matrix_spma(
  sorted_transcript_sequences,
  sorted_transcript_values = NULL,
  transcript_values_label = "transcript value",
  motifs = NULL,
  n_bins = 40,
  midpoint = 0,
  x_value_limits = NULL,
  max_model_degree = 1,
  max_cs_permutations = 1e+07,
  min_cs_permutations = 5000,
  max_hits = 5,
  threshold_method = "p_value",
  threshold_value = 0.25^6,
  max_fg_permutations = 1e+06,
  min_fg_permutations = 1000,
  e = 5,
  p_adjust_method = "BH",
  n_cores = 1,
  cache = paste0(tempdir(), "/sc/")
)

Arguments

`sorted_transcript_sequences`	named character vector of ranked sequences (only containing upper case characters A, C, G, T), where the names are RefSeq identifiers and sequence type qualifiers (`"3UTR"`, `"5UTR"` or `"mRNA"`), separated by `"\|"`, e.g. `"NM_010356\|3UTR"`. Names are only used to cache results. The sequences in `sorted_transcript_sequences` must be ranked (i.e., sorted). Commonly used sorting criteria are measures of differential expression, such as fold change or signal-to-noise ratio (e.g., between treatment and control samples in gene expression profiling experiments).
`sorted_transcript_values`	vector of sorted transcript values, i.e., the fold change or signal-to-noise ratio or any other quantity that was used to sort the transcripts that were passed to `run_matrix_spma` or `run_kmer_spma` (default value is `NULL`). These values are displayed as a semi-transparent area over the enrichment value heatmaps of spectrum plots.
`transcript_values_label`	label of transcript sorting criterion (e.g., `"log fold change"`, default value is `"transcript value"`), only shown if `!is.null(sorted_transcript_values)`
`motifs`	a list of motifs that is used to score the specified sequences. If `is.null(motifs)` then all Transite motifs are used.
`n_bins`	specifies the number of bins in which the sequences will be divided, valid values are between 7 and 100
`midpoint`	for enrichment values the midpoint should be `1`, for log enrichment values `0` (defaults to `0`)
`x_value_limits`	sets limits of the x-value color scale (used to harmonize color scales of different spectrum plots), see `limits` argument of `continuous_scale` (defaults to `NULL`, i.e., the data-dependent default scale range)
`max_model_degree`	maximum degree of polynomial
`max_cs_permutations`	maximum number of permutations performed in Monte Carlo test for consistency score
`min_cs_permutations`	minimum number of permutations performed in Monte Carlo test for consistency score
`max_hits`	maximum number of putative binding sites per mRNA that are counted
`threshold_method`	either `"p_value"` (default) or `"relative"`. If `threshold_method` equals `"p_value"`, the default `threshold_value` is `0.25^6`, which is lowest p-value that can be achieved by hexamer motifs, the shortest supported motifs. If `threshold_method` equals `"relative"`, the default `threshold_value` is `0.9`, which is 90% of the maximum PWM score.
`threshold_value`	semantics of the `threshold_value` depend on `threshold_method` (default is 0.25^6)
`max_fg_permutations`	maximum number of foreground permutations performed in Monte Carlo test for enrichment score
`min_fg_permutations`	minimum number of foreground permutations performed in Monte Carlo test for enrichment score
`e`	integer-valued stop criterion for enrichment score Monte Carlo test: aborting permutation process after observing `e` random enrichment values with more extreme values than the actual enrichment value
`p_adjust_method`	adjustment of p-values from Monte Carlo tests to avoid alpha error accumulation, see `p.adjust`
`n_cores`	the number of cores that are used
`cache`	either logical or path to a directory where scores are cached. The scores of each motif are stored in a separate file that contains a hash table with RefSeq identifiers and sequence type qualifiers as keys and the number of putative binding sites as values. If `cache` is `FALSE`, scores will not be cached.

Details

In order to investigate how motif targets are distributed across a spectrum of transcripts (e.g., all transcripts of a platform, ordered by fold change), Spectrum Motif Analysis visualizes the gradient of RBP binding evidence across all transcripts.

The matrix-based approach skips the k-merization step of the k-mer-based approach and instead scores the transcript sequence as a whole with a position specific scoring matrix.

For each sequence in foreground and background sets and each sequence motif, the scoring algorithm evaluates the score for each sequence position. Positions with a relative score greater than a certain threshold are considered hits, i.e., putative binding sites.

By scoring all sequences in foreground and background sets, a hit count for each motif and each set is obtained, which is used to calculate enrichment values and associated p-values in the same way in which motif-compatible hexamer enrichment values are calculated in the k-mer-based approach. P-values are adjusted with one of the available adjustment methods.

An advantage of the matrix-based approach is the possibility of detecting clusters of binding sites. This can be done by counting regions with many hits using positional hit information or by simply applying a hit count threshold per sequence, e.g., only sequences with more than some number of hits are considered. Homotypic clusters of RBP binding sites may play a similar role as clusters of transcription factors.

Value

A list with the following components:

`foreground_scores`	the result of `score_transcripts` for the foreground sets (the bins)
`background_scores`	the result of `score_transcripts` for the background set
`enrichment_dfs`	a list of data frames, returned by `calculate_motif_enrichment`
`spectrum_info_df`	a data frame with the SPMA results
`spectrum_plots`	a list of spectrum plots, as generated by `score_spectrum`
`classifier_scores`	a list of classifier scores, as returned by `classify_spectrum`

Examples

# example data set
background_df <- transite:::ge$background_df
# sort sequences by signal-to-noise ratio
background_df <- dplyr::arrange(background_df, value)
# character vector of named and ranked (by signal-to-noise ratio) sequences
background_seqs <- gsub("T", "U", background_df$seq)
names(background_seqs) <- paste0(background_df$refseq, "|",
  background_df$seq_type)

results <- run_matrix_spma(background_seqs,
                           sorted_transcript_values = background_df$value,
                           transcript_values_label = "signal-to-noise ratio",
                           motifs = get_motif_by_id("M178_0.6"),
                           n_bins = 20,
                           max_fg_permutations = 10000)

## Not run: 
results <- run_matrix_spma(background_seqs,
                           sorted_transcript_values = background_df$value,
                           transcript_values_label = "SNR") 
## End(Not run)

kkrismer/transite documentation built on July 13, 2024, 8:01 a.m.