score_spectrum: Calculates spectrum scores and creates spectrum plots
In kkrismer/transite: RNA-binding protein motif analysis

score_spectrum

R Documentation

Calculates spectrum scores and creates spectrum plots

Description

Spectrum scores are a means to evaluate if a spectrum has a meaningful (i.e., biologically relevant) or a random pattern.

Usage

score_spectrum(
  x,
  p_values = array(1, length(x)),
  x_label = "log enrichment",
  sorted_transcript_values = NULL,
  transcript_values_label = "transcript value",
  midpoint = 0,
  x_value_limits = NULL,
  max_model_degree = 3,
  max_cs_permutations = 1e+07,
  min_cs_permutations = 5000,
  e = 5
)

Arguments

`x`	vector of values (e.g., enrichment values, normalized RBP scores) per bin
`p_values`	vector of p-values (e.g., significance of enrichment values) per bin
`x_label`	label of values (e.g., `"enrichment value"`)
`sorted_transcript_values`	vector of sorted transcript values, i.e., the fold change or signal-to-noise ratio or any other quantity that was used to sort the transcripts that were passed to `run_matrix_spma` or `run_kmer_spma` (default value is `NULL`). These values are displayed as a semi-transparent area over the enrichment value heatmaps of spectrum plots.
`transcript_values_label`	label of transcript sorting criterion (e.g., `"log fold change"`, default value is `"transcript value"`), only shown if `!is.null(sorted_transcript_values)`
`midpoint`	for enrichment values the midpoint should be `1`, for log enrichment values `0` (defaults to `0`)
`x_value_limits`	sets limits of the x-value color scale (used to harmonize color scales of different spectrum plots), see `limits` argument of `continuous_scale` (defaults to `NULL`, i.e., the data-dependent default scale range)
`max_model_degree`	maximum degree of polynomial
`max_cs_permutations`	maximum number of permutations performed in Monte Carlo test for consistency score
`min_cs_permutations`	minimum number of permutations performed in Monte Carlo test for consistency score
`e`	integer-valued stop criterion for consistency score Monte Carlo test: aborting permutation process after observing `e` random consistency values with more extreme values than the actual consistency value

Details

One way to quantify the meaningfulness of a spectrum is to calculate the deviance between the linear interpolation of the scores of two adjoining bins and the score of the middle bin, for each position in the spectrum. The lower the score, the more consistent the trend in the spectrum plot. Formally, the local consistency score x_c is defined as

x_c = \frac{1}{n} \sum_{i = 1}^{n - 2}{|\frac{s_i + s_{i + 2}}{2} - s_{i + 1}|}.

In order to obtain an estimate of the significance of a particular score x_c', Monte Carlo sampling is performed by randomly permuting the coordinates of the scores vector s and recomputing x_c. The probability estimate \hat{p} is given by the lower tail version of the cumulative distribution function

\hat{Pr}(T(x)) = \frac{\sum_{i = 1}^n 1(T(y_i) \le T(x)) + 1}{n + 1},

where 1 is the indicator function, n is the sample size, i.e., the number of performed permutations, and T equals x_c in the above equation.

An alternative approach to assess the consistency of a spectrum plot is via polynomial regression. In a first step, polynomial regression models of various degrees are fitted to the data, i.e., the dependent variable s (vector of scores), and orthogonal polynomials of the independent variable b (vector of bin numbers). Secondly, the model that reflects best the true nature of the data is selected by means of the F-test. And lastly, the adjusted R^2 and the sum of squared residuals are calculated to indicate how well the model fits the data. These statistics are used as scores to rank the spectrum plots. In general, the polynomial regression equation is

y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \cdots + \beta_m x_i^m + \epsilon_i,

where m is the degree of the polynomial (usually m \le 5), and \epsilon_i is the error term. The dependent variable y is the vector of scores s and x to x^m are the orthogonal polynomials of the vector of bin numbers b. Orthogonal polynomials are used in order to reduce the correlation between the different powers of b and therefore avoid multicollinearity in the model. This is important, because correlated predictors lead to unstable coefficients, i.e., the coefficients of a polynomial regression model of degree m can be greatly different from a model of degree m + 1.

The orthogonal polynomials of vector b are obtained by centering (subtracting the mean), QR decomposition, and subsequent normalization. Given the dependent variable y and the orthogonal polynomials of b x to x^m, the model coefficients \beta are chosen in a way to minimize the deviance between the actual and the predicted values characterized by

M(x) = \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots + \beta_m x^m

M = argmin_{M}{(\sum_{i = 1}^n{L(y_i, M(x_i))})},

where L(actual value, predicted value) denotes the loss function.

Ordinary least squares is used as estimation method for the model coefficients \beta. The loss function of ordinary least squares is the sum of squared residuals (SSR) and is defined as follows SSR(y, \hat{y}) = \sum_{i = 1}^n{(y_i - \hat{y}_i)^2}, where y are the observed data and \hat{y} the model predictions.

Thus the ordinary least squares estimate of the coefficients \hat{\beta} (including the intercept \hat{\beta}_0) of the model M is defined by

\hat{\beta} = argmin_{\beta}{(\sum_{i = 1}^n{(y_i - \beta_0 - \sum_{j = 1}^m{\beta_j x_i^j})^2})}.

After polynomial models of various degrees have been fitted to the data, the F-test is used to select the model that best fits the data. Since the SSR monotonically decreases with increasing model degree (model complexity), the relative decrease of the SSR between the simpler model and the more complex model must outweigh the increase in model complexity between the two models. The F-test gives the probability that a relative decrease of the SSR between the simpler and the more complex model given their respective degrees of freedom is due to chance. A low p-value indicates that the additional degrees of freedom of the more complex model lead to a better fit of the data than would be expected after a mere increase of degrees of freedom.

The F-statistic is calculated as follows

F = \frac{(SSR_1 - SSR_2) / (p_2 - p_1)}{SSR_2 / (n - p_2)},

where SSR_i is the sum of squared residuals and p_i is the number of parameters of model i. The number of data points, i.e., bins, is denoted as n. F is distributed according to the F-distribution with df_1 = p_2 - p_1 and df_2 = n - p_2.

Value

A list object of class SpectrumScore with the following components:

`adj_r_squared`	adjusted `R^2` of polynomial model
`degree`	maximum degree of polynomial
`residuals`	residuals of polynomial model
`slope`	coefficient of the linear term of the polynomial model (spectrum "direction")
`f_statistic`	statistic of the F-test
`f_statistic_p_value`	p-value of F-test
`consistency_score`	normalized sum of deviance between the linear interpolation of the scores of two adjoining bins and the score of the middle bin, for each position in the spectrum
`consistency_score_p_value`	obtained by Monte Carlo sampling (randomly permuting the coordinates of the scores vector)
`consistency_score_n`	number of permutations
`plot`

Examples

# random spectrum
score_spectrum(runif(n = 40, min = -1, max = 1), max_model_degree = 1)

# two random spectrums with harmonized color scales
plot(score_spectrum(runif(n = 40, min = -1, max = 1), max_model_degree = 1,
     x_value_limits = c(-2.0, 2.0)))
plot(score_spectrum(runif(n = 40, min = -2, max = 2), max_model_degree = 1,
     x_value_limits = c(-2.0, 2.0)))

# random spectrum with p-values
score_spectrum(runif(n = 40, min = -1, max = 1),
               p_values = runif(n = 40, min = 0, max = 1),
               max_model_degree = 1)

# random spectrum with sorted transcript values
log_fold_change <- log(runif(n = 1000, min = 0, max = 1) /
                           runif(n = 1000, min = 0, max = 1))
score_spectrum(runif(n = 40, min = -1, max = 1),
               sorted_transcript_values = sort(log_fold_change),
               max_model_degree = 1)

# non-random linear spectrum
signal <- seq(-1, 0.99, 2 / 40)
noise <- rnorm(n = 40, mean = 0, sd = 0.5)
score_spectrum(signal + noise, max_model_degree = 1,
               max_cs_permutations = 100000)

# non-random quadratic spectrum
signal <- seq(-1, 0.99, 2 / 40)^2 - 0.5
noise <- rnorm(n = 40, mean = 0, sd = 0.2)
score_spectrum(signal + noise, max_model_degree = 2,
               max_cs_permutations = 100000)

kkrismer/transite documentation built on July 13, 2024, 8:01 a.m.