noisyR count matrix approach workflow"
In noisyr: Noise Quantification in High Throughput Sequencing Output

wzxhzdk:0 The *noisyR* package is an end-to-end pipeline for quantifying and removing technical noise from HTS datasets. The three main pipeline steps are: 1. similarity calculation 1. noise quantification 1. noise removal Each step can be finely tuned using hyperparameters; optimal, data-driven values for these parameters are also determined. The package and some applications are described in more detail in [this preprint](https://www.biorxiv.org/content/10.1101/2021.01.17.427026v2) and is actively maintained on https://github.com/Core-Bioinformatics/noisyR. The **count matrix approach** uses the original, un-normalised count matrix, as provided after alignment and feature quantification; each sample is processed individually, only the relative expressions across samples are compared. Relying on the hypothesis that the majority of genes are not DE, most of the evaluations are expected to point towards a high similarity across samples. # Installation To install the package, first install all bioconductor dependencies: wzxhzdk:1 Then, you can install *noisyR* (and all its other dependencies) from CRAN: wzxhzdk:2 To install the latest stable version from GitHub, first install CRAN dependencies: wzxhzdk:3 # Preprocessing First, load *noisyR*: wzxhzdk:4 For this demonstration we will be using a subset of the count matrix for an experiment included in [a 2019 paper by Yang et al](https://www.sciencedirect.com/science/article/pii/S2405471219301152). Rows represent genes/features and columns represent samples: wzxhzdk:5 Note that when reading from a file R typically returns a data frame. To convert to a matrix we use the function *cast_matrix_to_numeric()*. This also converts values to numeric (in case they were read as characters). Any values that are not coercible to numeric are replaced by 0. wzxhzdk:6 # Running *noisyR* The full *noisyR* pipeline can be run through *noisyr()*, choosing "counts" for the count matrix approach (internally calls *noisyr_counts()*). The user can pass many arguments to this function, which alter the behaviour of the different pipeline steps, as discussed in the breakdown below. wzxhzdk:7 The output of the noise removal is a denoised matrix that can be passed on to other methods for downstream analysis. wzxhzdk:8 # Pipeline breakdown While for most applications, running *noisyr()* is sufficient, it may be useful to run individual pipeline steps manually. For example, the user may want to create summary figures for the denoising, store the noise thresholds obtained or other intermediary outputs, or try out different options without rerunning all steps. In this section, we look deeper into *noisyr()* and break down the three main steps that are performed. ## Similarity calculation We can then run the similarity calculation using *calculate_expression_similarity_counts()*: wzxhzdk:9 Users can select a similarity measure to assess the localised consistency in expression across samples (dissimilarity measures are inverted). See the *philentropy* package documentation for more information on the different distances. The full list of available metrics can be viewed by: wzxhzdk:10 By default, the window length is 10% of the number of rows in the matrix, as it has proven effective empirically. A different window can be specified by the *n.elements.per.window* parameter. The optimal window length can also be estimated by seeking stability of output (but this can be computationally intensive for large datasets): wzxhzdk:11 Window length optimisation can be turned on using the *optimise.window.length.logical* parameter in *noisyr()* or *noisyr_counts()*. Plots of the abundance-correlation relation can be generated through the *plot_expression_similarity()* function: wzxhzdk:12 > As expected, we observe low correlation values for low abundances and a steady increase towards 1 as the abundance increases. This is based on the expectation that most genes are not differentially expressed and have consistent expression, but at low abundances the stochastic nature of transcription and sequencing gives rise to noise. The local maximum at very low abundances is due to strings of zeros driving the correlation higher than expected. These are ggplot objects, and can thus be modified and combined intuitively. For example, plotting all the line plots together: wzxhzdk:13 ## Noise quantification Using the output of the similarity calculation, we can compute the signal to noise threshold in each sample: wzxhzdk:14 Here we used the default parameters: a similarity threshold of 0.25 and the Boxplot-IQR method. There are several methods available, which can be viewed with the *get_methods_calculate_noise_threshold()* function: wzxhzdk:15 The first three methods are just calculating the minimum of the density plot for all genes (a common, fast approach). This usually provides a rough, overestimated signal to noise threshold. The rest of the methods use either the (smoothed) line plot or the boxplot to find the noise threshold given a similarity (correlation/distance) threshold. It is recommended that the method with the least coefficient of variation across all samples is chosen for noise removal. This can also be applied to compute the correlation/distance threshold instead of supplying it manually, which is especially useful for non-correlation measures which don't have a standard range. For example, by looking to minimise the coefficient of variation, we get a correlation threshold of 0.21 and the loess10 smoothing method for this dataset (by default all methods from *get_methods_calculate_noise_threshold()* are used: wzxhzdk:16 We can then call *calculate_noise_threshold()* with our optimised parameters: wzxhzdk:17 Parameter optimisation can be turned on by specifying vectors of values for *similarity.threshold* and/or *method.chosen* (a subset of *get_methods_calculate_noise_threshold()*) when calling *noisyr()* or *noisyr_counts()*. ## Noise removal To produce the denoised count matrix, the function *remove_noise_from_matrix()* is used with a specified vector of noise thresholds (usually calculated by *calculate_noise_threshold()*). wzxhzdk:18 The behaviour of *remove_noise_from_matrix()* can be further modified: * __*add.threshold*__ whether to add the noise threshold to each entry (default) or set each entry under the noise threshold to the noise threshold. * __*average.threshold*__ whether the noise thresholds for the different samples are averaged (default) or used individually. The latter should be especially avoided if the thresholds have high variance, as it could intoduce artificial differences in the data. * __*remove.noisy.features*__ whether genes/features whose expression is under the noise threshold in every sample should be removed from the matrix (default) or not Because of these defaults, passing the mean of the thresholds gives a slightly different result (different # of genes fully under the noise threshold: wzxhzdk:19 By supplying the corresponding parameters to *noisyr()*, we obtain the same final result from the full pipeline: wzxhzdk:20 # Downstream analysis The denoised matrix can be used instead of the raw count matrix for downstream analysis. Here we present a simple example of a differential expression (DE) analysis and compare the two. We create a function to perform the same DE pipeline on both matrices, using [the edgeR package](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2796818/). wzxhzdk:21 wzxhzdk:22 > We observe the distribution of genes in the volcano plots becoming a lot tighter for the denoised matrix. For the raw matrix, there are a lot of genes with low p-values and high log-fold changes that are barely called DE. Those "whiskers" are corrected for the denoised matrix. We can also see the number of differentially expressed genes has been reduced. wzxhzdk:23