degCovariates: Find correlation between pcs and covariates
In lpantano/DEGreport: Report of DEG analysis

degCovariates

R Documentation

Find correlation between pcs and covariates

Description

This function will calculate the pcs using prcomp function, and correlate categorical and numerical variables from metadata. The size of the dots indicates the importance of the metadata, for instance, when the range of the values is pretty small (from 0.001 to 0.002 in ribosimal content), the correlation results is not important. If black stroke lines are shown, the correlation analysis has a FDR < 0.05 for that variable and PC. Only significant variables according the linear model are colored. See details to know how this is calculated.

Usage

degCovariates(
  counts,
  metadata,
  fdr = 0.1,
  scale = FALSE,
  minPC = 5,
  correlation = "kendall",
  addCovDen = TRUE,
  legacy = FALSE,
  smart = TRUE,
  method = "lm",
  plot = TRUE
)

Arguments

`counts`	normalized counts matrix
`metadata`	data.frame with samples metadata.
`fdr`	numeric value to use as cutoff to determine the minimum fdr to consider significant correlations between pcs and covariates.
`scale`	boolean to determine wether counts matrix should be scaled for pca. default FALSE.
`minPC`	numeric value that will be used as cutoff to select only pcs that explain more variability than this.
`correlation`	character determining the method for the correlation between pcs and covariates.
`addCovDen`	boolean. Whether to add the covariates dendograme to the plot to see covariates relationship. It will show `degCorCov()` dendograme on top of the columns of the heatmap.
`legacy`	boolean. Whether to plot the legacy version.
`smart`	boolean. Whether to avoid normalization of the numeric covariates when calculating importance. This is not used if `legacy = TRUE`. See @details for more information.
`method`	character. Whether to use `lm` to calculate the significance of the variable during reduction step. See @details for more information.
`plot`	Whether to plot or not the correlation matrix.

Details

This method is adapeted from Daily et al 2017 article. Principal components from PCA analysis are correlated with covariates metadata. Factors are transformed to numeric variables. Correlation is measured by cor.test function with Kendall method by default.

The size of the dot, or importance, indicates the importance of the covariate based on the range of the values. Covariates where the range is very small (like a % of mapped reads that varies between 0.001 to 0.002) will have a very small size (0.1*max_size). The maximum value is set to 5 units. To get to importance, each covariate is normalized using this equation: 1 - min(v/max(v)), and the minimum and maximum values are set to 0.01 and 1 respectively. For instance, 0.5 would mean there is at least 50% of difference between the minimum value and the maximum value. Categorical variables are plot using the maximum size always, since it is not possible to estimate the variability. By default, it won't do v/max(v) if the values are already between 0-1 or 0-100 (already normalized values as rates and percentages). If you want to ignore the importance, use legacy = TRUE.

Finally, a linear model is used to calculate the significance of the covariates effect on the PCs. For that, this function uses lm to regress the data and uses the p-value calculated by each variable in the model to define significance (pvalue < 0.05). Variables with a black stroke are significant after this step. Variables with grey stroke are significant at the first pass considering p.value < 0.05 for the correlation analysis.

Value

: list:

plot, heatmap showing the signifcance of the variables.
corMatrix, correlation, p-value, FDR values for each covariate and PCA pais
pcsMatrix: PCs loading for each sample
scatterPlot: plot for each significant covariate and the PC values.
significants: contains the significant covariates using a linear model to predict the coefficient of covariates that have some color in the plot. All the significant covariates from the liner model analysis are returned.

Author(s)

: Lorena Pantano, Victor Barrera, Kenneth Daily and Thanneer Malai Perumal

References

Daily, K. et al. Molecular, phenotypic, and sample-associated data to describe pluripotent stem cell lines and derivatives. Sci Data 4, 170030 (2017).

Examples

data(humanGender)
library(DESeq2)
idx <- c(1:10, 75:85)
dse <- DESeqDataSetFromMatrix(assays(humanGender)[[1]][1:1000, idx],
  colData(humanGender)[idx,], design=~group)
res <- degCovariates(log2(counts(dse)+0.5), colData(dse))
res <- degCovariates(log2(counts(dse)+0.5),
  colData(dse), legacy = TRUE)
res$plot
res$scatterPlot[[1]]

lpantano/DEGreport documentation built on Feb. 28, 2024, 12:01 a.m.