knitr::opts_chunk$set(comment = "#", collapse = TRUE)
load("plots_corr.RData")
library(metan)
library(DT) # Used to make the tables
# Function to make HTML tables
print_table <- function(table, rownames = FALSE, digits = 3, ...){
datatable(table, rownames = rownames, extensions = 'Buttons',
          options = list(scrollX = TRUE, 
                         dom = '<<t>Bp>',
                         buttons = c('copy', 'excel', 'pdf', 'print')), ...) %>%
    formatSignif(columns = c(as.numeric(which(sapply(table, class) == "numeric"))), digits = digits)}

See the section Rendering engine to know how HTML tables were generated.

Find possible outliers in a data frame

The function find_outlier() may be used to identify possible outliers in a dataframe. It is suggested that before applying any statistical procedures, outliers be checked out.

library(metan)
data_out <- data_ge2
data_out[12, 4] = 5
find_outliers(data_out, var = PH, plots = TRUE)

To check the outliers in different levels of a factor, use the argument by. As an example, we will find possible outliers for each level of the factor ENV.

find_outliers(data_ge2, PH, by = ENV)

To group by more than one variable, use the function group_by() is used.

data_ge2 %>%
  group_by(ENV) %>%
  find_outliers(PH)

Correlations

Linear and partial correlation coefficients

Pearson's linear correlation does not consider the influence a set of traits on the relationship between two traits. For example, the hypothetical correlation of r = 0.9 between x and y may be due to the influence of a third trait or group of traits acting together. To identify this linear effect between x and y controlling statistically the effect of others traits, the partial correlation is used. From Pearson's simple correlation matrix, the partial correlation is calculated by the following equation:

$$ {r_{xy.m}} = \frac{{ - {a_{xy}}}}{{\sqrt {{a_{xx}}{a_{yy}}} }} $$

Where ${r_{xy.m}}$ is the partial correlation coefficient between the traits * x * and * y *, excluding the effects of the * m * remaining traits of the set; $- {a_{ij}}$ is the inverse element of the correlation matrix corresponding to xy, ${a_{ii}}{a_{jj}}$ are the diagonal elements of the inverse matrix of correlation associated with trait x and y , respectively. The significance of this correlation is also tested by the test * t * according to the following expression:

$$ t_{calc} = r_{xy.m} \sqrt \frac{n-v}{1-r_{xy.m}^2} $$

Where $t_{calc}$ is the calculated Student * t * statistic; $ r_{xy.m} $ is the partial correlation coefficient for the traits x and y excluding the effect of the other * m * traits; * n * is the number of observations; and * v * is the number of traits. Both the linear and partial correlation coefficients may be obtained using the function lpcor().

lpc1 <- 
  data_ge2 %>% 
  select(starts_with("N")) %>% 
  lpcor()
print_table(lpc1$results)

# Compute the correlations for each level of the factor ENV
lpc2 <- lpcor(data_ge2, by = ENV)
print(lpc2)

Using the pairs_mantel() function, it is possible to compute a Mantel's test [@Mantel1967] for all pairwise correlation matrices of the above example.

lpc2 %>% pairs_mantel(names = paste("H", 1:4, sep = ""))

This same plot may be obtained by passing correlation matrices with the same dimension to an object of class lpcor and then applying the function pairs_mantel().

as.lpcor(cor(data_ge2[1:30, 5:ncol(data_ge2)]),
         cor(data_ge2[31:60, 5:ncol(data_ge2)]),
         cor(data_ge2[61:90, 5:ncol(data_ge2)]),
         cor(data_ge2[91:120, 5:ncol(data_ge2)]),
         cor(data_ge2[121:150, 5:ncol(data_ge2)])) %>%
  pairs_mantel(diag = TRUE,
               pan.spacing = 0,
               shape.point = 21,
               col.point = "black",
               fill.point = "red",
               size.point = 1.5,
               alpha.point = 0.6,
               main = "My own plot",
               alpha = 0.2)

Graphical and numerical visualization of a correlation matrix

The function corr_coef() can be used to compute Pearson producto-moment correlation coefficients with p-values. A correlation heat map can be created with the function plot().

# All numeric variables
ccoef <- corr_coef(data_ge2)
plot(ccoef)

We can use a select helper function to select variables. Here, we will select variables that starts with "C" OR ends with "D" using union_var().

ccoef2 <- corr_coef(data_ge2, union_var("C", "D"))
print(ccoef2, digits = 2)

The function corr_plot() may be used to visualize (both graphically and numerically) a correlation matrix. Pairwise of scatterplots are produced and may be shown in the upper or lower diagonal, which may be seen as a nicer and customizable ggplot2-based version of the pairs()base R function.

a <- corr_plot(data_ge2, CD, EL, PERK, NKR)

b <- 
corr_plot(data_ge2, CD, EL, PERK, NKR,
          lower = NULL,
          upper = "corr")
c <- 
corr_plot(data_ge2, CD, EL, PERK, NKR,
          shape.point = 19,
          size.point = 2,
          alpha.point = 0.5,
          alpha.diag = 0,
          pan.spacing = 0,
          diag.type = "boxplot",
          col.sign = "gray",
          alpha.sign = 0.3,
          axis.labels = TRUE)

d <- 
corr_plot(data_ge2, CD, EL, PERK, NKR, CW, NKE,
          prob = 0.01,
          shape.point = 21,
          col.point = "black",
          fill.point = "orange",
          size.point = 2,
          alpha.point = 0.6,
          maxsize = 4,
          minsize = 2,
          smooth = TRUE,
          size.smooth = 1,
          col.smooth = "black",
          col.sign = "cyan",
          col.up.panel = "black",
          col.lw.panel = "black",
          col.dia.panel = "black",
          pan.spacing = 0,
          lab.position = "tl")
arrange_ggplot(a, b, c, d, tag_levels = "a")

It is also possible to use a categorical variable of the data to map the scatterplot by colors.

corr_plot(iris, col.by = Species)

The function corr_coef() can be used to compute Pearson product-moment correlation coefficients with p-values. A correlation heat map can be created with the function plot()

(co)variance and correlations for designed experiments

The function covcor_design() may be used to compute genetic, phenotypic and residual correlation/(co)variance matrices through Analysis of Variance (ANOVA) method using randomized complete block design (RCBD) or completely randomized design (CRD).

The phenotypic ($r_p$), genotypic ($r_g$) and residual ($r_r$) correlations are computed as follows:

$$ r^p_{xy} = \frac{cov^p_{xy}}{\sqrt{var^p_{x}var^p_{y}}} \ r^g_{xy} = \frac{cov^g_{xy}}{\sqrt{var^g_{x}var^g_{y}}} \ r^r_{xy} = \frac{cov^r_{xy}}{\sqrt{var^r_{x}var^r_{y}}} $$

Using Mean Squares (MS) from the ANOVA method, the variances (var) and covariances (cov) are computed as follows:

$$ cov^p_{xy} = [(MST_{x+y} - MST_x - MST_y)/2]/r \ var^p_x = MST_x / r \ var^p_y = MST_y / r \ cov^r_{xy} = (MSR_{x+y} - MSR_x - MSR_y)/2 \ var^r_x = MSR_x \ var^r_y = MSR_y \ cov^g_{xy} = [(cov^p_{xy} \times r) - cov^r_{xy}]/r \ var^g_x = (MST_x - MSE_x)/r \ var^g_y = (MST_x - MSE_y)/r \ $$

where MST is the mean square for treatment, MSR is the mean square for residuals, and r is the number of replications.

The function covcor_design() returns a list with the matrices of (co)variances and correlations. Specific matrices may be returned using the argument type, as shown bellow.

Genetic correlations

# environment A1
data <- subset(data_ge2, ENV == "A1")
gcor <- covcor_design(data, gen = GEN, rep = REP,
                      resp = c(PH, EH, NKE, TKW, CL, CD, CW, KW),
                      type = "gcor") %>% 
  as.data.frame()
print_table(gcor, rownames = TRUE)

Phenotypic correlations

pcor <- covcor_design(data, gen = GEN, rep = REP,
                      resp = c(PH, EH, NKE, TKW, CL, CD, CW, KW),
                      type = "pcor") %>% 
  as.data.frame()
print_table(pcor, rownames = TRUE)

Residual correlations

rcor <- covcor_design(data,
                      gen = GEN,
                      rep = REP,
                      resp = c(PH, EH, NKE, TKW, CL, CD, CW, KW),
                      type = "rcor") %>% 
  as.data.frame()
print_table(rcor, rownames = TRUE)

Residual (co)variance matrix

In this example we will obtain the residual (co)variance for each environment.

cov <- covcor_design(data_ge2,
                     gen = GEN,
                     rep = REP,
                     resp = c(PH, EH, NKE, TKW, CL, CD, CW, KW),
                     type = "rcov")

The residual (co)variance matrix and the means (obtained using type = "means") may be used into the function mahala() to compute the Mahalanobis distance

res <- covcor_design(data, GEN, REP,
                     resp = c(PH, EH, NKE, TKW, CL, CD, CW, KW),
                     type = "rcov")
means <- covcor_design(data, GEN, REP,
                       resp = c(PH, EH, NKE, TKW, CL, CD, CW, KW),
                       type = "means")

D2 <- mahala(.means = means, covar = res, inverted = FALSE) %>% 
  as.data.frame()
print_table(D2, rownames = TRUE)

Nonparametric confidence interval for Pearson's correlation

Recently, a Gaussian-independent estimator for the confidence interval of Pearson's correlation coefficient was proposed by @Olivoto2018. This estimator is based on the sample size and strength of associations and may be estimated using the function corr_ci(). It is possible to estimate the confidence interval by declaring the sample size (n) and the correlation coefficient (r), or using a dataframe. The following code computes the confidence interval and make a plot to show the results.

# Use a data frame
corr_ci(data_ge2, contains("E"), verbose = FALSE) %>%
plot_ci()

In the following examples, the confidence interval is calculated by declaring the sample size (n) and the correlation coefficient (r). Using the argument by = ENV the confidence interval can be calculated within each level of the factor ENV.

# Inform n and r
corr_ci(n = 145, r = 0.34)

# Compute the confidence for each level of ENV
CI2 <- corr_ci(data_ge2, contains("E"), by = ENV)

Sample size planning

corr_ss(r = 0.6, CI = 0.1)

Collinearity diagnostic

The following codes compute a complete collinearity diagnostic of a correlation matrix of predictor traits. Several indicators, such as Variance Inflation Factor, Condition Number, and Matrix Determinant are considered [@Olivoto2017f; @Olivoto2017c] The diagnostic may be performed using: (i) correlation matrices; (ii) dataframes, or (iii) an object of class group_factor, which split a dataframe into subsets based on one or more grouping factors.

Using a correlation matrix, which was estimated earlier

cor_data <- lpc1$linear.mat
n <- nrow(data_ge2)
cold1 <- colindiag(cor_data, n = n)

Using a dataframe

cold2 <- colindiag(data_ge2)

Perform the diagnostic for each level of the factor ENV

cold3 <- colindiag(data_ge2 , by = ENV)

Path analysis

Using KW as response trait and all other ones as predictors.

pcoeff <- path_coeff(data_ge2, resp = KW)

Declaring the predictor traits

pcoeff2 <-
  path_coeff(data_ge2,
             resp = KW,
             pred = c(PH, NKE, TKW),
             verbose = FALSE)
print(pcoeff2)

Selecting traits to be excluded from the analysis.

pcoeff2 <- 
  path_coeff(data_ge2, 
             resp = KW,
             pred = c(PH, EH, NKE, TKW),
             exclude = TRUE,
             verbose = FALSE)

Selecting a set of predictors with minimal multicollinearity

pcoeff3 <- 
  path_coeff(data_ge2,
             resp = KW,
             brutstep = TRUE,
             maxvif = 5)

Compute the analysis for each level of environment

pcoeff4 <- 
  path_coeff(data_ge2,
             resp = KW,
             pred = c(PH, EH, NKE, TKW),
             by = ENV)

Canonical correlation analysis

First of all, we will rename the plant-related traits PH, EH EP with the suffix _PLA to show the usability of the select helper contains().

data_cc <- rename(data_ge2,
                  PH_PLA = PH,
                  EH_PLA = EH,
                  EP_PLA = EP)

# Type the variable names
cc1 <- can_corr(data_cc,
                FG = c(PH_PLA, EH_PLA, EP_PLA),
                SG = c(EL, ED, CL, CD, CW, KW))

# Use select helpers
cc2 <- can_corr(data_cc,
                FG = contains("_PLA"),
                SG = c(EL, ED, CL, CD, CW, KW))

Clustering analysis

Using function clustering()

All rows and all numeric variables from data

d1 <- clustering(data_ge2)

Based on the mean for each genotype

mean_gen <-
 data_ge2 %>%
 mean_by(GEN) %>%
 column_to_rownames("GEN")

d2 <- clustering(mean_gen)

The S3 generic function plot() may be used to plot the dendrogram generated by the function clustering(). A dashed line is draw at the cutpoint suggested according to @Mojena1977.

plot(d2)

According to the suggested cutpoint, two clusters are formed. The number of clusters may also be found using intensive computation. I suggest the package pcvlust, an R package for assessing the uncertainty in hierarchical cluster analysis. The implementation may be see below.

library(pvclust)
pv_clust <- pvclust(t(d2$data), nboot = 100, method.dist = "euclidean")
plot(pv_clust, hang = -1, cex = 0.5)
pvrect(pv_clust, alpha = 0.95)

Indicating the variables to compute the distances

It is possible to indicate the variables from the data_ge2 to compute the distances. To do that is easy. You should only provide a comma-separated list of unquoted variable names after the .data argument. For example, to compute the distances between the genotypes based on the variables NKR, TKW, and NKE, the following arguments should be used.

d3 <- clustering(mean_gen, NKR, TKW, NKE)

Select variables for compute the distances

When selvar = TRUE is used, an algorithm for variable selection is implemented. See ?clustering for more details.

d4 <- clustering(mean_gen, selvar = TRUE)

The distances were computed using the variables ED, CW, KW, NKR, TKW, and NKE. By using these variables the highest cophenetic correlation coefficient (0.8658) was observed. The Mantel's correlation estimated with the distance matrix of Model 10 (selected variables) with the original distance matrix (estimated with all variables) was near to 1, suggesting that the deletion of the variables to compute the distance don't affect significantly the computation of the distances.

Extending the dendrogram Functionality

The package dendextend offers a set of functions for extending 'dendrogram' objects in R. A simple example is given bellow.

library(dendextend)
d4$hc %>%
  color_labels(k = 2, col = c("red", "blue")) %>%
  branches_color(k = 2, col = c("red", "blue")) %>%
  highlight_branches_lwd() %>%
plot(horiz = TRUE, xlab = "Euclidean distance")

Compute the distances for each environment

d6 <- clustering(data_ge2,
                 by = ENV,
                 nclust = 2)

Check the correlation between distance matrices

The function pairs_mantel() may be used to check the relationships between the distance matrices when the clustering is performed for each level of a grouping factor. In this example, we have four distance matrices corresponding to four environments.

pairs_mantel(d6, names = c("A1", "A2", "A3", "A4"))

The low values of correlation between the distance matrices suggest that the genotype clustering should vary significantly among environments.

Mahalanobis distance

Based on designed experiments

# Environment E1
data_E1 <- subset(data_ge2, ENV == "A1")
D2_des <- mahala_design(data_E1,
                        gen = GEN,
                        rep = REP,
                        resp = c(PH, EH, NKE, TKW, CL, CD, CW, KW)) %>% 
  as.data.frame()
print_table(D2_des, rownames = TRUE)

Compute one distance for each environment

To compute the Mahalanobis distance for each environment (or any grouping variable) we can use the argument by. Note that pairs_mantel() is used to compare compute the Mantel's test for each combination of distance matrices. Let's do it.

data_ge2 %>%
mahala_design(gen = GEN,
              rep = REP,
              resp = c(PH, EH, NKE, TKW, CL, CD, CW, KW),
              by = ENV)

If I have the matrices of means and covariances

Lets suppose we want compute the Mahalanobis' distance for each pairwise genotype comparison based on cob-related traits. Note that the function select(contains("C")) is used to select the cob-relate traits, after computing the mean for each genotype.

means <- 
  data_ge2 %>%
  mean_by(GEN) %>% 
  column_to_rownames("GEN") %>%
  select(contains("C"))

The next step is to compute the variance-covariance matrix for the means. The first approach combines R base functions with some functions from metan package to compute the covariance matrix. Of course, the simplest way is by using cov().

# Compute the covariance matrix (by hand)
cov_mat <- matrix(0, 4, 4)
dev_scores <- sweep(means, 2, colMeans(means), FUN = "-")
comb_v <- comb_vars(dev_scores, FUN = "*")
cov_mat[lower.tri(cov_mat, diag = F)] <- colSums(comb_v/(nrow(means) - 1))
rownames(cov_mat) <- colnames(cov_mat) <- colnames(means)
cov_mat <- make_sym(cov_mat, diag = diag(var(means)))

# Compute the covariance using cov()
covmat2 <- cov(means)

# Check if the two matrices are equal
all.equal(cov_mat, covmat2)

After computing the means and covariance matrices we are able to compute the Mahalanobis distance using the function mahala().

D2 <- mahala(means, covar = cov_mat)

# Dendrogram
D2 %>% as.dist() %>% hclust() %>% as.dendrogram() %>% plot()

Rendering engine {#rendering}

This vignette was built with pkgdown. All tables were produced with the package DT using the following function.

library(DT) # Used to make the tables
# Function to make HTML tables
print_table <- function(table, rownames = FALSE, digits = 3, ...){
datatable(table, rownames = rownames, extensions = 'Buttons',
          options = list(scrollX = TRUE, 
                         dom = '<<t>Bp>',
                         buttons = c('copy', 'excel', 'pdf', 'print')), ...) %>%
    formatSignif(columns = c(as.numeric(which(sapply(table, class) == "numeric"))), digits = digits)}

References



TiagoOlivoto/WAASB documentation built on March 20, 2024, 4:18 p.m.