PCA_biplot: The PCA biplot with loadings

View source: R/PCA_biplot.R

PCA_biplotR Documentation

The PCA biplot with loadings

Description

[Stable]

  • PCA_biplot() creates the PCA (Principal Component Analysis) biplot with loadings for the new index rYWAASB for simultaneous selection of genotypes by trait and WAASB index. It shows rYWAASB, rWAASB and rWAASBY indices (r: ranked) in a biplot, simultaneously for a better differentiation of genotypes. In PCA biplots controlling the color of variable using their contrib i.e. contributions and cos2 takes place.

Usage

PCA_biplot(datap, lowt = FALSE)

Arguments

datap

The data set

lowt

A parameter indicating whether lower rates of the trait is preferred or not. For grain yield e.g. Upper values is preferred. For plant height lower values e.g. is preferred.

Details

PCA is a machine learning method and dimension reduction technique. It is utilized to simplify large data sets by extracting a smaller set that preserves significant patterns and trends(1). According to Johnson and Wichern (2007), a PCA explains the var-covar structure of a set of variables \loadmathjax \mjseqnX_1, X_2, ..., X_p with a less linear combinations of such variables. Moreover the common objective of PCA is 1) data reduction and 2) interpretation.

Biplot and PCA: The biplot is a method used to visually represent both the rows and columns of a data table. It involves approximating the table using a two-dimensional matrix product, with the aim of creating a plane that represents the rows and columns. The techniques used in a biplot typically involve an eigen decomposition, similar to the one used in PCA. It is common for the biplot to be conducted using mean-centered and scaled data(2). For scaling variables, the data can be transformed as follow: \mjsdeqnz = \fracx-\barxs(x) where 's(x)' denotes the sample standard deviation of 'x' parameter, calculated as: \mjsdeqns = \sqrt\frac1n-1\sum_i=1^n(x_i-\barx)^2 Algebra of PCA: As Johnson and Wichern (2007) stated(3), if the random vector \mjseqn\mathbfX' = {X_1, X_2,...,X_p } have the covariance matrix \mjseqn\sum with eigenvalues \mjseqn\lambda_1 \ge \lambda_2 \ge ... \ge \lambda_p \ge 0.

Regarding the linear combinations: \mjsdeqnY_1 = a'_1X = a_11X_1 + a_12X_2 + ... + a_1PX_p \mjsdeqnY_2 = a'_2X = a_21X_1 + a_22X_2 + ... + a_2pX_p \mjsdeqn... \mjsdeqnY_p = a'_pX = a_p1X_1 + a_p2X_2 + ... + a_ppX_p

where \mjseqnVar(Y_i) = \mathbfa'_i\suma_i , i = 1, 2, ..., p \mjseqnCov(Y_i, Y_k) = \mathbfa'_i\suma_k , i, k = 1, 2, ..., p

The principal components refer to the uncorrelated linear combinations \mjseqnY_1, Y_2, ..., Y_p which aim to have the largest possible variances.

For the random vector \mjseqn\mathbfX'=\left [ X_1, X_2, ..., X_p \right ], if \mjseqn\mathbf\sum be the associated covariance matrix, then \mjseqn\mathbf\sum have the eigenvalue-eigenvector pairs \mjseqn(\lambda_1, e_1), (\lambda_2, e_2), ..., (\lambda_p, e_p), and as said \mjseqn\lambda_1 \ge \lambda_2 \ge ... \ge \lambda_p \ge 0.

Then the \mjseqn\itith principal component is as follows: \mjsdeqnY_i = \mathbfe'_iX = e_i1X_1 + e_i2X_2 + ... + e_ipX_p, i = 1, 2, ..., p, where \mjseqnVar(Y_i) =\mathbf(e'_i\sume_i) = \lambda_i, i = 1, 2, ..., p \mjseqnCov(Y_i, Y_k) = \mathbfe'_i\sum e_i = 0, i \not\equiv k, and: \mjseqn\sigma_11 + \sigma_22 + ... + \sigma_pp = \sum_i=1^pVar(X_i) = \lambda_1 + \lambda_2 + ... + \lambda_p = \sum_i=1^pVar(Y_i).

Interestingly, Total population variance = \mjseqn\sigma_11 + \sigma_22 + ... + \sigma_pp = \lambda_1 + \lambda_2 + ... + \lambda_p.

Another issues that are significant in PCA analysis are:

  1. The proportion of total variance due to (explained by) the \mjseqn\mathitkth principal component: \mjsdeqn\frac\lambda_k(\lambda_1 + \lambda_2 + ... + \lambda_p), k=1, 2, ..., p

  2. The correlation coefficients between the components \mjseqnY_i and the variables \mjseqnX_k is as follows: \mjseqn\rho_Y_i, X_k = \frace_ik\sqrt\lambda_i\sqrt\sigma_kk, i,k = 1, 2, ..., p

Please note that PCA can be performed on Covariance or ⁠correlation matrices⁠. And before PCA the data should be centered, generally.

Value

Returns a a list of dataframes

Author(s)

Ali Arminian abeyran@gmail.com

References

(1) https://builtin.com

(2) https://pca4ds.github.io/biplot-and-pca.html.

(3) Johnson, R.A. and Wichern, D.W. 2007. Applied Multivariate Statistical Analysis. Pearson Prentice Hall. 773 p.

Examples

# Case 1: for maize dataset, grain yield

data(maize)
PCA_biplot(maize) # or: PCA_biplot(maize, lowt = FALSE)

# Case 2: for days to maturity (dm) trait of chickpea

data(dm)
PCA_biplot(dm, lowt = TRUE)


rYWAASB documentation built on June 10, 2025, 9:12 a.m.