panPca: Principal component analysis of a pan-matrix
In micropan: Microbial Pan-Genome Analysis

Description Usage Arguments Details Value Author(s) See Also Examples

Computes a principal component decomposition of a pan-matrix, with possible scaling and weightings.

1	panPca(pan.matrix, scale = 0, weights = rep(1, ncol(pan.matrix)))

`pan.matrix`	A pan-matrix, see `panMatrix` for details.
`scale`	An optional scale to control how copy numbers should affect the distances.
`weights`	Vector of optional weights of gene clusters.

A principal component analysis (PCA) can be computed for any matrix, also a pan-matrix. The principal components will in this case be linear combinations of the gene clusters. One major idea behind PCA is to truncate the space, e.g. instead of considering the genomes as points in a high-dimensional space spanned by all gene clusters, we look for a few ‘smart’ combinations of the gene clusters, and visualize the genomes in a low-dimensional space spanned by these directions.

The scale can be used to control how copy number differences play a role in the PCA. Usually we assume that going from 0 to 1 copy of a gene is the big change of the genome, and going from 1 to 2 (or more) copies is less. Prior to computing the PCA, the pan.matrix is transformed according to the following affine mapping: If the original value in pan.matrix is x, and x is not 0, then the transformed value is 1 + (x-1)*scale. Note that with scale=0.0 (default) this will result in 1 regardless of how large x was. In this case the PCA only distinguish between presence and absence of gene clusters. If scale=1.0 the value x is left untransformed. In this case the difference between 1 copy and 2 copies is just as big as between 1 copy and 0 copies. For any scale between 0.0 and 1.0 the transformed value is shrunk towards 1, but a certain effect of larger copy numbers is still present. In this way you can decide if the PCA should be affected, and to what degree, by differences in copy numbers beyond 1.

The PCA may also up- or downweight some clusters compared to others. The vector weights must contain one value for each column in pan.matrix. The default is to use flat weights, i.e. all clusters count equal. See geneWeights for alternative weighting strategies.

A list with three tables:

Evar.tbl has two columns, one listing the component number and one listing the relative explained variance for each component. The relative explained variance always sums to 1.0 over all components. This value indicates the importance of each component, and it is always in descending order, the first component being the most important. This is typically the first result you look at after a PCA has been computed, as it indicates how many components (directions) you need to capture the bulk of the total variation in the data.

Scores.tbl has a column listing the GID.tag for each genome, and then one column for each principal component. The columns are ordered corresponding to the elements in Evar. The scores are the coordinates of each genome in the principal component space.

Loadings.tbl is similar to Scores.tbl but contain values for each gene cluster instead of each genome. The columns are ordered corresponding to the elements in Evar. The loadings are the contributions from each gene cluster to the principal component directions. NOTE: Only gene clusters having a non-zero variance is used in a PCA. Gene clusters with the same value for every genome have no impact and are discarded from the Loadings.

Lars Snipen and Kristian Hovde Liland.

distManhattan, geneWeights.

# Loading a pan-matrix in this package
data(xmpl.panmat)

# Computing panPca
ppca <- panPca(xmpl.panmat)

## Not run: 
# Plotting explained variance
library(ggplot2)
ggplot(ppca$Evar.tbl) +
  geom_col(aes(x = Component, y = Explained.variance))
# Plotting scores
ggplot(ppca$Scores.tbl) +
  geom_text(aes(x = PC1, y = PC2, label = GID.tag))
# Plotting loadings
ggplot(ppca$Loadings.tbl) +
  geom_text(aes(x = PC1, y = PC2, label = Cluster))

## End(Not run)