VIDTAXA | R Documentation |
Identification of the different taxa based on the morphological variability observed in a Principal Components Analysis or a Correspondence Analysis.
VIDTAXA(data, var, labels, cat=NULL, analysis="PCA", por=80, k=NULL, pthreshold=0.05, ellipse=FALSE, convex=FALSE, dim=c(1,2), size=c(1,5), showCluster=TRUE, VIF=FALSE, VARSEDIG=TRUE, BUBBLE=TRUE, threshold=10, method="overlap", minimum=TRUE, ResetPAR=TRUE, PAR=NULL, PCA=NULL, SCATTERPLOT=NULL, HCLUST=NULL, CLUSTER=NULL, BOXPLOT=NULL, mfrowBOXPLOT=NULL, LabelCat=NULL, COLOR=NULL, COLORC=NULL, COLORB=NULL, PCH=NULL, XLIM=NULL, YLIM=NULL, XLAB=NULL, YLAB=NULL, ylabBOXPLOT=NULL, LEGEND=NULL, MTEXT= NULL, TEXTvar=NULL, TEXTlabels=NULL, arrows=TRUE, larrow=0.7, colArrows="black", quadratic=FALSE, file1="Output.txt", file2="Cat loadings.csv", file3="Descriptive statistics of clusters.csv", file4="Original data and cluster number.csv", file5="Var loadings-Linear.csv", file6="Cat loadings-Linear.csv", file7="Table cross-validation-Linear.csv", file8="Cases cross-validation-Linear.csv", file9="Table cross-validation-Quadratic.csv", file10="Cases cross-validation-Quadratic.csv", file11="Plots VARSEDIG.pdf", file12="U Mann-Whitney test.csv", na="NA", dec=",", row.names=TRUE)
data |
Data file. |
var |
Variables that are included in the analysis. |
labels |
Variable that allows to display a label for each case. |
cat |
Optionally, it is possible to specify a variable to show a grouping in the plot of the Principal Components or Correspondence analyses. |
analysis |
If it is "PCA" a Principal Components analysis is carried out, whereas a Correspondence analysis is performed if the selection is "CA". |
por |
Cut-off threshold specifying the cumulative variance percentage, to determine how many axes are selected from the Principal Components or Correspondence analyses. By default it is 80%, which means that the axes are selected until reaching an accumulated variance percentage of 80%. |
k |
Number of clusters in which the Dendrogram is divided. If it is NULL, the algorithm select automatically the maximum number of clusters in which the Dendrogram can be divided, which are those groups that are statistically different in at least one variable according to the U Mann-Whitney test. |
pthreshold |
Threshold probability of the U Mann-Whitney test. |
ellipse |
If it is TRUE, the ellipses with the levels of significance to the 0.5 (inner ellipse) and 0.95 (outer ellipse) of each category of the variable cat are depicted. These levels of significance can be modified by entering the function scatterplot using the argument SCATTERPLOT and modifying the argument levels=c(0.5,0.95). If it is TRUE, the ellipses of the clusters in the Discriminant analysis and in the polar coordinate plot of the VARSEDIG algorithm are also calculated. |
convex |
If it is TRUE, the convex hull is calculated for each category in the plot of the Principal Components or Correspondence analyses, but only if some variable has been selected in the argument cat. If TRUE, the convex hull of the clusters is also calculated in the Discriminant analysis and in the polar coordinate plot of the VARSEDIG algorithm. |
dim |
Vector with two values indicating the axes that are shown in the plot of the Principal Components or Correspondence analyses. |
size |
Size range of bubbles. Two values: minimum and maximum size. |
showCluster |
If it is TRUE, the number of each cluster is shown in the Dendrogram. |
VIF |
If it is TRUE, the inflation factor of the variance (VIF) is used to select the highly correlated variables and, therefore, not correlated variables are excluded from the Principal Components analysis. |
VARSEDIG |
If it is TRUE, the VARSEDIG algorithm is performed. |
BUBBLE |
If it is TRUE, the BUBBLE plot is depicted. |
threshold |
Cut-off value for the VIF. |
method |
Three different methods for prioritizing the variables according to their capacity for discrimination can be used in the VARSEDIG algorithm. If the method is "overlap", a density curve is obtained for each variable and the overlap of the area under the curve between the two groups of the variable group is estimated for all variables. Those variables with lower overlap should have better discrimination capacities and, hence, all variables are ordered from lowest to highest overlap; in other words, from the highest to lowest discrimination capacity. If the method is "Monte-Carlo", a Monte-Carlo test is performed comparing all values of group 1 with group 2, and all values of group 2 with 1. The variables are prioritized from the variable with the lowest mean of all p-values (highest discrimination capacity) to the variable with the highest mean of all p-values (lowest discrimination capacity). If the method is "logistic regression", then a binomial logistic regression is calculated and only significant variables are selected for further analyses with the regression performed by steps using the Akaike Information Criterion (AIC). |
minimum |
If it is TRUE, the algorithm is designed to find a significant discrimination between both groups with the minimum possible number of significant variables. Therefore, only the variables with higher discrimination capacity are selected. If it is FALSE, the algorithm selects all significant variables, and not only those with higher discrimination capacity. This argument is only valid with the methods "Monte-Carlo" and "overlap" and it is useful in those cases that discrimination between the groups is difficult and requires to include as many as variables as possible. |
ResetPAR |
If it is FALSE, the default condition of the function PAR are not placed and those defined by the user on previous graphics are maintained. |
PAR |
It accesses the PAR function that allows to modify many different aspects of the graphs. |
PCA |
It accesses the prcomp function of the stats package. |
SCATTERPLOT |
It accesses the function scatterplot of the car package. |
HCLUST |
You may access the function hclust of the stats package. |
CLUSTER |
Access to the function that allows to modify the graphic representation of the Dendrogram. |
BOXPLOT |
Allows to specify the characteristics of the boxplot. |
mfrowBOXPLOT |
It allows to specify the boxplot panel. It is a vector with two numbers, for example c(2,5) which means that the boxplots are put in 2 rows and 5 columns. |
LabelCat |
It allows to specify a vector with the names of the clusters in the boxplots. They must be as many as clusters. |
COLOR |
It allows to modify the colours of the graphic in the in the plot of the Principal Components or Correspondence analyses, but they must be as many as different groups have the variable cat. |
COLORC |
It allows to modify the colours of the clusters in the Dendrogram, but they must be as many as clusters. |
COLORB |
It allows to modify the colours of the clusters in the boxplots, but they must be as many as clusters. |
PCH |
Vector with the symbols in the plot of the Principal Components or Correspondence analyses, which must be as many as different groups have the variable cat. If it is NULL they are calculated automatically starting with the symbol 15. |
XLIM, YLIM |
Vectors with the axes limits X and Y in the plot of the Principal Components or Correspondence analyses. |
XLAB, YLAB |
Legends of the axes X and Y in the plot of the Principal Components or Correspondence analyses. |
ylabBOXPLOT |
You can specify a vector with the legends of the axes Y of the boxplots. They should be as many as the number of variables. |
LEGEND |
It allows to include or to modify a legend in the plot of the Principal Components or Correspondence analyses. |
MTEXT |
It allows to add text in the margins in the plot of the Principal Components or Correspondence analyses. |
TEXTvar |
It allows to modify the labels of the variables in the plot of the Principal Components or Correspondence analyses. |
TEXTlabels |
It allows to modify the labels of the cases in the plot of the Principal Components or Correspondence analyses plot. |
arrows |
If it is TRUE the arrows are shown in the scatterplot in the plot of the Principal Components or Correspondence analyses. |
larrow |
It modifies the length of the arrows in the plot of the Principal Components or Correspondence analyses. |
colArrows |
Colours of the arrows in the plot of the Principal Components or Correspondence analyses. |
quadratic |
If TRUE, a Quadratic Discriminant Analysis is performed, in addition to the Linear Discriminant Analysis. |
file1 |
TXT FILE. Name of the output file with the results. |
file2 |
CSV FILE. Name of the output file with the coordinates of the cases in the plot of the Principal Components or Correspondence analyses. |
file3 |
CSV FILE. Name of the output file with the descriptive statistics of each variable for each of the clusters obtained in the Dendrogram. |
file4 |
CSV FILE. Name of the output file with the original data of the variables and the cluster to which each case belongs. |
file5 |
CSV FILE. Name of the output file with the coordinates of the variables in the Linear Discriminant Analysis plot. |
file6 |
CSV FILE. Name of the output file with the coordinates of the categories in the Linear Discriminant Analysis plot. |
file7 |
CSV FILE. Name of the output file with the prediction table using the cross-validation of the Linear Discriminant Analysis. |
file8 |
CSV FILE. Name of the output file with the group to which each case belongs and the prediction of the Discriminant Analysis using the cross-validation of the Linear Discriminant Analysis. |
file9 |
CSV FILE. Name of the output file with the predictions table using the cross-validation of the Quadratic Discriminant Analysis. |
file10 |
CSV FILE. Name of the output file with the group to which each case belongs and the prediction of the Discriminant Analysis using the cross-validation of the Quadratic Discriminant Analysis. |
file11 |
PDF File. Name of the output file with the graphics obtained from the VARSEDIG algorithm. |
file12 |
CSV FILE. Name of the output file with the obtained probabilities of comparing all the variables among all the clusters with the U Mann-Whitney test. |
na |
CSV FILES. Text that is used in the cells without data. |
dec |
CSV FILES. It defines if a comma "," or a dot "." is used as decimal separator. |
row.names |
CSV FILES. Logical value that defines if identifiers are put in rows or a vector with a text for each of the rows. |
The aim of this analysis is to determine what statistically different groups are formed by applying a Principal Components or Correspondence analyses.
The first axis in a Principal Components analysis or Correspondence analysis is the linear combination of the original variables that has maximum variance. The second component is the linear combination of the original variables with maximum variance with the added condition that it is independent of the first (orthogonal), and so on, all the main components can be obtained, which, being independent of each other, contain different information. The independence or absence of correlation means that the new variables or components do not share common information. Each main component, therefore, explains the maximum possible residual variability (which has not already been explained above). Therefore, in a Principal Components or Correspondence analyses the cases are differentiated according to the variables that have greater variability. The idea of the analysis is to determine if statistically different groups are formed associated to the variability observed in the variables.
This analysis can be useful to find different groups when you really do not know what they are. For example, find different species using morphometric variables, without really knowing how many potential species there are and to what species each individual belongs. However, it is important to note that only different groups will be detected if the variables that have more variability give rise to different groups. It is possible that a variable does not present a great variability, but it is important for discriminating groups. This type of differentiation based on variables that do not have high variance, would not be detected in this analysis.
To detect the potential groups being formed, a Dendrogram is applied to the scores obtained from the axes that absorb a greater variance. By default, the axes that absorb 80% of the variability are chosen, but this value can be modified by the user.
Subsequently, a Discriminant Analysis is carried out to determine if the clusters that have been generated are well discriminated, that is, to determine the number of correctly identified cases in each cluster.
Next, a U Mann-Whitney test is performed to determine if there are significant differences in the variables between the clusters.
Finally, the algorithm of the VARSEDIG function is applied (see for more details (Guisande et al., 2016). With this algorithm it is possible to determine if all the cases of each cluster are statistically different from the other clusters.
The idea of this function is to find the largest possible number of clusters with the highest discrimination percentage. To do this the user should perform tests, modifying the cut-off threshold by specifying the cumulative variance percentage to determine how many axes are selected from the Main Components (by default por=80) and the variables to be included, eliminating those that are not correlated and are not useful in the Principal Components or Correspondence analyses, as well as those that have little discrimination power in the Discriminant Analysis.
FUNCTIONS
The Correspondence analysis was performed with the ca function of the package ca (Greenacre & Pardo, 2006; Greenacre, 2007; Nenadic & Greenacre, 2007; Greenacre, 2013). The Principal Components Analysis was performed with the prcomp function of the stats package. The vif function of the usdm package was used for the calculation of VIF (Naimi et al., 2014; Naimi, 2017). To perform the biplot graph the scatterplot function of the car package was used (Fox et al., 2018). The arrows are depicted with the function Arrows of the package IDPmisc (Locher & Ruckstuhl, 2014). The convex hull is estimated with the function chull of the package grDevices. KMO test was performed with the function KMO of the package psych (Revelle, 2018). Bartlett's test sphericity was performed with the function bart_spher of the package REdaS (Maier, 2015). The U Mann-Whitney test is performed with the wilcox.test function of the base stats package. The comparison between clusters with the VARSEDIG algorithm is done with the VARSEDIM function of the VARSEDIG package (Guisande et al., 2016: Guisande, 2019). The Linear Discriminant Analysis was performed with the functions candisc of the candisc package (Friendly, 2007; Friendly & Fox, 2017) and lda of the MASS package (Venables & Ripley, 2002; Ripley et al., 2018). The Quadratic Discriminant Analysis was performed with the function qda of the MASS package (Venables & Ripley, 2002; Ripley et al., 2018). The graph with one dimension in the Discriminant analysis was performed with the function plot.cancor of the candisc package (Friendly, 2007; Friendly & Fox, 2017).
EXAMPLE
The example consisted of analysing the morphometric variability of several species of scorpaeniformes. The aim is to find how many groups are statistically different based on the morphometric variability observed in the Principal Components analysis. For purposes only of graphic presentation in the Principal Components, the genus is used as a category cat="Genus". It is important to highlight that the category is not used for any statistical analysis and it is simply used to group the cases with ellipses or with the convex hull in the Principal Components graphic.
The analysis is performed by eliminating the variables that are not correlated, for which it is specified VIF=TRUE. Therefore, the first result obtained is the VIF values of the variables. Those variables with a VIF lower than the threshold are no included in the Principal Components analysis.
The second statistic obtained is the KMO test, which tells us if the variables are adequate for the Principal Components. The value must be greater than 0.5. Therefore, all variables that do not have a value greater than 0.5, could be eliminated from the analysis. In the case that the value is exactly 0.5, it means that it is not possible to estimate the KMO.
The next statistic that appears is Bartlett's test of sphericity, which tests whether the correlation matrix is an identity matrix, which would indicate that the factor model is inappropriate. A value p of the contrast smaller than the level of significance allows rejecting the hypothesis and concluding that there is correlation. Therefore, for the Principal Components analysis to be valid, the probability must be less than 0.05, as it is in this case.
Figure VIDTAXA.1 shows that the variability observed in the Principal Components analysis allows to clearly differentiate among the genera.
Figure VIDTAXA.1. Principal Components analysis showing the |
variability observed in the genera. |
The first axis accounts for 54%, the second for 25.3% and the third for 8.5% of the variance observed. The first three axes explain 87.8% of the variance. Since the default value of por=80 was selected, these three Principal Component axes are selected.
Figure VIDTAXA.2 shows the Dendrogram where 6 clusters are grouped, which are the six genera used in this example.
Figure VIDTAXA.2. Dendrogram with the scores of the axes selected |
from the Principal Components analysis. |
Figure VIDTAXA.3 shows the differences between clusters for each of the variables. It is clear, for instance, the difference in M21 for cluster 1, in M6 for cluster 5, etc.
Figure VIDTAXA.3. Boxplot obtained for each of the variables |
with the averaged values for each cluster. |
The Discriminant Analysis shows that it is possible to correctly discriminate 100% of cases by cross-validation with the Linear method. The first discriminant axis explains most of the variability and discriminates well between the 6 clusters (Figure VIDTAXA.4). Many variables seem to be important for the discrimination since the arrows are not small. Figure VIDTAXA.5 shows the first two discriminant axes and shows the differences between the 6 clusters.
Figure VIDTAXA.4. Axis I of the Discriminant analysis |
Figure VIDTAXA.5. Axes I and II of the Discriminant analysis |
The next test to determine if the clusters are statistically different was the comparison of the variables between the clusters. The results of the U Mann-Whitney test are shown in Figure VIDTAXA.6. For clusters to be different, there must be at least one statistically different variable when comparing each cluster with all the others. In the graph it is noted that in the comparison between all the clusters there is always a point, that is, there is always at least one variable that is different. In fact, between cluster 2 and cluster 4, the smaller number of statistically different variables was observed, a total of 14 variables. Therefore, from the comparison of the variables between clusters with the U Mann-Whitney test, it is concluded that the clusters are statistically different from each other.
Figure VIDTAXA.6. Plot where the bubbles show the number of variables, |
that are statistically different (p <= 0.05) between clusters. |
Finally, in a pdf, the plots obtained from applying the VARSEDIG algorithm are saved, whose objective is to compare all the clusters with each other.
Figure VIDTAXA.7 shows the example of the comparison of cluster 1 with 2. It is observed that the variable that discriminate significantly between both clusters is M22 (upper right panel). The Monte-Carlo test showed that the individuals that most resembles cluster 2 in cluster 1 (lower left panel) does not have significant differences in the polar coordinate axes X and Y (p = 0.1).
The individual that most resembles cluster 1 to cluster 2 (bottom right panel), it is very close to the significance threshold on both the polar coordinate axes X and Y (p = 0.077). Therefore, it cannot be concluded that cluster 1 and 2 are different. The same process would be done to compare the rest of the clusters.
Figure VIDTAXA.8. Plots obtained from the algorithm VARSEDIG. |
It is shown the comparison between the cluster 1 and 2. |
Therefore, according to the Discriminant Analysis and the tests performed with the U Mann-Whitney test, the clusters are statistically different from each other, but the VARSEDIG algorithm showed that not all clusters are statistically different. However, it is very important to emphasize that the VARSEDIG algorithm considers two statistically different groups if the case that most resembles each group is statistically different using the Monte-Carlo test. The Monte-Carlo test needs a large number of cases in each group for detecting significant differences. That is, it is possible that, as it was shown in the comparison of cluster 1 with cluster 2, the cases of both groups that resemble each other are not within the point cloud of the other group, but due to the low number of cases in each group, it is not possible to determine that the difference is not due to chance.
It is obtained:
1. A TXT file with the VIF (if the argument VIF=TRUE), the correlations between variables, the Kaiser-Meyer-Olkin (KMO) test, the Bartlett sphericity test and the results of the Principal Components or Correspondence analyses. The file is called by default "Output.TXT".
2. A CSV FILE with the coordinates for each case of the Principal Components or Correspondence analyses. The file is called by default "Cat loadings.CSV".
3. A CSV FILE with the descriptive statistics of each variable for each of the clusters obtained in the Dendrogram. The file is called by default "Descriptive statistics of clusters.CSV".
4. A CSV FILE with the original data of the variables and the cluster to which each case belongs. The file is called by default "Original data and cluster number.CSV".
5. A CSV FILE with the coordinates of the variables in the Linear Discriminant Analysis plot. The file is called by default "Var loadings-Linear.csv"
6. A CSV FILE with the coordinates of the categories in the Linear Discriminant Analysis plot. The file is called by default "Cat loadings-Linear.csv".
7. A CSV FILE with the predictions table using the cross-validation of Linear Discriminant Analysis. The file is called by default "Table cross-validation-Linear.csv".
8. A CSV FILE with the group to which each case belongs and the prediction of the Discriminant Analysis using the cross-validation of the Linear Discriminant Analysis. The file is called by default "Cases cross-validation-Linear.csv".
9. A CSV file with the predictions table using the cross-validation of the Quadratic Discriminant Analysis. The file is called by default "Table cross-validation-Quadratic.csv".
10. A CSV file with the group to which each case belongs and the prediction of the Discriminant Analysis using the cross-validation of the Quadratic Discriminant Analysis. The file is called by default "Cases cross-validation-Quadratic.csv".
11. A CSV file with the obtained probabilities of comparing all the variables among all the clusters with the U Mann-Whitney test. The file is called by default "U Mann-Whitney test.csv".
12. A PDF file with the graphics obtained from the VARSEDIG algorithm.
13. A scatterplot of the Principal Components or Correspondence analyses.
14. A Dendrogram grouping by clusters according to the scores of the Principal Components or Correspondence analyses.
15. A graphic panel with a boxplot for each variable comparing the values of these variables between each of the clusters obtained in the Dendrogram.
16. A Graph of the Discriminant Analysis showing the influence of the variables on the discriminant axis I, differentiating the different clusters.
17. A graph of the Discriminant Analysis showing the scores of the discriminant axes I and II, differentiating the different clusters.
18. A bubble chart with the number of variables that are statistically different between clusters.
Fox, J., Weisberg, S., Adler, D., Bates, D., Baud-Bovy, G., Ellison, S., Firth, D., Friendly, M., Gorjanc, G., Graves, S., Heiberger, R., Laboissiere, R., Monette, G., Murdoch, D., Nilsson, H., Ogle, D., Ripley, B., Venables, W. & Zeileis, A. (2018) Companion to Applied Regression. R package version 3.0-0. Available at: https://CRAN.R-project.org/package=car.
Friendly, M. & Fox, J. (2017) Visualizing Generalized Canonical Discriminant and Canonical Correlation Analysis. R package version 0.8-0. Available at: https://CRAN.R-project.org/package=candisc.
Friendly, M. (2007). HE plots for Multivariate General Linear Models. Journal of Computational and Graphical Statistics, 16: 421-444.
Greenacre, M. (2007) Correspondence Analysis in Practice. Second Edition. London: Chapman & Hall / CRC.
Greenacre, M. (2013). Simple, Multiple and Joint Correspondence Analysis. R package version 0.53. Available at: https://CRAN.R-project.org/package=ca.
Greenacre, M.J. & Pardo, R. (2006) Subset correspondence analysis: visualizing relationships among a selected set of response categories from a questionnaire survey. Sociological Methods and Research, 35: 193-218.
Guisande, C., Vari, R.P., Heine, J., Garcia-Rosello, E., Gonzalez-Dacosta, J., Perez-Schofield, B.J., Gonzalez-Vilas, L. & Pelayo-Villamil, P. (2016) VARSEDIG: an algorithm for morphometric characters selection and statistical validation in morphological taxonomy. Zootaxa, 4162. 571-580.
Guisande, C. (2019) An Algorithm for Morphometric Characters Selection and Statistical Validation in Morphological Taxonomy. R package version 2.0. Available at: https://CRAN.R-project.org/package=VARSEDIG.
Locher, R. & Ruckstuhl, A. (2014) Utilities of Institute of Data Analyses and Process Design. R package version 1.1.17. Available at: https://CRAN.R-project.org/package=IDPmisc.
Maier, M.J. (2015) Companion Package to the Book 'R: Einfuehrung durch angewandte Statistik. R package version 0.9.3. Available at: https://CRAN.R-project.org/package=REdaS.
Naimi, B. (2017) Uncertainty analysis for species distribution models. R package version 1.1-18. Available at: https://CRAN.R-project.org/package=usdm.
Naimi, B., Hamm, N.A.S., Groen, T.A., Skidmore, A.K., & Toxopeus, A.G. (2014) Where is positional uncertainty a problem for species distribution modelling? Ecography, 37: 191-203.
Nenadic, O. & Greenacre, M. (2007) Correspondence analysis in R, with two- and three-dimensional graphics: The ca package. Journal of Statistical Software, 20: 1-13.
Revelle,W. (2018) Procedures for Psychological, Psychometric, and Personality Research. R package version 1.8.4. Available at: https://CRAN.R-project.org/package=psych.
Ripley, B., Venables, B., Bates, D.M., Hornik, K., Gebhardt, A. & Firth, D. (2018) Support Functions and Datasets for Venables and Ripley's MASS. R package version 7.3-50. Available at: https://CRAN.R-project.org/package=MASS.
Rizopoulos, D. (2006) ltm: An R package for latent variable modelling and item response theory analyses. Journal of Statistical Software, 17: 1-25.
Rizopoulos, D. (2018) Latent Trait Models under IRT. R package version 1.1-1. Available at: https://CRAN.R-project.org/package=ltm.
Venables, W.N. & Ripley, B.D. (2002) Modern Applied Statistics with S. Springer, fourth edition, New York. https://www.stats.ox.ac.uk/pub/MASS4/.
## Not run: data(scorpaeniformes) VIDTAXA(data=scorpaeniformes, var=c("M2","M3","M4","M5","M6","M7", "M8","M9","M10","M11","M12","M13","M14","M15","M16","M19","M20", "M21","M22","M23","M24","M25","M26","M27"), labels="Genus", cat="Genus", VIF=TRUE, convex=TRUE) ## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.