knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

1. Load Dependencies

library(LRCoDa)
library(robCompositions)
library(dplyr)
library(sjmisc)
library(robustbase)
library(forcats)

2. Load data

Load the dataset "gemas" from the package robCompositions and recode one level from "" to "Not Specified", because the function LRCoDa does not allow empty names for the levels.

data("gemas")

gemas <- gemas %>% 
  mutate(soilclass = fct_recode(soilclass, "Not specified" = ""))

3. Introduction of the use of the factor variable extension with LRCoDa

With the newly released package LRCoDa we are able to use factor variables for linear regression on raw compositional data. The factor variables however are not allowed as Internal variables of a compositions. Further, as with the function lmCoDaX from the package robCompositions we have the choice between the "robust" and "classical" method.

First, we show an example where we apply the robust method to the dataset "gemas" and use the "soilclass" as the factor variable, the trace elements as composition and "sand" as target variable.

X <- dplyr::select(gemas, c(soilclass, Al:Zr))
y <- gemas$sand
LR_robust_1 <- LRCoDa(y = y, X = X, external = 'soilclass', method = 'robust', 
                      max_refinement_steps = 500)
LR_robust_1$ilr

We can access the summary output of the Linear Regression as known from the function lmCoDaX with \$ilr (for the ilr output) and \$lm (for the lm output). In addition, we can clearly see which is the external factor variable and which are the variables from the composition by the prefix of the variables.

The output of the robust method is further differentiated from the output using lmCoDaX because we use lmrob for the robust calculation of the linear regression in LRCoDa instead of ltsReg in lmCoDaX.

3.1. Limitations on the use of factor variables

We list some scenarios where we show the limitations of using factor variables with the LRCoDa function.

3.1.1. More than one factor variable is not allowed

Only one factor variable can be used for the model. The function automatically detect when two variables with datatype "Factor" and/or "Character" are. We return an error message if more than one factor variable is used.

X <- dplyr::select(gemas, c(soilclass, COUNTRY, Al:Zr))
LRCoDa(y = gemas$sand, X = X, external = c('COUNTRY', 'soilclass'))

3.1.2. The factor has to be specified

When the external factor column is not specified in the function input, but X contains a variable with factors or characters, the function returns an error message that "Variable with datatype character or factor have to be defined as external".

X <- dplyr::select(gemas, c(soilclass, Al:Zr))
LRCoDa(y = gemas$sand, X = X)

3.1.3. Factor column must be specified as a name in the function input

The external factor column must be specified exactly as in the help function described. It is not possible to fill in the whole column.

X <- dplyr::select(gemas, c(soilclass, Al:Zr))
LRCoDa(y = gemas$sand, X = X, external = X %>% select(soilclass))

4. Introduction of the use of the external variable extension with LRCoDa

With the newly released package LRCoDa we are also able to use numerical external variables that are not part of the composition. Here we also have the possibility to choose between the "robust" and "classical" method and can combine them with a factor variable.

We show an example where we combine external variables with a factor variables and of course the composition (naming as internal variables). The function automatically detects which variable is the factor variable from the external variables. We use the "gemas" dataset, where "MeanTemp" the external variable is, "soilclass" the factor variable, the trace elements the composition ("Internal" variables) and "sand" the target variable.

X <- dplyr::select(gemas, c(MeanTemp, soilclass, Al:Zr))
y <- gemas$sand
LR_robust_2 <- LRCoDa(y = y, X = X, external = c('MeanTemp', 'soilclass'), 
                      method = 'robust', max_refinement_steps = 500)
LR_robust_2$ilr

We can see here that the different types of variables have different prefixes with which we can easily identify them.

4.1 Possibilities and Limitations of using external variables

We list the possibilities and limitations of using the external variables

4.1.1. Multiple numerical external variables are allowed and supported

We allow and support the use of multiple numerical external variables in the same model.

X <- dplyr::select(gemas, c(MeanTemp, AnnPrec, Al:Zr))
LRCoDa(y = gemas$sand, X = X, external = c('MeanTemp', 'AnnPrec'), 
       method = "classical")

4.1.2. External column has to be specified as name in the function input

The external column must be specified exactly as in the help function described. It is not possible to fill in the whole column.

X <- dplyr::select(gemas, c(MeanTemp, AnnPrec, Al:Zr))
LRCoDa(y = gemas$sand, X = X, external = X %>% select(c('MeanTemp', 'AnnPrec')))

5. Use of lmrob for robust fitting

With the newly introduced use of factor variables for linear regression on raw compositional data, we need to change the fitting algorithm from the old ltsReg (used in lmCoDaX) to lmrob (used in LRCoDa). This change offers many advantages when working with factor variables and also with external variables compared to ltsReg which is very limited for this application. Also, lmrob determines robustness weights between 0 and 1, while ltsReg only works with 0 and 1 for the robustness weigths. However, there are also new challenges introduced with lmrob, as shown in the example below. Also, the backwards compatibility of the function when fitting with the robust method is no longer given due to this change.

X <- dplyr::select(gemas, c(MeanTemp, AnnPrec, soilclass, Al:Zr))
LRCoDa(y = gemas$sand, X = X, external = c('MeanTemp', 'AnnPrec', 'soilclass'), 
       max_refinement_steps = 200)

The algorithm does not converge with the default refinement steps 200 as defined in the function lmrob. Therefore, we implemented the parameter max_refinement_steps as input to the function LRCoDa to change the maximum steps in such cases.

X <- dplyr::select(gemas, c(MeanTemp, AnnPrec, soilclass, Al:Zr))
LRCoDa(y = gemas$sand, X = X, external = c('MeanTemp', 'AnnPrec', 'soilclass'), 
       max_refinement_steps = 500)

And now with max_refinement_steps set to 500 the algorithm converges.

6. further improvements

The norm of the pivot coordinate calculation can now also be defined in the function input, in contrast to the function lmCoDaX. The default value is "orthonorm" as it is preset in the function pivotCoord and therefore also the one used in lmCoDaX function.

X <- dplyr::select(gemas, c(Al:Zr))
LRCoDa(y = gemas$sand, X = X, pivot_norm = "orthogonal", method = "classical")


Wiedi97/LRCoDa-Package documentation built on June 30, 2022, 10:51 a.m.