knitr::opts_chunk$set(echo = FALSE)
In order to facilitate any attempt at reproducing this study, here is a depiction of the system information used to prepare and analyse the data from the survey on tarping operations for the control of Japanese knotweed s.l. (Reynoutria spp.).
rm(list=ls()) sessionInfo()
All data and code used in this study can furthermore be found here: https://github.com/mrelnoob/jk.dusz.tarping \ \
library(jk.dusz.tarping) zzz_mydata_cleaned <- jk.dusz.tarping::clean_my_data() .pardefault <- par() # To save the default graphical parameters (in case I want to restore them).
\
In our efforts to build valid and meaningful models to explain or predict the success/failure of tarping operations for the control of Japanese knotweed s.l., we had to learn more about our data.
To explain the variations of our five response variables (i.e. efficiency, high_eff, lreg_edges, lreg_tarpedarea, and reg_overlaps), we created four slightly different sub-datasets from our global dataset. These sub-datasets differed in their sample size and in the potential explanatory variables they contained (because of our working hypotheses). They varied in their sample size because of the varying numbers of missing values in their associated response variables (i.e. NAs), and they varied in the potential explanatory variables they contained because we did not hypothesized that all response variables would be explained by the same set of explanatory variables. The present documents presents, for each one of the five response variables, the results of the exploratory data analyses and basic assumption checks performed on their respective sub-datasets.
Various aspects of our data (e.g. outliers, variables distribution, multicollinearity among predictors, homogeneity of variances, independence, potential interactions, multivariate relationships, linearity etc.) were explored in this document, mostly following Bolker (2007), Zuur et al. (2010), and Harrell (2015). However, as the response variables were of different nature (e.g. binary, continuous), they would be modelled using different types of models with varying assumptions:
Naturally, most actual models assumptions were checked after model building and fitting. Our intention here was simply to improve our knowledge of the data and identify potentially problematic variables upstream. As our global dataset contained many potential predictors (p = 85), much effort was indeed directed toward reducing their number and keeping only the most reliable and important ones relative to our hypotheses. In other words, we tried to reduce the dimensionality of the data.
For the meaning of all variables, the reader is referred to Appendix §§§ *** §§§ !!!!!! \ \
\ The first thing to do here was to prepare the sub-dataset that will be used to model the responses of the efficiency variables (i.e. efficiency and high_eff). To do so, the following steps have been performed:
missForest
package (Stekhoven & Buehlmann, 2012). The table below presents the out-of-bag (OOB) imputation error estimates for each type of variable (computed as the Normalized Root Mean Squared Error (NRMSE) for numeric variables, and the Proportion of Falsely Classified entries (PFC) for categorical variables). OOB values near 1 indicate that imputations are not reliable, while values close to 0 indicate very low imputation errors. For the meaning of the variables, please refer to the attached documentation.
erad <- jk.dusz.tarping::model_datasets(response.var = "efficiency") erad %>% dplyr::select(-liner_geomem, -agri_geomem,-woven_geotex, -mulching_geotex, -pla_geotex, -other_unknown, -grammage, -age, -thickness, -pierced_tarpinstall, -add_control_type) %>% dplyr::filter(eff_eradication != "NA") %>% dplyr::filter(tarping_duration >= 2) %>% dplyr::filter(xp_id != "54") -> erad # Operation n°54 was destroyed because of vandalism. ### Imputation of missing values using `missForest` (we could also have worked with `MICE`): erad %>% dplyr::select(-manager_id, -xp_id) %>% as.data.frame() -> erad_mis # missForest only accepts data.frames or matrices (not tibbles). Variables must also be numeric or factors only (no character, date, difftime... classes)! imput <- missForest::missForest(xmis = erad_mis, verbose = TRUE) # To create a small summary table for the OOB errors (for each variable, with the argument `variablewise = TRUE` in missForest): # imput_error <- data.frame(cbind( # sapply(erad_mis, function(y) sum(length(which(is.na(y))))), # Number of imputed values (NAs) # imput$OOBerror), # Out-of-bag error (OOB) for each variable # row.names = colnames(erad_mis)) # To get the name of the variables # dplyr::rename(imput_error, nb_imputed_values = 'X1', oob_error = 'X2') -> imput_error
knitr::kable(imput$OOBerror)
Here are a few descriptive statistics of our sub-dataset.
erad[,3:ncol(erad)] <- imput$ximp # Replacing th original dataset with the imputed one! summary(erad) rm(erad_mis, imput)
\ \
\ We performed a normed-PCA on some of our environmental variables for which we did not have strong direct hypotheses (i.e. difficulty_access, shade, forest, ruggedness, granulometry) but we kept slope, obstacles, and flood because we suspected they might have a major influence on the success/failure of tarping.
xxx <- dplyr::select(.data = erad, c(difficulty_access, shade, forest, ruggedness, granulometry)) # In order to get optimum results, we imputed missing values using the *Regularised Iterative PCA algorithm* (Josse & Husson, 2013) of the `missMDA` package (Josse & Husson, 2016). # But first, we need to estimate how many components are required to correctly impute the missing values: #nb <- missMDA::estim_ncpPCA(xxx, ncp.max=5) # Can be time consuming. Here, the best result is 0 but we will take 2 as it is (almost) the second best option. # Then we impute the missing values and extract the imputed dataset: #res.imput <- missMDA::imputePCA(X = xxx, method = "Regularized", scale = TRUE, ncp = 2) #xxx <- res.imput$completeObs # Gives the completed dataset! # Normed-PCA: res.pca <- FactoMineR::PCA(X = xxx, scale.unit = TRUE, graph = FALSE) # To get the correlation circle for this PCA, with a variable coloration according to their contribution to the first 2 principal components: factoextra::fviz_pca_var(res.pca, col.var = "contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE) # As the first axis (PC) of my PCA satisfactorily synthesizes a fairly large amount of the variance of my 5 variables, I will use the coordinates of the tarping operations on this axis as a synthetic variable: ttt <- res.pca$ind$coord[,1] erad %>% dplyr::select(-difficulty_access, -shade, -forest, -ruggedness) %>% dplyr::rename(coarse_env = "granulometry") -> erad erad$coarse_env <- ttt # And I use this occasion to reduce the dataset and replace one variable with my new synthetic variable. # To plot side by side: #gridExtra::grid.arrange(FIG_1, FIG_2, ncol = 2)
As the first axis (PC) of the PCA satisfactorily synthesized a large amount (r round(res.pca$eig[1,2],1)
%) of the variance of these 5 environmental variables, we used operations' coordinates on this axis to build a synthetic variable to summarize their environment. This new variable was named coarse_env and replaced the other 5 variables in the dataset. Operations with positive values for this variable were located in sites that tended to be forested (and thus shaded), difficult to access, with uneven ground and a coarse soil texture (and vice-versa).
We also performed another normed-PCA on some of our follow-up variables that we thought could be synthesized (i.e. add_control, repairs, and freq_monitoring).
erad %>% dplyr::select(freq_monitoring, repairs, add_control) %>% dplyr::mutate(repairs = as.numfactor(x = repairs), add_control = as.numfactor(x = add_control))-> xxx # Normed-PCA: res.pca <- FactoMineR::PCA(X = xxx, scale.unit = TRUE, graph = FALSE) # To get the correlation circle for this PCA, with a variable coloration according to their contribution to the first 2 principal components: factoextra::fviz_pca_var(res.pca, col.var = "contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE) # As the first axis (PC) of my PCA satisfactorily synthesizes a fairly large amount of the variance of my 5 variables, I will use the coordinates of the tarping operations on this axis as a synthetic variable: ttt <- res.pca$ind$coord[,1] erad %>% dplyr::rename(followups = "freq_monitoring") -> erad erad$followups <- ttt # And I use this occasion to reduce the dataset and replace one variable with my new synthetic variable.
As the first axis (PC) of the PCA satisfactorily synthesized a large amount (r round(res.pca$eig[1,2],1)
%) of the variance of the 3 follow-up variables, we used operations' coordinates on this axis to build a synthetic variable to summarize their follow-up conditions. This new variable was named followups and thus replaced freq_monitoring in the dataset (we decided to keep repairs and add_control as is). High positive values for this variable represent operations that were frequently monitored and where additional control and repairs were performed (and vice-versa).
\
\
\ The next step in our data exploration was to look for potential outliers in the data using boxplots and Cleveland dotplots on our quantitative variables.
### IMPORTANT NOTE: as ggplot2 requires data in the long format to be able to do multi-panels plots (known as facets), we need to transform our data using the pivot_longer() function of {tidyr}. # To keep only numeric columns: num.erad <- erad[, sapply(erad, is.numeric)]; or we can do: # With ggplot2, it is time-consuming! #erad %>% #purrr::keep(is.numeric) %>% # Keep only numeric columns #tidyr::pivot_longer(cols = tidyselect::everything()) %>% # Pivot every columns #ggplot2::ggplot(ggplot2::aes(x = name, y = value)) + #ggplot2::geom_boxplot(fill = "lightsalmon") + #ggplot2::theme_minimal() -> ppp #ppp + ggplot2::facet_wrap( ~ name, scales = "free") #RColorBrewer::display.brewer.all() To display all the colors in {RColorBrewer} rm(res.pca, res.imput, xxx, ttt, nb) jk.dusz.tarping::uni.boxplots(dataset = erad)
jk.dusz.tarping::uni.dotplots(dataset = erad)
By the look of these graphs, it is clear that we have many extreme values (possible outliers). Yet, extreme values are not necessarily true outliers:
simu.var <- dplyr::select(.data = erad, elevation, freq_monitoring, stand_surface, distance, sedicover_height, trench_depth, tarping_duration) jk.dusz.tarping::uni.simudistrib(simu.var = simu.var, distribution = "normal") jk.dusz.tarping::uni.simudistrib(simu.var = simu.var, distribution = "log-normal") jk.dusz.tarping::uni.simudistrib(simu.var = simu.var, distribution = "poisson") rm(simu.var)
After analysing random samples drawn from log-normal distributions (based on the parameters of our potentially problematic variables), we have good reason to believe that only stand_surface should be problematic. Consequently, we chose to delete the spotted outlier within stand_surface (i.e. the stand_surface of ca. 5000 m^2^).
The other variables do not seem have values that are too extreme to be possibly drawn from Normal, log-Normal or Poisson distributed variables, except longitude. Yet, we do not intend on truly using this variable at this point (we included it in case we need to account for some spatial autocorrelation in the models).
erad <- erad[-(which.max(erad$stand_surface)),]
\ \
\ Although we already gained pretty good insights into the probability distribution followed by our variables, it is usually interesting and often useful to have a precise view of the variables' distributions.\ Consequently, we described the shapes of the distribution of our quantitative variables using histograms as well as skewness and kurtosis values. Skewness is a measure of the symmetry of a variable's distribution (a perfectly normal distribution has a skewness of 0), while kurtosis gives information on the combined amount of probability contained in the two tails of a distribution (often reported as excess kurtosis, that is kurtosis - 3, since a perfectly normally distributed variable has a kurtosis of 3). There is much debate regarding what constitutes "acceptable" values for these statistics or whether this question actually makes any sense at all. Yet, various authors suggest that values of ±2 should not be too problematic for procedures requiring approximately normally distributed variables (Field, 2009; Gravetter & Wallnau, 2014). Nonetheless, many statistical methods do not possess a direct normality assumption (linear regression models, for instance and contrarily to popular beliefs, do not assume the normality of either the response variable nor of any predictor/covariate but do assume that the replicated observations of the response variable for any fixed predictor value should be normally distributed (Montgomery & Peck, 1992); an assumption that is often difficult to verify, hence the usual focus on models residuals). For all these reasons, here, we only used these statistics as descriptive tools to foresee potential future complications related to the distribution of our variables, but not as bases for actual decision making.
uni.histograms(dataset = erad) num.data <- erad[, sapply(erad, is.numeric)] tab <- data.frame(moments::skewness(x = num.data), moments::kurtosis(x = num.data)-3) knitr::kable(x = tab, digits = 3, col.names = c("Skewness", "Excess kurtosis")) rm(num.data, tab)
From the above histograms and table, we can see that: Very few of our sampled variables are close to approximate a normal distribution: perhaps latitude, coarse_env, and followups. Surprisingly, efficiency possesses "acceptable" skewness and kurtosis values yet is clearly far from a normal variable (left-dissymmetric and bounded), perhaps attesting the limited usefulness of these statistics. Most variables are skewed to the right but in relatively reasonable proportions. Some variables (e.g. elevation, stand_surface, distance, trench_depth and sedicover_height*) have skewness and/or kurtosis values possibly high enough to have undue influence on regression coefficients. These variables will perhaps require some transformations. \ \
\
To investigate potential correlations and collinearity among our variables, we started by computing a correlation matrix of the data. As most of our factors were either ordered factors (i.e. ordinal variables) or binary factors (i.e. boolean), we could use them in a correlation matrix. To do that however, computations were based on Spearman's rank correlation coefficients.
To avoid giving too much weight to our own tarping experiments (those performed in Chalon s/ Saone with the French national railway company), we removed them from the data to compute correlations: their influence was accounted for in further analyses by the use of mixed-effect models.
# To convert all ordered factors into numeric variables usable in a correlation matrix: erad %>% dplyr::filter(!manager_id == 1) %>% # We remove our own experiments as they share many features that will strongly influence correlations (it should not be a problem for modelling as we will use mixed-effect models). dplyr::select(-manager_id, -xp_id, -goals) %>% dplyr::mutate_if(.predicate = is.factor, .funs = as.numfactor) -> num.erad # as.numfactor() is my own custom function (see '01_data_cleaning.R'). # To compute the correlation matrix: res.cor.erad <- round(stats::cor(num.erad, use = "complete.obs", method = "spearman"), 2) # To compute a matrix of correlation p-values: res.pcor.erad <- ggcorrplot::cor_pmat(x = num.erad, method = "spearman") ggcorrplot::ggcorrplot(res.cor.erad, type = "upper", outline.col = "white", ggtheme = ggplot2::theme_gray, colors = c("#6D9EC1", "white", "#E46726"), p.mat = res.pcor.erad, insig = "blank") rm(num.erad, res.cor.erad, res.pcor.erad)
Many interesting patterns can be observed in this correlation matrix. Among them, we can see that:
It is worth reminding that accounting for multivariate relationships, interactions and the effect of our tarping operations (dropped for the current correlation matrix) will certainly affect the observed patterns. Still, at this stage and in accordance with the hierarchy of hypotheses we would like to test for the efficiency variables, we decided to drop the following variables: woven_geotex, geotex, and degradations.
erad %>% dplyr::select(-geotex, -weedsp_geotex, -degradation) -> erad
As multicollinearity is one of the biggest issues for inference modelling, assessing which explanatory variables and covariates are collinear and could possibly be problematic during modelling stages is important: for instance by computing Variance Inflation Factors (VIF) for each variable. However, as only a small fraction of the variables will actually be used together in the likely a priori models that we will build, we intend on computing VIF as part of the models diagnostic checks and not here. Furthermore, as several variables were categorical and as we will not be working with models fitted with ordinary least squares, we will use Generalized VIF instead of VIF (Fox & Monette, 1992). \ \
\ To further study distributions and bivariate correlations, we decided to plot every numeric predictor against each others. It gives us the possibility to start investigating potential multivariate outliers.
GGally::ggpairs(erad[, c("coarse_env", "distance", "flood", "followups", "elevation", "obstacles", "stand_surface", "sedicover_height", "slope", "tarping_duration")])
This figure confirms that most variables are far from being normally distributed but that's not necessarily a problem. If many scatterplots appear relatively compact and consistent, all highly skewed variables (e.g. stand_surface, sedicover_height, etc.) tend to produce what seem to be multivariate outliers that could become potentially problematic for modelling. In case it happens, transforming these variables may be useful. \ \
\ The next step in our data exploration was to plot the response variables against each predictors and covariates in order to assess the linearity and homoscedasticity of these relationships and help identify potential multivariate outliers.
### To plot all variables against each other, we can use pairs() or ggpairs(): # ggplot2::theme_set(ggplot2::theme_bw()) # erad %>% dplyr::select(-xp_id, -goals) %>% GGally::ggpairs() ### To plot a response variable against all predictors/covariates, we will have to do use a for loop or the facet_wrap function of ggplot (or equivalent). # For the "efficiency" variable (bounded continuous variable): erad %>% dplyr::select(-manager_id, -xp_id, -goals, -eff_eradication, -high_eff, -latitude, -longitude) %>% dplyr::mutate_if(is.factor, as.numfactor) %>% tidyr::pivot_longer(-c(efficiency), # Means that all columns of `mydata` need to be reshaped except "efficiency" (works as well with the form `!efficiency`, and we can place it somewhere else: actually, it is the `cols =` argument)! We have to work in the long-format because that's how the tidyverse and thus ggplot2 love to work... names_to = "Predictor", # Gives the name of the new grouping variable while "values_to" gives the name of the variable that will be created from the data stored in the cell value. values_to = "Value") %>% ggplot2::ggplot(ggplot2::aes(x = Value, y = efficiency)) + ggplot2::facet_wrap(~Predictor, scales = "free_x", strip.position = "bottom") + ggplot2::geom_point(size = 1.5, alpha = 0.8, col = "palevioletred4") + ggplot2::scale_y_continuous(limits=c(0,10)) + ggplot2::geom_smooth(method = "loess", na.rm = TRUE, size = 1, col = "plum1") + # If you want to affect something related to the input dataframe, you have to put the command in an aes() function (e.g. if we want a smoother for each value of "fully_tarped", we should have wrote ggplot2::aes(col = as.factor(fully_tarped)))! # If a "loess" smoother is used, sometimes the smoother line won't be plotted, I don't know why. ggplot2::scale_colour_manual(values = c("plum1", "palevioletred4")) + ggplot2::labs(x = "Predictor value", y = "Efficiency score", col = "Fully tarped") + ggthemes::theme_hc(style = "darkunica") + ggplot2::theme(panel.border = ggplot2::element_blank(), legend.key = ggplot2::element_blank(), strip.background = ggplot2::element_blank(), strip.text = ggplot2::element_text(colour = "gray60", size = 10), strip.placement = "outside", axis.line.x = ggplot2::element_line(colour = "darkorange4", size=0.5), axis.line.y = ggplot2::element_line(colour = "darkorange4", size=0.5), legend.position = "right")
Various interesting observations can be made from these scatterplots:
\ Since high_eff will be modelled using logistic regressions, it is very important to ensure the respect of the linearity assumption between the outcome on the logit scale and each numeric predictor, for instance using Box-Tidwell tests or scatterplots. The main issue here is to get the values of the response variable on the logit scale. To do so, we have to build a logistic model incorporating all potential predictors to extract expected probabilities. A drawback from this approach is that the toy model can easily be misspecified in the process: for that reason, we actually checked this assumption iteratively for each a priori candidate model built for analyses (the scatterplots displayed below only serve to illustrate what transformations were actually applied to linearize relationships).
eff2 <- erad[,c("high_eff", "distance", "followups", "obstacles", "sedicover_height", "slope", "stand_surface", "tarping_duration")] model <- glm(high_eff ~., data = eff2, family = binomial) # Predict the probability (p) of high efficacy: probabilities <- predict(model, type = "response") # Transforming predictors mydata <- eff2 %>% dplyr::select_if(is.numeric) %>% dplyr::mutate("stand_surface (log)" = log(stand_surface)) %>% dplyr::mutate("distance (log+1)" = log(distance + 1)) %>% dplyr::mutate("sedicover_height (log+1)" = log(sedicover_height+1)) # These transformations were made to linearize the relationships predictors <- colnames(mydata) # Bind the logit and tidying the data for plot (ggplot2, so long format) mydata <- mydata %>% dplyr::mutate(logit = log(probabilities/(1-probabilities))) %>% tidyr::gather(key = "predictors", value = "predictor.value", -logit) # Create scatterplot ggplot2::ggplot(mydata, ggplot2::aes(y = logit, x = predictor.value))+ ggplot2::geom_point(size = 0.5, alpha = 0.5) + ggplot2::geom_smooth(method = "loess") + ggplot2::theme_bw() + ggplot2::facet_wrap(~predictors, scales = "free_x") rm(eff2, model, probabilities, predictors, mydata)
This figure shows that log-transforming distance, sedicover_height and stand_surface acceptably linearized their relationship with high_eff expressed on the logit scale.
More general scatterplots can further be used to ensure the absence of bivariate (quasi-)perfect separation between a predictor and the response variable, which is highly problematic for logistic models fitted using Maximum Likelihood (Allison, 2004).
erad %>% dplyr::select(-manager_id, -xp_id, -goals, -eff_eradication, -efficiency, -latitude, -longitude) %>% dplyr::mutate_if(is.factor, as.numfactor) %>% tidyr::pivot_longer(-c(high_eff), names_to = "Predictor", values_to = "Value") %>% ggplot2::ggplot(ggplot2::aes(x = Value, y = high_eff)) + ggplot2::facet_wrap(~Predictor, scales = "free_x", strip.position = "bottom") + ggplot2::geom_point(size = 1.5, alpha = 0.8, col = "coral") + #ggplot2::scale_y_continuous(limits=c(0,10.5)) + ggplot2::geom_smooth(method = "lm", na.rm = TRUE, size = 1, col = "darkred") + ggplot2::scale_colour_manual(values = c("coral", "darkred")) + ggplot2::labs(x = "Predictor value", y = "High efficacy", col = "Plantation") + ggthemes::theme_hc(style = "darkunica") + ggplot2::theme(panel.border = ggplot2::element_blank(), legend.key = ggplot2::element_blank(), strip.background = ggplot2::element_blank(), strip.text = ggplot2::element_text(colour = "gray60", size = 10), strip.placement = "outside", axis.line.x = ggplot2::element_line(colour = "darkorange4", size=0.5), axis.line.y = ggplot2::element_line(colour = "darkorange4", size=0.5), legend.position = "right")
Although most relationships look fine, we can see from these scatterplots that there is a clear case of perfect separation between high_eff and fully_tarped! CONSEQUENTLY ?????? Penalized method or modified variable???? RIDGE??? LASSO? Elastic net? With or without cross-validation???? \ \
As stated before, heteroscedasticity is normally not an issue for beta regression models. Nonetheless, it is always useful to have a clear idea of the distribution of variances, so we decided to draw conditional boxplots of our only "continuous" response variable (i.e. efficiency).
erad %>% dplyr::select(-goals, -latitude, -longitude) -> erad1 jk.dusz.tarping::cond.boxplots(dataset = erad1, Y = 3, Xs = 6:ncol(erad1), outlier = TRUE) rm(erad1)
According to Fox (2008), for simple linear regression models, problematic heterogeneity of variance occurs when the ratio between the largest and smallest variance is 4 or more. Consequently, there is a possibility for the following variables to become problematic for modelling: elevation, flood, stand_surface, tarping_duration, sedicover_height, and trench_depth. Still, actual problematic patterns will be better observable by plotting residuals against model fitted values. As high_eff is binomial, heteroscedasticity is not a problem as variance is assumed to be heterogeneous in logistic regression models. \ \
Finally, to further investigate potentially meaningful interactions among our variables, we used co-plots (i.e. conditioning plots) between some of our predictors and efficiency, based on plausible hypotheses. However, to avoid overloading the present document, we only display some of the most informative plots.
par(.pardefault) # To restore defaults graphical parameters ### Three-way interaction between elevation, fully_tarped, and stand_surface graphics::coplot(efficiency ~ elevation | fully_tarped * log(stand_surface), data = erad, panel=function(x,y,...) { points(x,y, col = wesanderson::wes_palette(n=4, name="GrandBudapest1"), pch = 19); abline(line(y~x)) }, # The function in the `panel` argument is what enables the drawing of the regression lines! In the col argument, I simply created a palette of colours based on the movies of Wes Anderson (just to make the plot prettier). number = 3, overlap = 0.05) # `number` indicates the number of groups by which the quantitative grouping variable (here, stand_surface) should be divided into. The `overlap` argument controls the proportion of points that are shared between the groups (here, 5%). # Note that, sometimes coplot will create several rows for the same variable, but this can be controlled using the `rows` argument (e.g. rows = 1).
This coplot is very interesting because it shows an inconsistent pattern within each category (single row or single column) suggesting that our sample prevents us to satisfactorily test a three-way interaction between those variables. Additionally, we can see that there are very few observations of not entirely tarped knotweed stands (n = 17) as well as very few observations above 500m a.s.l. (n = 9; which is not a very high elevation for knotweeds (Martin et al., 2019)). As our dataset was gathered opportunistically it is not surprising, yet it means that our sampled dataset will only be able to highlight strong effects of elevation and of partially tarped control operations, which is not the case here. All this may explain the very odd pattern displayed by the lower-right panel: Figure 2.2, Figure 2.6 and the other panels (save one) seem to indicate that there is a positive correlation between elevation and tarping efficiency; however, the fact that this pattern is reversed for smaller knotweed stands (and thus weaker ones) that are entirely covered does not make any sense. It is therefore possible that the seemingly higher efficiency of tarping at higher elevations actually reflects chance.
As we already have too many candidate predictors, we may consider dropping the elevation variable for further analyses.
### graphics::coplot(efficiency ~ log(stand_surface) | fully_tarped, data = erad, panel=function(x,y,...) { points(x,y, col = wesanderson::wes_palette(n=4, name="GrandBudapest2"), pch = 19); abline(line(y~x)) }, number = 4, overlap = 0.05, rows = 1)
This above coplot seems to indicate that entirely covered (tarped) knotweed stands prevent the drop in tarping efficiency with increasing stand surface. Caution is however still required as the relationships do not seem very strong.
graphics::coplot(efficiency ~ followups | pb_fixation, data = erad, panel=function(x,y,...) { points(x,y, col = wesanderson::wes_palette(n=4, name="Royal2"), pch = 19); abline(line(y~x)) }, number = 2, overlap = 0.05, rows = 1)
Quite logically, this coplot seems to confirm that follow-up operations increase tarping efficiency especially when there are some problems regarding fabrics fixation.
### Finally, I export the dataset to be used for modelling in my "mydata" folder: readr::write_csv(x = erad, file = here::here("mydata", "erad.csv"), append = FALSE) # A problem with this method is that once exported, R will lose track of the variable types (class). Therefore, if we open (read) this .csv file with R, it will specify columns as best it can, that is as double for numeric columns and as character for columns containing letters. For instance, it will import the first column of `erad` (i.e. "manager_id") as a numeric (double) variable when it should be a factor. # A not so ideal solution is to use the `col_types` argument of the readr::read_csv() function to manually specify variable types during import (see the readr vignette for more info).
\ \
\ The first thing to do here was to prepare the sub-dataset that will be used to model the responses of the lreg_edges variable. To do so, the following steps have been performed:
missForest
package (Stekhoven & Buehlmann, 2012). The table below presents the out-of-bag (OOB) imputation error estimates for each type of variable (computed as the Normalized Root Mean Squared Error (NRMSE) for numeric variables, and the Proportion of Falsely Classified entries (PFC) for categorical variables). OOB values near 1 indicate that imputations are not reliable, while values close to 0 indicate very low imputation errors. For the meaning of the variables, please refer to the attached documentation.
redges <- jk.dusz.tarping::model_datasets(response.var = "edges") redges <- dplyr::mutate(.data = redges, lreg_edges = as.factor(x = lreg_edges), reg_elsewhere = as.factor(x = reg_elsewhere), uprootexcav = as.factor(x = uprootexcav)) ### Imputation of missing values using `missForest` (we could also have worked with `MICE`): redges %>% dplyr::select(-manager_id, -xp_id) %>% as.data.frame() -> redges_mis # missForest only accepts data.frames or matrices (not tibbles). Variables must also be numeric or factors only (no character, date, difftime... classes)! imput <- missForest::missForest(xmis = redges_mis, verbose = TRUE)
knitr::kable(imput$OOBerror)
Here are a few descriptive statistics of our sub-dataset.
redges[,3:ncol(redges)] <- imput$ximp # Replacing the original dataset with the imputed one! summary(redges) rm(redges_mis, imput)
\ \
\ We performed a normed-PCA on some of our environmental variables for which we did not have strong direct hypotheses (i.e. difficulty_access, shade, forest, ruggedness, granulometry) but we kept slope, obstacles, and flood because we suspected they might have a major influence on the success/failure of tarping.
xxx <- dplyr::select(.data = redges, c(difficulty_access, shade, forest, ruggedness, granulometry)) # Normed-PCA: res.pca <- FactoMineR::PCA(X = xxx, scale.unit = TRUE, graph = FALSE) # To get the correlation circle for this PCA, with a variable coloration according to their contribution to the first 2 principal components: factoextra::fviz_pca_var(res.pca, col.var = "contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE) # As the first axis (PC) of my PCA satisfactorily synthesizes a fairly large amount of the variance of my 5 variables, I will use the coordinates of the tarping operations on this axis as a synthetic variable: ttt <- res.pca$ind$coord[,1] redges %>% dplyr::select(-difficulty_access, -shade, -forest, -ruggedness) %>% dplyr::rename(coarse_env = "granulometry") -> redges redges$coarse_env <- ttt # And I use this occasion to reduce the dataset and replace one variable with my new synthetic variable.
As the first axis (PC) of our PCA satisfactorily synthesized a large amount (r round(res.pca$eig[1,2],1)
%) of the variance of the 5 environmental variables, we used operations' coordinates on this axis to build a synthetic variable to summarize their environment. This new variable was named coarse_env and thus replaced the other 5 variables in the dataset. Operations with positive values for this variable were located in sites that tend to be forested (and thus shaded), difficult to access, with uneven ground and a coarse soil texture (and vice-versa).
We also performed another normed-PCA on some of our follow-up variables that we thought could be synthesized (i.e. add_control, repairs, and freq_monitoring).
redges %>% dplyr::select(freq_monitoring, repairs, add_control) %>% dplyr::mutate(repairs = as.numfactor(x = repairs), add_control = as.numfactor(x = add_control))-> xxx # Normed-PCA: res.pca <- FactoMineR::PCA(X = xxx, scale.unit = TRUE, graph = FALSE) # To get the correlation circle for this PCA, with a variable coloration according to their contribution to the first 2 principal components: factoextra::fviz_pca_var(res.pca, col.var = "contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE) # As the first axis (PC) of my PCA satisfactorily synthesizes a fairly large amount of the variance of my 5 variables, I will use the coordinates of the tarping operations on this axis as a synthetic variable: ttt <- res.pca$ind$coord[,1] redges %>% dplyr::rename(followups = "freq_monitoring") -> redges redges$followups <- ttt # And I use this occasion to reduce the dataset and replace one variable with my new synthetic variable.
As the first axis (PC) of our PCA satisfactorily synthesized a large amount (r round(res.pca$eig[1,2],1)
%) of the variance of the 3 follow-up variables, we used operations' coordinates on this axis to build a synthetic variable to summarize their follow-up conditions. This new variable was named followups and thus replaced freq_monitoring in the dataset (we decided to keep repairs and add_control as is). High positive values for this variable represent operations that were frequently monitored and where additional control and repairs were performed (and vice-versa).
\
\
\ The next step in our data exploration was to look for potential outliers in the data using boxplots and Cleveland dotplots on our quantitative variables.
rm(res.pca, xxx, ttt) jk.dusz.tarping::uni.boxplots(dataset = redges)
jk.dusz.tarping::uni.dotplots(dataset = redges)
By the look of these graphs, it is clear that we have many extreme values (possible outliers). Yet, extreme values are not necessarily true outliers:
simu.var <- dplyr::select(.data = redges, latitude, freq_monitoring, distance, sedicover_height, trench_depth) jk.dusz.tarping::uni.simudistrib(simu.var = simu.var, distribution = "normal") jk.dusz.tarping::uni.simudistrib(simu.var = simu.var, distribution = "log-normal") jk.dusz.tarping::uni.simudistrib(simu.var = simu.var, distribution = "poisson") rm(simu.var)
After analysing random samples drawn from log-normal distributions (based on the parameters of our potentially problematic variables), we have good reason to believe that only stand_surface should be problematic. Consequently, we chose to delete the spotted outlier within stand_surface (i.e. the stand_surface of ca. 5000 m^2^).
The other variables do not have distribution that a transformation could not fix, except perhaps latitude and longitude. Yet, we do not intend on truly using these variables at this point (we included them in case we need to account for some spatial autocorrelation in the models).
redges <- redges[-(which.max(redges$stand_surface)),]
\ \
\ Although we already gained pretty good insights into the probability distribution followed by our variables, it is usually interesting and often useful to have a precise view of the variables' distributions.\ Consequently, we described the shapes of the distribution of our quantitative variables using histograms as well as skewness and kurtosis values. Skewness is a measure of the symmetry of a variable's distribution (a perfectly normal distribution has a skewness of 0), while kurtosis gives information on the combined amount of probability contained in the two tails of a distribution (often reported as excess kurtosis, that is kurtosis - 3, since a perfectly normally distributed variable has a kurtosis of 3). There is much debate regarding what constitutes "acceptable" values for these statistics or whether this question actually makes any sense at all. Yet, various authors suggest that values of ±2 should not be too problematic for procedures requiring approximately normally distributed variables (Field, 2009; Gravetter & Wallnau, 2014). Nonetheless, many statistical methods do not possess a direct normality assumption (linear regression models, for instance and contrarily to popular beliefs, do not assume the normality of either the response variable nor of any predictor/covariate but do assume that the replicated observations of the response variable for any fixed predictor value should be normally distributed (Montgomery & Peck, 1992); an assumption that is often difficult to verify, hence the usual focus on models residuals). For all these reasons, here, we only used these statistics as descriptive tools to foresee potential future complications related to the distribution of our variables, but not as bases for actual decision making.
uni.histograms(dataset = redges) num.data <- redges[, sapply(redges, is.numeric)] tab <- data.frame(moments::skewness(x = num.data), moments::kurtosis(x = num.data)-3) knitr::kable(x = tab, digits = 3, col.names = c("Skewness", "Excess kurtosis")) rm(num.data, tab)
From the above histograms and table, we can see that: Very few of our sampled variables are close to approximate a normal distribution. Many variables are skewed to the right but in relatively reasonable proportions. Some variables (e.g. elevation, stand_surface, distance, and sedicover_height*) have skewness and/or kurtosis values possibly high enough to have undue influence on regression coefficients. These variables will perhaps require some transformations. \ \
\
To investigate potential correlations and collinearity among our variables, we started by computing a correlation matrix of the data. As most of our factors were either ordered factors (i.e. ordinal variables) or binary factors (i.e. boolean), we could use them in a correlation matrix. To do that however, computations were based on Spearman's rank correlation coefficients.
To avoid giving too much weight to our own tarping experiments (those performed in Chalon s/ Saone with the French national railway company), we removed them from the data to compute correlations: their influence was accounted for in further analyses by the use of mixed-effect models.
# To convert all ordered factors into numeric variables usable in a correlation matrix: redges %>% dplyr::filter(!manager_id == 1) %>% # We remove our own experiments as they share many features that will strongly influence correlations (it should not be a problem for modelling as we will use mixed-effect models). dplyr::select(-manager_id, -xp_id) %>% dplyr::mutate_if(.predicate = is.factor, .funs = as.numfactor) -> num.redges # as.numfactor() is my own custom function (see '01_data_cleaning.R'). # To compute the correlation matrix: res.cor.redges <- round(stats::cor(num.redges, use = "complete.obs", method = "spearman"), 2) # To compute a matrix of correlation p-values: res.pcor.redges <- ggcorrplot::cor_pmat(x = num.redges, method = "spearman") ggcorrplot::ggcorrplot(res.cor.redges, type = "upper", outline.col = "white", ggtheme = ggplot2::theme_gray, colors = c("#6D9EC1", "white", "#E46726"), p.mat = res.pcor.redges, insig = "blank") rm(num.redges, res.cor.redges, res.pcor.redges)
Many interesting patterns can be observed in this correlation matrix. Among them, we can see that:
It is worth reminding that accounting for multivariate relationships, interactions and the effect of our tarping operations (dropped for the current correlation matrix) will certainly affect the observed patterns. Still, at this stage and in accordance with the hierarchy of hypotheses we would like to test for the lreg_edges variable, we decided to drop the following variables: geotex and degradation.
redges %>% dplyr::select(-geotex, -degradation) -> redges
As multicollinearity is one of the biggest issues for inference modelling, assessing which explanatory variables and covariates are collinear and could possibly be problematic during modelling stages is important: for instance by computing Variance Inflation Factors (VIF) for each variable. However, as only a small fraction of the variables will actually be used together in the likely a priori models that we will build, we intend on computing VIF as part of the models diagnostic checks and not here. Furthermore, as several variables were categorical and as we will not be working with models fitted with ordinary least squares, we will use Generalized VIF instead of VIF (Fox & Monette, 1992). \ \
\ To further study distributions and bivariate correlations, we decided to plot every numeric predictor against each others. It gives us the possibility to start investigating potential multivariate outliers.
GGally::ggpairs(redges[, c("coarse_env", "distance", "flood", "followups", "obstacles", "stand_surface", "slope", "tarping_duration", "trench_depth")])
This figure confirms that most variables are far from being normally distributed but that's not necessarily a problem. If many scatterplots appear relatively compact and consistent, all highly skewed variables (e.g. stand_surface, trench_depth etc.) tend to produce what seem to be multivariate outliers that could become potentially problematic for modelling. In case it happens, transforming these variables may be useful. \ \
\ Since lreg_edges will be modelled using logistic regressions, it is very important to ensure the respect of the linearity assumption between the outcome on the logit scale and each numeric predictor, for instance using Box-Tidwell tests or scatterplots. The main issue here is to get the values of the response variable on the logit scale. To do so, we have to build a logistic model incorporating all potential predictors to extract expected probabilities. A drawback from this approach is that the toy model can easily be misspecified in the process: for that reason, we actually checked this assumption iteratively for each a priori candidate model built for analyses (the scatterplots displayed below only serve to illustrate what transformations were actually applied to linearize relationships).
eff2 <- redges[,c("lreg_edges", "distance", "followups", "obstacles", "slope", "stand_surface", "tarping_duration", "trench_depth")] model <- glm(lreg_edges ~., data = eff2, family = binomial) # Predict the probability (p) of regrowths at the edge: probabilities <- predict(model, type = "response") # Transforming predictors mydata <- eff2 %>% dplyr::select_if(is.numeric) %>% dplyr::mutate("stand_surface (log)" = log(stand_surface)) %>% dplyr::mutate("distance (log+1)" = log(distance + 1)) # These transformations were made to linearize the relationships predictors <- colnames(mydata) # Bind the logit and tidying the data for plot (ggplot2, so long format) mydata <- mydata %>% dplyr::mutate(logit = log(probabilities/(1-probabilities))) %>% tidyr::gather(key = "predictors", value = "predictor.value", -logit) # Create scatterplot ggplot2::ggplot(mydata, ggplot2::aes(y = logit, x = predictor.value))+ ggplot2::geom_point(size = 0.5, alpha = 0.5) + ggplot2::geom_smooth(method = "loess") + ggplot2::theme_bw() + ggplot2::facet_wrap(~predictors, scales = "free_x") rm(eff2, model, probabilities, predictors, mydata)
This figure shows that log-transforming distance and stand_surface could help linearize their relationship with lreg_edges expressed on the logit scale.
More general scatterplots can further be used to ensure the absence of bivariate (quasi-)perfect separation between a predictor and the response variable, which is highly problematic for logistic models fitted using Maximum Likelihood (Allison, 2004).
### To plot a response variable against all predictors/covariates, we will have to do use a for loop or the facet_wrap function of ggplot (or equivalent). # For the "lreg_edges" variable (bounded continuous variable): redges %>% dplyr::select(-manager_id, -xp_id, -latitude, -longitude) %>% dplyr::mutate_if(is.factor, as.numfactor) %>% tidyr::pivot_longer(-c(lreg_edges), # Means that all columns of `mydata` need to be reshaped except "lreg_edges" & "fully_tarped" (works as well with the form `!lreg_edges`, and we can place it somewhere else: actually, it is the `cols =` argument)! We have to work in the long-format because that's how the tidyverse and thus ggplot2 love to work... names_to = "Predictor", # Gives the name of the new grouping variable while "values_to" gives the name of the variable that will be created from the data stored in the cell value. values_to = "Value") %>% ggplot2::ggplot(ggplot2::aes(x = Value, y = lreg_edges)) + ggplot2::facet_wrap(~Predictor, scales = "free_x", strip.position = "bottom") + ggplot2::geom_point(size = 1.5, alpha = 0.8, col = "palevioletred4") + ggplot2::scale_y_continuous(limits=c(0,1)) + ggplot2::geom_smooth(method = "lm", na.rm = TRUE, size = 1, col = "plum1") + ggplot2::labs(x = "Predictor value", y = "Regrowth on the edges") + ggthemes::theme_hc(style = "darkunica") + ggplot2::theme(panel.border = ggplot2::element_blank(), legend.key = ggplot2::element_blank(), strip.background = ggplot2::element_blank(), strip.text = ggplot2::element_text(colour = "gray60", size = 10), strip.placement = "outside", axis.line.x = ggplot2::element_line(colour = "darkorange4", size=0.5), axis.line.y = ggplot2::element_line(colour = "darkorange4", size=0.5), legend.position = "right")
There is no apparent sign of bivariate (quasi-)perfect separation, so regular logistic models can be used as well as multimodel inference. \ \
Finally, to further investigate potentially meaningful interactions among our variables, we used co-plots (i.e. conditioning plots) between some of our predictors and lreg_edges, based on plausible hypotheses. However, as lreg_edges is a binary factor, the use of coplots was more complicated than for a numeric variable. Visualizing categorical variables as a function of other variables, especially quantitative ones, is not ideal in a regular conditioning plot as points will tend to be stacked on each other and as logistic curves should be fitted instead of regression lines, which is quite tedious to automatise in R. Consequently, we chose to draw spineplots within the coplot panels (conditioning spineplots) by transforming quantitative variables into two-level factors ("low" and "high" - based on whether observations had values below or above average, meaning that the number of observations is not distributed equitably between the two categories). The plots created could be viewed as some sort of graphical contingency tables. Unfortunately, combining spineplots with coplots creates a graphic bug in the drawing of the axes (linked to the coplot function internals) that we could not fix yet. We apologize for that.
To avoid overloading the present document, only the most informative plots are displayed.
par(.pardefault) # To restore defaults graphical parameters redges %>% dplyr::mutate(distance = ifelse(distance <= mean(distance), "Short", "Long")) %>% dplyr::mutate(distance = factor(distance, levels=c("Short", "Long"))) -> rrr pan.fun = function(x, y, ...){ usr <- par("usr"); on.exit(par(usr)) # Frankly, I don't understand what these two lines do (only new = TRUE seems to be doing something), and the spineplot messes with the axes of the coplots below. par(usr = c(usr[1:2], 0, 1.5), new=T) spineplot(as.factor(x), as.factor(y), main="", xlab="", ylab="", axes=F, col = c("#FF9933","#FFCC99")) } ### Three-way interaction between distance, stand_surface and trench_depth: graphics::coplot(lreg_edges ~ distance | log(stand_surface) * trench_depth, data = rrr, panel = pan.fun, number = 3, overlap = 0.01, rows = 1)
This conditioning spineplot is quite informative because it shows a changing pattern as we move from bottom to top, but globally not from left to right. It could be the sign of an existing interaction between distance and trench depth, but not with stand surface: with increasing trench depth, the proportion of tarping operations that found no knotweed regrowths at the edges (i.e. light orange rectangles) when tarping distances were above average (i.e. right-hand rectangles within each of the 9 panels) increased, regardless of the size of the knotweed stands. Actually, it is hard to say if stand surface did not play any role here but it appears that there was no unequivocal trend.
redges %>% dplyr::mutate(slope = ifelse(slope <= mean(slope), "Gentle", "Steep")) %>% dplyr::mutate(slope = factor(slope, levels=c("Gentle", "Steep"))) -> rrr pan.fun = function(x, y, ...){ usr <- par("usr"); on.exit(par(usr)) # Frankly, I don't understand what these two lines do (only new = TRUE seems to be doing something), and the spineplot messes with the axes of the coplots below. par(usr = c(usr[1:2], 0, 1.5), new=T) spineplot(as.factor(x), as.factor(y), main="", xlab="", ylab="", axes=F, col = c("#FF9933","#FFCC99")) } ### Two-way interaction between slope and obstacles: graphics::coplot(lreg_edges ~ slope | obstacles, data = rrr, panel = pan.fun, number = 3, overlap = 0.01, rows = 1)
This above conditioning spineplot seems to indicate that the effect of increasing slopes on the proportion of regrowths at the edge increased when there were many obstacles on the tarped area.
redges %>% dplyr::mutate(obstacles = ifelse(obstacles <= mean(obstacles), "Low", "High")) %>% dplyr::mutate(obstacles = factor(obstacles, levels=c("Low", "High"))) -> rrr pan.fun = function(x, y, ...){ usr <- par("usr"); on.exit(par(usr)) # Frankly, I don't understand what these two lines do (only new = TRUE seems to be doing something), and the spineplot messes with the axes of the coplots below. par(usr = c(usr[1:2], 0, 1.5), new=T) spineplot(as.factor(x), as.factor(y), main="", xlab="", ylab="", axes=F, col = c("#FF9933","#FFCC99")) } ### Three-way interaction between obstacles, geomem and tarpfix_multimethod: graphics::coplot(lreg_edges ~ obstacles | geomem * tarpfix_multimethod, data = rrr, panel = pan.fun, number = 3, overlap = 0.01, rows = 1)
This very interesting conditioning spineplot seems to indicate that when not much efforts had been taken to fix fabrics to the ground, increasing obstacles increased the likelihood of finding regrowths at the edge of the tarped area. However, when multiple methods were used for fixation, this trend seemed to be quite reduced except when geomembranes were used! This could mean that geomembranes are harder to properly fix to the ground and/or around obstacles (compared to geotextiles).
### Two-way interaction between obstacles and obstacles: graphics::coplot(lreg_edges ~ obstacles | distance, data = rrr, panel = pan.fun, number = 3, overlap = 0.01, rows = 1)
This figure suggests that the effect of increasing obstacle proportion on the likelihood of finding regrowths at the edge of the tarped area might me be reduced when larger covering distances are applied.
rm(rrr) ### Finally, I export the dataset to be used for modelling in my "mydata" folder: readr::write_csv(x = redges, file = here::here("mydata", "redges.csv"), append = FALSE)
\ \
\ The first thing to do here was to prepare the sub-dataset that will be used to model the responses of the lreg_tarpedarea variable. To do so, the following steps have been performed:
missForest
package (Stekhoven & Buehlmann, 2012). The table below presents the out-of-bag (OOB) imputation error estimates for each type of variable (computed as the Normalized Root Mean Squared Error (NRMSE) for numeric variables, and the Proportion of Falsely Classified entries (PFC) for categorical variables). OOB values near 1 indicate that imputations are not reliable, while values close to 0 indicate very low imputation errors. For the meaning of the variables, please refer to the attached documentation.
rtarped <- jk.dusz.tarping::model_datasets(response.var = "tarpedarea") rtarped <- dplyr::mutate(.data = rtarped, lreg_tarpedarea = as.factor(x = lreg_tarpedarea), uprootexcav = as.factor(x = uprootexcav)) ### Imputation of missing values using `missForest` (we could also have worked with `MICE`): rtarped %>% dplyr::select(-manager_id, -xp_id) %>% as.data.frame() -> rtarped_mis # missForest only accepts data.frames or matrices (not tibbles). Variables must also be numeric or factors only (no character, date, difftime... classes)! imput <- missForest::missForest(xmis = rtarped_mis, verbose = TRUE)
knitr::kable(imput$OOBerror)
Here are a few descriptive statistics of our sub-dataset.
rtarped[,3:ncol(rtarped)] <- imput$ximp # Replacing the original dataset with the imputed one! summary(rtarped) rm(rtarped_mis, imput)
\ \
\ We performed a normed-PCA on some of our environmental variables for which we did not have strong direct hypotheses (i.e. difficulty_access, shade, forest, ruggedness, granulometry) but we kept slope, obstacles, and flood because we suspected they might have a major influence on the success/failure of tarping.
xxx <- dplyr::select(.data = rtarped, c(difficulty_access, shade, forest, ruggedness, granulometry)) # Normed-PCA: res.pca <- FactoMineR::PCA(X = xxx, scale.unit = TRUE, graph = FALSE) # To get the correlation circle for this PCA, with a variable coloration according to their contribution to the first 2 principal components: factoextra::fviz_pca_var(res.pca, col.var = "contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE) # As the first axis (PC) of my PCA satisfactorily synthesizes a fairly large amount of the variance of my 5 variables, I will use the coordinates of the tarping operations on this axis as a synthetic variable: ttt <- res.pca$ind$coord[,1] rtarped %>% dplyr::select(-difficulty_access, -shade, -forest, -ruggedness) %>% dplyr::rename(coarse_env = "granulometry") -> rtarped rtarped$coarse_env <- ttt # And I use this occasion to reduce the dataset and replace one variable with my new synthetic variable.
As the first axis (PC) of our PCA satisfactorily synthesized a large amount (r round(res.pca$eig[1,2],1)
%) of the variance of the 5 environmental variables, we used operations' coordinates on this axis to build a synthetic variable to summarize their environment. This new variable was named coarse_env and thus replaced the other 5 variables in the dataset. Operations with positive values for this variable were located in sites that tend to be forested (and thus shaded), difficult to access, with uneven ground and a coarse soil texture (and vice-versa).
We also performed another normed-PCA on some of our follow-up variables that we thought could be synthesized (i.e. add_control, repairs, and freq_monitoring).
rtarped %>% dplyr::select(freq_monitoring, repairs, add_control) %>% dplyr::mutate(repairs = as.numfactor(x = repairs), add_control = as.numfactor(x = add_control))-> xxx # Normed-PCA: res.pca <- FactoMineR::PCA(X = xxx, scale.unit = TRUE, graph = FALSE) # To get the correlation circle for this PCA, with a variable coloration according to their contribution to the first 2 principal components: factoextra::fviz_pca_var(res.pca, col.var = "contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE) # As the first axis (PC) of my PCA satisfactorily synthesizes a fairly large amount of the variance of my 5 variables, I will use the coordinates of the tarping operations on this axis as a synthetic variable: ttt <- res.pca$ind$coord[,1] rtarped %>% dplyr::rename(followups = "freq_monitoring") -> rtarped rtarped$followups <- ttt # And I use this occasion to reduce the dataset and replace one variable with my new synthetic variable.
As the first axis (PC) of our PCA satisfactorily synthesized a large amount (r round(res.pca$eig[1,2],1)
%) of the variance of the 3 follow-up variables, we used operations' coordinates on this axis to build a synthetic variable to summarize their follow-up conditions. This new variable was named followups and thus replaced freq_monitoring in the dataset (we decided to keep repairs and add_control as is). High positive values for this variable represent operations that were frequently monitored and where additional control and repairs were performed (and vice-versa).
\
\
\ The next step in our data exploration was to look for potential outliers in the data using boxplots and Cleveland dotplots on our quantitative variables.
rm(res.pca, xxx, ttt) jk.dusz.tarping::uni.boxplots(dataset = rtarped)
jk.dusz.tarping::uni.dotplots(dataset = rtarped)
By the look of these graphs, it is clear that we have many extreme values (possible outliers). Yet, extreme values are not necessarily true outliers:
simu.var <- dplyr::select(.data = rtarped, latitude, freq_monitoring, distance, sedicover_height, trench_depth) jk.dusz.tarping::uni.simudistrib(simu.var = simu.var, distribution = "normal") jk.dusz.tarping::uni.simudistrib(simu.var = simu.var, distribution = "log-normal") jk.dusz.tarping::uni.simudistrib(simu.var = simu.var, distribution = "poisson") rm(simu.var)
After analysing random samples drawn from log-normal distributions (based on the parameters of our potentially problematic variables), we have good reason to believe that only stand_surface should be problematic. Consequently, we chose to delete the spotted outlier within stand_surface (i.e. the stand_surface of ca. 5000 m^2^).
The other variables do not have distribution that a transformation could not fix, except perhaps latitude and longitude. Yet, we do not intend on truly using these variables at this point (we included them in case we need to account for some spatial autocorrelation in the models).
rtarped <- rtarped[-(which.max(rtarped$stand_surface)),]
\ \
\ Although we already gained pretty good insights into the probability distribution followed by our variables, it is usually interesting and often useful to have a precise view of the variables' distributions.\ Consequently, we described the shapes of the distribution of our quantitative variables using histograms as well as skewness and kurtosis values. Skewness is a measure of the symmetry of a variable's distribution (a perfectly normal distribution has a skewness of 0), while kurtosis gives information on the combined amount of probability contained in the two tails of a distribution (often reported as excess kurtosis, that is kurtosis - 3, since a perfectly normally distributed variable has a kurtosis of 3). There is much debate regarding what constitutes "acceptable" values for these statistics or whether this question actually makes any sense at all. Yet, various authors suggest that values of ±2 should not be too problematic for procedures requiring approximately normally distributed variables (Field, 2009; Gravetter & Wallnau, 2014). Nonetheless, many statistical methods do not possess a direct normality assumption (linear regression models, for instance and contrarily to popular beliefs, do not assume the normality of either the response variable nor of any predictor/covariate but do assume that the replicated observations of the response variable for any fixed predictor value should be normally distributed (Montgomery & Peck, 1992); an assumption that is often difficult to verify, hence the usual focus on models residuals). For all these reasons, here, we only used these statistics as descriptive tools to foresee potential future complications related to the distribution of our variables, but not as bases for actual decision making.
uni.histograms(dataset = rtarped) num.data <- rtarped[, sapply(rtarped, is.numeric)] tab <- data.frame(moments::skewness(x = num.data), moments::kurtosis(x = num.data)-3) knitr::kable(x = tab, digits = 3, col.names = c("Skewness", "Excess kurtosis")) rm(num.data, tab)
From the above histograms and table, we can see that: Very few of our sampled variables are close to approximate a normal distribution. Many variables are skewed to the right but in relatively reasonable proportions. Some variables (e.g. elevation, stand_surface, distance, and sedicover_height*) have skewness and/or kurtosis values possibly high enough to have undue influence on regression coefficients. These variables will perhaps require some transformations. \ \
\
To investigate potential correlations and collinearity among our variables, we started by computing a correlation matrix of the data. As most of our factors were either ordered factors (i.e. ordinal variables) or binary factors (i.e. boolean), we could use them in a correlation matrix. To do that however, computations were based on Spearman's rank correlation coefficients.
To avoid giving too much weight to our own tarping experiments (those performed in Chalon s/ Saone with the French national railway company), we removed them from the data to compute correlations: their influence was accounted for in further analyses by the use of mixed-effect models.
# To convert all ordered factors into numeric variables usable in a correlation matrix: rtarped %>% dplyr::filter(!manager_id == 1) %>% # We remove our own experiments as they share many features that will strongly influence correlations (it should not be a problem for modelling as we will use mixed-effect models). dplyr::select(-manager_id, -xp_id) %>% dplyr::mutate_if(.predicate = is.factor, .funs = as.numfactor) -> num.rtarped # as.numfactor() is my own custom function (see '01_data_cleaning.R'). # To compute the correlation matrix: res.cor.rtarped <- round(stats::cor(num.rtarped, use = "complete.obs", method = "spearman"), 2) # To compute a matrix of correlation p-values: res.pcor.rtarped <- ggcorrplot::cor_pmat(x = num.rtarped, method = "spearman") ggcorrplot::ggcorrplot(res.cor.rtarped, type = "upper", outline.col = "white", ggtheme = ggplot2::theme_gray, colors = c("#6D9EC1", "white", "#E46726"), p.mat = res.pcor.rtarped, insig = "blank") rm(num.rtarped, res.cor.rtarped, res.pcor.rtarped)
Many interesting patterns can be observed in this correlation matrix. Among them, we can see that:
It is worth reminding that accounting for multivariate relationships, interactions and the effect of our tarping operations (dropped for the current correlation matrix) will certainly affect the observed patterns. Still, at this stage and in accordance with the hierarchy of hypotheses we would like to test for the lreg_tarpedarea variable, we decided to drop the following variables: geotex, pierced_tarpinstall, and pb_trampiercing.
rtarped %>% dplyr::select(-geotex, -pierced_tarpinstall, -pb_trampiercing) -> rtarped
As multicollinearity is one of the biggest issues for inference modelling, assessing which explanatory variables and covariates are collinear and could possibly be problematic during modelling stages is important: for instance by computing Variance Inflation Factors (VIF) for each variable. However, as only a small fraction of the variables will actually be used together in the likely a priori models that we will build, we intend on computing VIF as part of the models diagnostic checks and not here. Furthermore, as several variables were categorical and as we will not be working with models fitted with ordinary least squares, we will use Generalized VIF instead of VIF (Fox & Monette, 1992). \ \
\ To further study distributions and bivariate correlations, we decided to plot every numeric predictor against each others. It gives us the possibility to start investigating potential multivariate outliers.
GGally::ggpairs(rtarped[, c("coarse_env", "flood", "followups", "obstacles", "sedicover_height", "stand_surface", "slope", "tarping_duration")])
This figure confirms that most variables are far from being normally distributed but that's not necessarily a problem. If many scatterplots appear relatively compact and consistent, all highly skewed variables (e.g. stand_surface, sedicover_height etc.) tend to produce what seem to be multivariate outliers that could become potentially problematic for modelling. In case it happens, transforming these variables may be useful. \ \
\ Since lreg_tarpedarea will be modelled using logistic regressions, it is very important to ensure the respect of the linearity assumption between the outcome on the logit scale and each numeric predictor, for instance using Box-Tidwell tests or scatterplots. The main issue here is to get the values of the response variable on the logit scale. To do so, we have to build a logistic model incorporating all potential predictors to extract expected probabilities. A drawback from this approach is that the toy model can easily be misspecified in the process: for that reason, we actually checked this assumption iteratively for each a priori candidate model built for analyses (the scatterplots displayed below only serve to illustrate what transformations were actually applied to linearize relationships).
eff2 <- rtarped[,c("lreg_tarpedarea", "coarse_env", "followups", "obstacles", "sedicover_height", "slope", "stand_surface", "tarping_duration")] model <- glm(lreg_tarpedarea ~., data = eff2, family = binomial) # Predict the probability (p) of regrowths within the tarped area: probabilities <- predict(model, type = "response") # Transforming predictors mydata <- eff2 %>% dplyr::select_if(is.numeric) %>% dplyr::mutate("stand_surface (log)" = log(stand_surface)) %>% dplyr::mutate("sedicover_height (log+1)" = log(sedicover_height+1)) # These transformations were made to linearize the relationships predictors <- colnames(mydata) # Bind the logit and tidying the data for plot (ggplot2, so long format) mydata <- mydata %>% dplyr::mutate(logit = log(probabilities/(1-probabilities))) %>% tidyr::gather(key = "predictors", value = "predictor.value", -logit) # Create scatterplot ggplot2::ggplot(mydata, ggplot2::aes(y = logit, x = predictor.value))+ ggplot2::geom_point(size = 0.5, alpha = 0.5) + ggplot2::geom_smooth(method = "loess") + ggplot2::theme_bw() + ggplot2::facet_wrap(~predictors, scales = "free_x") rm(eff2, model, probabilities, predictors, mydata)
This figure shows that log-transforming sedicover_height and stand_surface helped linearize their relationship with lreg_tarpedarea expressed on the logit scale.
More general scatterplots can further be used to ensure the absence of bivariate (quasi-)perfect separation between a predictor and the response variable, which is highly problematic for logistic models fitted using Maximum Likelihood (Allison, 2004).
### To plot a response variable against all predictors/covariates, we will have to do use a for loop or the facet_wrap function of ggplot (or equivalent). # For the "lreg_tarpedarea" variable (bounded continuous variable): rtarped %>% dplyr::select(-manager_id, -xp_id, -latitude, -longitude) %>% dplyr::mutate_if(is.factor, as.numfactor) %>% tidyr::pivot_longer(-c(lreg_tarpedarea), # Means that all columns of `mydata` need to be reshaped except "lreg_tarpedarea" & "fully_tarped" (works as well with the form `!lreg_tarpedarea`, and we can place it somewhere else: actually, it is the `cols =` argument)! We have to work in the long-format because that's how the tidyverse and thus ggplot2 love to work... names_to = "Predictor", # Gives the name of the new grouping variable while "values_to" gives the name of the variable that will be created from the data stored in the cell value. values_to = "Value") %>% ggplot2::ggplot(ggplot2::aes(x = Value, y = lreg_tarpedarea)) + ggplot2::facet_wrap(~Predictor, scales = "free_x", strip.position = "bottom") + ggplot2::geom_point(size = 1.5, alpha = 0.8, col = "palevioletred4") + ggplot2::scale_y_continuous(limits=c(0,1)) + ggplot2::geom_smooth(method = "lm", na.rm = TRUE, size = 1, col = "plum1") + ggplot2::labs(x = "Predictor value", y = "Regrowth within the tarped area") + ggthemes::theme_hc(style = "darkunica") + ggplot2::theme(panel.border = ggplot2::element_blank(), legend.key = ggplot2::element_blank(), strip.background = ggplot2::element_blank(), strip.text = ggplot2::element_text(colour = "gray60", size = 10), strip.placement = "outside", axis.line.x = ggplot2::element_line(colour = "darkorange4", size=0.5), axis.line.y = ggplot2::element_line(colour = "darkorange4", size=0.5), legend.position = "right")
There is no apparent sign of bivariate (quasi-)perfect separation, so regular logistic models can be used as well as multimodel inference. \ \
Finally and similarly as for other response variables, we wanted to further investigate potentially meaningful interactions among our variables using co-plots (i.e. conditioning plots). However, we did not have many hypotheses regarding interactions for lreg_tarpedarea and as none of the tested ones gave insightful results so we decided to present none.
### Finally, I export the dataset to be used for modelling in my "mydata" folder: readr::write_csv(x = rtarped, file = here::here("mydata", "rtarped.csv"), append = FALSE)
\ \
\ The first thing to do here was to prepare the sub-dataset that will be used to model the responses of the reg_stripsoverlap variable. To do so, the following steps have been performed:
missForest
package (Stekhoven & Buehlmann, 2012). The table below presents the out-of-bag (OOB) imputation error estimates for each type of variable (computed as the Normalized Root Mean Squared Error (NRMSE) for numeric variables, and the Proportion of Falsely Classified entries (PFC) for categorical variables). OOB values near 1 indicate that imputations are not reliable, while values close to 0 indicate very low imputation errors. For the meaning of the variables, please refer to the attached documentation.
roverlaps <- jk.dusz.tarping::model_datasets(response.var = "overlaps") roverlaps <- dplyr::filter(.data = roverlaps, reg_stripsoverlap != "NA") %>% dplyr::filter(strips_overlap != "NA") ### Imputation of missing values using `missForest` (we could also have worked with `MICE`): roverlaps %>% dplyr::select(-manager_id, -xp_id) %>% as.data.frame() -> roverlaps_mis # missForest only accepts data.frames or matrices (not tibbles). Variables must also be numeric or factors only (no character, date, difftime... classes)! imput <- missForest::missForest(xmis = roverlaps_mis, verbose = TRUE)
knitr::kable(imput$OOBerror)
Here are a few descriptive statistics of our sub-dataset.
roverlaps[,3:ncol(roverlaps)] <- imput$ximp # Replacing the original dataset with the imputed one! summary(roverlaps) rm(roverlaps_mis, imput)
\ \
\ We performed a normed-PCA on some of our environmental variables for which we did not have strong direct hypotheses (i.e. difficulty_access, shade, forest, ruggedness, granulometry) but we kept slope, obstacles, and flood because we suspected they might have a major influence on the success/failure of tarping.
xxx <- dplyr::select(.data = roverlaps, c(difficulty_access, shade, forest, ruggedness, granulometry)) # Normed-PCA: res.pca <- FactoMineR::PCA(X = xxx, scale.unit = TRUE, graph = FALSE) # To get the correlation circle for this PCA, with a variable coloration according to their contribution to the first 2 principal components: factoextra::fviz_pca_var(res.pca, col.var = "contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE) # As the first axis (PC) of my PCA satisfactorily synthesizes a fairly large amount of the variance of my 5 variables, I will use the coordinates of the tarping operations on this axis as a synthetic variable: ttt <- res.pca$ind$coord[,1] roverlaps %>% dplyr::select(-difficulty_access, -shade, -forest, -ruggedness) %>% dplyr::rename(coarse_env = "granulometry") -> roverlaps roverlaps$coarse_env <- ttt # And I use this occasion to reduce the dataset and replace one variable with my new synthetic variable.
As the first axis (PC) of our PCA satisfactorily synthesized a large amount (r round(res.pca$eig[1,2],1)
%) of the variance of the 5 environmental variables, we used operations' coordinates on this axis to build a synthetic variable to summarize their environment. This new variable was named coarse_env and thus replaced the other 5 variables in the dataset. Operations with positive values for this variable were located in sites that tend to be forested (and thus shaded), difficult to access, with uneven ground and a coarse soil texture (and vice-versa).
We also performed another normed-PCA on some of our follow-up variables that we thought could be synthesized (i.e. add_control, repairs, and freq_monitoring).
roverlaps %>% dplyr::select(freq_monitoring, repairs, add_control) %>% dplyr::mutate(repairs = as.numfactor(x = repairs), add_control = as.numfactor(x = add_control))-> xxx # Normed-PCA: res.pca <- FactoMineR::PCA(X = xxx, scale.unit = TRUE, graph = FALSE) # To get the correlation circle for this PCA, with a variable coloration according to their contribution to the first 2 principal components: factoextra::fviz_pca_var(res.pca, col.var = "contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE) # As the first axis (PC) of my PCA satisfactorily synthesizes a fairly large amount of the variance of my 5 variables, I will use the coordinates of the tarping operations on this axis as a synthetic variable: ttt <- res.pca$ind$coord[,1] roverlaps %>% dplyr::rename(followups = "freq_monitoring") -> roverlaps roverlaps$followups <- ttt # And I use this occasion to reduce the dataset and replace one variable with my new synthetic variable.
As the first axis (PC) of our PCA satisfactorily synthesized a large amount (r round(res.pca$eig[1,2],1)
%) of the variance of the 3 follow-up variables, we used operations' coordinates on this axis to build a synthetic variable to summarize their follow-up conditions. This new variable was named followups and thus replaced freq_monitoring in the dataset (we decided to keep repairs and add_control as is). High positive values for this variable represent operations that were frequently monitored and where additional control and repairs were performed (and vice-versa).
\
\
\ The next step in our data exploration was to look for potential outliers in the data using boxplots and Cleveland dotplots on our quantitative variables.
rm(res.pca, xxx, ttt) jk.dusz.tarping::uni.boxplots(dataset = roverlaps)
jk.dusz.tarping::uni.dotplots(dataset = roverlaps)
By the look of these graphs, it is clear that we have many extreme values (possible outliers). Yet, extreme values are not necessarily true outliers:
simu.var <- dplyr::select(.data = roverlaps, latitude, freq_monitoring, distance, sedicover_height, trench_depth) jk.dusz.tarping::uni.simudistrib(simu.var = simu.var, distribution = "normal") jk.dusz.tarping::uni.simudistrib(simu.var = simu.var, distribution = "log-normal") jk.dusz.tarping::uni.simudistrib(simu.var = simu.var, distribution = "poisson") rm(simu.var)
After analysing random samples drawn from log-normal distributions (based on the parameters of our potentially problematic variables), we have good reason to believe that only stand_surface should be problematic. Consequently, we chose to delete the spotted outlier within stand_surface (i.e. the stand_surface of ca. 5000 m^2^). The other variables do not have distribution that a transformation could not fix, except perhaps latitude and longitude. Yet, we do not intend on truly using these variables at this point (we included them in case we need to account for some spatial autocorrelation in the models).
roverlaps <- roverlaps[-(which.max(roverlaps$stand_surface)),]
\ \
\ Although we already gained pretty good insights into the probability distribution followed by our variables, it is usually interesting and often useful to have a precise view of the variables' distributions.\ Consequently, we described the shapes of the distribution of our quantitative variables using histograms as well as skewness and kurtosis values. Skewness is a measure of the symmetry of a variable's distribution (a perfectly normal distribution has a skewness of 0), while kurtosis gives information on the combined amount of probability contained in the two tails of a distribution (often reported as excess kurtosis, that is kurtosis - 3, since a perfectly normally distributed variable has a kurtosis of 3). There is much debate regarding what constitutes "acceptable" values for these statistics or whether this question actually makes any sense at all. Yet, various authors suggest that values of ±2 should not be too problematic for procedures requiring approximately normally distributed variables (Field, 2009; Gravetter & Wallnau, 2014). Nonetheless, many statistical methods do not possess a direct normality assumption (linear regression models, for instance and contrarily to popular beliefs, do not assume the normality of either the response variable nor of any predictor/covariate but do assume that the replicated observations of the response variable for any fixed predictor value should be normally distributed (Montgomery & Peck, 1992); an assumption that is often difficult to verify, hence the usual focus on models residuals). For all these reasons, here, we only used these statistics as descriptive tools to foresee potential future complications related to the distribution of our variables, but not as bases for actual decision making.
uni.histograms(dataset = roverlaps) num.data <- roverlaps[, sapply(roverlaps, is.numeric)] tab <- data.frame(moments::skewness(x = num.data), moments::kurtosis(x = num.data)-3) knitr::kable(x = tab, digits = 3, col.names = c("Skewness", "Excess kurtosis")) rm(num.data, tab)
From the above histograms and table, we can see that: Very few of our sampled variables are close to approximate a normal distribution. Many variables are skewed to the right but in relatively reasonable proportions. Some variables (e.g. elevation, stand_surface, distance, and sedicover_height*) have skewness and/or kurtosis values possibly high enough to have undue influence on regression coefficients. These variables will perhaps require some transformations. \ \
\
To investigate potential correlations and collinearity among our variables, we started by computing a correlation matrix of the data. As most of our factors were either ordered factors (i.e. ordinal variables) or binary factors (i.e. boolean), we could use them in a correlation matrix. To do that however, computations were based on Spearman's rank correlation coefficients.
To avoid giving too much weight to our own tarping experiments (those performed in Chalon s/ Saone with the French national railway company), we removed them from the data to compute correlations: their influence was accounted for in further analyses by the use of mixed-effect models.
# To convert all ordered factors into numeric variables usable in a correlation matrix: roverlaps %>% dplyr::filter(!manager_id == 1) %>% # We remove our own experiments as they share many features that will strongly influence correlations (it should not be a problem for modelling as we will use mixed-effect models). dplyr::select(-manager_id, -xp_id) %>% dplyr::mutate_if(.predicate = is.factor, .funs = as.numfactor) -> num.roverlaps # as.numfactor() is my own custom function (see '01_data_cleaning.R'). # To compute the correlation matrix: res.cor.roverlaps <- round(stats::cor(num.roverlaps, use = "complete.obs", method = "spearman"), 2) # To compute a matrix of correlation p-values: res.pcor.roverlaps <- ggcorrplot::cor_pmat(x = num.roverlaps, method = "spearman") ggcorrplot::ggcorrplot(res.cor.roverlaps, type = "upper", outline.col = "white", ggtheme = ggplot2::theme_gray, colors = c("#6D9EC1", "white", "#E46726"), p.mat = res.pcor.roverlaps, insig = "blank") rm(num.roverlaps, res.cor.roverlaps, res.pcor.roverlaps)
Many interesting patterns can be observed in this correlation matrix. Among them, we can see that:
It is worth reminding that accounting for multivariate relationships, interactions and the effect of our tarping operations (dropped for the current correlation matrix) will certainly affect the observed patterns. Still, at this stage and in accordance with the hierarchy of hypotheses we would like to test for the reg_stripsoverlap variable, we decided to drop the following variables: geotex, maxveg, uprootexcav and flood.
roverlaps %>% dplyr::select(-geotex, -maxveg, -uprootexcav, -flood) -> roverlaps
As multicollinearity is one of the biggest issues for inference modelling, assessing which explanatory variables and covariates are collinear and could possibly be problematic during modelling stages is important: for instance by computing Variance Inflation Factors (VIF) for each variable. However, as only a small fraction of the variables will actually be used together in the likely a priori models that we will build, we intend on computing VIF as part of the models diagnostic checks and not here. Furthermore, as several variables were categorical and as we will not be working with models fitted with ordinary least squares, we will use Generalized VIF instead of VIF (Fox & Monette, 1992). \ \
\ To further study distributions and bivariate correlations, we decided to plot every numeric predictor against each others. It gives us the possibility to start investigating potential multivariate outliers.
GGally::ggpairs(roverlaps[, c("coarse_env", "obstacles", "sedicover_height", "slope", "stand_surface", "strips_overlap")])
This figure confirms that most variables are far from being normally distributed but that's not necessarily a problem. If many scatterplots appear relatively compact and consistent, all highly skewed variables (i.e. stand_surface, sedicover_height, and strips_overlap) tend to produce what seem to be multivariate outliers that could become potentially problematic for modelling. In case it happens, transforming these variables may be useful. \ \
\ Since reg_stripsoverlap will be modelled using logistic regressions, it is very important to ensure the respect of the linearity assumption between the outcome on the logit scale and each numeric predictor, for instance using Box-Tidwell tests or scatterplots. The main issue here is to get the values of the response variable on the logit scale. To do so, we have to build a logistic model incorporating all potential predictors to extract expected probabilities. A drawback from this approach is that the toy model can easily be misspecified in the process: for that reason, we actually checked this assumption iteratively for each a priori candidate model built for analyses (the scatterplots displayed below only serve to illustrate what transformations were actually applied to linearize relationships).
eff2 <- roverlaps[,c("reg_stripsoverlap", "coarse_env", "obstacles", "sedicover_height", "slope", "stand_surface", "strips_overlap")] model <- glm(reg_stripsoverlap ~., data = eff2, family = binomial) # Predict the probability (p) of regrowths at strip overlaps: probabilities <- predict(model, type = "response") # Transforming predictors mydata <- eff2 %>% dplyr::select_if(is.numeric) %>% dplyr::mutate("stand_surface (log)" = log(stand_surface)) %>% dplyr::mutate("sedicover_height (log+1)" = log(sedicover_height+1)) %>% dplyr::mutate("strips_overlap (sqrt)" = sqrt(strips_overlap)) # These transformations were made to linearize the relationships predictors <- colnames(mydata) # Bind the logit and tidying the data for plot (ggplot2, so long format) mydata <- mydata %>% dplyr::mutate(logit = log(probabilities/(1-probabilities))) %>% tidyr::gather(key = "predictors", value = "predictor.value", -logit) # Create scatterplot ggplot2::ggplot(mydata, ggplot2::aes(y = logit, x = predictor.value))+ ggplot2::geom_point(size = 0.5, alpha = 0.5) + ggplot2::geom_smooth(method = "loess") + ggplot2::theme_bw() + ggplot2::facet_wrap(~predictors, scales = "free_x") rm(eff2, model, probabilities, predictors, mydata)
This figure shows that log-transforming sedicover_height, stand_surface and strips_overlap helped linearize their relationship with reg_stripsoverlap expressed on the logit scale.
More general scatterplots can further be used to ensure the absence of bivariate (quasi-)perfect separation between a predictor and the response variable, which is highly problematic for logistic models fitted using Maximum Likelihood (Allison, 2004).
### To plot a response variable against all predictors/covariates, we will have to do use a for loop or the facet_wrap function of ggplot (or equivalent). # For the "reg_stripsoverlap" variable (bounded continuous variable): roverlaps %>% dplyr::select(-manager_id, -xp_id, -latitude, -longitude) %>% dplyr::mutate_if(is.factor, as.numfactor) %>% tidyr::pivot_longer(-c(reg_stripsoverlap), # Means that all columns of `mydata` need to be reshaped except "reg_stripsoverlap" & "fully_tarped" (works as well with the form `!reg_stripsoverlap`, and we can place it somewhere else: actually, it is the `cols =` argument)! We have to work in the long-format because that's how the tidyverse and thus ggplot2 love to work... names_to = "Predictor", # Gives the name of the new grouping variable while "values_to" gives the name of the variable that will be created from the data stored in the cell value. values_to = "Value") %>% ggplot2::ggplot(ggplot2::aes(x = Value, y = reg_stripsoverlap)) + ggplot2::facet_wrap(~Predictor, scales = "free_x", strip.position = "bottom") + ggplot2::geom_point(size = 1.5, alpha = 0.8, col = "palevioletred4") + ggplot2::scale_y_continuous(limits=c(0,1)) + ggplot2::geom_smooth(method = "lm", na.rm = TRUE, size = 1, col = "plum1") + ggplot2::labs(x = "Predictor value", y = "Regrowth at strip overlaps") + ggthemes::theme_hc(style = "darkunica") + ggplot2::theme(panel.border = ggplot2::element_blank(), legend.key = ggplot2::element_blank(), strip.background = ggplot2::element_blank(), strip.text = ggplot2::element_text(colour = "gray60", size = 10), strip.placement = "outside", axis.line.x = ggplot2::element_line(colour = "darkorange4", size=0.5), axis.line.y = ggplot2::element_line(colour = "darkorange4", size=0.5), legend.position = "right")
There is no apparent sign of bivariate (quasi-)perfect separation, so regular logistic models can be used as well as multimodel inference. \ \
Finally and similarly as for other response variables, we would have liked to further investigate potentially meaningful interactions among our variables using co-plots (i.e. conditioning plots). Unfortunately, as reg_stripsoverlap contains only 22 successes (i.e. n~1~ = 22 and n~0~ = 52), our sample will not enable us to model interactions because it would require estimating three parameters, which is too many for such a small sample (Harrell, 2015).
### Finally, I export the dataset to be used for modelling in my "mydata" folder: readr::write_csv(x = roverlaps, file = here::here("mydata", "roverlaps.csv"), append = FALSE)
\ \
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.