knitr::opts_chunk$set(echo = TRUE)
library(here) library(dplyr) library(psych) library(car)
dat = read.csv(here("Data_folder","file_name.csv")) View(dat)
# Check the packaging dim(dat) # Look at Data (top + bottom) head(dat) tail(dat)
# Run str() str(dat)
Check column data type (e.g. numeric, character,...).
Are data and column classes correctly specified?
If not, convert column class into the correct one.
?as.character ?as.numeric ?as.factor dat$col_name = as.factor(dat$col_name)
summary(dat)
If column class is character, check:
dat %>% count(col_name)
Validating data with external information --> Check "weird stuff"
# Check: if TN >= (NO2 + NO3 + NH4) # Check: if TP >= PO4 # Check: if PC >= mcyB dat_check = dat %>% mutate(N_check = (TN >= (NO2 + NO3 + NH4)), P_check = (TP >= PO4), gene_check = (PC >= mcyB)) dat_check %>% select(ID, TN, NO2, NO3, NH4, N_check) %>% filter(N_check == FALSE) dat_check %>% select(ID, TP, PO4, P_check) %>% filter(P_check == FALSE) dat_check %>% select(ID, PC, mcyB, gene_check) %>% filter(gene_check == FALSE)
zero-inflated data: response variable contains more zeros than expected, based on Poisson or negative binomial distribution.
#proportion of 0's in the data dat.tab <- table(dat$col_name == 0) dat.tab/sum(dat.tab)
#proportion of 0's expected from a Poisson distribution mu <- mean(dat$col_name) cnts <- rpois(1000, mu) dat.tab <- table(cnts == 0) dat.tab/sum(dat.tab)
Note:
value under FALSE: proportion of non-zero values in data
value under TRUE: proportion of zeros in data
The proportion of zeros observed (TRUE_observed
%) far exceeds the proportion
would have been expected (TRUE_expected
%).
=> col_name
data is zero-inflated
Collinearity: existence of correlation between covariates.
--> the information that a covariate provides about the response is redundant in the presence of the other covariates.
Solution: remove concerned variables
# matrix plot of correlations select(dat, var_1:var_n) %>% psych::pairs.panels(ellipses = FALSE, scale = TRUE)
Possible collinearity:
e.g. TP ~ PO4,
e.g. TN ~ NO3,
VIF: variance inflation factor.
Smallest possible value of VIF = 1 = absence of collinearity.
Strategy:
sequentially drop the covariate with highest VIF
recalculate VIFs
* repeat process until all VIFs are smaller than a pre-selected threshold (3)
dat_vif = select(dat, var_1:var_n) ml(var_y ~ ., data = dat_vif) %>% car::vif()
VIFs of var_1
, var_2
, etc.
are bigger than 3.
--> multicollinearity between above explanatory variables.
Remove var_1
(highest VIF) and recalculate VIF.
lm(var_y ~ .-var_1, data = dat_vif) %>% vif()
There are changes in VIFs after removing var_1
, especially VIF of var_2
(x1
to x2
).
Remove var_3
(highest VIF) and recalculate VIF.
lm(var_y ~ .-var1-var3, data = dat_vif) %>% vif()
There are changes in VIFs after removing var_3
, especially var_4
(x1
to x2
) and var_5
(x1
to x2
).
All VIFs are smaller than 3 --> remaining explanatory variables can be used in next steps.
After detect and deal with multicollinearity, possible explanatory variables: var_4, var_5, var_6, etc.
Plot response variable vs each covariate (remaining after dealing with collinearity).
select(dat, var_y, var4, var5, var6) %>% pairs.panels(ellipses = FALSE, scale = TRUE)
Overdispersion: the variance is larger than the mean
Check in/after model fitting
=> result: Overdispersion (?)
Interaction: effect of one explanatory variable on the response depends on the value of another explanatory variable
Check in/after model fitting
=> result: No interaction (?)
Use boxplot or Cleverland dotplot
dotchart()
After fitting models by ...
=> Outliers: obs.X1 and obs.X2
=> remove
conditional boxplot
Use hist()
or qqplot()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.