knitr::opts_chunk$set(echo = TRUE)
library(here) library(dplyr) library(psych) library(car)
dat = read.csv(here("Data_folder","file_name.csv")) View(dat)
# Check the packaging dim(dat) # Look at Data (top + bottom) head(dat) tail(dat)
# Run str() str(dat)
Check column data type (e.g. numeric, character,...).
Are data and column classes correctly specified?
If not, convert column class into the correct one.
?as.character ?as.numeric ?as.factor dat$col_name = as.factor(dat$col_name)
summary(dat)
If column class is character, check:
dat %>% count(col_name)
Validating data with external information --> Check "weird stuff"
# Check: if TN >= (NO2 + NO3 + NH4) # Check: if TP >= PO4 # Check: if PC >= mcyB dat_check = dat %>% mutate(N_check = (TN >= (NO2 + NO3 + NH4)), P_check = (TP >= PO4), gene_check = (PC >= mcyB)) dat_check %>% select(ID, TN, NO2, NO3, NH4, N_check) %>% filter(N_check == FALSE) dat_check %>% select(ID, TP, PO4, P_check) %>% filter(P_check == FALSE) dat_check %>% select(ID, PC, mcyB, gene_check) %>% filter(gene_check == FALSE)
zero-inflated data: response variable contains more zeros than expected, based on Poisson or negative binomial distribution.
#proportion of 0's in the data dat.tab <- table(dat$col_name == 0) dat.tab/sum(dat.tab)
#proportion of 0's expected from a Poisson distribution mu <- mean(dat$col_name) cnts <- rpois(1000, mu) dat.tab <- table(cnts == 0) dat.tab/sum(dat.tab)
Note:
value under FALSE: proportion of non-zero values in data
value under TRUE: proportion of zeros in data
The proportion of zeros observed (TRUE_observed
%) far exceeds the proportion
would have been expected (TRUE_expected
%).
=> col_name
data is zero-inflated
Collinearity: existence of correlation between covariates.
--> the information that a covariate provides about the response is redundant in the presence of the other covariates.
Solution: remove concerned variables
# matrix plot of correlations select(dat, var_1:var_n) %>% psych::pairs.panels(ellipses = FALSE, scale = TRUE)
Possible collinearity:
e.g. TP ~ PO4,
e.g. TN ~ NO3,
VIF: variance inflation factor.
Smallest possible value of VIF = 1 = absence of collinearity.
Strategy:
sequentially drop the covariate with highest VIF
recalculate VIFs
* repeat process until all VIFs are smaller than a pre-selected threshold (3)
dat_vif = select(dat, var_1:var_n) ml(var_y ~ ., data = dat_vif) %>% car::vif()
VIFs of var_1
, var_2
, etc.
are bigger than 3.
--> multicollinearity between above explanatory variables.
Remove var_1
(highest VIF) and recalculate VIF.
lm(var_y ~ .-var_1, data = dat_vif) %>% vif()
There are changes in VIFs after removing var_1
, especially VIF of var_2
(x1
to x2
).
Remove var_3
(highest VIF) and recalculate VIF.
lm(var_y ~ .-var1-var3, data = dat_vif) %>% vif()
There are changes in VIFs after removing var_3
, especially var_4
(x1
to x2
) and var_5
(x1
to x2
).
All VIFs are smaller than 3 --> remaining explanatory variables can be used in next steps.
After detect and deal with multicollinearity, possible explanatory variables: var_4, var_5, var_6, etc.
Plot response variable vs each covariate (remaining after dealing with collinearity).
select(dat, var_y, var4, var5, var6) %>% pairs.panels(ellipses = FALSE, scale = TRUE)
Overdispersion: the variance is larger than the mean
Check in/after model fitting
=> result: Overdispersion (?)
Interaction: effect of one explanatory variable on the response depends on the value of another explanatory variable
Check in/after model fitting
=> result: No interaction (?)
Use boxplot or Cleverland dotplot
dotchart()
After fitting models by ...
=> Outliers: obs.X1 and obs.X2
=> remove
conditional boxplot
Use hist()
or qqplot()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.