knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
In this vignette, we illustrate our package with the two real data sets used in the article https://hal.archives-ouvertes.fr/hal-02936779v3.
install.packages("ZIprop") install.packages("knitr") install.packages("kableExtra") install.packages("ggplot2") install.packages("ggrepel") install.packages("ggthemes") install.packages("stringr")
Note that knitr and kableExtra are used to visualize tables, ggplot2, ggrepel and ggthemes to plot graphics and stringr to manipulate char.
suppressPackageStartupMessages(library(ZIprop)) library(knitr) library(kableExtra) library(ggplot2) library(ggrepel) library(ggthemes) library(stringr)
We consider an Equine Influenza outbreak in 2003 in race horses from different training yards in Newmarket. In what follows, we use four discrete factors and one continuous factor computed from the observed variables age, sex and training yard:
The weight is the probability of transmission between horses.
Some of these factors are missing for some target-contributor pairs. Hence, the tests for assessing the effect of a given factor on the transmission probability are applied on the subset of complete data for this factor. The data set used for this study is available in the package or at this link https://doi.org/10.5281/zenodo.4837560.
data(equineDiffFactors) summary(equineDiffFactors)
In this example, only $B=1000$ permutations are used. It is recommended to study the stability of the permutations tests results to choose the best number of permutations.
res_equine = permDT ( equineDiffFactors, ColNameFactor = colnames(equineDiffFactors)[-c(1:3)], B = 1000, nclust = 1, ColNameWeight = "weight", ColNameRecep = "ID.recep", ColNameSource = "ID.source", num_class = "DistYard", other_class = colnames(equineDiffFactors)[-c(1:3,7)], multiple_test = TRUE )
Note that all non-numeric factor are specified in the other_class option.
kbl(res_equine$dt[,c(1,2)], caption = "ALL") %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Factors SameYard, SameSex, DistYard and TransSe are significantly correlated to the transmission probability whereas DiffAg is not.
kbl(res_equine$dt_multi$SameYard[,-c(3:4)], caption = "SameYard") %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
The test statistics of SameYard and DistYard being negative, horses trained in the same yard or in nearby yards have a higher chance to be linked by a transmission. This is a clearly intuitive result certainly due to higher contact rate in shared training areas.
kbl(res_equine$dt_multi$SameSex[,-c(3:4)], caption = "SameSex") %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
The statistics of post-hoc univariate tests for factor SameSex is also negative, which means that the virus better circulates between horses with the same sex.
kbl(res_equine$dt_multi$TransSex[,-c(3:4)], caption = "TransSex") %>% kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Moreover, the post-hoc tests on the TransSex modalities show that only the difference between F $\to$ M - M $\to$ M (and M $\to$ F - M $\to$ M when one considers the corrected p-values) are not significant. The results on the p-values and the sign of the statistics show that transmissions between females are favored compared to all other possible combinations (F $\to$ F transmissions have positive probabilities 1.8 times more than expected under complete randomness.
First of all, only the factor that are significant is chosen to compute the performance indicator and all the NA has to be removed from the data set.
DT_omitna = na.omit(equineDiffFactors[,c(1:3,4,5,7,8),with=F]) summary(DT_omitna )
max_generations and wait_generations are set to low values because the number of factors is low. It is recommended to increase these values id there is no convergence due for example to a higher number of factor.
system.time({I_max = indicator_max( DT_omitna, ColNameFactor = colnames(DT_omitna)[-c(1:3)], ColNameWeight = "weight", bounds = c(-10, 10), max_generations = 50, hard_limit = TRUE, wait_generations = 10, other_class = colnames(DT_omitna)[-c(1:3,6)] )}) I_max$indicator
The performance indicator takes the value $I_{\hat{{\beta}}}(\mathbb{X},Z)=0.21$ using the four selected factors. This relatively low value, which indicates that there is a moderate correlation between the combination of the four factors and the transmissions, can actually be viewed as quite large given the fact that we only consider very basic factors to \textit{predict} the transmissions.
The proba are the mixture probabilities i.e. the similarity between targets (UE countries) and contributors (US and CA states) in terms of mortality dynamics during the first wave of the Covid-19. We consider 29 variables related to economy, demography, health, healthcare system and climate. More precisely, the objective is to identify factors negatively correlated with the response, i.e., the lower the distance between two geographic entities with respect to a given variable, the higher the mixture probability. Consequently, we use the univariate test for continuous factors, which are computed for each target-contributor pair by $x_k^{(i,j)}= |x_k^i - x_k^j|$. The data set used for this study is available in https://doi.org/10.5281/zenodo.4769671.
data(diffFactors) summary(diffFactors)
In this example, only $B=1000$ permutations are used. It is recommended to study the stability of the permutations tests results to choose the best number of permutations.
res = permDT ( diffFactors, ColNameFactor = colnames(diffFactors)[-c(1:3)], B = 1000, nclust = 1, ColNameWeight = "proba", ColNameRecep = "ID.recep", ColNameSource = "ID.source", multiple_test = TRUE )
res$stat2 = res$stat res$stat2[res$pv_neg>0.05] = NA res$Factor = str_remove(rownames(res),"_diff") res$Factor = factor(res$Factor, levels = res$Factor) res$dec = rep(c(0.15,-0.15),50)[1:nrow(res)] res$dec[res$pv_neg>0.05] = NA p = ggplot(res,aes(Factor,pv_neg)) + geom_point(colour = "black") + ylim(-0.1,1) + xlab("") + ylab("P-value") + theme_classic() + geom_rangeframe() + geom_hline(yintercept = 0.05, color ="red") + #geom_text_repel(data = res, aes(label = round(stat2,2)), # point.padding = 1) + geom_label_repel(data = res, aes(label = round(stat2,2)), fill = "white", nudge_y = na.omit(res$dec) )+ ggtitle("Permutation test results") p + theme(axis.text.x = element_text(angle = 45, hjust = 1), plot.title = element_text(hjust = 0.5))
The figure shows the p-values obtained for each factor and the Spearman's correlation for significant factors. We identified eleven impacting factors: hospibed, smokers, lung, healthexp, gdp_capita, fertility, urbanpop, nurses_per_1K, gdp2019, pop_female_0_14, and pop_tot_0_14. The figure shows that Spearman's correlation is negative for significant factors. This result is consistent with our objective (to identify the significant factors negatively correlated to the response).
Only the factor that are significant is chosen to compute the performance indicator. The computation time it is quite long (around seconds).
ind_fact_select = which(colnames(diffFactors) %in% rownames(res[res$pv_neg<0.05,]) ) system.time({I_max = indicator_max( diffFactors, ColNameFactor = colnames(diffFactors)[c(ind_fact_select)], ColNameWeight = "proba", bounds = c(-10, 10), max_generations = 200, hard_limit = TRUE, wait_generations = 50 )}) I_max$indicator
Then, we applied the multivariate analysis based on the eleven significant factors. The performance indicator is equal to $I_{\hat{{\beta}}}(\mathbb{X},Z)=0.73$, which shows that a high monotonous dependency exists between the mixture probabilities and these factors.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.