Studying data
In ZIprop: Permutations Tests and Performance Indicator for Zero-Inflated Proportions Response

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

In this vignette, we illustrate our package with the two real data sets used in the article https://hal.archives-ouvertes.fr/hal-02936779v3.

Install and load package

Install packages

install.packages("ZIprop")
install.packages("knitr")
install.packages("kableExtra")
install.packages("ggplot2")
install.packages("ggrepel")
install.packages("ggthemes")
install.packages("stringr")

Note that knitr and kableExtra are used to visualize tables, ggplot2, ggrepel and ggthemes to plot graphics and stringr to manipulate char.

Load packages

suppressPackageStartupMessages(library(ZIprop))
library(knitr)
library(kableExtra)
library(ggplot2)
library(ggrepel)
library(ggthemes)
library(stringr)

Equine Influenza

We consider an Equine Influenza outbreak in 2003 in race horses from different training yards in Newmarket. In what follows, we use four discrete factors and one continuous factor computed from the observed variables age, sex and training yard:

SameYard: $1$ if the target and contributor are trained in the same yard, $0$ otherwise;
SameSex: $1$ if the target and contributor have the same sex, $0$ otherwise;
DiffAge: $0$ if the target and contributor have the same age, $1$ for a one-year difference and $2$ for more than one year;
DistYard: geographic distance (in km) between the training yards of the target and the contributor;
TransSex: F $\to$ F if a female infected another female, M $\to$ F if a male infected a female, F $\to$ M if a female infected a male and M $\to$ M if a male infected another male.

The weight is the probability of transmission between horses.

Some of these factors are missing for some target-contributor pairs. Hence, the tests for assessing the effect of a given factor on the transmission probability are applied on the subset of complete data for this factor. The data set used for this study is available in the package or at this link https://doi.org/10.5281/zenodo.4837560.

Load the data and summarize them.

data(equineDiffFactors)
summary(equineDiffFactors)

Run permutations test

In this example, only $B=1000$ permutations are used. It is recommended to study the stability of the permutations tests results to choose the best number of permutations.

res_equine = permDT (
  equineDiffFactors,
  ColNameFactor = colnames(equineDiffFactors)[-c(1:3)],
  B = 1000,
  nclust = 1,
  ColNameWeight = "weight",
  ColNameRecep = "ID.recep",
  ColNameSource = "ID.source",
  num_class = "DistYard",
  other_class = colnames(equineDiffFactors)[-c(1:3,7)],
  multiple_test = TRUE
)

Note that all non-numeric factor are specified in the other_class option.

Permutations tests results

kbl(res_equine$dt[,c(1,2)], caption = "ALL") %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

Factors SameYard, SameSex, DistYard and TransSe are significantly correlated to the transmission probability whereas DiffAg is not.

kbl(res_equine$dt_multi$SameYard[,-c(3:4)], caption = "SameYard") %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

The test statistics of SameYard and DistYard being negative, horses trained in the same yard or in nearby yards have a higher chance to be linked by a transmission. This is a clearly intuitive result certainly due to higher contact rate in shared training areas.

kbl(res_equine$dt_multi$SameSex[,-c(3:4)], caption = "SameSex") %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

The statistics of post-hoc univariate tests for factor SameSex is also negative, which means that the virus better circulates between horses with the same sex.

kbl(res_equine$dt_multi$TransSex[,-c(3:4)], caption = "TransSex") %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

Moreover, the post-hoc tests on the TransSex modalities show that only the difference between F $\to$ M - M $\to$ M (and M $\to$ F - M $\to$ M when one considers the corrected p-values) are not significant. The results on the p-values and the sign of the statistics show that transmissions between females are favored compared to all other possible combinations (F $\to$ F transmissions have positive probabilities 1.8 times more than expected under complete randomness.

Performance indicator

First of all, only the factor that are significant is chosen to compute the performance indicator and all the NA has to be removed from the data set.

DT_omitna = na.omit(equineDiffFactors[,c(1:3,4,5,7,8),with=F])
summary(DT_omitna )

max_generations and wait_generations are set to low values because the number of factors is low. It is recommended to increase these values id there is no convergence due for example to a higher number of factor.

system.time({I_max = indicator_max(
  DT_omitna,
  ColNameFactor = colnames(DT_omitna)[-c(1:3)],
  ColNameWeight = "weight",
  bounds = c(-10, 10),
  max_generations = 50,
  hard_limit = TRUE,
  wait_generations = 10,
  other_class = colnames(DT_omitna)[-c(1:3,6)]
)})

I_max$indicator

The performance indicator takes the value $I_{\hat{{\beta}}}(\mathbb{X},Z)=0.21$ using the four selected factors. This relatively low value, which indicates that there is a moderate correlation between the combination of the four factors and the transmissions, can actually be viewed as quite large given the fact that we only consider very basic factors to \textit{predict} the transmissions.

Covid-19

The proba are the mixture probabilities i.e. the similarity between targets (UE countries) and contributors (US and CA states) in terms of mortality dynamics during the first wave of the Covid-19. We consider 29 variables related to economy, demography, health, healthcare system and climate. More precisely, the objective is to identify factors negatively correlated with the response, i.e., the lower the distance between two geographic entities with respect to a given variable, the higher the mixture probability. Consequently, we use the univariate test for continuous factors, which are computed for each target-contributor pair by $x_k^{(i,j)}= |x_k^i - x_k^j|$. The data set used for this study is available in https://doi.org/10.5281/zenodo.4769671.

Load the data and summarize them.

data(diffFactors)
summary(diffFactors)

Run permutations test

In this example, only $B=1000$ permutations are used. It is recommended to study the stability of the permutations tests results to choose the best number of permutations.

res = permDT (
  diffFactors,
  ColNameFactor = colnames(diffFactors)[-c(1:3)],
  B = 1000,
  nclust = 1,
  ColNameWeight = "proba",
  ColNameRecep = "ID.recep",
  ColNameSource = "ID.source",
  multiple_test = TRUE
)

Permutations tests results

res$stat2 = res$stat
res$stat2[res$pv_neg>0.05] = NA
res$Factor = str_remove(rownames(res),"_diff")
res$Factor = factor(res$Factor, levels = res$Factor)
res$dec = rep(c(0.15,-0.15),50)[1:nrow(res)]
res$dec[res$pv_neg>0.05] = NA

p = ggplot(res,aes(Factor,pv_neg)) + geom_point(colour = "black") + 
  ylim(-0.1,1) +
  xlab("") + 
  ylab("P-value") +
  theme_classic() +
  geom_rangeframe() + 
  geom_hline(yintercept = 0.05, color ="red") +
  #geom_text_repel(data = res, aes(label = round(stat2,2)),
  #                         point.padding = 1)  +
  geom_label_repel(data = res, aes(label = round(stat2,2)), fill = "white", 
                   nudge_y = na.omit(res$dec) )+
  ggtitle("Permutation test results")

p + theme(axis.text.x = element_text(angle = 45, hjust = 1),
          plot.title = element_text(hjust = 0.5))

The figure shows the p-values obtained for each factor and the Spearman's correlation for significant factors. We identified eleven impacting factors: hospibed, smokers, lung, healthexp, gdp_capita, fertility, urbanpop, nurses_per_1K, gdp2019, pop_female_0_14, and pop_tot_0_14. The figure shows that Spearman's correlation is negative for significant factors. This result is consistent with our objective (to identify the significant factors negatively correlated to the response).

Performance indicator

Only the factor that are significant is chosen to compute the performance indicator. The computation time it is quite long (around seconds).

ind_fact_select = which(colnames(diffFactors) %in%  rownames(res[res$pv_neg<0.05,]) )

system.time({I_max = indicator_max(
  diffFactors,
  ColNameFactor = colnames(diffFactors)[c(ind_fact_select)],
  ColNameWeight = "proba",
  bounds = c(-10, 10),
  max_generations = 200,
  hard_limit = TRUE,
  wait_generations = 50
)})

I_max$indicator

Then, we applied the multivariate analysis based on the eleven significant factors. The performance indicator is equal to $I_{\hat{{\beta}}}(\mathbb{X},Z)=0.73$, which shows that a high monotonous dependency exists between the mixture probabilities and these factors.

Any scripts or data that you put into this service are public.

ZIprop documentation built on June 9, 2021, 5:09 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

ZIprop
Permutations Tests and Performance Indicator for Zero-Inflated Proportions Response

Studying data
In ZIprop: Permutations Tests and Performance Indicator for Zero-Inflated Proportions Response

Introduction

Install and load package

Install packages

Load packages

Equine Influenza

Load the data and summarize them.

Run permutations test

Permutations tests results

Performance indicator

Covid-19

Load the data and summarize them.

Run permutations test

Permutations tests results

Performance indicator

Try the ZIprop package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

ZIprop Permutations Tests and Performance Indicator for Zero-Inflated Proportions Response

Studying data In ZIprop: Permutations Tests and Performance Indicator for Zero-Inflated Proportions Response

Introduction

Install and load package

Install packages

Load packages

Equine Influenza

Load the data and summarize them.

Run permutations test

Permutations tests results

Performance indicator

Covid-19

Load the data and summarize them.

Run permutations test

Permutations tests results

Performance indicator

Try the ZIprop package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

ZIprop
Permutations Tests and Performance Indicator for Zero-Inflated Proportions Response

Studying data
In ZIprop: Permutations Tests and Performance Indicator for Zero-Inflated Proportions Response