library(admix)
The clustering of populations following admixture models is, for now, based on the K-sample test theory (see [@MilhaudPommeretSalhiVandekerkhove2024b]. Consider $K$ samples. For $i=1,...,K$, sample $X^{(i)} = (X_1^{(i)}, ..., X_{n_i}^{(i)})$ follows $$L_i(x) = p_i F_i(x) + (1-p_i) G_i, \qquad x \in \mathbb{R}.$$
We still use IBM approach to perform pairwise hypothesis testing. The idea is to adapt the K-sample test procedure to obtain a data-driven method that cluster the $K$ populations into $N$ subgroups, characterized by a common unknown mixture component. The advantages of such an approach is twofold:
This clustering technique thus allows to cluster unobserved subpopulations instead of individuals. We call this algorithm the K-sample 2-component mixture clustering (K2MC).
We now detail the steps of the algorithm.
Test $H_0$ between $x$ and $y$.
If $H_0$ is not rejected then $S_1 = {x,y}$,\ Else $S_1 = {x}$, $S_{c+1} = {y}$ and then $c=c+1$.
While $S\setminus \bigcup_{k=1}^c S_k = \emptyset$ do
Select $u={\rm argmin}{d(i,j); i\in S_c, j\in S\setminus \bigcup_{k=1}^c S_k}$; Test $H_0$ the simultaneous equality of all the $f_j$, $j\in S_c$ :\ If $H_0$ not rejected, then put $S_c=S_c\bigcup {u}$;\ Else $S_{c+1} = {u}$ and $c = c+1$.
We present a case study with 5 populations to cluster on $\mathbb{R}^+$, with Gamma-Exponential, Exponential-Exponential and Gamma-Gamma mixtures.
## Simulate mixture data: mixt1 <- twoComp_mixt(n = 6000, weight = 0.8, comp.dist = list("gamma", "exp"), comp.param = list(list("shape" = 16, "scale" = 1/4), list("rate" = 1/3.5))) mixt2 <- twoComp_mixt(n = 6000, weight = 0.7, comp.dist = list("gamma", "exp"), comp.param = list(list("shape" = 14, "scale" = 1/2), list("rate" = 1/5))) mixt3 <- twoComp_mixt(n = 6000, weight = 0.6, comp.dist = list("gamma", "gamma"), comp.param = list(list("shape" = 16, "scale" = 1/4), list("shape" = 12, "scale" = 1/2))) mixt4 <- twoComp_mixt(n = 6000, weight = 0.5, comp.dist = list("exp", "exp"), comp.param = list(list("rate" = 1/2), list("rate" = 1/7))) mixt5 <- twoComp_mixt(n = 6000, weight = 0.5, comp.dist = list("gamma", "exp"), comp.param = list(list("shape" = 14, "scale" = 1/2), list("rate" = 1/6))) data1 <- getmixtData(mixt1) data2 <- getmixtData(mixt2) data3 <- getmixtData(mixt3) data4 <- getmixtData(mixt4) data5 <- getmixtData(mixt5) admixMod1 <- admix_model(knownComp_dist = mixt1$comp.dist[[2]], knownComp_param = mixt1$comp.param[[2]]) admixMod2 <- admix_model(knownComp_dist = mixt2$comp.dist[[2]], knownComp_param = mixt2$comp.param[[2]]) admixMod3 <- admix_model(knownComp_dist = mixt3$comp.dist[[2]], knownComp_param = mixt3$comp.param[[2]]) admixMod4 <- admix_model(knownComp_dist = mixt4$comp.dist[[2]], knownComp_param = mixt4$comp.param[[2]]) admixMod5 <- admix_model(knownComp_dist = mixt5$comp.dist[[2]], knownComp_param = mixt5$comp.param[[2]]) ## Look for the clusters: admix_cluster(samples = list(data1,data2,data3,data4,data5), admixMod = list(admixMod1,admixMod2,admixMod3,admixMod4,admixMod5), conf_level = 0.95, tune_penalty = TRUE, tabul_dist = NULL, echo = FALSE, n_sim_tab = 50, parallel = FALSE, n_cpu = 2) #> Call: #> admix_cluster(samples = list(data1, data2, data3, data4, data5), #> admixMod = list(admixMod1, admixMod2, admixMod3, admixMod4, #> admixMod5), conf_level = 0.95, tune_penalty = TRUE, tabul_dist = NULL, #> echo = FALSE, n_sim_tab = 50, parallel = FALSE, n_cpu = 2) #> #> Number of detected clusters across the samples provided: 3. #> #> List of samples involved in each built cluster (inside c()) : #> - Cluster #1: samples c(2, 5) #> - Cluster #2: samples 4 #> - Cluster #3: samples c(1, 3)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.