README.md

EcoCluster

The EcoCluster package implements 2 Bayesian clustering models that rely on a stick-breaking prior to determine the optimal number of clusters, namely:

Results from these models help to reveal the underlying structure of highly multivariate binary data by factorizing these data in different ways.

Installation

You can install EcoCluster from github with the following code:

# install.packages("devtools")
devtools::install_github("drvalle1/EcoCluster")

We apply these two clustering methods to simulated data to showcase how the number of groups can be successfully recovered.

Mixture model example

We start by generating simulated data:

rm(list=ls(all=TRUE))
set.seed(1)

#general settings
nloc=300 #number of locations
nspp=50  #number of species
ngroup=5 #number of groups

#set parameters
z=sample(1:ngroup,size=nloc,replace=T)         #cluster assignment of each location
phi=matrix(rbeta(ngroup*nspp,1,1),ngroup,nspp) #species composition of each cluster  

#generate data
y=matrix(NA,nloc,nspp)
for (i in 1:nloc){
  phi1=phi[z[i],]
  y[i,]=rbinom(nspp,size=1,prob=phi1)
}
colnames(y)=paste0('spp',1:nspp)
rownames(y)=paste0('loc',1:nloc)

head(y[,1:13])
#>      spp1 spp2 spp3 spp4 spp5 spp6 spp7 spp8 spp9 spp10 spp11 spp12 spp13
#> loc1    0    1    1    0    1    1    0    1    1     1     0     0     0
#> loc2    1    0    0    1    1    0    1    1    0     1     0     1     0
#> loc3    0    0    0    0    1    1    1    0    1     1     1     0     0
#> loc4    1    1    0    0    0    1    1    1    1     0     0     0     0
#> loc5    0    1    1    0    0    1    1    0    0     1     1     0     1
#> loc6    0    0    1    1    0    0    0    0    1     1     0     1     1

To obtain posterior samples from the mixture model parameters, we rely on the function mixture.gibbs. The user needs to specify the following arguments:

library('EcoCluster')
ngibbs=1000
res=mixture.gibbs(dat=y,ngroup=50,nl=rep(1,nrow(y)),ngibbs=ngibbs,burnin=ngibbs/2,a.prior=1,b.prior=1)
str(res)
#> List of 5
#>  $ phi  : num [1:501, 1:2500] 0.969 0.945 0.952 0.89 0.953 ...
#>  $ theta: num [1:501, 1:50] 0.174 0.197 0.169 0.19 0.202 ...
#>  $ logl : num [1:501] -7443 -7442 -7432 -7445 -7453 ...
#>  $ z    : int [1:501, 1:300] 2 2 2 2 2 2 2 2 2 2 ...
#>  $ gamma: num [1:501] 1 0.9 0.95 0.85 0.9 0.9 0.9 0.85 0.9 0.9 ...

The output is a list with 5 elements:

The graph below with the log-likelihood suggests that this algorithm has successfully converged after discarding the burn-in iterations.

plot(res$logl,type='l',main='convergence assessment',xlab='Gibbs iterations',ylab='log-likelihood')

The next graph shows that the algorithm successfully identified the 5 true location groups, with the remaining groups being very small or empty.

theta=colMeans(res$theta)
plot(theta[1:20],type='h',xlab='clusters',main='number of groups',ylab='theta',lwd=2)
abline(v=5,col='red',lty=3,lwd=0.1)

Species Archetype model example

We start by generating simulated data:

rm(list=ls(all=TRUE))
library('mvtnorm')
set.seed(1)

#general settings
nloc=300   #number of locations
nspp=50    #number of species
nparam=6   #number of covariates
ngroup1=5  #number of species groups

#create design matrix with covariates
xmat=matrix(rnorm(nparam*nloc),nloc,nparam)

#set parameters
alpha=rnorm(nspp,mean=0,sd=0.4) #intercept of each species
tmp=rnorm(nparam*ngroup1)
betas=matrix(tmp,nparam,ngroup1) #slope parameters for each group
cs=sample(1:ngroup1,size=nspp,replace=T) #cluster assignment for each species

#generate data assuming a probit formulation    
omega=matrix(NA,nloc,nspp)
for (i in 1:nspp){
  media=alpha[i]+xmat%*%betas[,cs[i]]
  omega[,i]=rnorm(nloc,mean=media,sd=1)
}
y=omega.true=omega
y[omega>0]=1
y[omega<0]=0

colnames(y)=paste0('spp',1:nspp)
rownames(y)=paste0('loc',1:nloc)
head(y[,1:13])
#>      spp1 spp2 spp3 spp4 spp5 spp6 spp7 spp8 spp9 spp10 spp11 spp12 spp13
#> loc1    0    1    1    1    1    1    0    0    1     0     1     1     1
#> loc2    1    0    0    0    0    0    0    1    0     0     0     0     0
#> loc3    0    1    0    1    1    0    0    0    1     0     1     1     0
#> loc4    1    1    0    1    1    1    0    1    1     0     1     1     1
#> loc5    0    1    0    0    0    0    0    0    1     1     0     0     0
#> loc6    0    0    0    0    1    1    0    0    1     0     1     1     1

To obtain posterior samples from the species archetype model parameters, we rely on the function gibbs.SAM. The user needs to specify the following arguments:

library('EcoCluster')
ngibbs=1000

res=gibbs.SAM(y=y,xmat=xmat,ngroups=50,ngibbs=ngibbs,burnin=ngibbs/2)
str(res)
#> List of 6
#>  $ theta: num [1:501, 1:50] 0.376 0.442 0.352 0.305 0.371 ...
#>  $ logl : num [1:501] -4289 -4295 -4303 -4311 -4291 ...
#>  $ betas: num [1:501, 1:300] 0.232 0.217 0.21 0.235 0.232 ...
#>  $ alpha: num [1:501, 1:50] 0.178 0.28 0.206 0.3 0.222 ...
#>  $ cs   : int [1:501, 1:50] 4 4 4 4 4 4 4 4 4 4 ...
#>  $ gamma: num [1:501] 0.85 0.7 0.4 0.4 0.35 0.3 0.3 0.35 0.3 0.55 ...

The output is a list with 6 elements:

The graph below with the log-likelihood suggests that this algorithm has successfully converged after discarding the burn-in iterations.

plot(res$logl,type='l',main='convergence assessment',xlab='Gibbs iterations',ylab='log-likelihood')

The next graph shows that the algorithm successfully identified the 5 true species groups.

theta=colMeans(res$theta)
plot(theta[1:20],type='h',xlab='clusters',main='number of species groups',ylab='theta',lwd=2)
abline(v=5,col='red',lty=3,lwd=0.1)



drvalle1/EcoCluster documentation built on Jan. 25, 2022, 12:46 a.m.