knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
Denis Valle, Carlos Silva, Marcos Longo, Paulo Brando
The goal of LidarLDA is to fit a modified version of the Latent Dirichlet Allocation (LDA) model to LiDAR data. The main benefit of using this mixed-membership model is that it allows for some grid cells to have varying proportions of each cluster whereas more standard hard-clustering methods force grid cells to belong to a single cluster. This model estimates two sets of parameters: one set characterizes each grid cell in relation to the relative abundance of clusters while the other set characterizes each cluster in relation to its absorptance probabilities.
You can install this package from GitHub with:
``` {r, eval=F} library("devtools") devtools::install_github("drvalle1/LidarLDA",build_vignettes=T)
## Fitting LidarLDA to simulated data We start by showing how to fit the model based on simulated data with 5 clusters. The simulated datasets contain information for 2,601 pixels (rows) and 30 height bins (columns), labeled z1, z2, ..., z30. The data in `sim_y5` consist of the number of returned light pulses whereas the data in `sim_n5` consist of the number of incoming light pulses, in each pixel and height bin. As a result, `sim_y5` is always smaller or equal to `sim_n5`. These data were simulated with 5 clusters. ```r #library('devtools') #devtools::install_github("drvalle1/LidarLDA",build_vignettes=T) library(LidarLDA) #basic characteristics of simulated data dim(sim_y5) #sim_y5 is provided in the LidarLDA package dim(sim_n5) #sim_n5 is provided in the LidarLDA package colnames(sim_y5) colnames(sim_n5) #remove columns with coordinates and topography information ind=which(colnames(sim_y5)%in%c('x','y','topo')) sim_y5a=sim_y5[,-ind] coord=sim_y5[,ind] ind=which(colnames(sim_n5)%in%c('x','y','topo')) sim_n5a=sim_n5[,-ind] mean(sim_y5a<=sim_n5a)
We fit these simulated data using the code below. In this code, we assume a maximum of 10 clusters and we rely on 10000 iterations of the gibbs sampler with a burn-in of 9000 iterations. Finally, we just return the posterior mean parameter estimates instead of all the posterior samples by specifying theta.post=F
and phi.post=F
.
Model.Results=LidarLDA(y=data.matrix(sim_y5a), n=data.matrix(sim_n5a), nclust=10, a.phi=1,b.phi=1, gamma=0.1,ngibbs=10000, nburn=9000,theta.post=F,phi.post=F)
We can assess convergence by examining the trace-plot of the log-likelihood. This plot suggests that the algorithm has converged.
#using pre-computed results stored as internal data Model.Results=list(theta=LidarLDA:::theta, llk=LidarLDA:::llk$x, phi=LidarLDA:::phi)
plot(Model.Results$llk,type='l',xlab='Iterations', ylab='Log-likelihood')
According to the theta
matrix (i.e., the matrix that shows the relative abundance of each cluster for each pixel), our model has identified 5 (out of a maximum of 10) main clusters. These 5 first clusters, on average, represent 99.8% of all observations in each pixel.
boxplot(Model.Results$theta,xlab='Cluster id',ylab='theta') sum1=apply(Model.Results$theta[,1:5],1,sum) mean(sum1)
Because this is based on simulated data, there is a nice pattern regarding how the relative abundance of each cluster changes as a function of topography. This is shown below. In this figure, each cluster is depicted with a different color and relative abundance is given by color opacity (i.e., 0 is transparent and 1 is completely opaque). Contour lines show topography.
theta=Model.Results$theta[,1:5] colnames(theta)=paste0('clust',1:5) theta1=cbind(coord,theta) library('ggplot2') res5=ggplot(data=theta1,aes(x=x,y=y)) + geom_tile(alpha = theta1$clust1, fill='red') + geom_tile(alpha = theta1$clust2, fill='green') + geom_tile(alpha = theta1$clust3, fill='cyan') + geom_tile(alpha = theta1$clust4, fill='orange') + geom_tile(alpha = theta1$clust5, fill='blue') + geom_contour(aes(x=x, y=y, z = topo))+ theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), axis.title.y=element_blank(), axis.text.y=element_blank(), axis.ticks.y=element_blank()) + xlab('')+ylab('') res5
It would be tempting to analyze LiDAR data collected for the same area but in a different year (i.e., t=2) simply by re-running LidarLDA on this new dataset. The problem with this approach is that it might identify a different set of clusters, making temporal comparisons challenging. To avoid this problem, we rely on the folding-in operation to compare how the relative abundance of each cluster has changed through time.
In the folding-in operation, the characteristics of each cluster (captured by the $\Phi$ matrix) are kept constant and only the relative abundances of these clusters (capture by the $\Theta$ matrix) are re-estimated. As a result, assuming the grid cells were defined in a comparable way in both time periods, we can directly compare the relative abundances at time 1 (i.e., $\Theta_{t=1}$) with those from time 2 (i.e., $\Theta_{t=2}$). To perform this operation, we will rely on the function LidarLDA_foldin
. As illustrated below, we will need the posterior mean (or posterior samples) of the $\Phi$ matrix, as estimated from the analysis of data at time t=1 using LidarLDA.
#data for t=2 colnames(sim_y5_t2) #this dataset is provided as part of the LidarLDA package colnames(sim_n5_t2) #this dataset is provided as part of the LidarLDA package #remove unnecessary columns ind=which(colnames(sim_y5_t2)%in%c('x','y','topo')) y_t2=sim_y5_t2[,-ind] coord=sim_y5_t2[,ind] ind=which(colnames(sim_n5_t2)%in%c('x','y','topo')) n_t2=sim_n5_t2[,-ind] #run folding-in operation phi=Model.Results$phi Model.Results.t2=LidarLDA_foldin(y=data.matrix(y_t2), n=data.matrix(n_t2), nclust=5, gamma=0.1,ngibbs=10000, nburn=9000, phi.post=F, phi.estim=data.matrix(phi[1:5,]), theta.post=F)
We can assess convergence by examining the trace-plot of the log-likelihood. This plot suggests that the algorithm has converged.
#using pre-computed results stored as internal data Model.Results.t2=list(llk=LidarLDA:::llk_t2$x, theta=LidarLDA:::theta_t2)
plot(Model.Results.t2$llk,type='l',xlab='Iterations', ylab='Log-likelihood')
By spatially visualizing the relative abundances of each cluster at time t=2, we can notice that there has been substantial spatial change. For example, all the clusters have moved upward, with spatial expansion of the cluster that dominated the lowland areas (i.e., green cluster) and shrinkage of the highland area dominated by the orange cluster.
theta=Model.Results.t2$theta colnames(theta)=paste0('clust',1:5) theta2=cbind(coord,theta) library('ggplot2') res5=ggplot(data=theta2,aes(x=x,y=y)) + geom_tile(alpha = theta2$clust1, fill='red') + geom_tile(alpha = theta2$clust2, fill='green') + geom_tile(alpha = theta2$clust3, fill='cyan') + geom_tile(alpha = theta2$clust4, fill='orange') + geom_tile(alpha = theta2$clust5, fill='blue') + geom_contour(aes(x=x, y=y, z = topo))+ theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), axis.title.y=element_blank(), axis.text.y=element_blank(), axis.ticks.y=element_blank()) + xlab('')+ylab('') res5
To use this model to fit empirical LiDAR data, it is important to format the data correctly. Here we show how we have formatted an empirical LIDAR data set.
We start by assuming that we have a matrix called lidar_data
which holds the number of returns in each pixel and each height bin. Notice that this matrix also contains the x and y coordinates of each pixel.
library(LidarLDA) dim(lidar_data) #this dataset is contained within the LidarLDA package colnames(lidar_data)
Next, we remove height bins for which there is not much data because these are relatively uninformative. In our case, there are almost no returns in bins above 31.5 m, as shown below. As a result, we sum all of the returns above this threshold and assign these results to the last height bin (i.e., z31.5
).
dat=lidar_data #get rid of height bins with few observations apply(dat,2,sum) limite='z31.5' ind=which(colnames(dat)==limite) dat1=dat #all of the returns above this threshold get stored in the highest bin dat1[,ind]=apply(dat1[,ind:ncol(dat)],1,sum) dat2=dat1[,1:ind] ind=grep('z',colnames(dat2)) coord=dat2[,-ind] dat3=dat2[,ind]
I then create the matrices y
and n
, containing the number of returned light pulses and incoming light pulses, respectively, for each pixel and each height bin.
y=z=dat3 nheight=ncol(z) npix=nrow(z) #get n matrix containing the incoming light pulses for each pixel and height bin n=matrix(NA,npix,nheight) for (i in nheight:2){ n[,i]=rowSums(z) z=z[,-i] } n[,1]=y[,1] #get names colnames(n)=colnames(y) #eliminate the first column because y/n is always equal to 1 for that column n1=n[,-1] y1=y[,-1]
Finally, we finish formatting these data by randomly sampling 500 pulses whenever the number of incoming pulses is greater than 500. This procedure ensures an approximately even number of pulses in each height bin in each pixel (here arbitrarily set to 500) and can help ensure a more rapid convergence of the algorithm.
set.seed(1) prob=y1/n1 thresh=500 cond=n1>thresh n1[cond]=thresh y1[cond]=rbinom(sum(cond),size=thresh,prob[cond])
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.