rgeode: GEOmetric Density Estimation.
In LorenzoRimella/RGeode: Geometric Density Estimation

Description Usage Arguments Details Value Note Author(s) References Examples

View source: R/rgeode.R

It selects the principal directions of the data and performs inference. Moreover GEODE is also able to handle missing data.

1
2
3

rgeode(Y, d = 6, burn = 1000, its = 2000, tol = 0.01, atau = 1/20,
  asigma = 1/2, bsigma = 1/2, starttime = NULL, stoptime = NULL,
  fast = TRUE, c0 = -1, c1 = -0.005)

`Y`	array_like a real input matrix (or data frame), with dimensions (n, D). It is the real matrix of data.
`d`	int, optional it is the conservative upper bound for the dimension D. We are confident that the real dimension is smaller then it.
`burn`	int, optional number of burn-in to perform in our Gibbs sampler. It represents also the stopping time that stop the choice of the principal axes.
`its`	int, optional number of iterations that must be performed after the burn-in.
`tol`	double, optional threshold for adaptively removing redundant dimensions. It is used compared with the ratio: \frac{α_j^2(t)}{\max α_i^2(t)}.
`atau`	double, optional The parameter a_τ of the truncated Exponential (the prior for τ_j).
`asigma`	double, optional The shape parameter a_σ of the truncated Gamma (the prior for σ^2).
`bsigma`	double, optional The rate parameter b_σ of the truncated Gamma (the prior for σ^2).
`starttime`	int, optional starting time for adaptive pruning. It must be less then the number of burn-in.
`stoptime`	int, optional stop time for adaptive pruning. It must be less then the number of burn-in.
`fast`	bool, optional If TRUE it is run using fast d-rank SVD. Otherwise it uses the classical SVD.
`c0`	double, optional Additive constant for the exponent of the pruning step.
`c1`	double, optional Multiplicative constant for the exponent of the pruning step.

GEOmetric Density Estimation (rgeode) is a fast algorithm performing inference on normally distributed data. It is essentially divided in two principal steps:

Selection of the principal axes of the data.
Adaptive Gibbs sampler with the creation of a set of samples from the full conditional posteriors of the parameters of interest, which enable us to perform inference.

It takes in inputs several quantities. A rectangular (N,D) matrix Y, on which we will run a Fast rank d SVD. The conservative upper bound of the true dimension of our data d. A set of tuning parameters. We remark that the choice of the conservative upper bound d must be such that d>p, with p real dimension, and d << D.

rgeode returns a list containing the following components:

`InD`	array_like The chose principal axes.
`u`	matrix Containing the sample from the full conditional posterior of u_js. We store each iteration on the columns.
`tau`	matrix Containing the sample from the full conditional posterior of tau_js.
`sigmaS`	array_like Containing the sample from the full conditional posterior of sigma.
`W`	matrix Containing the principal singular vectors.
`Miss`	list Containing all the informations about missing data. If there are not missing data this output is not provide. id_m array It contains the set of rows with missing data. pos_m list It contains the set of missing data positions for each row with missing values. yms list The list contained the pseudo-observation substituting our missing data. Each element of the list represents the simulated data for that time.

The part related to the missing data is filled only in the case in which we have missing data.

L. Rimella, lorenzo.rimella@hotmail.it

[1] Y. Wang, A. Canale, D. Dunson. "Scalable Geometric Density Estimation" (2016).

library(MASS)
library(RGeode)

####################################################################
# WITHOUT MISSING DATA
####################################################################
# Define the dataset
D= 200
n= 500
d= 10
d_true= 3

set.seed(321)

mu_true= runif(d_true, -3, 10)

Sigma_true= matrix(0,d_true,d_true)
diag(Sigma_true)= c(runif(d_true, 10, 100))

W_true = svd(matrix(rnorm(D*d_true, 0, 1), d_true, D))$v

sigma_true = abs(runif(1,0,1))

mu= W_true%*%mu_true
C= W_true %*% Sigma_true %*% t(W_true)+ sigma_true* diag(D)

y= mvrnorm(n, mu, C)

################################
# GEODE: Without missing data
################################

start.time <- Sys.time() 
GEODE= rgeode(Y= y, d)
Sys.time()- start.time

# SIGMAS
#plot(seq(110,3000,by=1),GEODE$sigmaS[110:3000],ty='l',col=2,
#     xlab= 'Iteration', ylab= 'sigma^2', main= 'Simulation of sigma^2')
#abline(v=800,lwd= 2, col= 'blue')
#legend('bottomright',c('Posterior of sigma^2', 'Stopping time'),
#       lwd=c(1,2),col=c(2,4),cex=0.55, border='black', box.lwd=3)
       
       
####################################################################
# WITH MISSING DATA
####################################################################

###########################
#Insert NaN
n_m = 5 #number of data vectors containing missing features
d_m = 1  #number of missing features

data_miss= sample(seq(1,n),n_m)

features= sample(seq(1,D), d_m)
for(i in 2:n_m)
{
  features= rbind(features, sample(seq(1,D), d_m))
}

for(i in 1:length(data_miss))
{
  
  if(i==length(data_miss))
  {
    y[data_miss[i],features[i,][-1]]= NaN
  }
  else
  {
    y[data_miss[i],features[i,]]= NaN
  }
  
}

################################
# GEODE: With missing data
################################
set.seed(321)
start.time <- Sys.time() 
GEODE= rgeode(Y= y, d)
Sys.time()- start.time

# SIGMAS
#plot(seq(110,3000,by=1),GEODE$sigmaS[110:3000],ty='l',col=2,
#     xlab= 'Iteration', ylab= 'sigma^2', main= 'Simulation of sigma^2')
#abline(v=800,lwd= 2, col= 'blue')
#legend('bottomright',c('Posterior of sigma^2', 'Stopping time'),
#       lwd=c(1,2),col=c(2,4),cex=0.55, border='black', box.lwd=3)



####################################################################
####################################################################