CDatanet: An R package to simulate and estimate a Count Data Model with Social Interactions

knitr::opts_chunk$set(echo = FALSE)

Introduction

There is a large and growing literature on peer effects in economics.^[For recent reviews, see @boucher2016some, @de2017econometrics and @bramoulle2019peer.] Recent contributions include, among others, models for discrete data, including binary [e.g., @lee2014binary; @liu2019simultaneous], ordered [e.g., @boucher2018peer], and multinomial [e.g., @guerra2017multinomial]. To my knowledge, however, there are no existing models for count variables with microeconomic foundations, despite these variables being prevalent in survey data (e.g., number of times teenagers have taken drugs, number of times teenagers have drunk alcohol, frequency of participation in sport activities, frequency of participation in extracurricular activities, frequency of restaurant visits, number of physician visits).

I present an incomplete information game with social interactions in which the outcome takes unbounded integer values. I show that it is important to account for the counting nature of the dependent variable. Assuming that this variable is continuous significantly underestimates the peer effects.

I also present an easy-to-use R package---named CDatanet---for implementing the model. This note is a supplement for the package manual. I provide examples with simulated data to show how to use the package. The package is located on GitHub at github.com/ahoundetoungan/CDatanet. All the results of the note can be replicated following the code provided.

The remainder of the paper is organized as follows. Section 2 briefly introduces the model. Sections 3 and 4 provide examples in which the network matrix is exogenous and endogenous respectively. They present a Monte Carlo study to assess the performance of the estimator of the model parameters. They also discuss how to control for the endogeneity of the network. Section 5 concludes.

Model{#model}

Let $\mathcal{V} = {1, \dots, n}$ be a set of $n$ players indexed by $i$ and $y_i$ the observed integer outcome of player $i$ (e.g., the number of cigarettes smoked per day or per week). I denote by $\boldsymbol{G}$ the adjacency matrix such that $G_{ij} = \dfrac{1}{n_i}$ if $i$ knows $j$ and $G_{ij} = 0$ otherwise, where $n_i$ the number of friends of $i$. Let also $\boldsymbol{X}$ be the matrix of explanatory variables.

As in @liu2019simultaneous, I assume that there is a latent variable $y_i^$ which determines the observed count variable $y_i$ . This latent variable $y_i^$ and the observed $y_i$ are linked as follow.

Let $\left(a_q\right){q \in \mathbb{N}}$ be a sequence given by $a_0 = -\infty$, $a_1 \in \mathbb{R}$, and $a_q = a_1 + \gamma(q-1)$ for $q \in \mathbb{N}^$ and $\gamma \in \mathbb{R}_+^$. If $y_i^* \in (a{q}, a_{q + 1}]$ then $y_i = q$.

The first order conditions of the game equilibrium implies that

$$ y_i^* = \lambda\sum_{i = 1}^n \sum_{r = 0}^{\infty} rG_{ij} p_{ir} + \boldsymbol{x}i^{\prime}\boldsymbol{\beta} + \varepsilon{i}$$ where $p_{ir}$ is the probability of $y_i = r$, $\lambda$ is the peer effect and $\varepsilon_{i} \sim \mathcal{N}\left(0, \sigma_{\varepsilon}^2\right)$.

The game equilibrium is given by $$ p_{ir} = \Phi\left(\dfrac{\lambda\sum_{i = 1}^n \sum_{r = 0}^{\infty} rG_{ij} p_{ir} + \boldsymbol{x}i^{\prime}\boldsymbol{\beta} - a_q}{\sigma{\varepsilon}}\right) - \Phi\left(\dfrac{\lambda\sum_{i = 1}^n \sum_{r = 0}^{\infty} rG_{ij} p_{ir} + \boldsymbol{x}i^{\prime}\boldsymbol{\beta} - a{q+1}}{\sigma_{\varepsilon}}\right)$$ where $\Phi$ is the probability density function of $\mathcal{N}(0, 1)$.

To estimate $\boldsymbol{\theta} = (\lambda, \boldsymbol{\beta}^{\prime}, \sigma_{\varepsilon})^{\prime}$, I rely on the Nested Partial Likelihood (NPL) method proposed by @aguirregabiria2007sequential. I refer the reader to @houndetoungan2020cdata for more details. The estimation method is implemented in the package CDatanet.

Examples with exogenous network{#exo}

In this section, I present examples to show how to use the package. I simulate data following the model by assuming that the network matrix is exogenous. I then show how to estimate the model parameters using functions provided by CDatanet. I also use Monte Carlo simulations to assess the performance of the estimator of the model parameters.

Data simulation{#exo.data}

Given the adjacency matrix, the explanatory variables and $\boldsymbol{\theta}$, the function simCDnet can used to simulate data. I assume that there are M sub-networks. The number of individuals in each sub-network is randomly chosen between 100 and 1000. I assume three exogenous variables (including the intercept) and I control for the contextual effects.

set.seed(2020) 
library(CDatanet)
# Groups' size
M      <- 5 # Number of sub-groups
nvec   <- round(runif(M, 100, 1000))
print(nvec)
n      <- sum(nvec)
print(n)

# Parameters
lambda <- 0.4              # peer effects
beta   <- c(2, -1.9, 0.8)  # own effects
gamma  <- c(1.5, -1.2)     # contextual effects
sigma  <- 1.5              # standard deviation of epsilon
theta  <- c(lambda, beta, gamma, sigma)

# X
data   <- data.frame(x1 = rnorm(n, 1, 1), x2 =  rexp(n, 0.4))

To simulate the network matrix, I assume that the number of friends of each individual is randomly chosen between 0 and 30.

# Network
Glist  <- list()
for (m in 1:M) {
  nm           <- nvec[m]
  Gm           <- matrix(0, nm, nm)
  max_d        <- 30
  for (i in 1:nm) {
    tmp        <- sample((1:nm)[-i], sample(0:max_d, 1))
    Gm[i, tmp] <- 1
  }
  rs           <- rowSums(Gm); rs[rs == 0] <- 1
  Gm           <- Gm/rs
  Glist[[m]]   <- Gm
}

I can now simulate the count data. The output of simCDnet includes yst the vector of $y_i^*$, y the vector of $y_i$, yb the vector of $\displaystyle\sum_{r = 0}^{\infty}rp_{ir}$, Gyb the vector of $\displaystyle\sum_{i = 1}^n \sum_{r = 0}^{\infty} rG_{ij} p_{ir}$ and iteration as the number of iterations performed to find the fixed point yb.

ytmp    <- simCDnet(formula = ~ x1 + x2 | x1 + x2, Glist = Glist, theta = theta,
                    data = data)
names(ytmp)
y       <- ytmp$y
# Add y to data
data$y  <- y
# Summarize y
summary(y)
table(y)
# Plot data histogram
library(ggplot2)
print(ggplot(data = data.frame(y = y), aes(x = y)) +
        geom_bar(color = "black", fill = "cyan") +
        theme_bw() + xlab("") + ylab("Frequency"))

Estimation

Using the simulated data, I estimate the count data model as well as the spatial autoregressive Tobit (SART) and the spatial autoregressive (SAR) models. The SART model assumes that the dependent variable is continuous and left-censored at 0. This model can be used to model count data when data contain many zeros. However, the SART model does not account for the integer nature of the dependent variable. This can lead to biased estimates. In addition, the SART model also does not account for the integer nature of the dependent variable and does not assume that the data are censored. One can compare the peer effects using the count data model to those when the dependent variable is assumed continuous. The outputs of the functions that implement these models have a summary class to summarize the results and provide marginal effects.^[Note that the estimates from the SAR model can be directly interpreted as marginal effects.]

#  I will not run the estimates because it takes time. I will just load the saved ouptuts. However, you will get the same results if you run the code on your computer with the seed set above.
load("out.est.Rdata")
#  Count data
CD   <- CDnetNPL(formula = y ~ x1 + x2, contextual = TRUE, Glist = Glist,
                  optimizer = "nlm", 
                 data = data, npl.ctr = list(print = FALSE, maxit = 5e3))
summary(CD)
summary(CD)
# SART
SART   <- SARTML(formula = y ~ x1 + x2, contextual = TRUE, Glist = Glist,
                optimizer = "nlm", data = data, print = FALSE)
summary(SART)
summary(SART)
#SAR
SAR  <- SARML(formula = y ~ x1 + x2, contextual = TRUE, Glist = Glist,
              optimizer = "nlm", data = data, print = FALSE)
summary(SAR)
summary(SAR)

Using the count data model, the peer effect is estimated at 0.38, which is close to the true value, 0.4. By replicating this experience several times, the average of the estimates is likely to be almost equal to the true value (see next section). Note that this parameter is not interpretable. Instead, one should interpret the marginal effect: increasing the expected $y_j$ of the peers by one implies an increase of 0.28 in the expected $y_i$. However, the SART and the SAR model underestimate thos impact at 0.19 and 0.16, respectively.

Monte Carlo Simulations

In this section, I conduct Monte Carlo simulations to assess the performance of the estimator. I use the same settings as in Section 3.1 except that the number of individuals in each sub-network is set to 500. I build the following Monte Carlo function that simulates the data, estimates the three models and returns the estimates.

fMC <- function(s) {
  # Groups' size
  M      <- 5
  nvec   <- rep(500, 5); 
  n      <- sum(nvec)
  # Parameters
  lambda <- 0.4 
  beta   <- c(2, -1.9, 0.8)
  gamma  <- c(1.5, -1.2)
  sigma  <- 1.5   
  theta  <- c(lambda, beta, gamma, sigma)

  # X
  data   <- data.frame(x1 = rnorm(n, 1, 1), x2 =  rexp(n, 0.4))

  # Network
  Glist  <- list()
  for (m in 1:M) {
    nm           <- nvec[m]
    Gm           <- matrix(0, nm, nm)
    max_d        <- 30
    for (i in 1:nm) {
      tmp        <- sample((1:nm)[-i], sample(0:max_d, 1))
      Gm[i, tmp] <- 1
    }
    rs           <- rowSums(Gm); rs[rs == 0] <- 1
    Gm           <- Gm/rs
    Glist[[m]]   <- Gm
  }

  # y
  ytmp   <- simCDnet(formula = ~ x1 + x2 | x1 + x2, Glist = Glist, theta = theta,
                     data = data)
  y      <- ytmp$y
  data$y <- y

  #  Models
  CD     <- CDnetNPL(formula = y ~ x1 + x2, contextual = TRUE, Glist = Glist,
                     optimizer = "nlm", 
                     data = data, npl.ctr = list(print = FALSE, maxit = 5e3))
  SART   <- SARTML(formula = y ~ x1 + x2, contextual = TRUE, Glist = Glist,
                   optimizer = "nlm", data = data, print = FALSE)
  SAR    <- SARML(formula = y ~ x1 + x2, contextual = TRUE, Glist = Glist,
                  optimizer = "nlm", data = data, print = FALSE)
  c(CD$estimate, SART$estimate, SAR$estimate)
}

I run 1000 times the Monte Carlo function fMC and compute the average of each estimate. To make it faster, I run the replications in parallel using doParallel package [@R-doParallel].

library(doParallel)
n.cores  <- 32
replic   <- 1000 #Number of replications
out.mc   <- mclapply(1:replic, fMC, mc.cores = n.cores)
out.mc   <- apply(t(do.call(cbind, out.mc)), 2, mean)
out.mc   <- cbind(theta, out.mc[1:7], out.mc[8:14], out.mc[15:21])
colnames(out.mc) <- c("TrueValue", "CoutData", "SART", "SAR")
print(out.mc)
load("out.mc.Rdata")
print(out.mc)

The estimator of the count data model seems unbiased while the SART and the SAR models underestimate the peer effects.

Examples with endogenous network{#endo}

Peer effects estimation is generally based on the assumption of exogeneity of the adjacency matrix. This means that the probability of link formation is not correlated to the error term in the count data model. Such an assumption is strong as the link formation probability may depend on unobserved characteristics (e.g., gregariousness) that also influence the outcome. In this section, I simulate data with endogenous network (which violates the assumption). I then show that the peer effect is overestimated when one does not control for the endogeneity of the network. I also present a method to control for the endogeneity. Finally, I use Monte Carlo simulations to prove that this method performs well.

Data simulation

I assume three sub-networks. The number of individuals in each sub-network is randomly chosen between 100 and 500.

rm(list = ls())
set.seed(2020) 
# Groups' size
M      <- 3 # Number of sub-groups
nvec   <- round(runif(M, 100, 500))
print(nvec)
n      <- sum(nvec)
print(n)

# Parameters
lambda <- 0.4               # peer effects
beta   <- c(2, -1.9, 0.8)   # own effects
gamma  <- c(1.5, -1.2)      # contextual effects
sigma  <- 1.5               # standard deviation of epsilon
theta  <- c(lambda, beta, gamma, sigma)

# X
data   <- data.frame(x1 = rnorm(n, 1, 1), x2 =  rexp(n, 0.4))

The network matrix follows the dyadic linking model presented in @houndetoungan2020cdata. Let $\boldsymbol{A}$ be the matrix of links such that $A_{ij} = 1$ if $i$ knows $j$ and $A_{ij} = 0$. In other words, $A_{ij} = 1$ if $G_{ij} = \dfrac{1}{n_i}$. For he network formation model, it is easier to work with $\boldsymbol{A}$ (binary data). However, for the count data model, I use $\boldsymbol{G}$ as row-normalized equivalent of $\boldsymbol{A}$.

The probability of link formation, $P_{ij}$ depends on $\Delta x_{1ij} = |x_{1i} - x_{1j}|$ and $\Delta x_{2ij} = |x_{2i} - x_{2j}|$ (observed dyad-specific variables) as well as on unobserved individual-level attributes $\mu_i$ and $\mu_j$. Typically, let $a_{ij}^$ defined by $$a_{ij}^ = \Delta \boldsymbol{x}{ij}^{\prime}\bar{\boldsymbol{\beta}} + \mu_i + \mu_j + \varepsilon{ij}^,$$ where $\varepsilon_{ij}^ \sim logistic$ distribution. I assume that $A_{ij} = 1$ if $a_{ij}^* > 0$. In that case, $P_{ij}$ is given by $$P_{ij} = \dfrac{\exp{\left(\Delta \boldsymbol{x}{ij}^{\prime}\bar{\boldsymbol{\beta}} + \mu_i + \mu_j\right)}}{1 + \exp{\left(\Delta \boldsymbol{x}{ij}^{\prime}\bar{\boldsymbol{\beta}} + \mu_i + \mu_j\right)}}.$$ I also assume that $\mu_i$ is random. If $i$ is from the s-th sub-network, then $\mu_i \sim \mathcal{N}\left(u_{\mu s}, \sigma_{\mu s}^2\right)$. The mean and the variance of $\mu_i$ vary across the sub-networks. This is a way to control for the sub-network heterogeneity as fixed effects in the link formation probability.

# Parameter for network model
betanet  <- c(-2.8, -1.5)    # beta
Glist    <- list()           # adjacency matrix row normalized
Network  <- list()           # adjacency matrix row non-normalized
dX       <- matrix(0, 0, 2)  # observed dyad-specific variables
mu       <- list()           # unobserved individual-level attribute
uu       <- runif(M, -1, 1)  # mean of uu in each sub-network
sigma2u  <- runif(M, 0.5, 4) # variance of uu in each sub-network

# Network
for (m in 1:M) {
  nm           <- nvec[m]
  mum          <- rnorm(nm, uu[m], sqrt(sigma2u[m]))
  Z1           <- matrix(0, nm, nm)  
  Z2           <- matrix(0, nm, nm)

  for (i in 1:nm) {
    for (j in 1:nm) {
      Z1[i, j] <- abs(data$x1[i] - data$x1[j])
      Z2[i, j] <- abs(data$x2[i] - data$x2[j])
    }
  }

  Gm           <- 1*((Z1*betanet[1] + Z2*betanet[2] +
                        kronecker(mum, t(mum), "+") + rlogis(nm^2)) > 0)
  diag(Gm)     <- 0

  diag(Z1)     <- NA
  diag(Z2)     <- NA
  Z1           <- Z1[!is.na(Z1)]
  Z2           <- Z2[!is.na(Z2)]

  dX           <- rbind(dX, cbind(Z1, Z2))

  Network[[m]] <- Gm
  rs           <- rowSums(Gm); rs[rs == 0] <- 1
  Gm           <- Gm/rs
  Glist[[m]]   <- Gm
  mu[[m]]      <- mum
}
mu            <- unlist(mu)

To simulate the dependent variable, I use $\boldsymbol{\mu}$, the vector of $\mu_i$ as additional explanatory variable in the count variable model. I set that $Cor(\mu_i, \varepsilon_i) = \rho$ and $Cor(\sum_{j = 1}^nG_{ij}\mu_i, \varepsilon_i) = \bar{\rho}$. In that case, $\varepsilon_{i} = \rho \sigma_{\varepsilon}\tilde{\mu}i + \bar{\rho} \sigma{\varepsilon}\bar{\tilde{\mu}}i + \tilde{\nu_i}$, where $\tilde{\mu}_i = \dfrac{\mu_i - u{\mu s}}{\sigma_{\mu s}}$, $\bar{\tilde{\mu}}i = \sum{j = 1}^nG_{ij}\tilde{\mu}j$ is the average of $\tilde{\mu}_i$ among $i$'s friends and $\tilde{\nu_i}$ is the new error term following normal distribution of variance $\bar{\sigma}^2{\varepsilon} = \left(1 - \rho^2 - \bar{\rho}^2\right)\sigma^2_{\varepsilon}$.

tmu      <- (mu - rep(uu, nvec))/sqrt(rep(sigma2u, nvec))
data$tmu <- tmu
rho      <- 0.24
rhobar   <- 0.18
thetanet <- c(lambda, beta, sigma*rho, gamma, sigma*rhobar,
              sigma*(1 - rho - rhobar))
ytmp     <- simCDnet(formula = ~ x1 + x2 + tmu | x1 + x2 + tmu,
                     Glist = Glist, theta = thetanet, data = data)
y        <- ytmp$y
# Add y to data
data$y   <- y
# Summarize y
summary(y)
# Plot data histogram
print(ggplot(data = data.frame(y = y), aes(x = y)) +
        geom_bar(color = "black", fill = "cyan") +
        theme_bw() + xlab("") + ylab("Frequency"))

In real-life, $\tilde{\mu}_i$ and $\bar{\tilde{\mu}}_i$ are not observed and they are included in the error term $\varepsilon_i$. This implies that the network is endogenous.

Estimation

When the endogeneity of the network is not taken into account, the peer effects are overestimated. The reason is that the parameter $\lambda$ also captures other effects. The model considers any common shock on $\tilde{\mu}_i$ and $\bar{\tilde{\mu}}_i$ as peer effects. In the example below, the peer effects are estimated at 0.49 when I do not control for the endogeneity of the network.

# Count data model
CDexo <- CDnetNPL(formula = y ~ x1 + x2, contextual = TRUE, Glist = Glist,
                  optimizer = "optim", data = data, npl.ctr = list(print = FALSE))
summary(CDexo)

To correct the endogeneity, $\tilde{\mu}_i$ and $\bar{\tilde{\mu}}_i$ should be included in the model as additional explanatory variables. However, these variables are not observed.

I use Markov Chain Monte Carlo (MCMC) to estimate the parameters of the dyadic linking model (these include the unobserved individual-level attributes $\mu_i$). The estimation can be done using the function netformation.

# Dyadic linking model
net <- netformation(network =  Network, formula = ~ dX, fixed.effects = TRUE,
                    mcmc.ctr = list(burnin = 1000, iteration = 5000))
load("out.net.Rdata")
# I copied and pasted the output to save time. But if you run the code on your laptop with the same seed, you will get the same results
cat("0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|


The program successfully executed 

********SUMMARY******** 
n.obs          :  314890 
n.links        :  26325 
K              :  2 
Fixed effects  :  Yes 
Burnin         :  1000 
Iteration      :  6000 

Elapsed time   :  0  HH  7  mm  56  ss 

Average acceptance rate 
                   beta:  0.274125 
                     mu:  0.2697619 ")

I randomly choose some parameters and present their posterior distribution.

# plot simulations
par(mfrow = c(4,2), mar = c(2, 2, 2, 1.9))
plot(net$posterior$beta[,1], type = "l", ylim = c(-2.9, -2.6), col = "blue",
     main = bquote(beta[1]), ylab = "")
abline(h = betanet[1], col = "red")

plot(net$posterior$beta[,2], type = "l", ylim = c(-1.6, -1.4), col = "blue",
     main = bquote(beta[2]), ylab = "")
abline(h = betanet[2], col = "red")

plot(net$posterior$mu[,10], type = "l", col = "blue",
     main = bquote(mu[10]), ylab = "")
abline(h = mu[10], col = "red")

plot(net$posterior$mu[,542], type = "l", col = "blue",
     main = bquote(mu[542]), ylab = "")
abline(h = mu[542], col = "red")

plot(net$posterior$mu[,849], type = "l", col = "blue",
     main = bquote(mu[849]), ylab = "")
abline(h = mu[849], col = "red")

plot(net$posterior$mu[,752], type = "l", col = "blue",
     main = bquote(mu[752]), ylab = "")
abline(h = mu[752], col = "red")

plot(net$posterior$uu[,1], type = "l", col = "blue",
     main = bquote(u[mu][1]), ylab = "")
abline(h = uu[1], col = "red")

plot(net$posterior$sigmamu2[,3], type = "l", col = "blue",
     main =  bquote(sigma[mu][3]^2), ylab = "")
abline(h = sigma2u[3], col = "red")
# plot simulations
par(mfrow = c(4,2), mar = c(2, 2, 2, 1.9))
plot(snet[[1]], type = "l", ylim = c(-2.9, -2.6), col = "blue",
     main = bquote(beta[1]), ylab = "")
abline(h = betanet[1], col = "red")

plot(snet[[2]], type = "l", ylim = c(-1.6, -1.4), col = "blue",
     main = bquote(beta[2]), ylab = "")
abline(h = betanet[2], col = "red")

plot(snet[[3]], type = "l", col = "blue",
     main = bquote(mu[10]), ylab = "")
abline(h = mu[10], col = "red")

plot(snet[[4]], type = "l", col = "blue",
     main = bquote(mu[542]), ylab = "")
abline(h = mu[542], col = "red")

plot(snet[[5]], type = "l", col = "blue",
     main = bquote(mu[849]), ylab = "")
abline(h = mu[849], col = "red")

plot(snet[[6]], type = "l", col = "blue",
     main = bquote(mu[752]), ylab = "")
abline(h = mu[752], col = "red")

plot(snet[[7]], type = "l", col = "blue",
     main = bquote(u[mu][1]), ylab = "")
abline(h = uu[1], col = "red")

plot(snet[[8]], type = "l", col = "blue",
     main =  bquote(sigma[mu][3]^2), ylab = "")
abline(h = sigma2u[3], col = "red")

It stands out that the MCMC converges quickly and the simulations (plotted in blue) are pretty close to the true values (red line).

Using simulations from the posterior distribution, I can construct good estimators for $\tilde{\mu}_i$ and $\bar{\tilde{\mu}}_i$ and use them as additional explanatory variables. In the next example, I use the simulation with the highest posterior density to estimate $\tilde{\mu}_i$ and $\bar{\tilde{\mu}}_i$.

t           <- which.max(net$posterior$log.density)
print(t)
print(t)
muest       <- net$posterior$mu[t,]
uuest       <- net$posterior$uu[t,]
sigma2uest  <- net$posterior$sigmamu2[t,]
tmuest      <- (muest - rep(uuest, nvec))/sqrt(rep(sigma2uest, nvec))
data$tmuest <- tmuest
CDendo      <- CDnetNPL(formula = y ~ x1 + x2 + tmuest, contextual = TRUE, 
                        Glist = Glist, optimizer = "optim", data = data, 
                        npl.ctr = list(print = FALSE))
summary(CDendo)
data$tmuest <- tmuest
summary(CDendo)

The coefficients of $\tilde{\mu}_i$ and $\bar{\tilde{\mu}}_i$ are significant. This confirms that the network is endogenous. Moreover, the peer effects are estimated at 0.40, which is much better than the previous estimate.

The standard errors computed in this example are not valid because they assume that $\tilde{\mu}_i$ and $\bar{\tilde{\mu}}_i$ are observed. To correct the standard error, I replicate simulations of $\tilde{\mu}_i$ and $\bar{\tilde{\mu}}_i$ with replacement from the posterior distribution in order to take into account the variance of the MCMC.

fendo    <- function(s) {
  t         <- sample(3001:8000, 1)
  datat     <- data
  mus       <- net$posterior$mu[t,]
  uus       <- rep(net$posterior$uu[t,], nvec)
  sus       <- rep(net$posterior$sigmamu2[t,], nvec)
  tmu       <- (mus - uus)/sqrt(sus)
  datat$tmu <- tmu

  CDnet    <- CDnetNPL(formula = y ~ x1 + x2 + tmu, contextual = TRUE, 
                       Glist = Glist, optimizer = "optim", data = datat,
                       npl.ctr = list(print = FALSE))

  summary(CDnet, Glist = Glist, data = datat)
}

n.cores    <- 32
replic     <- 1000 #Number of replications
out.endo   <- mclapply(1:replic, fendo, mc.cores = n.cores)
# The output of out.endo is a list of objects from "summary.CDnetNPL" class
# Let's set the class of out.endo as "summary.CDnetNPLs"
class(out.endo) <- "summary.CDnetNPLs"
# I can now summarize the results using print
# Object from "summary.CDnetNPL" has a print method
print(out.endo)
load("out.endo.Rdata")
CDatanet:::print.summary.CDnetNPLs(out.endo)

The standard errors are slightly greater than the previous estimates.

Monte Carlo Simulations

One simulation is not sufficient to confirm that the method used to correct the endogeneity performs well. In this section, I use Monte Carlo simulations to assess the performance of this method. I compare the estimates when the network is assumed exogenous to these when I control for the endogeneity.

I build the following Monte Carlo function which simulates data based on endogenous network, and estimates the model by assuming that the network is exogenous and then endogenous.

fMCendo <- function(s) {
  # parameters
  M        <- 3 
  nvec     <- rep(500, 3)
  n        <- sum(nvec)

  lambda   <- 0.4              # peer effects
  beta     <- c(2, -1.9, 0.8)  # own effects
  gamma    <- c(1.5, -1.2)     # contextual effects
  sigma    <- 1.5              # standard deviation of epsilon
  theta    <- c(lambda, beta, gamma, sigma)

  data     <- data.frame(x1 = rnorm(n, 1, 1), x2 =  rexp(n, 0.4))

  # Network
  betanet  <- c(-2.8, -1.5)    # beta
  Glist    <- list()           # adjacency matrix row normalized
  Network  <- list()           # adjacency matrix row non-normalized
  dX       <- matrix(0, 0, 2)  # observed dyad-specific variables
  mu       <- list()           # unobserved individual-level attribute
  uu       <- runif(M, -1, 1)  # mean of uu in each sub-network
  sigma2u  <- runif(M, 0.5, 4) # variance of uu in each sub-network

  for (m in 1:M) {
    nm           <- nvec[m]
    mum          <- rnorm(nm, uu[m], sqrt(sigma2u[m]))
    Z1           <- matrix(0, nm, nm)  
    Z2           <- matrix(0, nm, nm)

    for (i in 1:nm) {
      for (j in 1:nm) {
        Z1[i, j] <- abs(data$x1[i] - data$x1[j])
        Z2[i, j] <- abs(data$x2[i] - data$x2[j])
      }
    }

    Gm           <- 1*((Z1*betanet[1] + Z2*betanet[2] +
                          kronecker(mum, t(mum), "+") + rlogis(nm^2)) > 0)
    diag(Gm)     <- 0

    diag(Z1)     <- NA
    diag(Z2)     <- NA
    Z1           <- Z1[!is.na(Z1)]
    Z2           <- Z2[!is.na(Z2)]

    dX           <- rbind(dX, cbind(Z1, Z2))

    Network[[m]] <- Gm
    rs           <- rowSums(Gm); rs[rs == 0] <- 1
    Gm           <- Gm/rs
    Glist[[m]]   <- Gm
    mu[[m]]      <- mum
  }
  mu             <- unlist(mu)

  # Data
  tmu      <- (mu - rep(uu, nvec))/sqrt(rep(sigma2u, nvec))
  data$tmu <- tmu
  rho      <- 0.24
  rhobar   <- 0.18
  thetanet <- c(lambda, beta, sigma*rho, gamma, sigma*rhobar,
                sigma*(1 - rho - rhobar))
  ytmp     <- simCDnet(formula = ~ x1 + x2 + tmu | x1 + x2 + tmu,
                       Glist = Glist, theta = thetanet, data = data)
  y        <- ytmp$y
  data$y   <- y

  # Count data model
  CDexo <- CDnetNPL(formula = y ~ x1 + x2, contextual = TRUE, Glist = Glist,
                    optimizer = "optim", data = data, npl.ctr = list(print = FALSE))
  # Dyadic linking model
  net   <- netformation(network =  Network, formula = ~ dX, fixed.effects = TRUE,
                        mcmc.ctr = list(burnin = 1000, iteration = 2000), print = FALSE)
  # Endogeneity
  t           <- which.max(net$posterior$log.density)
  muest       <- net$posterior$mu[t,]
  uuest       <- net$posterior$uu[t,]
  sigma2uest  <- net$posterior$sigmamu2[t,]
  tmuest      <- (muest - rep(uuest, nvec))/sqrt(rep(sigma2uest, nvec))
  data$tmuest <- tmuest
  CDendo      <- CDnetNPL(formula = y ~ x1 + x2 + tmuest, contextual = TRUE, 
                          Glist = Glist, optimizer = "optim", data = data, 
                          npl.ctr = list(print = FALSE))
  c(CDexo$estimate, CDendo$estimate)
}

I now run 1000 times the function fMCendo and compute the average of both estimators.

n.cores               <- 5
replic                <- 1000 #Number of replications
out.mcendo            <- mclapply(1:replic, fMCendo, mc.cores = n.cores)
out.mcendo            <- apply(t(do.call(cbind, out.mcendo)), 2, mean)
out.endo              <- cbind(thetanet, c(out.mcendo[c(1:4, NA, 5:6, NA, 7)]),
                               out.mcendo[8:16])
rownames(out.endo)    <- names(out.mcendo[8:16])
colnames(out.endo)    <- c("TrueValue", "Endo-Not-Controlled", "Endo-Controlled")
print(out.endo)
load("out.mcendo.Rdata")
print(out.endo)

The results prove that the method used to correct the endogeneity performs fairly well. The average of the estimates is 0.42 while the true value is 0.40. There is still a small bias. This is certainly due to the number of iterations of the MCMC which is low. Indeed, I use the iteration with the highest posterior density to estimate $\tilde{\mu}_i$ and $\bar{\tilde{\mu}}_i$. Since the number of iterations is low, the estimates may not be very good: the more the number of iterations, the better the simulation with highest posterior density for approximating $\tilde{\mu}_i$ and $\bar{\tilde{\mu}}_i$.

In contrast, the peer effects are overestimated when the network is assumed exogenous. The average of the estimates is 0.52.

Conclusion {#con}

This paper provides technical details on the package CDatanet. It shows with simple and practical examples how to use the package. It provides comparison of the model with the SART and the SAR models. It also shows with Monte Carlo simulations how the model performs when the network matrix is exogenous and endogenous. Simulation results prove that the peer effects are overestimated when the network is endogenous. Lastly, it presents a way to control for the endogeneity of the network.



Try the CDatanet package in your browser

Any scripts or data that you put into this service are public.

CDatanet documentation built on Feb. 18, 2021, 1:07 a.m.