# CDatanet: An R package to simulate and estimate a Count Data Model with Social Interactions In CDatanet: Modeling Count Data with Peer Effects

# Add y to data
data$y <- y # Summarize y summary(y) table(y) # Plot data histogram library(ggplot2) print(ggplot(data = data.frame(y = y), aes(x = y)) + geom_bar(color = "black", fill = "cyan") + theme_bw() + xlab("") + ylab("Frequency"))  ## Estimation Using the simulated data, I estimate the count data model as well as the spatial autoregressive Tobit (SART) and the spatial autoregressive (SAR) models. The SART model assumes that the dependent variable is continuous and left-censored at 0. This model can be used to model count data when data contain many zeros. However, the SART model does not account for the integer nature of the dependent variable. This can lead to biased estimates. In addition, the SART model also does not account for the integer nature of the dependent variable and does not assume that the data are censored. One can compare the peer effects using the count data model to those when the dependent variable is assumed continuous. The outputs of the functions that implement these models have a summary class to summarize the results and provide marginal effects.^[Note that the estimates from the SAR model can be directly interpreted as marginal effects.] # I will not run the estimates because it takes time. I will just load the saved ouptuts. However, you will get the same results if you run the code on your computer with the seed set above. load("out.est.Rdata")  # Count data CD <- CDnetNPL(formula = y ~ x1 + x2, contextual = TRUE, Glist = Glist, optimizer = "nlm", data = data, npl.ctr = list(print = FALSE, maxit = 5e3)) summary(CD)  summary(CD)  # SART SART <- SARTML(formula = y ~ x1 + x2, contextual = TRUE, Glist = Glist, optimizer = "nlm", data = data, print = FALSE) summary(SART)  summary(SART)  #SAR SAR <- SARML(formula = y ~ x1 + x2, contextual = TRUE, Glist = Glist, optimizer = "nlm", data = data, print = FALSE) summary(SAR)  summary(SAR)  Using the count data model, the peer effect is estimated at 0.38, which is close to the true value, 0.4. By replicating this experience several times, the average of the estimates is likely to be almost equal to the true value (see next section). Note that this parameter is not interpretable. Instead, one should interpret the marginal effect: increasing the expected$y_j$of the peers by one implies an increase of 0.28 in the expected$y_i$. However, the SART and the SAR model underestimate thos impact at 0.19 and 0.16, respectively. ## Monte Carlo Simulations In this section, I conduct Monte Carlo simulations to assess the performance of the estimator. I use the same settings as in Section 3.1 except that the number of individuals in each sub-network is set to 500. I build the following Monte Carlo function that simulates the data, estimates the three models and returns the estimates. fMC <- function(s) { # Groups' size M <- 5 nvec <- rep(500, 5); n <- sum(nvec) # Parameters lambda <- 0.4 beta <- c(2, -1.9, 0.8) gamma <- c(1.5, -1.2) sigma <- 1.5 theta <- c(lambda, beta, gamma, sigma) # X data <- data.frame(x1 = rnorm(n, 1, 1), x2 = rexp(n, 0.4)) # Network Glist <- list() for (m in 1:M) { nm <- nvec[m] Gm <- matrix(0, nm, nm) max_d <- 30 for (i in 1:nm) { tmp <- sample((1:nm)[-i], sample(0:max_d, 1)) Gm[i, tmp] <- 1 } rs <- rowSums(Gm); rs[rs == 0] <- 1 Gm <- Gm/rs Glist[[m]] <- Gm } # y ytmp <- simCDnet(formula = ~ x1 + x2 | x1 + x2, Glist = Glist, theta = theta, data = data) y <- ytmp$y
data$y <- y # Models CD <- CDnetNPL(formula = y ~ x1 + x2, contextual = TRUE, Glist = Glist, optimizer = "nlm", data = data, npl.ctr = list(print = FALSE, maxit = 5e3)) SART <- SARTML(formula = y ~ x1 + x2, contextual = TRUE, Glist = Glist, optimizer = "nlm", data = data, print = FALSE) SAR <- SARML(formula = y ~ x1 + x2, contextual = TRUE, Glist = Glist, optimizer = "nlm", data = data, print = FALSE) c(CD$estimate, SART$estimate, SAR$estimate)
}


I run 1000 times the Monte Carlo function fMC and compute the average of each estimate. To make it faster, I run the replications in parallel using doParallel package [@R-doParallel].

library(doParallel)
n.cores  <- 32
replic   <- 1000 #Number of replications
out.mc   <- mclapply(1:replic, fMC, mc.cores = n.cores)
out.mc   <- apply(t(do.call(cbind, out.mc)), 2, mean)
out.mc   <- cbind(theta, out.mc[1:7], out.mc[8:14], out.mc[15:21])
colnames(out.mc) <- c("TrueValue", "CoutData", "SART", "SAR")
print(out.mc)

load("out.mc.Rdata")
print(out.mc)


The estimator of the count data model seems unbiased while the SART and the SAR models underestimate the peer effects.

# Examples with endogenous network{#endo}

Peer effects estimation is generally based on the assumption of exogeneity of the adjacency matrix. This means that the probability of link formation is not correlated to the error term in the count data model. Such an assumption is strong as the link formation probability may depend on unobserved characteristics (e.g., gregariousness) that also influence the outcome. In this section, I simulate data with endogenous network (which violates the assumption). I then show that the peer effect is overestimated when one does not control for the endogeneity of the network. I also present a method to control for the endogeneity. Finally, I use Monte Carlo simulations to prove that this method performs well.

## Data simulation

I assume three sub-networks. The number of individuals in each sub-network is randomly chosen between 100 and 500.

rm(list = ls())
set.seed(2020)
# Groups' size
M      <- 3 # Number of sub-groups
nvec   <- round(runif(M, 100, 500))
print(nvec)
n      <- sum(nvec)
print(n)

# Parameters
lambda <- 0.4               # peer effects
beta   <- c(2, -1.9, 0.8)   # own effects
gamma  <- c(1.5, -1.2)      # contextual effects
sigma  <- 1.5               # standard deviation of epsilon
theta  <- c(lambda, beta, gamma, sigma)

# X
data   <- data.frame(x1 = rnorm(n, 1, 1), x2 =  rexp(n, 0.4))


The network matrix follows the dyadic linking model presented in @houndetoungan2020cdata. Let $\boldsymbol{A}$ be the matrix of links such that $A_{ij} = 1$ if $i$ knows $j$ and $A_{ij} = 0$. In other words, $A_{ij} = 1$ if $G_{ij} = \dfrac{1}{n_i}$. For he network formation model, it is easier to work with $\boldsymbol{A}$ (binary data). However, for the count data model, I use $\boldsymbol{G}$ as row-normalized equivalent of $\boldsymbol{A}$.

The probability of link formation, $P_{ij}$ depends on $\Delta x_{1ij} = |x_{1i} - x_{1j}|$ and $\Delta x_{2ij} = |x_{2i} - x_{2j}|$ (observed dyad-specific variables) as well as on unobserved individual-level attributes $\mu_i$ and $\mu_j$. Typically, let $a_{ij}^$ defined by $$a_{ij}^ = \Delta \boldsymbol{x}{ij}^{\prime}\bar{\boldsymbol{\beta}} + \mu_i + \mu_j + \varepsilon{ij}^,$$ where $\varepsilon_{ij}^ \sim logistic$ distribution. I assume that $A_{ij} = 1$ if $a_{ij}^* > 0$. In that case, $P_{ij}$ is given by $$P_{ij} = \dfrac{\exp{\left(\Delta \boldsymbol{x}{ij}^{\prime}\bar{\boldsymbol{\beta}} + \mu_i + \mu_j\right)}}{1 + \exp{\left(\Delta \boldsymbol{x}{ij}^{\prime}\bar{\boldsymbol{\beta}} + \mu_i + \mu_j\right)}}.$$ I also assume that $\mu_i$ is random. If $i$ is from the s-th sub-network, then $\mu_i \sim \mathcal{N}\left(u_{\mu s}, \sigma_{\mu s}^2\right)$. The mean and the variance of $\mu_i$ vary across the sub-networks. This is a way to control for the sub-network heterogeneity as fixed effects in the link formation probability.

# Parameter for network model
betanet  <- c(-2.8, -1.5)    # beta
Glist    <- list()           # adjacency matrix row normalized
Network  <- list()           # adjacency matrix row non-normalized
dX       <- matrix(0, 0, 2)  # observed dyad-specific variables
mu       <- list()           # unobserved individual-level attribute
uu       <- runif(M, -1, 1)  # mean of uu in each sub-network
sigma2u  <- runif(M, 0.5, 4) # variance of uu in each sub-network

# Network
for (m in 1:M) {
nm           <- nvec[m]
mum          <- rnorm(nm, uu[m], sqrt(sigma2u[m]))
Z1           <- matrix(0, nm, nm)
Z2           <- matrix(0, nm, nm)

for (i in 1:nm) {
for (j in 1:nm) {
Z1[i, j] <- abs(data$x1[i] - data$x1[j])
Z2[i, j] <- abs(data$x2[i] - data$x2[j])
}
}

Gm           <- 1*((Z1*betanet[1] + Z2*betanet[2] +
kronecker(mum, t(mum), "+") + rlogis(nm^2)) > 0)
diag(Gm)     <- 0

diag(Z1)     <- NA
diag(Z2)     <- NA
Z1           <- Z1[!is.na(Z1)]
Z2           <- Z2[!is.na(Z2)]

dX           <- rbind(dX, cbind(Z1, Z2))

Network[[m]] <- Gm
rs           <- rowSums(Gm); rs[rs == 0] <- 1
Gm           <- Gm/rs
Glist[[m]]   <- Gm
mu[[m]]      <- mum
}
mu            <- unlist(mu)


To simulate the dependent variable, I use $\boldsymbol{\mu}$, the vector of $\mu_i$ as additional explanatory variable in the count variable model. I set that $Cor(\mu_i, \varepsilon_i) = \rho$ and $Cor(\sum_{j = 1}^nG_{ij}\mu_i, \varepsilon_i) = \bar{\rho}$. In that case, $\varepsilon_{i} = \rho \sigma_{\varepsilon}\tilde{\mu}i + \bar{\rho} \sigma{\varepsilon}\bar{\tilde{\mu}}i + \tilde{\nu_i}$, where $\tilde{\mu}_i = \dfrac{\mu_i - u{\mu s}}{\sigma_{\mu s}}$, $\bar{\tilde{\mu}}i = \sum{j = 1}^nG_{ij}\tilde{\mu}j$ is the average of $\tilde{\mu}_i$ among $i$'s friends and $\tilde{\nu_i}$ is the new error term following normal distribution of variance $\bar{\sigma}^2{\varepsilon} = \left(1 - \rho^2 - \bar{\rho}^2\right)\sigma^2_{\varepsilon}$.

tmu      <- (mu - rep(uu, nvec))/sqrt(rep(sigma2u, nvec))
data$tmu <- tmu rho <- 0.24 rhobar <- 0.18 thetanet <- c(lambda, beta, sigma*rho, gamma, sigma*rhobar, sigma*(1 - rho - rhobar)) ytmp <- simCDnet(formula = ~ x1 + x2 + tmu | x1 + x2 + tmu, Glist = Glist, theta = thetanet, data = data) y <- ytmp$y
# Add y to data
data$y <- y # Summarize y summary(y) # Plot data histogram print(ggplot(data = data.frame(y = y), aes(x = y)) + geom_bar(color = "black", fill = "cyan") + theme_bw() + xlab("") + ylab("Frequency"))  In real-life,$\tilde{\mu}_i$and$\bar{\tilde{\mu}}_i$are not observed and they are included in the error term$\varepsilon_i$. This implies that the network is endogenous. ## Estimation When the endogeneity of the network is not taken into account, the peer effects are overestimated. The reason is that the parameter$\lambda$also captures other effects. The model considers any common shock on$\tilde{\mu}_i$and$\bar{\tilde{\mu}}_i$as peer effects. In the example below, the peer effects are estimated at 0.49 when I do not control for the endogeneity of the network. # Count data model CDexo <- CDnetNPL(formula = y ~ x1 + x2, contextual = TRUE, Glist = Glist, optimizer = "optim", data = data, npl.ctr = list(print = FALSE)) summary(CDexo)  To correct the endogeneity,$\tilde{\mu}_i$and$\bar{\tilde{\mu}}_i$should be included in the model as additional explanatory variables. However, these variables are not observed. I use Markov Chain Monte Carlo (MCMC) to estimate the parameters of the dyadic linking model (these include the unobserved individual-level attributes$\mu_i$). The estimation can be done using the function netformation. # Dyadic linking model net <- netformation(network = Network, formula = ~ dX, fixed.effects = TRUE, mcmc.ctr = list(burnin = 1000, iteration = 5000))  load("out.net.Rdata") # I copied and pasted the output to save time. But if you run the code on your laptop with the same seed, you will get the same results cat("0% 10 20 30 40 50 60 70 80 90 100% [----|----|----|----|----|----|----|----|----|----| **************************************************| The program successfully executed ********SUMMARY******** n.obs : 314890 n.links : 26325 K : 2 Fixed effects : Yes Burnin : 1000 Iteration : 6000 Elapsed time : 0 HH 7 mm 56 ss Average acceptance rate beta: 0.274125 mu: 0.2697619 ")  I randomly choose some parameters and present their posterior distribution. # plot simulations par(mfrow = c(4,2), mar = c(2, 2, 2, 1.9)) plot(net$posterior$beta[,1], type = "l", ylim = c(-2.9, -2.6), col = "blue", main = bquote(beta[1]), ylab = "") abline(h = betanet[1], col = "red") plot(net$posterior$beta[,2], type = "l", ylim = c(-1.6, -1.4), col = "blue", main = bquote(beta[2]), ylab = "") abline(h = betanet[2], col = "red") plot(net$posterior$mu[,10], type = "l", col = "blue", main = bquote(mu[10]), ylab = "") abline(h = mu[10], col = "red") plot(net$posterior$mu[,542], type = "l", col = "blue", main = bquote(mu[542]), ylab = "") abline(h = mu[542], col = "red") plot(net$posterior$mu[,849], type = "l", col = "blue", main = bquote(mu[849]), ylab = "") abline(h = mu[849], col = "red") plot(net$posterior$mu[,752], type = "l", col = "blue", main = bquote(mu[752]), ylab = "") abline(h = mu[752], col = "red") plot(net$posterior$uu[,1], type = "l", col = "blue", main = bquote(u[mu][1]), ylab = "") abline(h = uu[1], col = "red") plot(net$posterior$sigmamu2[,3], type = "l", col = "blue", main = bquote(sigma[mu][3]^2), ylab = "") abline(h = sigma2u[3], col = "red")  # plot simulations par(mfrow = c(4,2), mar = c(2, 2, 2, 1.9)) plot(snet[[1]], type = "l", ylim = c(-2.9, -2.6), col = "blue", main = bquote(beta[1]), ylab = "") abline(h = betanet[1], col = "red") plot(snet[[2]], type = "l", ylim = c(-1.6, -1.4), col = "blue", main = bquote(beta[2]), ylab = "") abline(h = betanet[2], col = "red") plot(snet[[3]], type = "l", col = "blue", main = bquote(mu[10]), ylab = "") abline(h = mu[10], col = "red") plot(snet[[4]], type = "l", col = "blue", main = bquote(mu[542]), ylab = "") abline(h = mu[542], col = "red") plot(snet[[5]], type = "l", col = "blue", main = bquote(mu[849]), ylab = "") abline(h = mu[849], col = "red") plot(snet[[6]], type = "l", col = "blue", main = bquote(mu[752]), ylab = "") abline(h = mu[752], col = "red") plot(snet[[7]], type = "l", col = "blue", main = bquote(u[mu][1]), ylab = "") abline(h = uu[1], col = "red") plot(snet[[8]], type = "l", col = "blue", main = bquote(sigma[mu][3]^2), ylab = "") abline(h = sigma2u[3], col = "red")  It stands out that the MCMC converges quickly and the simulations (plotted in blue) are pretty close to the true values (red line). Using simulations from the posterior distribution, I can construct good estimators for$\tilde{\mu}_i$and$\bar{\tilde{\mu}}_i$and use them as additional explanatory variables. In the next example, I use the simulation with the highest posterior density to estimate$\tilde{\mu}_i$and$\bar{\tilde{\mu}}_i$. t <- which.max(net$posterior$log.density) print(t)  print(t)  muest <- net$posterior$mu[t,] uuest <- net$posterior$uu[t,] sigma2uest <- net$posterior$sigmamu2[t,] tmuest <- (muest - rep(uuest, nvec))/sqrt(rep(sigma2uest, nvec)) data$tmuest <- tmuest
CDendo      <- CDnetNPL(formula = y ~ x1 + x2 + tmuest, contextual = TRUE,
Glist = Glist, optimizer = "optim", data = data,
npl.ctr = list(print = FALSE))
summary(CDendo)

data$tmuest <- tmuest summary(CDendo)  The coefficients of$\tilde{\mu}_i$and$\bar{\tilde{\mu}}_i$are significant. This confirms that the network is endogenous. Moreover, the peer effects are estimated at 0.40, which is much better than the previous estimate. The standard errors computed in this example are not valid because they assume that$\tilde{\mu}_i$and$\bar{\tilde{\mu}}_i$are observed. To correct the standard error, I replicate simulations of$\tilde{\mu}_i$and$\bar{\tilde{\mu}}_i$with replacement from the posterior distribution in order to take into account the variance of the MCMC. fendo <- function(s) { t <- sample(3001:8000, 1) datat <- data mus <- net$posterior$mu[t,] uus <- rep(net$posterior$uu[t,], nvec) sus <- rep(net$posterior$sigmamu2[t,], nvec) tmu <- (mus - uus)/sqrt(sus) datat$tmu <- tmu

CDnet    <- CDnetNPL(formula = y ~ x1 + x2 + tmu, contextual = TRUE,
Glist = Glist, optimizer = "optim", data = datat,
npl.ctr = list(print = FALSE))

summary(CDnet, Glist = Glist, data = datat)
}

n.cores    <- 32
replic     <- 1000 #Number of replications
out.endo   <- mclapply(1:replic, fendo, mc.cores = n.cores)
# The output of out.endo is a list of objects from "summary.CDnetNPL" class
# Let's set the class of out.endo as "summary.CDnetNPLs"
class(out.endo) <- "summary.CDnetNPLs"
# I can now summarize the results using print
# Object from "summary.CDnetNPL" has a print method
print(out.endo)

load("out.endo.Rdata")
CDatanet:::print.summary.CDnetNPLs(out.endo)


The standard errors are slightly greater than the previous estimates.

## Monte Carlo Simulations

One simulation is not sufficient to confirm that the method used to correct the endogeneity performs well. In this section, I use Monte Carlo simulations to assess the performance of this method. I compare the estimates when the network is assumed exogenous to these when I control for the endogeneity.

I build the following Monte Carlo function which simulates data based on endogenous network, and estimates the model by assuming that the network is exogenous and then endogenous.

fMCendo <- function(s) {
# parameters
M        <- 3
nvec     <- rep(500, 3)
n        <- sum(nvec)

lambda   <- 0.4              # peer effects
beta     <- c(2, -1.9, 0.8)  # own effects
gamma    <- c(1.5, -1.2)     # contextual effects
sigma    <- 1.5              # standard deviation of epsilon
theta    <- c(lambda, beta, gamma, sigma)

data     <- data.frame(x1 = rnorm(n, 1, 1), x2 =  rexp(n, 0.4))

# Network
betanet  <- c(-2.8, -1.5)    # beta
Glist    <- list()           # adjacency matrix row normalized
Network  <- list()           # adjacency matrix row non-normalized
dX       <- matrix(0, 0, 2)  # observed dyad-specific variables
mu       <- list()           # unobserved individual-level attribute
uu       <- runif(M, -1, 1)  # mean of uu in each sub-network
sigma2u  <- runif(M, 0.5, 4) # variance of uu in each sub-network

for (m in 1:M) {
nm           <- nvec[m]
mum          <- rnorm(nm, uu[m], sqrt(sigma2u[m]))
Z1           <- matrix(0, nm, nm)
Z2           <- matrix(0, nm, nm)

for (i in 1:nm) {
for (j in 1:nm) {
Z1[i, j] <- abs(data$x1[i] - data$x1[j])
Z2[i, j] <- abs(data$x2[i] - data$x2[j])
}
}

Gm           <- 1*((Z1*betanet[1] + Z2*betanet[2] +
kronecker(mum, t(mum), "+") + rlogis(nm^2)) > 0)
diag(Gm)     <- 0

diag(Z1)     <- NA
diag(Z2)     <- NA
Z1           <- Z1[!is.na(Z1)]
Z2           <- Z2[!is.na(Z2)]

dX           <- rbind(dX, cbind(Z1, Z2))

Network[[m]] <- Gm
rs           <- rowSums(Gm); rs[rs == 0] <- 1
Gm           <- Gm/rs
Glist[[m]]   <- Gm
mu[[m]]      <- mum
}
mu             <- unlist(mu)

# Data
tmu      <- (mu - rep(uu, nvec))/sqrt(rep(sigma2u, nvec))
data$tmu <- tmu rho <- 0.24 rhobar <- 0.18 thetanet <- c(lambda, beta, sigma*rho, gamma, sigma*rhobar, sigma*(1 - rho - rhobar)) ytmp <- simCDnet(formula = ~ x1 + x2 + tmu | x1 + x2 + tmu, Glist = Glist, theta = thetanet, data = data) y <- ytmp$y
data$y <- y # Count data model CDexo <- CDnetNPL(formula = y ~ x1 + x2, contextual = TRUE, Glist = Glist, optimizer = "optim", data = data, npl.ctr = list(print = FALSE)) # Dyadic linking model net <- netformation(network = Network, formula = ~ dX, fixed.effects = TRUE, mcmc.ctr = list(burnin = 1000, iteration = 2000), print = FALSE) # Endogeneity t <- which.max(net$posterior$log.density) muest <- net$posterior$mu[t,] uuest <- net$posterior$uu[t,] sigma2uest <- net$posterior$sigmamu2[t,] tmuest <- (muest - rep(uuest, nvec))/sqrt(rep(sigma2uest, nvec)) data$tmuest <- tmuest
CDendo      <- CDnetNPL(formula = y ~ x1 + x2 + tmuest, contextual = TRUE,
Glist = Glist, optimizer = "optim", data = data,
npl.ctr = list(print = FALSE))
c(CDexo$estimate, CDendo$estimate)
}


I now run 1000 times the function fMCendo and compute the average of both estimators.

n.cores               <- 5
replic                <- 1000 #Number of replications
out.mcendo            <- mclapply(1:replic, fMCendo, mc.cores = n.cores)
out.mcendo            <- apply(t(do.call(cbind, out.mcendo)), 2, mean)
out.endo              <- cbind(thetanet, c(out.mcendo[c(1:4, NA, 5:6, NA, 7)]),
out.mcendo[8:16])
rownames(out.endo)    <- names(out.mcendo[8:16])
colnames(out.endo)    <- c("TrueValue", "Endo-Not-Controlled", "Endo-Controlled")
print(out.endo)

load("out.mcendo.Rdata")
print(out.endo)


The results prove that the method used to correct the endogeneity performs fairly well. The average of the estimates is 0.42 while the true value is 0.40. There is still a small bias. This is certainly due to the number of iterations of the MCMC which is low. Indeed, I use the iteration with the highest posterior density to estimate $\tilde{\mu}_i$ and $\bar{\tilde{\mu}}_i$. Since the number of iterations is low, the estimates may not be very good: the more the number of iterations, the better the simulation with highest posterior density for approximating $\tilde{\mu}_i$ and $\bar{\tilde{\mu}}_i$.

In contrast, the peer effects are overestimated when the network is assumed exogenous. The average of the estimates is 0.52.

# Conclusion {#con}

This paper provides technical details on the package CDatanet. It shows with simple and practical examples how to use the package. It provides comparison of the model with the SART and the SAR models. It also shows with Monte Carlo simulations how the model performs when the network matrix is exogenous and endogenous. Simulation results prove that the peer effects are overestimated when the network is endogenous. Lastly, it presents a way to control for the endogeneity of the network.

## Try the CDatanet package in your browser

Any scripts or data that you put into this service are public.

CDatanet documentation built on Feb. 18, 2021, 1:07 a.m.