knitr::opts_chunk$set(echo = FALSE)

There is a large and growing literature on peer effects in economics.^[For recent reviews, see @boucher2016some, @de2017econometrics and @bramoulle2019peer.] Recent contributions include, among others, models for discrete data, including binary [e.g., @lee2014binary; @liu2019simultaneous], ordered [e.g., @boucher2018peer], and multinomial [e.g., @guerra2017multinomial]. To my knowledge, however, there are no existing models for count variables with microeconomic foundations, despite these variables being prevalent in survey data (e.g., number of times teenagers have taken drugs, number of times teenagers have drunk alcohol, frequency of participation in sport activities, frequency of participation in extracurricular activities, frequency of restaurant visits, number of physician visits).

I present an incomplete information game with social interactions in which the outcome takes unbounded integer values. I show that it is important to account for the counting nature of the dependent variable. Assuming that this variable is continuous significantly underestimates the peer effects.

I also present an easy-to-use R package---named **CDatanet**---for implementing the model. This note is a supplement for the package manual. I provide examples with simulated data to show how to use the package. The package is located on GitHub at github.com/ahoundetoungan/CDatanet. All the results of the note can be replicated following the code provided.

The remainder of the paper is organized as follows. Section 2 briefly introduces the model. Sections 3 and 4 provide examples in which the network matrix is exogenous and endogenous respectively. They present a Monte Carlo study to assess the performance of the estimator of the model parameters. They also discuss how to control for the endogeneity of the network. Section 5 concludes.

Let $\mathcal{V} = {1, \dots, n}$ be a set of $n$ players indexed by $i$ and $y_i$ the observed integer outcome of player $i$ (e.g., the number of cigarettes smoked per day or per week). I denote by $\boldsymbol{G}$ the adjacency matrix such that $G_{ij} = \dfrac{1}{n_i}$ if $i$ knows $j$ and $G_{ij} = 0$ otherwise, where $n_i$ the number of friends of $i$. Let also $\boldsymbol{X}$ be the matrix of explanatory variables.

As in @liu2019simultaneous, I assume that there is a latent variable $y_i^*$ which determines the observed count variable $y_i$ . This latent variable $y_i^*$ and the observed $y_i$ are linked as follow.

Let $\left(a_q\right)*{q \in \mathbb{N}}$ be a sequence given by $a_0 = -\infty$, $a_1 \in \mathbb{R}$, and $a_q = a_1 + \gamma(q-1)$ for $q \in \mathbb{N}^ $ and $\gamma \in \mathbb{R}_+^$. If $y_i^* \in (a*{q}, a_{q + 1}]$ then $y_i = q$.

The first order conditions of the game equilibrium implies that

$$ y_i^* = \lambda\sum_{i = 1}^n \sum_{r = 0}^{\infty} rG_{ij} p_{ir} + \boldsymbol{x}*i^{\prime}\boldsymbol{\beta} + \varepsilon*{i}$$
where $p_{ir}$ is the probability of $y_i = r$, $\lambda$ is the peer effect and $\varepsilon_{i} \sim \mathcal{N}\left(0, \sigma_{\varepsilon}^2\right)$.

The game equilibrium is given by
$$ p_{ir} = \Phi\left(\dfrac{\lambda\sum_{i = 1}^n \sum_{r = 0}^{\infty} rG_{ij} p_{ir} + \boldsymbol{x}*i^{\prime}\boldsymbol{\beta} - a_q}{\sigma*{\varepsilon}}\right) - \Phi\left(\dfrac{\lambda\sum_{i = 1}^n \sum_{r = 0}^{\infty} rG_{ij} p_{ir} + \boldsymbol{x}*i^{\prime}\boldsymbol{\beta} - a*{q+1}}{\sigma_{\varepsilon}}\right)$$
where $\Phi$ is the probability density function of $\mathcal{N}(0, 1)$.

To estimate $\boldsymbol{\theta} = (\lambda, \boldsymbol{\beta}^{\prime}, \sigma_{\varepsilon})^{\prime}$, I rely on the Nested Partial Likelihood (NPL) method proposed by @aguirregabiria2007sequential. I refer the reader to @houndetoungan2020cdata for more details. The estimation method is implemented in the package **CDatanet**.

In this section, I present examples to show how to use the package. I simulate data following the model by assuming that the network matrix is exogenous. I then show how to estimate the model parameters using functions provided by **CDatanet**. I also use Monte Carlo simulations to assess the performance of the estimator of the model parameters.

Given the adjacency matrix, the explanatory variables and $\boldsymbol{\theta}$, the function `simCDnet`

can used to simulate data. I assume that there are `M`

sub-networks. The number of individuals in each sub-network is randomly chosen between 100 and 1000. I assume three exogenous variables (including the intercept) and I control for the contextual effects.

set.seed(2020) library(CDatanet) # Groups' size M <- 5 # Number of sub-groups nvec <- round(runif(M, 100, 1000)) print(nvec) n <- sum(nvec) print(n) # Parameters lambda <- 0.4 # peer effects beta <- c(2, -1.9, 0.8) # own effects gamma <- c(1.5, -1.2) # contextual effects sigma <- 1.5 # standard deviation of epsilon theta <- c(lambda, beta, gamma, sigma) # X data <- data.frame(x1 = rnorm(n, 1, 1), x2 = rexp(n, 0.4))

To simulate the network matrix, I assume that the number of friends of each individual is randomly chosen between 0 and 30.

# Network Glist <- list() for (m in 1:M) { nm <- nvec[m] Gm <- matrix(0, nm, nm) max_d <- 30 for (i in 1:nm) { tmp <- sample((1:nm)[-i], sample(0:max_d, 1)) Gm[i, tmp] <- 1 } rs <- rowSums(Gm); rs[rs == 0] <- 1 Gm <- Gm/rs Glist[[m]] <- Gm }

I can now simulate the count data. The output of `simCDnet`

includes `yst`

the vector of $y_i^*$, `y`

the vector of $y_i$, `yb`

the vector of $\displaystyle\sum_{r = 0}^{\infty}rp_{ir}$, `Gyb`

the vector of $\displaystyle\sum_{i = 1}^n \sum_{r = 0}^{\infty} rG_{ij} p_{ir}$ and `iteration`

as the number of iterations performed to find the fixed point `yb`

.

ytmp <- simCDnet(formula = ~ x1 + x2 | x1 + x2, Glist = Glist, theta = theta, data = data) names(ytmp) y <- ytmp$y # Add y to data data$y <- y # Summarize y summary(y) table(y) # Plot data histogram library(ggplot2) print(ggplot(data = data.frame(y = y), aes(x = y)) + geom_bar(color = "black", fill = "cyan") + theme_bw() + xlab("") + ylab("Frequency"))

Using the simulated data, I estimate the count data model as well as the spatial autoregressive Tobit (SART) and the spatial autoregressive (SAR) models. The SART model assumes that the dependent variable is continuous and left-censored at 0. This model can be used to model count data when data contain many zeros. However, the SART model does not account for the integer nature of the dependent variable. This can lead to biased estimates. In addition, the SART model also does not account for the integer nature of the dependent variable and does not assume that the data are censored. One can compare the peer effects using the count data model to those when the dependent variable is assumed continuous. The outputs of the functions that implement these models have a `summary class`

to summarize the results and provide marginal effects.^[Note that the estimates from the SAR model can be directly interpreted as marginal effects.]

# I will not run the estimates because it takes time. I will just load the saved ouptuts. However, you will get the same results if you run the code on your computer with the seed set above. load("out.est.Rdata")

# Count data CD <- CDnetNPL(formula = y ~ x1 + x2, contextual = TRUE, Glist = Glist, optimizer = "nlm", data = data, npl.ctr = list(print = FALSE, maxit = 5e3)) summary(CD)

```
summary(CD)
```

# SART SART <- SARTML(formula = y ~ x1 + x2, contextual = TRUE, Glist = Glist, optimizer = "nlm", data = data, print = FALSE) summary(SART)

```
summary(SART)
```

#SAR SAR <- SARML(formula = y ~ x1 + x2, contextual = TRUE, Glist = Glist, optimizer = "nlm", data = data, print = FALSE) summary(SAR)

```
summary(SAR)
```

Using the count data model, the peer effect is estimated at 0.38, which is close to the true value, 0.4. By replicating this experience several times, the average of the estimates is likely to be almost equal to the true value (see next section). Note that this parameter is not interpretable. Instead, one should interpret the marginal effect: *increasing the expected $y_j$ of the peers by one implies an increase of 0.28 in the expected $y_i$*. However, the SART and the SAR model underestimate thos impact at 0.19 and 0.16, respectively.

In this section, I conduct Monte Carlo simulations to assess the performance of the estimator. I use the same settings as in Section 3.1 except that the number of individuals in each sub-network is set to 500. I build the following Monte Carlo function that simulates the data, estimates the three models and returns the estimates.

fMC <- function(s) { # Groups' size M <- 5 nvec <- rep(500, 5); n <- sum(nvec) # Parameters lambda <- 0.4 beta <- c(2, -1.9, 0.8) gamma <- c(1.5, -1.2) sigma <- 1.5 theta <- c(lambda, beta, gamma, sigma) # X data <- data.frame(x1 = rnorm(n, 1, 1), x2 = rexp(n, 0.4)) # Network Glist <- list() for (m in 1:M) { nm <- nvec[m] Gm <- matrix(0, nm, nm) max_d <- 30 for (i in 1:nm) { tmp <- sample((1:nm)[-i], sample(0:max_d, 1)) Gm[i, tmp] <- 1 } rs <- rowSums(Gm); rs[rs == 0] <- 1 Gm <- Gm/rs Glist[[m]] <- Gm } # y ytmp <- simCDnet(formula = ~ x1 + x2 | x1 + x2, Glist = Glist, theta = theta, data = data) y <- ytmp$y data$y <- y # Models CD <- CDnetNPL(formula = y ~ x1 + x2, contextual = TRUE, Glist = Glist, optimizer = "nlm", data = data, npl.ctr = list(print = FALSE, maxit = 5e3)) SART <- SARTML(formula = y ~ x1 + x2, contextual = TRUE, Glist = Glist, optimizer = "nlm", data = data, print = FALSE) SAR <- SARML(formula = y ~ x1 + x2, contextual = TRUE, Glist = Glist, optimizer = "nlm", data = data, print = FALSE) c(CD$estimate, SART$estimate, SAR$estimate) }

I run 1000 times the Monte Carlo function `fMC`

and compute the average of each estimate. To make it faster, I run the replications in parallel using **doParallel** package [@R-doParallel].

library(doParallel) n.cores <- 32 replic <- 1000 #Number of replications out.mc <- mclapply(1:replic, fMC, mc.cores = n.cores) out.mc <- apply(t(do.call(cbind, out.mc)), 2, mean) out.mc <- cbind(theta, out.mc[1:7], out.mc[8:14], out.mc[15:21]) colnames(out.mc) <- c("TrueValue", "CoutData", "SART", "SAR") print(out.mc)

load("out.mc.Rdata") print(out.mc)

The estimator of the count data model seems unbiased while the SART and the SAR models underestimate the peer effects.

Peer effects estimation is generally based on the assumption of exogeneity of the adjacency matrix. This means that the probability of link formation is not correlated to the error term in the count data model. Such an assumption is strong as the link formation probability may depend on unobserved characteristics (e.g., gregariousness) that also influence the outcome. In this section, I simulate data with endogenous network (which violates the assumption). I then show that the peer effect is overestimated when one does not control for the endogeneity of the network. I also present a method to control for the endogeneity. Finally, I use Monte Carlo simulations to prove that this method performs well.

I assume three sub-networks. The number of individuals in each sub-network is randomly chosen between 100 and 500.

rm(list = ls()) set.seed(2020) # Groups' size M <- 3 # Number of sub-groups nvec <- round(runif(M, 100, 500)) print(nvec) n <- sum(nvec) print(n) # Parameters lambda <- 0.4 # peer effects beta <- c(2, -1.9, 0.8) # own effects gamma <- c(1.5, -1.2) # contextual effects sigma <- 1.5 # standard deviation of epsilon theta <- c(lambda, beta, gamma, sigma) # X data <- data.frame(x1 = rnorm(n, 1, 1), x2 = rexp(n, 0.4))

The network matrix follows the dyadic linking model presented in @houndetoungan2020cdata. Let $\boldsymbol{A}$ be the matrix of links such that $A_{ij} = 1$ if $i$ knows $j$ and $A_{ij} = 0$. In other words, $A_{ij} = 1$ if $G_{ij} = \dfrac{1}{n_i}$. For he network formation model, it is easier to work with $\boldsymbol{A}$ (binary data). However, for the count data model, I use $\boldsymbol{G}$ as row-normalized equivalent of $\boldsymbol{A}$.

The probability of link formation, $P_{ij}$ depends on $\Delta x_{1ij} = |x_{1i} - x_{1j}|$ and $\Delta x_{2ij} = |x_{2i} - x_{2j}|$ (observed dyad-specific variables) as well as on unobserved individual-level attributes $\mu_i$ and $\mu_j$. Typically, let $a_{ij}^*$ defined by
$$a_{ij}^* = \Delta \boldsymbol{x}*{ij}^{\prime}\bar{\boldsymbol{\beta}} + \mu_i + \mu_j + \varepsilon*{ij}^*,$$
where $\varepsilon_{ij}^* \sim logistic$ distribution. I assume that $A_{ij} = 1$ if $a_{ij}^* > 0$. In that case, $P_{ij}$ is given by
$$P_{ij} = \dfrac{\exp{\left(\Delta \boldsymbol{x}*{ij}^{\prime}\bar{\boldsymbol{\beta}} + \mu_i + \mu_j\right)}}{1 + \exp{\left(\Delta \boldsymbol{x}*{ij}^{\prime}\bar{\boldsymbol{\beta}} + \mu_i + \mu_j\right)}}.$$
I also assume that $\mu_i$ is random. If $i$ is from the s-th sub-network, then $\mu_i \sim \mathcal{N}\left(u_{\mu s}, \sigma_{\mu s}^2\right)$. The mean and the variance of $\mu_i$ vary across the sub-networks. This is a way to control for the sub-network heterogeneity as fixed effects in the link formation probability.

# Parameter for network model betanet <- c(-2.8, -1.5) # beta Glist <- list() # adjacency matrix row normalized Network <- list() # adjacency matrix row non-normalized dX <- matrix(0, 0, 2) # observed dyad-specific variables mu <- list() # unobserved individual-level attribute uu <- runif(M, -1, 1) # mean of uu in each sub-network sigma2u <- runif(M, 0.5, 4) # variance of uu in each sub-network # Network for (m in 1:M) { nm <- nvec[m] mum <- rnorm(nm, uu[m], sqrt(sigma2u[m])) Z1 <- matrix(0, nm, nm) Z2 <- matrix(0, nm, nm) for (i in 1:nm) { for (j in 1:nm) { Z1[i, j] <- abs(data$x1[i] - data$x1[j]) Z2[i, j] <- abs(data$x2[i] - data$x2[j]) } } Gm <- 1*((Z1*betanet[1] + Z2*betanet[2] + kronecker(mum, t(mum), "+") + rlogis(nm^2)) > 0) diag(Gm) <- 0 diag(Z1) <- NA diag(Z2) <- NA Z1 <- Z1[!is.na(Z1)] Z2 <- Z2[!is.na(Z2)] dX <- rbind(dX, cbind(Z1, Z2)) Network[[m]] <- Gm rs <- rowSums(Gm); rs[rs == 0] <- 1 Gm <- Gm/rs Glist[[m]] <- Gm mu[[m]] <- mum } mu <- unlist(mu)

To simulate the dependent variable, I use $\boldsymbol{\mu}$, the vector of $\mu_i$ as additional explanatory variable in the count variable model. I set that $Cor(\mu_i, \varepsilon_i) = \rho$ and $Cor(\sum_{j = 1}^nG_{ij}\mu_i, \varepsilon_i) = \bar{\rho}$. In that case, $\varepsilon_{i} = \rho \sigma_{\varepsilon}\tilde{\mu}*i + \bar{\rho} \sigma*{\varepsilon}\bar{\tilde{\mu}}*i + \tilde{\nu_i}$, where $\tilde{\mu}_i = \dfrac{\mu_i - u*{\mu s}}{\sigma_{\mu s}}$, $\bar{\tilde{\mu}}*i = \sum*{j = 1}^nG_{ij}\tilde{\mu}*j$ is the average of $\tilde{\mu}_i$ among $i$'s friends and $\tilde{\nu_i}$ is the new error term following normal distribution of variance $\bar{\sigma}^2*{\varepsilon} = \left(1 - \rho^2 - \bar{\rho}^2\right)\sigma^2_{\varepsilon}$.

tmu <- (mu - rep(uu, nvec))/sqrt(rep(sigma2u, nvec)) data$tmu <- tmu rho <- 0.24 rhobar <- 0.18 thetanet <- c(lambda, beta, sigma*rho, gamma, sigma*rhobar, sigma*(1 - rho - rhobar)) ytmp <- simCDnet(formula = ~ x1 + x2 + tmu | x1 + x2 + tmu, Glist = Glist, theta = thetanet, data = data) y <- ytmp$y # Add y to data data$y <- y # Summarize y summary(y) # Plot data histogram print(ggplot(data = data.frame(y = y), aes(x = y)) + geom_bar(color = "black", fill = "cyan") + theme_bw() + xlab("") + ylab("Frequency"))

In real-life, $\tilde{\mu}_i$ and $\bar{\tilde{\mu}}_i$ are not observed and they are included in the error term $\varepsilon_i$. This implies that the network is endogenous.

When the endogeneity of the network is not taken into account, the peer effects are overestimated. The reason is that the parameter $\lambda$ also captures other effects. The model considers any common shock on $\tilde{\mu}_i$ and $\bar{\tilde{\mu}}_i$ as peer effects. In the example below, the peer effects are estimated at 0.49 when I do not control for the endogeneity of the network.

# Count data model CDexo <- CDnetNPL(formula = y ~ x1 + x2, contextual = TRUE, Glist = Glist, optimizer = "optim", data = data, npl.ctr = list(print = FALSE)) summary(CDexo)

To correct the endogeneity, $\tilde{\mu}_i$ and $\bar{\tilde{\mu}}_i$ should be included in the model as additional explanatory variables. However, these variables are not observed.

I use Markov Chain Monte Carlo (MCMC) to estimate the parameters of the dyadic linking model (these include the unobserved individual-level attributes $\mu_i$). The estimation can be done using the function `netformation`

.

# Dyadic linking model net <- netformation(network = Network, formula = ~ dX, fixed.effects = TRUE, mcmc.ctr = list(burnin = 1000, iteration = 5000))

load("out.net.Rdata") # I copied and pasted the output to save time. But if you run the code on your laptop with the same seed, you will get the same results cat("0% 10 20 30 40 50 60 70 80 90 100% [----|----|----|----|----|----|----|----|----|----| **************************************************| The program successfully executed ********SUMMARY******** n.obs : 314890 n.links : 26325 K : 2 Fixed effects : Yes Burnin : 1000 Iteration : 6000 Elapsed time : 0 HH 7 mm 56 ss Average acceptance rate beta: 0.274125 mu: 0.2697619 ")

I randomly choose some parameters and present their posterior distribution.

# plot simulations par(mfrow = c(4,2), mar = c(2, 2, 2, 1.9)) plot(net$posterior$beta[,1], type = "l", ylim = c(-2.9, -2.6), col = "blue", main = bquote(beta[1]), ylab = "") abline(h = betanet[1], col = "red") plot(net$posterior$beta[,2], type = "l", ylim = c(-1.6, -1.4), col = "blue", main = bquote(beta[2]), ylab = "") abline(h = betanet[2], col = "red") plot(net$posterior$mu[,10], type = "l", col = "blue", main = bquote(mu[10]), ylab = "") abline(h = mu[10], col = "red") plot(net$posterior$mu[,542], type = "l", col = "blue", main = bquote(mu[542]), ylab = "") abline(h = mu[542], col = "red") plot(net$posterior$mu[,849], type = "l", col = "blue", main = bquote(mu[849]), ylab = "") abline(h = mu[849], col = "red") plot(net$posterior$mu[,752], type = "l", col = "blue", main = bquote(mu[752]), ylab = "") abline(h = mu[752], col = "red") plot(net$posterior$uu[,1], type = "l", col = "blue", main = bquote(u[mu][1]), ylab = "") abline(h = uu[1], col = "red") plot(net$posterior$sigmamu2[,3], type = "l", col = "blue", main = bquote(sigma[mu][3]^2), ylab = "") abline(h = sigma2u[3], col = "red")

# plot simulations par(mfrow = c(4,2), mar = c(2, 2, 2, 1.9)) plot(snet[[1]], type = "l", ylim = c(-2.9, -2.6), col = "blue", main = bquote(beta[1]), ylab = "") abline(h = betanet[1], col = "red") plot(snet[[2]], type = "l", ylim = c(-1.6, -1.4), col = "blue", main = bquote(beta[2]), ylab = "") abline(h = betanet[2], col = "red") plot(snet[[3]], type = "l", col = "blue", main = bquote(mu[10]), ylab = "") abline(h = mu[10], col = "red") plot(snet[[4]], type = "l", col = "blue", main = bquote(mu[542]), ylab = "") abline(h = mu[542], col = "red") plot(snet[[5]], type = "l", col = "blue", main = bquote(mu[849]), ylab = "") abline(h = mu[849], col = "red") plot(snet[[6]], type = "l", col = "blue", main = bquote(mu[752]), ylab = "") abline(h = mu[752], col = "red") plot(snet[[7]], type = "l", col = "blue", main = bquote(u[mu][1]), ylab = "") abline(h = uu[1], col = "red") plot(snet[[8]], type = "l", col = "blue", main = bquote(sigma[mu][3]^2), ylab = "") abline(h = sigma2u[3], col = "red")

It stands out that the MCMC converges quickly and the simulations (plotted in blue) are pretty close to the true values (red line).

Using simulations from the posterior distribution, I can construct good estimators for $\tilde{\mu}_i$ and $\bar{\tilde{\mu}}_i$ and use them as additional explanatory variables. In the next example, I use the simulation with the highest posterior density to estimate $\tilde{\mu}_i$ and $\bar{\tilde{\mu}}_i$.

t <- which.max(net$posterior$log.density) print(t)

```
print(t)
```

muest <- net$posterior$mu[t,] uuest <- net$posterior$uu[t,] sigma2uest <- net$posterior$sigmamu2[t,] tmuest <- (muest - rep(uuest, nvec))/sqrt(rep(sigma2uest, nvec)) data$tmuest <- tmuest CDendo <- CDnetNPL(formula = y ~ x1 + x2 + tmuest, contextual = TRUE, Glist = Glist, optimizer = "optim", data = data, npl.ctr = list(print = FALSE)) summary(CDendo)

data$tmuest <- tmuest summary(CDendo)

The coefficients of $\tilde{\mu}_i$ and $\bar{\tilde{\mu}}_i$ are significant. This confirms that the network is endogenous. Moreover, the peer effects are estimated at 0.40, which is much better than the previous estimate.

The standard errors computed in this example are not valid because they assume that $\tilde{\mu}_i$ and $\bar{\tilde{\mu}}_i$ are observed. To correct the standard error, I replicate simulations of $\tilde{\mu}_i$ and $\bar{\tilde{\mu}}_i$ with replacement from the posterior distribution in order to take into account the variance of the MCMC.

fendo <- function(s) { t <- sample(3001:8000, 1) datat <- data mus <- net$posterior$mu[t,] uus <- rep(net$posterior$uu[t,], nvec) sus <- rep(net$posterior$sigmamu2[t,], nvec) tmu <- (mus - uus)/sqrt(sus) datat$tmu <- tmu CDnet <- CDnetNPL(formula = y ~ x1 + x2 + tmu, contextual = TRUE, Glist = Glist, optimizer = "optim", data = datat, npl.ctr = list(print = FALSE)) summary(CDnet, Glist = Glist, data = datat) } n.cores <- 32 replic <- 1000 #Number of replications out.endo <- mclapply(1:replic, fendo, mc.cores = n.cores) # The output of out.endo is a list of objects from "summary.CDnetNPL" class # Let's set the class of out.endo as "summary.CDnetNPLs" class(out.endo) <- "summary.CDnetNPLs" # I can now summarize the results using print # Object from "summary.CDnetNPL" has a print method print(out.endo)

load("out.endo.Rdata") CDatanet:::print.summary.CDnetNPLs(out.endo)

The standard errors are slightly greater than the previous estimates.

One simulation is not sufficient to confirm that the method used to correct the endogeneity performs well. In this section, I use Monte Carlo simulations to assess the performance of this method. I compare the estimates when the network is assumed exogenous to these when I control for the endogeneity.

I build the following Monte Carlo function which simulates data based on endogenous network, and estimates the model by assuming that the network is exogenous and then endogenous.

fMCendo <- function(s) { # parameters M <- 3 nvec <- rep(500, 3) n <- sum(nvec) lambda <- 0.4 # peer effects beta <- c(2, -1.9, 0.8) # own effects gamma <- c(1.5, -1.2) # contextual effects sigma <- 1.5 # standard deviation of epsilon theta <- c(lambda, beta, gamma, sigma) data <- data.frame(x1 = rnorm(n, 1, 1), x2 = rexp(n, 0.4)) # Network betanet <- c(-2.8, -1.5) # beta Glist <- list() # adjacency matrix row normalized Network <- list() # adjacency matrix row non-normalized dX <- matrix(0, 0, 2) # observed dyad-specific variables mu <- list() # unobserved individual-level attribute uu <- runif(M, -1, 1) # mean of uu in each sub-network sigma2u <- runif(M, 0.5, 4) # variance of uu in each sub-network for (m in 1:M) { nm <- nvec[m] mum <- rnorm(nm, uu[m], sqrt(sigma2u[m])) Z1 <- matrix(0, nm, nm) Z2 <- matrix(0, nm, nm) for (i in 1:nm) { for (j in 1:nm) { Z1[i, j] <- abs(data$x1[i] - data$x1[j]) Z2[i, j] <- abs(data$x2[i] - data$x2[j]) } } Gm <- 1*((Z1*betanet[1] + Z2*betanet[2] + kronecker(mum, t(mum), "+") + rlogis(nm^2)) > 0) diag(Gm) <- 0 diag(Z1) <- NA diag(Z2) <- NA Z1 <- Z1[!is.na(Z1)] Z2 <- Z2[!is.na(Z2)] dX <- rbind(dX, cbind(Z1, Z2)) Network[[m]] <- Gm rs <- rowSums(Gm); rs[rs == 0] <- 1 Gm <- Gm/rs Glist[[m]] <- Gm mu[[m]] <- mum } mu <- unlist(mu) # Data tmu <- (mu - rep(uu, nvec))/sqrt(rep(sigma2u, nvec)) data$tmu <- tmu rho <- 0.24 rhobar <- 0.18 thetanet <- c(lambda, beta, sigma*rho, gamma, sigma*rhobar, sigma*(1 - rho - rhobar)) ytmp <- simCDnet(formula = ~ x1 + x2 + tmu | x1 + x2 + tmu, Glist = Glist, theta = thetanet, data = data) y <- ytmp$y data$y <- y # Count data model CDexo <- CDnetNPL(formula = y ~ x1 + x2, contextual = TRUE, Glist = Glist, optimizer = "optim", data = data, npl.ctr = list(print = FALSE)) # Dyadic linking model net <- netformation(network = Network, formula = ~ dX, fixed.effects = TRUE, mcmc.ctr = list(burnin = 1000, iteration = 2000), print = FALSE) # Endogeneity t <- which.max(net$posterior$log.density) muest <- net$posterior$mu[t,] uuest <- net$posterior$uu[t,] sigma2uest <- net$posterior$sigmamu2[t,] tmuest <- (muest - rep(uuest, nvec))/sqrt(rep(sigma2uest, nvec)) data$tmuest <- tmuest CDendo <- CDnetNPL(formula = y ~ x1 + x2 + tmuest, contextual = TRUE, Glist = Glist, optimizer = "optim", data = data, npl.ctr = list(print = FALSE)) c(CDexo$estimate, CDendo$estimate) }

I now run 1000 times the function `fMCendo`

and compute the average of both estimators.

n.cores <- 5 replic <- 1000 #Number of replications out.mcendo <- mclapply(1:replic, fMCendo, mc.cores = n.cores) out.mcendo <- apply(t(do.call(cbind, out.mcendo)), 2, mean) out.endo <- cbind(thetanet, c(out.mcendo[c(1:4, NA, 5:6, NA, 7)]), out.mcendo[8:16]) rownames(out.endo) <- names(out.mcendo[8:16]) colnames(out.endo) <- c("TrueValue", "Endo-Not-Controlled", "Endo-Controlled") print(out.endo)

load("out.mcendo.Rdata") print(out.endo)

The results prove that the method used to correct the endogeneity performs fairly well. The average of the estimates is 0.42 while the true value is 0.40. There is still a small bias. This is certainly due to the number of iterations of the MCMC which is low. Indeed, I use the iteration with the highest posterior density to estimate $\tilde{\mu}_i$ and $\bar{\tilde{\mu}}_i$. Since the number of iterations is low, the estimates may not be very good: the more the number of iterations, the better the simulation with highest posterior density for approximating $\tilde{\mu}_i$ and $\bar{\tilde{\mu}}_i$.

In contrast, the peer effects are overestimated when the network is assumed exogenous. The average of the estimates is 0.52.

This paper provides technical details on the package **CDatanet**. It shows with simple and practical examples how to use the package. It provides comparison of the model with the SART and the SAR models. It also shows with Monte Carlo simulations how the model performs when the network matrix is exogenous and endogenous. Simulation results prove that the peer effects are overestimated when the network is endogenous. Lastly, it presents a way to control for the endogeneity of the network.

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.