\definecolor{shadecolor}{rgb}{0.97, 0.97, 0.95}

rmarkdown::find_pandoc(version = '2.9.2.1')
knitr::opts_chunk$set(echo = TRUE)

\newpage

Instrumental variable (IV) procedure {#IV}

We provide the function sim.IV(dnetwork, X, y, replication, power) where dnetwork is the network linking probabilities, X is a matrix of covariates, y (optional) is the vector of outcome, replication (optional, default = 1) is the number of replications, and power (optional, default = 1) is the number of powers of the interaction matrix used to generate the instruments. The function outputs a proxy for Gy and simulated instruments.

Model without contextual effects

The following code provides an example using a sample of 30 networks of size 50 each. For the sake of the example, we assume that linking probabilities are known and drawn from an uniform distribution. We first simulate data. Then, we estimate the linear-in-means model using our IV procedure, using the known linking probabilities to generate approximations of the true network.

library(PartialNetwork)
set.seed(123)
# Number of groups
M             <- 30
# size of each group
N             <- rep(50,M)
# individual effects
beta          <- c(2,1,1.5) 
# endogenous effects
alpha         <- 0.4
# std-dev errors
se            <- 1
# network distribution
distr         <- runif(sum(N*(N-1)))
distr         <- vec.to.mat(distr, N, normalise = FALSE)
# covariates
X             <- cbind(rnorm(sum(N),0,5),rpois(sum(N),7))
# true network
G0            <- sim.network(distr)
# normalise 
G0norm        <- norm.network(G0)
# simulate dependent variable use an external package
y             <- CDatanet::simsar(~ X, contextual = FALSE, Glist = G0norm, 
                                  theta = c(alpha, beta, se))
y             <- y$y
# generate instruments 
instr         <- sim.IV(distr, X, y, replication = 1, power = 1)
GY1c1         <- instr[[1]]$G1y       # proxy for Gy (draw 1)
GXc1          <- instr[[1]]$G1X[,,1]  # proxy for GX (draw 1)
GXc2          <- instr[[1]]$G2X[,,1]  # proxy for GX (draw 2)
# build dataset
# keep only instrument constructed using a different draw than the one used to proxy Gy
dataset           <- as.data.frame(cbind(y, X, GY1c1, GXc1, GXc2)) 
colnames(dataset) <- c("y","X1","X2","G1y", "G1X1", "G1X2", "G2X1", "G2X2") 

Once the instruments are generated, the estimation can be performed using standard tools, e.g. the function ivreg from the AER package. For example, if we use the same draw for the proxy and the instruments, the estimation is "bad".

library(AER)
# Same draws
out.iv1           <- ivreg(y ~ X1 + X2 + G1y | X1 + X2 + G1X1 + G1X2, data = dataset)
summary(out.iv1)

If we use different draws for the proxy and the instruments, the estimation is "good".

# Different draws
out.iv2           <- ivreg(y ~ X1 + X2 + G1y | X1 + X2 + G2X1 + G2X2, data = dataset)
summary(out.iv2)

However, as stated by our Proposition 2, this estimator is biased. We can approximate the bias as follows.

ddS     <- as.matrix(cbind(1, dataset[,c("X1", "X2", "G1y")]))         #\ddot{S} 
dZ      <- as.matrix(cbind(1, dataset[,c("X1", "X2", "G2X1", "G2X2")]))#\dot{Z} 
dZddS   <- crossprod(dZ, ddS)/sum(N)                                  
W       <- solve(crossprod(dZ)/sum(N))                                 
matM    <- solve(crossprod(dZddS, W%*%dZddS), crossprod(dZddS, W))    
maxbias <- apply(sapply(1:1000, function(...){
  dddGy <- peer.avg(sim.network(distr, normalise = TRUE) , y)
  abs(matM%*%crossprod(dZ, dddGy - dataset$G1y)/sum(N))
}), 1, max); names(maxbias) <- c("(Intercept)", "X1", "X2", "G1y")
{cat("Maximal absolute bias\n"); print(maxbias)}

Model with contextual effects

We now assume that the model includes contextual effects. We first generate data.

library(PartialNetwork)
set.seed(123)
# Number of groups
M             <- 30
# size of each group
N             <- rep(50,M)
# individual effects
beta          <- c(2,1,1.5) 
# contextual effects
gamma         <- c(5, -3)
# endogenous effects
alpha         <- 0.4
# std-dev errors
se            <- 1
# network distribution
distr         <- runif(sum(N*(N-1)))
distr         <- vec.to.mat(distr, N, normalise = FALSE)
# covariates
X             <- cbind(rnorm(sum(N),0,5),rpois(sum(N),7))
# true network
G0            <- sim.network(distr)
# normalise 
G0norm        <- norm.network(G0)
# simulate dependent variable use an external package
y             <- CDatanet::simsar(~ X, contextual = TRUE, Glist = G0norm, 
                                  theta = c(alpha, beta, gamma, se))
y             <- y$y
# GX
GX            <- peer.avg(G0norm, X)
# generate instruments 
# we need power = 2 for models with contextual effetcs
instr         <- sim.IV(distr, X, y, replication = 1, power = 2)
GY1c1         <- instr[[1]]$G1y       # proxy for Gy (draw 1)
GXc1          <- instr[[1]]$G1X[,,1]  # proxy for GX (draw 1)
GXc2          <- instr[[1]]$G2X[,,1]  # proxy for GX (draw 2)
GXc2sq        <- instr[[1]]$G2X[,,2]  # proxy for G^2X (draw 2)
# build dataset
# keep only instrument constructed using a different draw than the one used to proxy Gy
dataset           <- as.data.frame(cbind(y, X, GX, GY1c1, GXc1, GXc2, GXc2sq)) 
colnames(dataset) <- c("y","X1","X2", "GX1", "GX2", "G1y", "G1X1", "G1X2", "G2X1", "G2X2",
                       "G2X1sq", "G2X2sq") 

As pointed out in the paper, the IV procedure requires $\mathbf{G}\mathbf{X}$ being observed. In additions, when contextual effects are included, we consider the extended model.

# Different draws
out.iv2           <- ivreg(y ~ X1 + X2 + GX1 + GX2 + G1X1 + G1X2 + G1y | X1 + X2 + GX1 + 
                             GX2 + G1X1 + G1X2 + G2X1 + G2X2 + G2X1sq + G2X2sq, 
                           data = dataset)
summary(out.iv2)

We also compute the maximal absolute bias.

ddS     <- as.matrix(cbind(1, dataset[,c("X1", "X2", "GX1", "GX2", "G1X1", "G1X2", 
                                         "G1y")]))                
dZ      <- as.matrix(cbind(1, dataset[,c("X1", "X2", "GX1", "GX2", "G1X1", 
                                         "G1X2", "G2X1", "G2X2", "G2X1sq", "G2X2sq")]))
dZddS   <- crossprod(dZ, ddS)/sum(N)                                  
W       <- solve(crossprod(dZ)/sum(N))                                 
matM    <- solve(crossprod(dZddS, W%*%dZddS), crossprod(dZddS, W))    
maxbias <- apply(sapply(1:1000, function(...){
  dddGy <- peer.avg(sim.network(distr, normalise = TRUE) , y)
  abs(matM%*%crossprod(dZ, dddGy - dataset$G1y)/sum(N))
}), 1, max); names(maxbias) <- c("(Intercept)", "X1", "X2", "GX1", "GX2", "G1X1", 
                                 "G1X2", "G1y")
{cat("Maximal absolute bias\n"); print(maxbias)}

Simulated Method of Moments {#SGMM}

As shown in the paper (see @boucher2020estimating), our IV estimator is inconsistent. Although the bias is expected to be small, in general, the IV estimator performs less well when the model includes contextual effects. Therefore, we propose a Simulated Method of Moments (SMM) estimator by correcting the bias of the moment function use by the IV estimator. Our SMM estimator is then consistent and also deals with group heterogeneity.

Models without group heterogeneity

We first simulate data.

library(PartialNetwork)
set.seed(123)
# Number of groups
M             <- 100
# size of each group
N             <- rep(30,M)
# individual effects
beta          <- c(2, 1, 1.5, 5, -3) 
# endogenous effects
alpha         <- 0.4
# std-dev errors
se            <- 1 
# network distribution
distr         <- runif(sum(N*(N-1)))
distr         <- vec.to.mat(distr, N, normalise = FALSE)
# covariates
X             <- cbind(rnorm(sum(N),0,5),rpois(sum(N),7))
# true network
G0            <- sim.network(distr)
# normalise 
G0norm        <- norm.network(G0)
# Matrix GX
GX            <- peer.avg(G0norm, X)
# simulate dependent variable use an external package
y             <- CDatanet::simsar(~ X, contextual = TRUE, Glist = G0norm, 
                                  theta = c(alpha, beta, se))
Gy            <- y$Gy
y             <- y$y
# build dataset
dataset           <- as.data.frame(cbind(y, X, Gy, GX)) 
colnames(dataset) <- c("y","X1","X2", "Gy", "GX1", "GX2") 

The estimation can be performed using the function smmSAR (do ?smmSAR to read the help file of the function). The function allows to specify if GX and Gy are observed. We provide an example for each case.

If GX and Gy are observed (instruments will be constructed using the network distribution).

load(url('https://github.com/ahoundetoungan/PartialNetwork/blob/master/datavignettes/smm.rda?raw=true'))
out.smm1       <- smmSAR(y ~ X1 + X2 | Gy | GX1 + GX2, dnetwork = distr, contextual = T, 
                         smm.ctr  = list(R = 1, print = F), data = dataset)
summary(out.smm1)
print(smm$smm1)

If GX is observed and not Gy.

out.smm2       <- smmSAR(y ~ X1 + X2 || GX1 + GX2, dnetwork = distr, contextual = T, 
                         smm.ctr  = list(R = 1, print = F), data = dataset)
summary(out.smm2)
print(smm$smm2)

If Gy is observed and not GX.

out.smm3       <- smmSAR(y ~ X1 + X2 | Gy, dnetwork = distr, contextual = T, 
                         smm.ctr  = list(R = 100, print = F), data = dataset)
summary(out.smm3)
print(smm$smm3)

If neither Gy nor GX are observed.

out.smm4       <- smmSAR(y ~ X1 + X2, dnetwork = distr, contextual = T, 
                         smm.ctr  = list(R = 100, print = F), data = dataset)
summary(out.smm4)
print(smm$smm4)

Models with group heterogeneity

We assume here that every group has an unobserved characteristic which affects the outcome. Let us first simulate the data.

library(PartialNetwork)
set.seed(123)
# Number of groups
M             <- 200
# size of each group
N             <- rep(30,M)
# individual effects
beta          <- c(1, 1, 1.5, 5, -3)
# endogenous effects
alpha         <- 0.4
# std-dev errors
se            <- 1
# network distribution
distr         <- runif(sum(N*(N-1)))
distr         <- vec.to.mat(distr, N, normalise = FALSE)
# covariates
X             <- cbind(rnorm(sum(N),0,5), rpois(sum(N),7))
# Groups' fixed effects
# In order to have groups' heterogeneity correlated to X (fixed effects),
# We consider the quantile of X2 at 25% in the group
eff           <- unlist(lapply(1:M, function(x) 
  rep(quantile(X[(c(0, cumsum(N))[x]+1):(cumsum(N)[x]),2], probs = 0.25), each = N[x])))
print(c("cor(eff, X1)" = cor(eff, X[,1]), "cor(eff, X2)" = cor(eff, X[,2]))) 
# We can see that eff is correlated to X2. We can confirm that the correlation is 
# strongly significant.
print(c("p.value.cor(eff, X1)" = cor.test(eff, X[,1])$p.value,
        "p.value.cor(eff, X2)" = cor.test(eff, X[,2])$p.value))
# true network
G0            <- sim.network(distr)
# normalise 
G0norm        <- norm.network(G0)
# Matrix GX
GX            <- peer.avg(G0norm, X)
# simulate dependent variable use an external package
y             <- CDatanet::simsar(~ -1 + eff + X | X, Glist = G0norm, 
                                  theta = c(alpha, beta, se))
Gy            <- y$Gy
y             <- y$y
# build dataset
dataset           <- as.data.frame(cbind(y, X, Gy, GX)) 
colnames(dataset) <- c("y","X1","X2", "Gy", "GX1", "GX2") 

The group heterogeneity is correlated to $\mathbf{X}$ and induces bias if we do not control for it. In practice, we do not observe eff and we cannot add 200 dummies variables as explanatory variables to the model. We can control for group heterogeneity by taking the difference of each variable with respect to the group average (see @bramoulle2009identification). To do this, we only need to set fixed.effects = TRUE in smmSAR. We provide examples.

If GX is observed and not Gy.

out.smmeff1 <- smmSAR(y ~ X1 + X2 || GX1 + GX2, dnetwork = distr, contextual = T, 
                      fixed.effects = T, smm.ctr  = list(R = 1, print = F), 
                      data = dataset)
summary(out.smmeff1)
print(smm$smme1)

If Gy is observed and not GX.

out.smmeff2 <- smmSAR(y ~ X1 + X2 | Gy, dnetwork = distr, contextual = T, 
                      fixed.effects = T, smm.ctr  = list(R = 100, print = F), 
                      data = dataset)
summary(out.smmeff2)
print(smm$smme2)

If neither Gy nor GX are observed.

out.smmeff3 <- smmSAR(y ~ X1 + X2, dnetwork = distr, contextual = T, fixed.effects = T,
                      smm.ctr  = list(R = 100, print = F), data = dataset)
summary(out.smmeff3)
print(smm$smme3)

As we can see, the estimator with fixed effects has larger variance. Thus, we cannot illustrate its consistency using only one simulation. Therefore, we conduct Monte Carlo simulations.

We construct the function fMC which simulates data and computes the SMM estimator following the same process as described above.

fMC <- function(...){
  # Number of groups
  M             <- 200
  # size of each group
  N             <- rep(30,M)
  # individual effects
  beta          <- c(1, 1, 1.5, 5, -3)
  # endogenous effects
  alpha         <- 0.4
  # std-dev errors
  se            <- 1
  # network distribution
  distr         <- runif(sum(N*(N-1)))
  distr         <- vec.to.mat(distr, N, normalise = FALSE)
  # covariates
  X             <- cbind(rnorm(sum(N),0,5), rpois(sum(N),7))
  # Groups' fixed effects
  # We defined the groups' fixed effect as the quantile at 25% of X2 in the group
  # This implies that the effects are correlated with X
  eff           <- unlist(lapply(1:M, function(x) 
    rep(quantile(X[(c(0, cumsum(N))[x]+1):(cumsum(N)[x]),2], probs = 0.25), each = N[x])))
  # true network
  G0            <- sim.network(distr)
  # normalise 
  G0norm        <- norm.network(G0)
  # Matrix GX
  GX            <- peer.avg(G0norm, X)
  # simulate dependent variable use an external package
  y             <- CDatanet::simsar(~ -1 + eff + X | X, Glist = G0norm, 
                                    theta = c(alpha, beta, se))
  Gy            <- y$Gy
  y             <- y$y
  # build dataset
  dataset           <- as.data.frame(cbind(y, X, Gy, GX)) 
  colnames(dataset) <- c("y","X1","X2", "Gy", "GX1", "GX2") 
  out.smmeff1 <- smmSAR(y ~ X1 + X2 || GX1 + GX2, dnetwork = distr, contextual = T, 
                        fixed.effects = T, smm.ctr  = list(R = 1, print = F), 
                        data = dataset)
  out.smmeff2 <- smmSAR(y ~ X1 + X2 | Gy, dnetwork = distr, contextual = T, 
                        fixed.effects = T, smm.ctr  = list(R = 100, print = F), 
                        data = dataset)
  out.smmeff3 <- smmSAR(y ~ X1 + X2, dnetwork = distr, contextual = T, fixed.effects = T,
                      smm.ctr  = list(R = 100, print = F), data = dataset)
  out         <- data.frame("GX.observed"   = out.smmeff1$estimates,
                            "Gy.observed"   = out.smmeff2$estimates,
                            "None.observed" = out.smmeff3$estimates)
  out
}

We perform 250 simulations.

smm.Monte.C   <- lapply(1:250, fMC)

We compute the average of the estimates

Reduce('+', smm.Monte.C)/250
print(smm$smmmc)

The SMM estimator performs well even when we only have the distribution of the network, $\mathbf{G}\mathbf{X}$ and $\mathbf{G}\mathbf{y}$ are not observed, and the model includes group heterogeneity.

smm <- list(smm1 = out.smm1, smm2 = out.smm2,
            smm3 = out.smm3, smm4 = out.smm4,
            smme1 = out.smmeff1, smme2 = out.smmeff2,
            smme3 = out.smmeff3, smmmc = smm$smmmc)
save(smm, file = "~/Dropbox/Papers - In progress/Partial Network/Package/AH/PartialNetwork/datavignettes/smm.rda")

How to compute the variance when the network distribution is estimated?

load(url('https://github.com/ahoundetoungan/PartialNetwork/blob/master/datavignettes/smmlogit.rda?raw=true'))

In practice, the econometrician does not observed the true distribution of the entire network. They can only have an estimator based on partial network data (see @boucher2020estimating). Because this estimator is used instead of the true distribution, this increases the variance of the SMM estimator. We develop a method to estimate the variance by taking into account the uncertainty related to the estimator of the network distribution.

Assume that the network distribution is logistic, ie, \begin{equation}\dfrac{p_{ij}}{1 - p_{ij}} = \exp(\rho_0 + \rho_1|X_{i1} + X_{j1}| + \rho_2|X_{i2} - X_{j2}|)\label{eq:logit}\end{equation}

and $\mathbb{P}(a_{ij} = 1|\mathbf{P}) = p_{ij}$, where $\mathbf{A}$ is the adjacency matrix, $a_{ij}$ is the ($i$, $j$)-th entry of $\mathbf{A}$ and $\mathbf{P}$ is the matrix of $p_{ij}$. We simulated data following this assumption.

library(PartialNetwork)
set.seed(123)
# Number of groups
M        <- 100
# size of each group
N        <- rep(30,M)
# covariates
X        <- cbind(rnorm(sum(N),0,5),rpois(sum(N),7))
# network formation model parameter
rho      <- c(-0.8, 0.2, -0.1)
# individual effects
beta     <- c(2, 1, 1.5, 5, -3) 
# endogenous effects
alpha    <- 0.4
# std-dev errors
se       <- 1 
# network
tmp      <- c(0, cumsum(N))
X1l      <- lapply(1:M, function(x) X[c(tmp[x] + 1):tmp[x+1],1])
X2l      <- lapply(1:M, function(x) X[c(tmp[x] + 1):tmp[x+1],2])
dist.net <- function(x, y) abs(x - y)
X1.mat   <- lapply(1:M, function(m) {
  matrix(kronecker(X1l[[m]], X1l[[m]], FUN = dist.net), N[m])})
X2.mat   <- lapply(1:M, function(m) {
  matrix(kronecker(X2l[[m]], X2l[[m]], FUN = dist.net), N[m])})
Xnet     <- as.matrix(cbind("Const" = 1, 
                            "dX1"   = mat.to.vec(X1.mat),
                            "dX2"   = mat.to.vec(X2.mat)))
ynet     <- Xnet %*% rho
ynet     <- c(1*((ynet + rlogis(length(ynet))) > 0))
G0       <- vec.to.mat(ynet, N, normalise = FALSE)
# normalise 
G0norm   <- norm.network(G0)
# Matrix GX
GX       <- peer.avg(G0norm, X)
# simulate dependent variable use an external package
y        <- CDatanet::simsar(~ X, contextual = TRUE, Glist = G0norm, 
                                  theta = c(alpha, beta, se))
Gy       <- y$Gy
y        <- y$y
# build dataset
dataset           <- as.data.frame(cbind(y, X, Gy, GX)) 
colnames(dataset) <- c("y","X1","X2", "Gy", "GX1", "GX2") 

We do not observed the true distribution of the network. We observe a subset of ${a_{ij}}$. Let $\mathcal{A}^{obs} = {(i,~j), ~ a_{ij} \text{ is observed}}$ and $\mathcal{A}^{nobs} = {(i,~j), ~ a_{ij} \text{ is not observed}}$. We assume that $|\mathcal{A}^{obs}| \to \infty$ as the sample size grows to $\infty$. Therefore, $\rho_0$, $\rho_1$, and $\rho_2$ can be consistently estimated.

nNet      <- nrow(Xnet) # network formation model sample size
Aobs      <- sample(1:nNet, round(0.3*nNet)) # We observed 30%
# We can estimate rho using the gml function from the stats package
logestim  <- glm(ynet[Aobs] ~ -1 + Xnet[Aobs,], family = binomial(link = "logit"))
slogestim <- summary(logestim)
rho.est   <- logestim$coefficients
rho.var   <- slogestim$cov.unscaled # we also need the covariance of the estimator

We assume that the explanatory variables $X_{i1}$ and $X_{i2}$ are observed for any $i$ in the sample. Using the estimator of $\rho_0$, $\rho_1$, and $\rho_2$, we can also compute $\hat{\mathbf{P}}$, a consistent estimator of $\mathbf{P}$, from Equation (\ref{eq:logit}).

d.logit     <- lapply(1:M, function(x) {
  out       <- 1/(1 + exp(-rho.est[1] - rho.est[2]*X1.mat[[x]] - 
                            rho.est[3]*X2.mat[[x]]))
  diag(out) <- 0
  out})

We can use $\hat{\mathbf{P}}$ for our SMM estimator. We focus on the case where neither $\mathbf{G}\mathbf{X}$ nor $\mathbf{G}\mathbf{y}$ are observed. The same strategy can be used for other cases.

smm.logit   <- smmSAR(y ~ X1 + X2, dnetwork = d.logit, contextual = T, 
                      smm.ctr  = list(R = 100, print = F), data = dataset)
summary(smm.logit)
print(smmlo$smm1)

The variance of the estimator computed above is conditionally on d.logit. It does not take into account the uncertainty related to the estimation of d.logit. To compute the right variance, we need a simulator from the distribution of d.logit. This simulator is a function which samples one network distribution from the distribution of d.logit. For instance, for the logit model, we can sample $\boldsymbol{\rho}$ from $N\Big(\hat{\boldsymbol{\rho}}, \mathbb{V}(\hat{\boldsymbol{\rho}})\Big)$ and then compute $\hat{\mathbf{P}}$. Depending on the network formation model, users can compute the adequate simulator.

fdist        <- function(rho.est, rho.var, M, X1.mat, X2.mat){
  rho.est1   <- MASS::mvrnorm(mu = rho.est, Sigma = rho.var)
  lapply(1:M, function(x) {
  out        <- 1/(1 + exp(-rho.est1[1] - rho.est1[2]*X1.mat[[x]] -
                             rho.est1[3]*X2.mat[[x]]))
  diag(out)  <- 0
  out})
}

The function can be passed into the argument .fun of the summary method. We also need to pass the arguments of the simulator as a list into .args.

fdist_args  <- list(rho.est = rho.est, rho.var = rho.var, M = M, X1.mat = X1.mat, 
                    X2.mat = X2.mat)
summary(smm.logit, dnetwork = d.logit, data = dataset, .fun = fdist, .args = fdist_args,
        sim = 500, ncores = 8) # ncores performs simulations in parallel
print(smmlo$smm2)

We can notice that the variance is larger when we take into account the uncertainty of the estimation of the logit model.

smm2  <- summary(smm.logit, dnetwork = d.logit, data = dataset, .fun = fdist, .args = fdist_args, sim = 500, ncores = 8)
smmlo <- list(smm1 = smm.logit, smm2 = smm2)
save(smmlo, file = "~/Dropbox/Papers - In progress/Partial Network/Package/AH/PartialNetwork/datavignettes/smmlogit.rda")

Bayesian estimator {#BAYES}

The Bayesian estimator is neatly packed in the function mcmcSAR (see the help page of the function in the package, using ?mcmcSAR, for more details on the function).

without network formation model

For the sake of the example, we assume that linking probabilities are known and drawn from an uniform distribution. We first simulate data. Then, we estimate the linear-in-means model using our Bayesian estimator.

In the following example (example I-1, output out.none1), we assume that the network is entirely observed.

We first simulate data.

load(url('https://github.com/ahoundetoungan/PartialNetwork/blob/master/datavignettes/out.none.rda?raw=true'))
#save(out.none1, out.none2.1, out.none2.2, out.none3.1, out.none3.2, file = "~/Dropbox/Papers - In progress/Partial Network/Package/AH/PartialNetwork/datavignettes/out.none.rda")
library(PartialNetwork)
set.seed(123)
# EXAMPLE I: WITHOUT NETWORK FORMATION MODEL 
# Number of groups
M             <- 50
# size of each group
N             <- rep(30,M)
# individual effects
beta          <- c(2,1,1.5) 
# contextual effects
gamma         <- c(5,-3) 
# endogenous effects
alpha         <- 0.4
# std-dev errors
se            <- 1 
# network distribution
distr         <- runif(sum(N*(N-1)))
distr         <- vec.to.mat(distr, N, normalise = FALSE)
# covariates
X             <- cbind(rnorm(sum(N),0,5),rpois(sum(N),7))
# true network
G0            <- sim.network(distr)
# normalize 
G0norm        <- norm.network(G0)
# simulate dependent variable use an external package
y             <- CDatanet::simsar(~ X, contextual = TRUE, Glist = G0norm,
                                  theta = c(alpha, beta, gamma, se))
y             <- y$y
# dataset
dataset       <- as.data.frame(cbind(y, X1 = X[,1], X2 = X[,2])) 

Once the data are simulated, the estimation can be performed using the function mcmcSAR.

# Example I-1: When the network is fully observed 
out.none1     <- mcmcSAR(formula = y ~ X1 + X2, contextual = TRUE, G0.obs = "all",
                         G0 = G0, data = dataset, iteration = 2e4)
summary(out.none1)
summary(out.none1)
plot(out.none1, plot.type = "sim", mar = c(3, 2.1, 1, 1))

For Example I-2, we assume that only 60% of the links are observed.

# Example I-2: When a part of the network is observed 
# 60% of the network data is observed
G0.obs       <- lapply(N, function(x) matrix(rbinom(x^2, 1, 0.6), x))

Estimation out.none2.1 assumes that the sampled network is the true one (inconsistent, peer effects are overestimated).

# replace the non-observed part of the network by 0 (missing links)
G0.start     <- lapply(1:M, function(x) G0[[x]]*G0.obs[[x]])
# Use network with missing data as the true network
out.none2.1  <- mcmcSAR(formula = y ~ X1 + X2, contextual = TRUE, G0.obs = "all",
                        G0 = G0.start,   data = dataset, iteration = 2e4)
summary(out.none2.1) # the peer effets seem overestimated
summary(out.none2.1) # the peer effets seem overestimated
plot(out.none2.1, plot.type = "sim", mar = c(3, 2.1, 1, 1))

\newpage

Estimation out.none2.2 specifies which links are observed and which ones are not. The true probabilities are used to sample un-observed links (consistent).

out.none2.2  <- mcmcSAR(formula = y ~ X1 + X2, contextual = TRUE, G0.obs = G0.obs,
                        G0 = G0.start, data = dataset, 
                        mlinks = list(dnetwork = distr), iteration = 2e4)
summary(out.none2.2)
summary(out.none2.2)
plot(out.none2.2, plot.type = "sim", mar = c(3, 2.1, 1, 1))

For Example I-3, we assume that only linking probabilities are known.

Estimation out.none3.1 assumes the researcher uses a draw from that distribution as the true network (inconsistent, peer effects are overestimated).

# Example I-3: When only the network distribution is available 
# Simulate a fictitious network and use as true network
G0.tmp       <- sim.network(distr)
out.none3.1  <- mcmcSAR(formula = y ~ X1 + X2, contextual = TRUE, G0.obs = "all",
                        G0 = G0.tmp, data = dataset, iteration = 2e4)
summary(out.none3.1)  # the peer effets seem overestimated
summary(out.none3.1)  # the peer effets seem overestimated
plot(out.none3.1, plot.type = "sim", mar = c(3, 2.1, 1, 1))

Estimation out.none3.2 specifies that no link is observed, but that the distribution is known (consistent).

out.none3.2  <- mcmcSAR(formula = y ~ X1 + X2, contextual = TRUE, G0.obs = "none",
                        data = dataset, mlinks = list(dnetwork = distr), iteration = 2e4)
summary(out.none3.2) 
summary(out.none3.2)  
plot(out.none3.2, plot.type = "sim", mar = c(3, 2.1, 1, 1))

With logit model as network formation model

For this example, we assume that links are generated using a simple logit model. We do not observe the true distribution.

We first simulate data.

load(url('https://github.com/ahoundetoungan/PartialNetwork/blob/master/datavignettes/out.logi.rda?raw=true'))
# save(out.logi2.2, out.logi3.2, slogestim, sout.logi4.1, sout.logi4.2, sout.logi4.3, sout.selb1, sout.selb2, file = "~/Dropbox/Papers - In progress/Partial Network/Package/AH/PartialNetwork/datavignettes/out.logi.rda")
# EXAMPLE II: NETWORK FORMATION MODEL: LOGIT 
library(PartialNetwork)
set.seed(123)
# Number of groups
M             <- 50
# size of each group
N             <- rep(30,M)
# individual effects
beta          <- c(2,1,1.5) 
# contextual effects
gamma         <- c(5,-3) 
# endogenous effects
alpha         <- 0.4
# std-dev errors
se            <- 2 
# parameters of the network formation model
rho           <- c(-2, -.5, .2)
# covariates
X             <- cbind(rnorm(sum(N),0,5),rpois(sum(N),7))
# compute distance between individuals 
tmp           <- c(0, cumsum(N))
X1l           <- lapply(1:M, function(x) X[c(tmp[x] + 1):tmp[x+1],1])
X2l           <- lapply(1:M, function(x) X[c(tmp[x] + 1):tmp[x+1],2])
dist.net      <- function(x, y) abs(x - y) 
X1.mat        <- lapply(1:M, function(m) {
  matrix(kronecker(X1l[[m]], X1l[[m]], FUN = dist.net), N[m])})
X2.mat        <- lapply(1:M, function(m) {
  matrix(kronecker(X2l[[m]], X2l[[m]], FUN = dist.net), N[m])})
# true network
Xnet          <- as.matrix(cbind("Const" = 1, 
                                 "dX1"   = mat.to.vec(X1.mat), 
                                 "dX2"   = mat.to.vec(X2.mat)))
ynet          <- Xnet %*% rho
ynet          <- 1*((ynet + rlogis(length(ynet))) > 0)
G0            <- vec.to.mat(ynet, N, normalise = FALSE)
G0norm        <- norm.network(G0)
# simulate dependent variable use an external package
y             <- CDatanet::simsar(~ X, contextual = TRUE, Glist = G0norm, 
                                     theta = c(alpha, beta, gamma, se))
y             <- y$y
# dataset
dataset       <- as.data.frame(cbind(y, X1 = X[,1], X2 = X[,2])) 

For example II-1, we assume that the researcher only observes 60% of the links, but know that the network formation model is logistic.

# Example II-1: When a part of the network is observed 
# 60% of the network data is observed
G0.obs       <- lapply(N, function(x) matrix(rbinom(x^2, 1, 0.6), x))
# replace the non-observed part of the network by 0
G0.start     <- lapply(1:M, function(x) G0[[x]]*G0.obs[[x]])
# Infer the missing links in the network data
mlinks       <- list(model = "logit", mlinks.formula = ~ dX1 + dX2, 
                     mlinks.data = as.data.frame(Xnet))

Once the data are simulated, the estimation can be performed.

out.logi2.2  <- mcmcSAR(formula = y ~ X1 + X2, contextual = TRUE, G0.obs = G0.obs,
                        G0 = G0.start, data = dataset, mlinks = mlinks, 
                        iteration = 2e4)
summary(out.logi2.2)
summary(out.logi2.2) 

Example II-2 disregards the information about observed links (which we used to estimate the logit model) and only uses the asymptotic distribution of the network formation parameters.

# Example II-2: When only the network distribution is available
# Infer the network data
# We only provide estimate of rho and its variance
Gvec         <- mat.to.vec(G0, ceiled = TRUE)
logestim     <- glm(Gvec ~ -1 + Xnet, family = binomial(link = "logit"))
slogestim    <- summary(logestim)
estimates    <- list("rho"     = logestim$coefficients, 
                     "var.rho" = slogestim$cov.unscaled,
                     "N"       = N)
mlinks       <- list(model = "logit", mlinks.formula = ~ dX1 + dX2, 
                     mlinks.data = as.data.frame(Xnet), estimates = estimates)
out.logi3.2  <- mcmcSAR(formula = y ~ X1 + X2, contextual = TRUE, G0.obs = "none", 
                        data = dataset, mlinks = mlinks, iteration = 2e4)
summary(out.logi3.2) 
summary(out.logi3.2) 

Remarks {-}

In the previous example, partial network information is available to estimate rho and var.rho. These estimates are included in the object estimates. Then in the MCMC, we set G0.obs = "none", which means that the entire network will be inferred. In this situation, the posterior distribution of rho is strongly linked to the prior given by the practitioner, i.e. a normal distribution of parameters, the initial estimates of rho and var.rho. Indeed, an additional source of identification of the posterior distribution of rho may come from the spatial autoregressive (SAR) model. However, the available information in the observed part of the network is not used to update rho since it is already used when computing estimates.

From a certain point of view, inferring the observed part of the network can be considered inefficient. It is possible to keep fixed the observed entries of the adjacency matrix. As in example II-1, it is sufficient to set G0.obs = G0.obs instead of G0.obs = "none" in the function mcmcSAR. However, the observed part of the network is assumed to be non-stochastic in this case and will not be used to update rho. As soon as the practitioner gives an initial estimate of rho and var.rho in estimates, no information from the observed part of the network is used to update rho. The initial estimate of rho and var.rho summarizes the information.

We present an example below.

estimates    <- list("rho"     = logestim$coefficients, 
                     "var.rho" = slogestim$cov.unscaled) 
mlinks       <- list(model = "logit", mlinks.formula = ~ dX1 + dX2, 
                     mlinks.data = as.data.frame(Xnet), estimates = estimates)
out.logi4.1  <- mcmcSAR(formula = y ~ X1 + X2, contextual = TRUE, G0.obs = G0.obs,
                        G0 = G0.start, data = dataset, mlinks = mlinks, 
                        iteration = 2e4)
summary(out.logi4.1) 
print(sout.logi4.1)

One can notice that the network formation model estimate is not too different from the initial estimate of rho.

print(slogestim)

It is also possible to update rho in the MCMC using the observed part of the network. A particular case is Example II-1, where the distribution of rho is not given by the practitioner. In this example, the observed part of the network is used to estimate rho and var.rho (as in Example II-2). These estimates are then used as the prior distribution in the MCMC and rho is updated using the available information in the network. If the practitioner wants to define another prior, they can use prior instead of estimate.

This is an example where we assume that the prior distribution of rho is the standard normal distribution.

prior        <- list("rho"     = c(0, 0, 0), 
                     "var.rho" = diag(3)) 
mlinks       <- list(model = "logit", mlinks.formula = ~ dX1 + dX2, 
                     mlinks.data = as.data.frame(Xnet), prior = prior)
out.logi4.2  <- mcmcSAR(formula = y ~ X1 + X2, contextual = TRUE, G0.obs = G0.obs,
                        G0 = G0.start, data = dataset, mlinks = mlinks, 
                        iteration = 2e4)
summary(out.logi4.2) 
print(sout.logi4.2)

The MCMC performs well although the distribution of rho defined in prior is not true. This is because the observed part of the network is used to update rho. In contrast, the MCMC could be inconsistent if one defines estimates as prior because rho will not be updated using the observed entries in the adjacency matrix, but only using information from the outcome y.

With latent space model as network formation model

ARD, @breza2020using

We also offer a function of the estimator in @breza2020using. We first simulate data. We then estimate the model's parameters assuming that the researcher only knows ARD. We present two examples, one for which we observe ARD for the entire population (Example 1) and one for which we observe ARD for only 70% of the population (Example 2).

load(url('https://github.com/ahoundetoungan/PartialNetwork/blob/master/datavignettes/out.ard.rda?raw=true'))
# save(data.plot1, data.plot2, genz, genv, zi, vk, file = "~/Dropbox/Papers - In progress/Partial Network/Package/AH/PartialNetwork/datavignettes/out.ard.rda")

The data is simulated following a procedure similar to the one in @breza2020using.

library(PartialNetwork)
set.seed(123)
# LATENT SPACE MODEL 
N           <- 500
genzeta     <- 1
mu          <- -1.35
sigma       <- 0.37
K           <- 12    # number of traits
P           <- 3     # Sphere dimension 
# ARD parameters
# Generate z (spherical coordinates)
genz        <- rvMF(N, rep(0,P))
# Generate nu  from a Normal(mu, sigma^2) (The gregariousness)
gennu       <- rnorm(N, mu, sigma)
# compute degrees
gend        <- N*exp(gennu)*exp(mu+0.5*sigma^2)*exp(logCpvMF(P,0) - logCpvMF(P,genzeta))
# Link probabilities
distr       <- sim.dnetwork(gennu, gend, genzeta, genz) 
# Adjacency matrix
G           <- sim.network(distr)
# Generate vk, the trait location
genv        <- rvMF(K, rep(0, P))
# set fixed some vk  distant
genv[1,]    <- c(1, 0, 0)
genv[2,]    <- c(0, 1, 0)
genv[3,]    <- c(0, 0, 1)
# eta, the intensity parameter
geneta      <- abs(rnorm(K, 2, 1))
# Build traits matrix
densityatz  <- matrix(0, N, K)
for(k in 1:K){
  densityatz[,k] <- dvMF(genz, genv[k,]*geneta[k])
}
trait       <- matrix(0, N, K)
NK          <- floor(runif(K, 0.8, 0.95)*colSums(densityatz)/apply(densityatz, 2, max))
for (k in 1:K) {
  trait[,k] <- rbinom(N, 1, NK[k]*densityatz[,k]/sum(densityatz[,k]))
}
# Build ADR
ARD         <- G %*% trait
# generate b
genb        <- numeric(K)
for(k in 1:K){
  genb[k]   <- sum(G[,trait[,k]==1])/sum(G)
}

Example 1: we observe ARD for the entire population

# Example1: ARD is observed for the whole population
# initialization 
d0     <- exp(rnorm(N)); b0 <- exp(rnorm(K)); eta0 <- rep(1,K)
zeta0  <- 2; z0 <- matrix(rvMF(N, rep(0,P)), N); v0 <- matrix(rvMF(K,rep(0, P)), K)
# We should fix some vk and bk
vfixcolumn      <- 1:5
bfixcolumn      <- c(3, 7, 9)
b0[bfixcolumn]  <- genb[bfixcolumn]
v0[vfixcolumn,] <- genv[vfixcolumn,]
start           <- list("z" = z0, "v" = v0, "d" = d0, "b" = b0, "eta" = eta0, 
                        "zeta" = zeta0)

The estimation can be performed using the function mcmcARD

# MCMC
estim.ard1      <- mcmcARD(Y = ARD, traitARD = trait, start = start, fixv = vfixcolumn,
                           consb = bfixcolumn, iteration = 5000)  
# plot coordinates of individual 123
i      <- 123
zi     <- estim.ard1$simulations$z[i,,]
invisible(lapply(1:3, function(x) {
  plot(zi[x,], type = "l", ylab = "", col = "blue", ylim = c(-1, 1))
  abline(h = genz[i, x], col = "red")
}))
i      <- 123
oldpar <- par(no.readonly = TRUE)   
par(mfrow = c(3, 1), mar = c(2.1, 2.1, 1, 1))
invisible(lapply(1:3, function(x) {
  plot(zi[x,], type = "l", ylab = "", col = "blue", ylim = c(-1, 1))
  abline(h = genz[i, x], col = "red")
}))
par(oldpar)
# plot coordinates of the trait 8
k     <- 8
vk    <- estim.ard1$simulations$v[k,,]
invisible(lapply(1:3, function(x) {
  plot(vk[x,], type = "l", ylab = "", col = "blue", ylim = c(-1, 1))
  abline(h = genv[k, x], col = "red")
}))
k      <- 8
oldpar <- par(no.readonly = TRUE)   
par(mfrow = c(3, 1), mar = c(2.1, 2.1, 1, 1))
invisible(lapply(1:3, function(x) {
  plot(vk[x,], type = "l", ylab = "", col = "blue", ylim = c(-1, 1))
  abline(h = genv[k, x], col = "red")
}))
par(oldpar)
# plot degree
library(ggplot2)
data.plot1 <- data.frame(True_degree  = gend, 
                         Estim_degree = colMeans(tail(estim.ard1$simulations$d, 2500)))
ggplot(data = data.plot1, aes(x = True_degree, y = Estim_degree)) + 
   geom_abline(col = "red") + geom_point(col = "blue")
library(ggplot2)
ggplot(data = data.plot1, aes(x = True_degree, y = Estim_degree)) + 
   geom_abline(col = "red") + geom_point(col = "blue")

Example 2: we observe ARD for only 70% of the population

# Example2: ARD is observed for 70% population
# sample with ARD 
n          <- round(0.7*N)
# individual with ARD
iselect    <- sort(sample(1:N, n, replace = FALSE))
ARDs       <- ARD[iselect,]
traits     <- trait[iselect,]
# initialization 
d0         <- d0[iselect]; z0 <- z0[iselect,] 
start      <- list("z" = z0, "v" = v0, "d" = d0, "b" = b0, "eta" = eta0, "zeta" = zeta0)

The estimation can be performed using the function mcmcARD

# MCMC
estim.ard2 <- mcmcARD(Y = ARDs, traitARD = traits, start = start, fixv = vfixcolumn,
                      consb = bfixcolumn, iteration = 5000)  
# estimation for non ARD
# we need a logical vector indicating if the i-th element has ARD
hasARD      <- (1:N) %in% iselect
# we use the matrix of traits to estimate distance between individuals
estim.nard2 <- fit.dnetwork(estim.ard2, X = trait, obsARD = hasARD, m = 1) 
# estimated degree
estd          <- estim.nard2$degree
data.plot2    <- data.frame(True_degree  = gend, 
                            Estim_degree = estd, 
                            Has_ARD      = ifelse(hasARD, "YES", "NO"))
ggplot(data = data.plot2, aes(x = True_degree, y = Estim_degree, colour = Has_ARD)) + 
  geom_abline(col = "red") + geom_point() 
ggplot(data = data.plot2, aes(x = True_degree, y = Estim_degree, colour = Has_ARD)) + 
  geom_abline(col = "red") + geom_point() 

Estimating peer effects model with ARD

Given the predicted probabilities, estimated using the estimator proposed by @breza2020using assuming that ARD are observed for the entire population, we implement our Bayesian estimator assuming that the posterior distribution of the linking probabilities are jointly normally distributed.

load(url('https://github.com/ahoundetoungan/PartialNetwork/blob/master/datavignettes/out.lsp.rda?raw=true'))
#save(out.lspa1, out.lspa2, file = "~/Dropbox/Papers - In progress/Partial Network/Package/AH/PartialNetwork/datavignettes/out.lsp.rda")

We first simulate data.

library(PartialNetwork)
set.seed(123)
M             <- 30
N             <- rep(60, M)
genzeta       <- 3
mu            <- -1.35
sigma         <- 0.37
K             <- 12    # number of traits
P             <- 3     # Sphere dimension

# IN THIS LOOP, WE GENERATE DATA FOLLOWING BREZA ET AL. (2020) AND 
# ESTIMATE THEIR LATENT SPACE MODEL FOR EACH SUB-NETWORK.
estimates     <- list()
list.trait    <- list()
G0            <- list()
for (m in 1:M) {
  #######################################################################################
  #######                             SIMULATION STAGE                           ########
  #######################################################################################
  # ARD parameters
  # Generate z (spherical coordinates)
  genz    <- rvMF(N[m], rep(0,P))
  # Generate nu  from a Normal(mu, sigma^2) (The gregariousness)
  gennu   <- rnorm(N[m],mu,sigma)
  # compute degrees
  gend    <- N[m]*exp(gennu)*exp(mu+0.5*sigma^2)*exp(logCpvMF(P,0) - logCpvMF(P,genzeta))
  # Link probabilities
  distr   <- sim.dnetwork(gennu, gend, genzeta, genz) 
  # Adjacency matrix
  G        <- sim.network(distr)
  G0[[m]]  <- G
  # Generate vk, the trait location
  genv     <- rvMF(K, rep(0, P))
  # set fixed some vk  distant
  genv[1,] <- c(1, 0, 0)
  genv[2,] <- c(0, 1, 0)
  genv[3,] <- c(0, 0, 1)
  # eta, the intensity parameter
  geneta   <-abs(rnorm(K, 2, 1))
  # Build traits matrix
  densityatz       <- matrix(0, N[m], K)
  for(k in 1:K){
    densityatz[,k] <- dvMF(genz, genv[k,]*geneta[k])
  }
  trait       <- matrix(0, N[m], K)
  NK          <- floor(runif(K, .8, .95)*colSums(densityatz)/apply(densityatz, 2, max))
  for (k in 1:K) {
    trait[,k] <- rbinom(N[m], 1, NK[k]*densityatz[,k]/sum(densityatz[,k]))
  }
  list.trait[[m]]  <- trait
  # Build ADR
  ARD         <- G %*% trait
  # generate b
  genb        <- numeric(K)
  for(k in 1:K){
    genb[k]  <- sum(G[,trait[,k]==1])/sum(G) + 1e-8
  }

  #######################################################################################
  #######                             ESTIMATION STAGE                           ########
  #######################################################################################
  # initialization 
  d0     <- gend; b0 <- exp(rnorm(K)); eta0 <- rep(1,K); zeta0 <- genzeta 
  z0     <- matrix(rvMF(N[m], rep(0,P)), N[m]); v0 <- matrix(rvMF(K,rep(0, P)), K)
  # We should fix some vk and bk
  vfixcolumn      <- 1:5
  bfixcolumn      <- c(1, 3, 5, 7, 9, 11)
  b0[bfixcolumn]  <- genb[bfixcolumn]
  v0[vfixcolumn,] <- genv[vfixcolumn,]
  start           <- list("z" = z0, "v" = v0, "d" = d0, "b" = b0, "eta" = eta0, 
                          "zeta" = zeta0)
  estimates[[m]]  <- mcmcARD(Y = ARD, traitARD = trait, start = start, fixv = vfixcolumn, 
                             consb = bfixcolumn, sim.d = FALSE, sim.zeta = FALSE, 
                             iteration = 5000, ctrl.mcmc = list(print = FALSE))
}

# SIMULATE DATA FOR THE OUTCOME MODEL
# individual effects
beta          <- c(2,1,1.5) 
# contextual effects
gamma         <- c(5,-3) 
# endogenous effects
alpha         <- 0.4
# std-dev errors
se            <- 1
# covariates
X             <- cbind(rnorm(sum(N),0,5),rpois(sum(N),7))
# Normalise G0
G0norm        <- norm.network(G0)
# simulate dependent variable use an external package
y             <- CDatanet::simsar(~ X, contextual = TRUE, Glist = G0norm,
                                  theta = c(alpha, beta, gamma, se))
y             <- y$y
# dataset
dataset       <- as.data.frame(cbind(y, X1 = X[,1], X2 = X[,2])) 

Once the data are simulated, the estimation can be performed using the function mcmcSAR.

mlinks       <- list(model = "latent space", estimates = estimates)
out.lspa1    <- mcmcSAR(formula = y ~ X1 + X2, contextual = TRUE, G0.obs = "none", 
                        data = dataset, mlinks = mlinks, iteration = 2e4)
summary(out.lspa1) 
summary(out.lspa1) 

Estimating peer effects with partial ARD

Given the predicted probabilities, estimated using the estimator proposed by @breza2020using assuming that ARD are observed for 70% $\sim$ 100% of the population, we implement our Bayesian estimator assuming that the posterior distribution of the linking probabilities are jointly normally distributed.

We first simulate data.

library(PartialNetwork)
set.seed(123)
M             <- 30
N             <- rep(60, M)
genzeta       <- 3
mu            <- -1.35
sigma         <- 0.37
K             <- 12
P             <- 3

# IN THIS LOOP, WE GENERATE DATA FOLLOWING BREZA ET AL. (2020) AND 
# ESTIMATE THEIR LATENT SPACE MODEL FOR EACH SUB-NETWORK.
estimates     <- list()
list.trait    <- list()
obARD         <- list()
G0            <- list()
for (m in 1:M) {
  #######################################################################################
  #######                             SIMULATION STAGE                           ########
  #######################################################################################
  # ARD parameters
  # Generate z (spherical coordinates)
  genz    <- rvMF(N[m], rep(0,P))
  # Generate nu  from a Normal(mu, sigma^2) (The gregariousness)
  gennu   <- rnorm(N[m],mu,sigma)
  # compute degrees
  gend    <- N[m]*exp(gennu)*exp(mu+0.5*sigma^2)*exp(logCpvMF(P,0) - logCpvMF(P,genzeta))
  # Link probabilities
  distr   <- sim.dnetwork(gennu, gend, genzeta, genz) 
  # Adjacency matrix
  G        <- sim.network(distr)
  G0[[m]]  <- G
  # Generate vk, the trait location
  genv     <- rvMF(K, rep(0, P))
  # set fixed some vk  distant
  genv[1,] <- c(1, 0, 0)
  genv[2,] <- c(0, 1, 0)
  genv[3,] <- c(0, 0, 1)
  # eta, the intensity parameter
  geneta   <-abs(rnorm(K, 2, 1))
  # Build traits matrix
  densityatz  <- matrix(0, N[m], K)
  for(k in 1:K){
    densityatz[,k] <- dvMF(genz, genv[k,]*geneta[k])
  }
  trait       <- matrix(0, N[m], K)
  NK          <- floor(runif(K, .8, .95)*colSums(densityatz)/apply(densityatz, 2, max)) 
  for (k in 1:K) {
    trait[,k] <- rbinom(N[m], 1, NK[k]*densityatz[,k]/sum(densityatz[,k]))
  } 
  list.trait[[m]]  <- trait
  # Build ADR
  ARD         <- G %*% trait
  # generate b
  genb        <- numeric(K)
  for(k in 1:K){
    genb[k]  <- sum(G[,trait[,k]==1])/sum(G) + 1e-8
  }
  # sample with ARD 
  n          <- round(runif(1, .7, 1)*N[m])
  # individual with ARD
  iselect    <- sort(sample(1:N[m], n, replace = FALSE))
  hasARD     <- (1:N[m]) %in% iselect
  obARD[[m]] <- hasARD
  ARDs       <- ARD[iselect,]
  traits     <- trait[iselect,]
  #######################################################################################
  #######                             ESTIMATION STAGE                           ########
  #######################################################################################
  # initialization 
  d0         <- gend[iselect]; b0 <- exp(rnorm(K)); eta0 <- rep(1,K); zeta0 <- genzeta
  z0         <- matrix(rvMF(n, rep(0,P)), n); v0 <- matrix(rvMF(K, rep(0, P)), K) 
  # We should fix some vk and bk
  vfixcolumn     <- 1:5
  bfixcolumn     <- c(1, 3, 5, 7, 9, 11)
  b0[bfixcolumn] <- genb[bfixcolumn]; v0[vfixcolumn,] <- genv[vfixcolumn,]
  start          <- list("z" = z0, "v" = v0, "d" = d0, "b" = b0, "eta" = eta0, 
                          "zeta" = zeta0)
  estimates[[m]] <- mcmcARD(Y = ARDs, traitARD = traits, start = start, fixv = vfixcolumn,
                            consb = bfixcolumn,  sim.d = FALSE, sim.zeta = FALSE, 
                            iteration = 5000, ctrl.mcmc = list(print = FALSE))
}

# SIMULATE DATA FOR THE OUTCOME MODEL
# individual effects
beta          <- c(2,1,1.5) 
# contextual effects
gamma         <- c(5,-3) 
# endogenous effects
alpha         <- 0.4
# std-dev errors
se            <- 1
# covariates
X             <- cbind(rnorm(sum(N),0,5),rpois(sum(N),7))
# Normalise G0
G0norm        <- norm.network(G0)
# simulate dependent variable use an external package
y             <- CDatanet::simsar(~ X, contextual = TRUE, Glist = G0norm,
                                  theta = c(alpha, beta, gamma, se))
y             <- y$y
# dataset
dataset       <- as.data.frame(cbind(y, X1 = X[,1], X2 = X[,2])) 

Once the data are simulated, the estimation can be performed using the function mcmcSAR.

mlinks       <- list(model = "latent space", estimates = estimates,
                     mlinks.data = list.trait, obsARD = obARD)
out.lspa2    <- mcmcSAR(formula = y ~ X1 + X2, contextual = TRUE, G0.obs = "none", 
                        data = dataset, mlinks = mlinks, iteration = 2e4)
summary(out.lspa2)
summary(out.lspa2)

The selection bias issue {#SELECT}

In many applications, missing links are not completely random. For example, an individual who has no friends is less likely to have missing value on their row in the adjacency matrix than someone who has a lot of links. Even regressing a logit/probit network formation model using the entries of the adjacency matrix that are correctly measured will not yield a consistent estimator due to the selection bias issue.

Consider the case of missing links due to unmatched declared friends. This is for example the case of the network data collected by the National Longitudinal Study of Adolescent to Adult Health (Add Health). We know the true number of links for each individual. This information is crucial to address the selection bias issue. When an individual has missing links, we could doubt all the "zeros" on their row in the adjacency matrix. To consistently estimate the SAR model, we can simply assume that information from this row is not true, i.e, the row has "zeros" everywhere in the object G0.obs. Information from individuals without missing links can be used to estimate rho and doubtful rows in the adjacency matrix will be inferred. However, there is a selection bias issue because we do not use rows with missing links to estimate rho. These rows are likely to have a lot of ones. As a consequence, the network formation model estimating using rows with no unmatched friends will simulate fewer links than the true data-generating process (DGP).

A straightforward approach to address the issue is to weight the selected rows that are used to estimate rho (see @manski1977estimation). Every selected row must be weighted by the inverse of the probability of having no unmatched links. To do so, the practitioner can include an input weights in the list mlinks which is an argument of the function mcmcsar. The input weights must be a vector of dimension the number of "ones" in G0.obs.

We consider the following example where the number of missing links is randomly chosen between zero and the true number of links.

library(PartialNetwork)
library(dplyr)
set.seed(123)
# Number of groups
M             <- 50
# size of each group
N             <- rep(30,M)
# individual effects
beta          <- c(2, 1, 1.5)
# contextual effects
gamma         <- c(5,-3)
# endogenous effects
alpha         <- 0.4
# std-dev errors
se            <- 2
# parameters of the network formation model
rho           <- c(-0.5, -.5, .4)
# covariates
X             <- cbind(rnorm(sum(N),0,5),rpois(sum(N),7))
# compute distance between individuals
tmp           <- c(0, cumsum(N))
X1l           <- lapply(1:M, function(x) X[c(tmp[x] + 1):tmp[x+1],1])
X2l           <- lapply(1:M, function(x) X[c(tmp[x] + 1):tmp[x+1],2])
dist.net      <- function(x, y) abs(x - y)
X1.mat        <- lapply(1:M, function(m) {
  matrix(kronecker(X1l[[m]], X1l[[m]], FUN = dist.net), N[m])})
X2.mat        <- lapply(1:M, function(m) {
  matrix(kronecker(X2l[[m]], X2l[[m]], FUN = dist.net), N[m])})
# true network
Xnet          <- as.matrix(cbind("Const" = 1,
                                 "dX1"   = mat.to.vec(X1.mat),
                                 "dX2"   = mat.to.vec(X2.mat)))
ynet          <- Xnet %*% rho
ynet          <- c(1*((ynet + rlogis(length(ynet))) > 0))
G0            <- vec.to.mat(ynet, N, normalise = FALSE)
# number of friends
nfriends      <- unlist(lapply(G0, function(x) rowSums(x)))
# number of missing links
nmislink      <- sapply(nfriends, function(x) sample(0:x, 1))

We now simulate the observed network by removing links from G0 depending on the number of unmatched links. We also simulate the outcome y using the true network.

Gobs          <- list(M)  # The observed network
G0.obs        <- list(M)  # Which information is true and doubtful
for(x in 1:M){
  Gx          <- G0[[x]]
  G0.obsx     <- matrix(1, N[x], N[x]); diag(G0.obsx) <- 0
  csum        <- cumsum(c(0, N))
  nmis        <- nmislink[(csum[x] + 1):csum[x + 1]]
  for (i in 1:N[x]) {
    if(nmis[i] > 0){
      tmp     <- which(c(Gx[i,]) == 1)
      if(length(which(c(Gx[i,]) == 1)) > 1) {
        tmp   <- sample(which(c(Gx[i,]) == 1), nmis[i])
      }
      Gx[i,tmp]   <- 0
      G0.obsx[i,] <- 0
    }
  }
  Gobs[[x]]   <- Gx
  G0.obs[[x]] <- G0.obsx
}
G0norm        <- norm.network(G0)
# simulate dependent variable use an external package
y             <- CDatanet::simsar(~ X, contextual = TRUE, Glist = G0norm,
                                  theta = c(alpha, beta, gamma, se))
y             <- y$y
# data set
dataset       <- as.data.frame(cbind(y, X1 = X[,1], X2 = X[,2]))

Assume that we estimate the network formation by selecting the individual with no unmatched links.

mlinks        <- list(model = "logit", mlinks.formula = ~ dX1 + dX2,
                      mlinks.data = as.data.frame(Xnet))
out.selb1     <- mcmcSAR(formula = y ~ X1 + X2, contextual = TRUE, G0 = Gobs,
                         G0.obs = G0.obs, data = dataset, mlinks = mlinks,
                         iteration = 2e4)
summary(out.selb1)
print(sout.selb1)

Although the peer effect estimate appears to be correct, the network formation estimator is likely biased. Indeed the estimator of the intercept is $-2.26$ and is significantly lower than its actual value of -$2$. As a result, the network formation model is likely to simulate fewer links than the true DGP. This is an issue if the practitioner uses the model to simulate policy impact on the outcome y or network features, such as centrality.

To control for the selection issue, we can weight selected individuals. The weight depends on the framework, especially how missing links occur. In our example, the number of missing links is chosen randomly between zero and the number of declared links. If the individual has $n_i$ declared friends, the probability of having no unmatched links can be estimated by the proportion of the number of individuals who has no unmatched links in the subset of individuals who declare $n_i$.

G0.obsvec     <- as.logical(mat.to.vec(G0.obs))
Gvec          <- mat.to.vec(Gobs, ceiled = TRUE)[G0.obsvec]
W             <- unlist(data.frame(nfriends = nfriends, nmislink = nmislink) %>%
                       group_by(nfriends) %>% 
                       summarise(w = length(nmislink)/sum(nmislink == 0)) %>% 
                       select(w))
W             <- lapply(1:M, function(x){
  matrix(rep(W[rowSums(G0[[x]]) + 1], each = N[x]), N[x], byrow = TRUE)})
weights       <- mat.to.vec(W)[G0.obsvec]
mlinks        <- list(model = "logit", mlinks.formula = ~ dX1 + dX2,
                      mlinks.data = as.data.frame(Xnet), weights = weights)
out.selb2     <- mcmcSAR(formula = y ~ X1 + X2, contextual = TRUE, G0 = Gobs, 
                         G0.obs = G0.obs, data = dataset, mlinks = mlinks,
                         iteration = 2e4)
summary(out.selb2)
print(sout.selb2)

It is also possible to use selected sample and the weight to estimate rho and var.rho using the function glm. These estimates can be provided to mlinks in estimates. In this case rho will not be updated using the observed part of the network.



ahoundetoungan/PartialNetwork documentation built on March 15, 2024, 4:35 p.m.