options(width = 10000)
knitr::opts_chunk$set(comment = "")
set.seed(121)
my.theme <- function(base_size = 10, base_family = "",
                     remove.ticks.x = FALSE,
                     remove.ticks.y = FALSE,
                     grid.x_colour = NA, grid.y_colour = NA,
                     grid.x_linetype = NA, grid.y_linetype = NA,
                     grid.x_size = NA, grid.y_size = NA,
                     strip_colour = NA, strip_text = "black",
                     background_colour = NA,
                     plot_background_colour = NA,
                     text_colour = NA,
                     tick_colour = "black", tick_length = 0.2,
                     borderless = 0, bordersize = 0.5){ 
    if((!is.na(grid.x_linetype) | !is.na(grid.x_size)) & is.na(grid.x_colour)) {
        grid.x_colour <- "black"
    }
    if((!is.na(grid.y_linetype) | !is.na(grid.y_size)) & is.na(grid.y_colour)) {
        grid.y_colour <- "black"
    }

  if(is.na(grid.x_size)) grid.x_size <- 0.25
  if(is.na(grid.x_linetype)) grid.x_linetype <- 1
  if(is.na(grid.y_size)) grid.y_size <- 0.25
  if(is.na(grid.y_linetype)) grid.y_linetype <- 1

  if(borderless == 2) {
    border <- theme(panel.border = element_blank(),
                    strip.background = element_blank())
  }

  else if(borderless == 1) {
    border <- theme(axis.line = element_line(colour = "black", size = 0.25),
                    panel.border = element_blank(),
                    # panel.background = element_blank(),
                    strip.background = element_blank())
  }
  else if(borderless == 0) border <- theme()
  if(remove.ticks.x == TRUE) border <- border + theme(axis.ticks.x = element_blank())
  if(remove.ticks.y == TRUE) border <- border + theme(axis.ticks.y = element_blank())
  theme(axis.text.x       = element_text(family = base_family, colour = "black", size = base_size, vjust = 1, lineheight = 0.9),
        axis.text.y       = element_text(family = base_family, colour = "black", size = base_size, hjust = 1, lineheight = 0.9),
        axis.ticks        = element_line(colour = tick_colour, size = 0.2),
        axis.ticks.length = unit(tick_length, "lines"),
        axis.title.x      = element_text(family = base_family, face = "bold", size = base_size*0.9, colour = "black", vjust = 0),
        axis.title.y      = element_text(family = base_family, face = "bold", size = base_size*0.9, angle = 90, colour = "black", vjust = 1),
        legend.background = element_blank(),#element_rect(fill = "grey95"),
        legend.key = element_blank(),
        legend.key.size = unit(0.6, "lines"),
        legend.text = element_text(family = base_family, size = base_size, color = "black", face = "plain", lineheight = 1),
        legend.text.align = -1, 
        legend.title = element_blank(),
        legend.title.align = 1,
        legend.position = "top",
        legend.direction = "horizontal",
        legend.justification = c(0, 1),
        panel.background = element_rect(fill = NA),
        panel.border = element_rect(fill = "transparent", colour = "black", size = bordersize),
        panel.grid.major.x = element_line(colour = grid.x_colour, size = grid.x_size, linetype = grid.x_linetype),
        panel.grid.major.y = element_line(colour = grid.y_colour, size = grid.y_size, linetype = grid.y_linetype),
        panel.grid.minor = element_blank(),
        panel.spacing = unit(0.75, "lines"),
        strip.background = element_blank(),
        strip.text.x = element_text(family = base_family, size = base_size, face = "bold", colour = strip_text),
        strip.text.y = element_text(family = base_family, size = base_size, face = "plain", angle = -90, colour = strip_text),
        plot.background = element_rect(fill = NA, colour = NA),
        plot.title = element_text(family = base_family, size = base_size * 1.1, vjust = 0.5, hjust = 0, face = "bold")) +
    border
}

News-sharing Ideology from Social Media Link Data

The [`mediascores`](https://github.com/SMAPPNYU/mediascores){target="_blank"} library in R

**[Gregory Eady](https://google.com){target="_blank"}**^$\dagger$^ and **[Fridolin Linder](https://google.com){target="_blank"}**^$\dagger$^ ^$\dagger$^ NYU [Social Media and Political Participation (SMaPP) Lab](https://smappnyu.org/){target="_blank"} & Department of Politics

1. Introduction {#section1}

This document provides a brief introduction to the R library mediascores{target="_blank"}, which implements a statistical model for news media sharing data (URLs) on social media (e.g. Twitter, Facebook). The goal of the method is to provide a measure of (1) the news-sharing ideology of politicians (e.g. Members of Congress) and ordinary users on social media, and (2) the ideology of the content that they share online.

The logic behind the method is intuitive, and is based on an assumption of homophily: that politicians and ordinary users on social media are more likely to share news stories that align with themselves ideologically than they are stories that are ideologically distant. In the US context, for example, a social media user who is liberal can be expected to share more content from the New York Times or MSNBC than they are content from the Wall Street Journal or Fox News. Conversely, conservatives can be expected to share more content from the Wall Street Journal and Fox News than from the New York Times or MSNBC.

Below, we work through an application of the method to Twitter data that were collected in 2018 from Members of Congress during the 115^th^ congressional session. Our example demonstrates how researchers can use such data to estimate the news-sharing ideology of social media users---in this case, politicians---and also that of the content that they share online.

2. Data {#section2}

The model uses as data the number of stories from each news organization that are shared by each social media user. In this vignette, we use the stories shared by members of the 115^th^ US Congress, US governors, and members of the US executive (e.g. the President and his predecessors). The data come from the tweets and retweets from these politicians, which were collected using the Twitter API at the end of 2018.^[URLs from "quote tweets" are excluded because Twitter users often use the quoted portion of tweets to mock the content of other users.] Because many URLs are provided only in shortened form, we unshorten them as necessary (e.g. nyti.ms/2BUTshD{target="_blank"} ---> nytimes.com/2018/...{target="_blank"}).^[An example of a (parallelized) link unshortener is that developed by Leon Yin{target="_blank"}: urlExpander{target="_blank"}.] The resulting data contains all URLs from tweets and retweets that linked to national news media organizations. To get a feel for these data, we describe them by example below.

To begin, we read in the data.frame that contains the count of the stories from each news media domain that have been tweeted and retweeted by each politician. These data, which are included in the mediascores{target="_blank"} library, also contain basic identifying information about each actor:

library(mediascores)
library(ggplot2)

# Read in Twitter URL-sharing data
data(MOC115)

The identifying information is contained in the first 6 variables of the data:

The remaining columns represent the count of the news stories tweeted or retweeted by each user (i.e. politician) from each news organization, where the column names indicate the organization. To make this concrete, a subset of these columns is shown alongside the names of five well-known US politicians:

columns_to_show <- c("name", "thenation.com", "huffingtonpost.com",
                     "washingtonpost.com", "foxnews.com", "breitbart.com")
politicians_to_show <- c("Ted Cruz", "Paul Ryan", "Susan Collins",
                         "Nancy Pelosi", "Bernie Sanders")
print(MOC115[MOC115$name %in% politicians_to_show, columns_to_show], row.names = FALSE)

For observers of American politics, these data are as one might expect: Bernie Sanders---a relatively far left Senator---shares far more news from the left (e.g. thenation.com, huffingtonpost.com) than he does that from the right (e.g. foxnews.com, breitbart.com). Ted Cruz---a relatively far right Senator---by contrast, shares mostly news from the ideological right.

As we can see, the data used for the model is simply a user-domain count matrix, where the rows indicate users; the columns indicate news media domains; and each cell represents the number of each domain that has been shared by each user. These data form the core set of data for the model, but as we will see, can also be augmented by a group-level indicator variable that represents whether a given user is, for example, a Democrat, Republican, or ordinary user.

3. A Statistical Model of News Sharing {#section3}

To model these data, the mediascores{target="blank"} library is designed to fit the following model:\begin{equation} y{img} \sim \text{NegBin}(\pi_{img}, \omega_i\omega_m) \end{equation}\begin{equation} \pi_{img} = \text{exp}(\alpha_i + \gamma_m - ||\vartheta_i - \zeta_m||^2), \end{equation}

where $y_{img}$ denotes the count of news stories from domain $m$ shared by user $i$ who belongs to group $g \in {$Republican, Democrat$}$ (i.e. the data in the data.frame shown above); $\alpha_i$ denotes a user-specific intercept; $\gamma_m$ denotes a domain-specific intercept; and $\omega_i$ and $\omega_m$ denote user- and domain-specific dispersion parameters respectively.^[The default model, however, excludes the parameter $\omega_i$, which is computationally burdensome and difficult to estimate.] The quantities of interest are denoted by $\vartheta_i$, which represents the news-sharing ideology of user $i$, and $\zeta_m$, the ideology of news media organization $m$. As the term $-||\vartheta_i - \zeta_m||^2$ in the linear predictor makes clear, the larger the ideological spatial distance between the ideology $\vartheta_i$ of the user and the ideology $\zeta_m$ of the news media organization, the less likely the user is to share stories from that organization.

The model is fit in a Bayesian context using the Bayesian inference engine Stan. Priors are placed on the model parameters as follows: \begin{equation} \alpha_i \sim \text{Normal}(\mu_{\alpha}, \sigma_{\alpha}) \end{equation}\begin{equation} \gamma_i \sim \text{Normal}(0, \sigma_{\gamma}), \end{equation}

where uniform prior distributions are placed on the hyperparameters $\mu_{\alpha}$, $\sigma_{\alpha}$ and $\sigma_{\gamma}$.

Group-level information, $g$ (e.g. Democrat, Republican) is then incorporated by placing separate common prior distributions on the batches of parameters denoting users' ideology, $\vartheta_{ig}$, based on their group (party) affiliation: \begin{equation} \vartheta^{(g)}{i} \sim \text{Normal}(\mu{\vartheta}^{(g)}, \sigma_{\vartheta}^{(g)}), \end{equation}

where uniform prior distributions are placed on the hyperparameters $\mu_{\vartheta}^{(g)}$ and $\sigma_{\vartheta}^{(g)}$. Finally, the variance parameters $\omega_i$ and $\omega_m$, are given common distributions $\omega_i \sim \text{InvGamma}(\omega^{(i)}_a, \omega^{(i)}_b)$ and $\omega_m \sim \text{InvGamma}(\omega^{(m)}_a, \omega^{(m)}_b)$, with uniform priors on the hyperparameters.

For identification, the parameters representing news media ideology are centered on 0: $\zeta_m \sim \text{Normal}(0, \sigma_{\zeta})$. Lastly, the direction of the model needs to be set, such that high values represent ideological liberalism or conservatism. Here, we follow the practical solution of @Jackman2001, who suggests allowing the sampler to freely explore the posterior and settle in on one of the two scale directions. We then flip the scale and associated parameters after estimation (if required) depending on anchors defined by the researcher: the researcher is asked to set two news media domains as scale anchors (e.g. {nytimes.com, foxnews.com}), such that the ideology of the first media organization defines the low end of the scale (in this example, liberal), and the second, the high end (in this example, conservative).

4. Model Fitting {#section4}

To fit this model, we use the data as shown in Section 2. We first transform the data.frame into a matrix, removing the columns that provide information about each politician (e.g. their name, party, and nominate score), which are the first 6 columns of the data set:

# Create user-domain count matrix
User_Domain_Counts <- as.matrix(MOC115[, -c(1:6)])

# For reference, name rows by politician's name
rownames(User_Domain_Counts) <- MOC115$name

# Show sub-matrix of first 5 politicians and 5 domains:
print(User_Domain_Counts[1:5, 1:5], row.names = FALSE)

The resulting user-domain count matrix can then be inputted as an argument into the workhorse function of the mediascores{target="_blank"} library, mediascores(), as follows:

# Fit the sharing ideology model
# anchor = c(1, 2) because column 1 in the matrix represents the NY Times
# and column 2 Fox News. As a result lower values of theta_i and zeta_m will
# represent the liberal side of the scale; higher values, the conservative side.
fitted_model <- mediascores(Y = User_Domain_Counts, group = MOC115$group,
                            variational = FALSE, user_variance = FALSE,
                            chains = 4, cores = 4, threads = 4,
                            warmup = 750, iter = 1500, seed = 1,
                            open_progress = TRUE,
                            anchors = c(1, 2))
# Fit the sharing ideology model
# anchor = c(1, 2) because column 1 in the matrix represents the NY Times
# and column 2 Fox News. As a result lower values of theta_i and zeta_m will
# represent the liberal side of the scale; higher values, the conservative side.
system.time({fitted_model <- mediascores(Y = User_Domain_Counts, group = MOC115$group,
                            variational = FALSE, user_variance = FALSE,
                            chains = 4, cores = 4, threads = 4,
                            warmup = 750, iter = 1500, seed = 1,
                            open_progress = TRUE,
                            anchors = c(1, 2))})
# saveRDS(fitted_model, "data/fitted_models/MOC115_fitted_model.rds")
# This is a _local_ file for the vignette only
fitted_model <- readRDS("~/GitHub/mediascores/data/fitted_models/MOC115_fitted_model.rds")

As we can see, we also include the MOC115$group variable to set separate common distributions on the ideology parameters $\theta_i$ of Democrats (group = 1) and Republicans (group = 2). One could alternatively exclude this information, in which a common prior distribution is assigned to all actors included in the model regardless of party affiliation.

5. Model Diagnostics {#section5}

The object returned from the mediascores() function is a standard Stan{target="_blank"} object. As a consequence, for in-depth analysis of the posterior, one can treat the resulting object as such and use open-source visualization tools to examine it (e.g. shinystan{target="_blank"}):

library(shinystan)
shiny_stan(fitted_model)

Here, we focus on a simple helper function built into the mediascores{target="_blank"} library to access one of the key post-estimation statistics, $\hat{R}$, which is commonly used to assess parameter convergence. Below, we use the rhat() function to display the $\hat{R}$ statistic for the two primary quantities of interest, $\vartheta_i$ (the news-sharing ideology of users), and $\zeta_m$ (the ideology of news media organizations):

rhats <- rhat(posterior = fitted_model, pars = c("zeta", "theta"))
print(rhats)
rhats <- rhat(posterior = fitted_model, pars = c("theta", "zeta"))
{print(subset(rhats, rownames(rhats) %in% c("zeta[1]", "zeta[2]", "zeta[3]", "zeta[4]", "zeta[5]")))
cat("...\n")
print(subset(rhats, rownames(rhats) %in% paste0("theta[", (nrow(User_Domain_Counts)-4):nrow(User_Domain_Counts), "]")))}

The function outputs $\hat{R}$ statistics for all specified parameters, including a warning for any values of $\hat{R}$ that are greater than 1.1, the rule-of-thumb cutoff suggested by @Gelman2014. To see if any such $\hat{R}$ statistics are greater than 1.1 we can also simply look at the potential subset of these as follows:

subset(rhats, rhat > 1.1)

Here, we see that none of the specified parameters have $\hat{R}$ statistics greater than 1.1. If we did, the solution would typically be to re-run the model with more post-warmup iterations:

# Setting iter = 2000 (not run)
fitted_model <- mediascores(Y = User_Domain_Counts, group = MOC115$group,
                            variational = FALSE, user_variance = FALSE,
                            warmup = 750, iter = 2000, chains = 4,
                            open_progress = TRUE, seed = 1,
                            # output_samples = 5000, tol_rel_obj = 0.001,
                            anchors = c(1, 2))

For more detail on $\hat{R}$, see the Stan Modeling Language User Guide and Reference Manual{target="_blank"}.

5.1 Divergent transitions {#subsection51}

After model fitting, if you receive a message that there were "divergent transitions" during estimation, there may be validity concerns about the resulting parameter estimates. Details of this issue lie outside the purview of this vignette, but the problem can often be resolved by increasing the value of adapt_delta. This argument can be included in the call to mediascores() as follows: mediascores(..., control = list(adapt_delta = 0.99)), where the default value is 0.8 (max < 1). Values of 0.99 or 0.999 typically resolve the problem. However, note that increasing the value of adapt_delta can substantially increase the time for model fitting. In our experience, this problem is rare for the model presented in this library.

5.2 Maximum treedepth exceeded {#subsection52}

If after model fitting you receive a message that the "maximum treedepth" was exceeded during estimation, the concern is one of efficiency not validity. If desired, you can increase efficiency by increasing the value of the max_treedepth argument, which defaults to 10. As with adapt_delta, this argument can be included in the function call as follows: mediascores(..., control = list(max_treedepth = 15)). Increasing max_treedepth also increases estimation time.

6. Analysis {#section6}

Finally, we examine the estimates of our quantities of interest: (1) the social media news-sharing ideology of Members of Congress, and (2) the ideology of the news media content shared.

To calculate these estimates from the fitted model obect, we can use the point_est() function, which calculates the median of the posterior of each parameter of interest alongside credible intervals:

theta_hat <- point_est(fitted_model, pars = c("theta"), prob = 0.9)
print(theta_hat)
theta_hat <- point_est(fitted_model, pars = c("theta"), prob = 0.9)
{print(subset(theta_hat, rownames(theta_hat) %in% c("theta[1]", "theta[2]", "theta[3]", "theta[4]", "theta[5]")))
cat("...\n")
print(subset(theta_hat, rownames(theta_hat) %in% paste0("theta[", (nrow(User_Domain_Counts)-4):nrow(User_Domain_Counts), "]")))}

Because these data are ordered as they are in our original data, we can merge them into our politicians dataset to see how news-sharing ideology aligns with that of voting ideology (i.e. NOMINATE scores):

MOC115$theta_hat <- theta_hat[, "median"]

# All politicians
round(cor(MOC115$theta_hat, MOC115$nominate, use = "complete.obs"), 2)

# Democrats
round(cor(MOC115$theta_hat[MOC115$party == "Democrat"], MOC115$nominate[MOC115$party == "Democrat"], use = "complete.obs"), 2)

# Republicans
round(cor(MOC115$theta_hat[MOC115$party == "Republican"], MOC115$nominate[MOC115$party == "Republican"], use = "complete.obs"), 2)

We can further graph these estimates as follows:

ggplot(subset(MOC115, role %in% c("Senate", "House")),
       aes(x = theta_hat, y = nominate, color = party,
           fill = party, shape = party)) +
  facet_wrap(~ role) +
  labs(x = expression(bold(paste("Media score (", vartheta, ")"))),
       y = "DW-NOMINATE") +
  coord_cartesian(xlim = c(-1.1, 1.6), ylim = c(-1.1, 1.1), expand = FALSE) +
  geom_point(size = 2.25, shape = 21, stroke = 0.4) +
  scale_colour_manual(values = c("Republican" = "red", "Democrat" = "white")) +
  scale_fill_manual(values = c("Republican" = NA, "Democrat" = "blue")) +
  theme(legend.position = "none")
ggplot(subset(MOC115, role %in% c("Senate", "House")),
       aes(x = theta_hat, y = nominate, color = party,
           fill = party, shape = party)) +
  my.theme(base_size = 11, borderless = 2,
           remove.ticks.x = TRUE, remove.ticks.y = TRUE,
           grid.x_colour = "grey50", grid.y_colour = "grey50",
           grid.x_linetype = 3, grid.y_linetype = 3) +
  facet_wrap(~ role) +
  labs(x = expression(bold(paste("Media score (", vartheta, ")"))),
       y = "DW-NOMINATE") +
  coord_cartesian(xlim = c(-1.1, 1.6), ylim = c(-1.1, 1.1), expand = FALSE) +
  geom_point(size = 2.5, shape = 21, stroke = 0.4) +
  scale_colour_manual(values = c("Republican" = "red", "Democrat" = "white")) +
  scale_fill_manual(values = c("Republican" = NA, "Democrat" = "blue")) +
  theme(legend.position = "none")

We can then calculate estimates of news media ideology as follows:

zeta_hat <- point_est(fitted_model, "zeta", prob = 0.9)

# Collect the estimates of news media ideology in a data.frame
M <- data.frame(media_org = colnames(MOC115)[7:(ncol(MOC115)-1)],
                zeta_hat = zeta_hat[, "median"],
                zeta_hat_5 = zeta_hat[, "5%"],
                zeta_hat_95 = zeta_hat[, "95%"])

# Re-order media_org for plotting from low values of zeta to high
M$media_org <- reorder(M$media_org, -M$zeta_hat)

Lastly, we graph these estimates along with their 90% credible intervals:

M$label <- "right"
M$label[M$zeta_hat > median(M$zeta_hat)] <- "left"

ggplot(M, aes(x = zeta_hat, y = media_org)) +
  coord_cartesian(xlim = c(-3.4, 3.4)) +
  scale_y_discrete(breaks = c()) +
  scale_x_continuous(breaks = -3:3) +
  labs(y = "", x = expression(bold(paste("Media score (", zeta, ")")))) +
  geom_segment(aes(x = zeta_hat_5, xend = zeta_hat_95, y = media_org, yend = media_org),
                   size = 2, color = "grey85") +
  geom_point(size = 1.25, shape = 21, stroke = 0.4, color = "black",
             fill = "black", shape = 21) +
  geom_label(data = subset(M, label == "left"),
             aes(x = zeta_hat_5, label = media_org), color = "black",
             label.r = unit(0, "lines"), label.size = NA,
             label.padding = unit(0.07, "lines"), 
             hjust = 1, size = 2.5, nudge_x = -0.13) +
  geom_label(data = subset(M, label == "right"),
             aes(x = zeta_hat_95, label = media_org), color = "black",
             label.r = unit(0, "lines"), label.size = NA,
             label.padding = unit(0.07, "lines"), 
             hjust = 0, size = 2.5, nudge_x = 0.13)
M$label <- "right"
M$label[M$zeta_hat > median(M$zeta_hat)] <- "left"

ggplot(M, aes(x = zeta_hat, y = media_org)) +
  my.theme(base_size = 11, borderless = 2,
           remove.ticks.x = TRUE,
           grid.x_linetype = 3) +
  coord_cartesian(xlim = c(-3.4, 3.4)) +
  scale_y_discrete(breaks = c()) +
  scale_x_continuous(breaks = -3:3) +
  labs(y = "", x = expression(bold(paste("Media score (", zeta, ")")))) +
  geom_segment(aes(x = zeta_hat_5, xend = zeta_hat_95, y = media_org, yend = media_org),
                   size = 2, color = "grey85") +
  geom_point(size = 1.25, shape = 21, stroke = 0.4, color = "black",
             fill = "black") +
  geom_label(data = subset(M, label == "left"),
             aes(x = zeta_hat_5, label = media_org), color = "black",
             label.r = unit(0, "lines"), label.size = NA,
             label.padding = unit(0.07, "lines"), 
             hjust = 1, size = 2.5, nudge_x = -0.13) +
  geom_label(data = subset(M, label == "right"),
             aes(x = zeta_hat_95, label = media_org), color = "black",
             label.r = unit(0, "lines"), label.size = NA,
             label.padding = unit(0.07, "lines"), 
             hjust = 0, size = 2.5, nudge_x = 0.13)

7. A Note on CPU Parallelization {#section7}

The mediascores{target="blank"} library is coded using map-reduce to permit within-chain CPU parallelization. This allows researchers to use many CPU cores for model fitting and thereby _substantially cut down on estimation time. This can be useful even for small-sized data sets, because the model is computationally demanding. For example, using a 4-core (7 year old) laptop, the model estimated in this vignette (using data from a 590 x 152 matrix) took roughly three and a half hours to fit. On a 20-core computer with within-chain parallelization turned on, model fitting took less than an hour.

To enable within-chain parallelization^[Between-chain parallelization is enabled by default.] requires that the C++ code compiled during the mediascores{target="_blank"} library's installation be done using flags to turn it on. The process to do this unfortunately varies from system to system (e.g. Windows, Unix, Mac), but is not overly complicated. For example, enabling parallelization on the systems to which the author has access required only that the following be added to the ~/.R/Makevars file (create this file if it does not exist):

On Mac OS (10.14.3):

CXX14FLAGS=-O3 -march=native -mtune=native
CXX14FLAGS += -arch x86_64 -ftemplate-depth-256
CXX14FLAGS += -DSTAN_THREADS

On CentOS 7:

CXX14 = icc
CXX14FLAGS = -DSTAN_THREADS
CXX14FLAGS += -O3 -march=native -mtune=native
CXX14FLAGS += -fPIC

For further support regarding your specific system, please consult the following resources: Stan Threading Support{target="_blank"}, Stan MPI Parallelism{target="_blank"} (for cluster-based systems), and the Stan user forums{target="_blank"}.

References




SMAPPNYU/mediascores documentation built on May 18, 2019, 1:30 p.m.