The data for this part can either be created in the second part, or the exact data used in the paper can be downloaded from the online appendix github page.
library(gtdnews)
library(data.table)
file_url = 'https://github.com/maskedforreview/gtdnews/raw/master/GTD_matching/sim_data.rds'
download.file(file_url, 'GTD_matching/sim_data.rds')
g = readRDS('GTD_matching/sim_data.rds')
The object g
is a list with 3 data.frames (in data.table format), that
contain the results of the document comparison (GTD events to news
articles) in a network format. The g$d
data.frame is the edgelist with
event-article pairs. g$from_meta
and g$to_meta
are the meta data for
the from (GTD events) and to (Guardian articles) nodes.
g$d ## the matches
g$from_meta ## gtd meta data
g$to_meta ## guardian meta data
The edgelist has already been filtered to only have event-article paris
where the similarity (weight) was at least 1. Otherwise, the data would
have been much larger, and we don’t use this information. However, note
that for the unsupervised validation method we do need to know the total
number of news article to which each event has been compared, separately
for articles published before the event ((N_{before})) and after the
event ((N_{after})). These numbers can be obtained from the
from_meta
data, which has the column from_n
(how many comparisons in
total) and lag_n
(how many comparison with negative time difference).
Also important to know is that for events near the start and end of the
period of analysis we do not have valid results because they do not have
a complete comparison window. Since our comparison window ranges from 30
days before the event to 7 days after the event, this means that events
in the first month and last week of our data cannot be used. So, we only
use events where the .complete_window
column in g$from_meta
is TRUE.
To produce figure 2, we plot barcharts of the hourdiff
column in the
edgelist, which gives the time difference between the event and news
article in hours. We divide this by 24 to get the difference in days.
Three plots side by side (mfrow = c(1,3)
) are presented, for the
similarity thresholds 2, 7 and 14. There is no particular reason for
these threshold, other than that it nicely shows high recall (left),
high precision (right), and a more balanced thresholds with presumably
high F1 (mid).
g$d$daydiff = floor(g$d$hourdiff / 24)
par(mar=c(4.5,4,1,4), mfrow=c(1,3)) ## plot side by side
for (thres in c(2,7,14)) {
agg = g$d[,list(n = sum(weight > thres)), by='daydiff']
setorderv(agg, 'daydiff')
barplot(agg$n, names.arg = agg$daydiff, cex.names = 0.8,
xlab= ifelse(thres==7, 'Day difference', ''),
ylab= ifelse(thres==2, 'Number of matches', ''),
col = ifelse(agg$daydiff >= 0, 'black','white'))
}
This function uses the method for estimating precision (P), recall (R)
and F1 for a range of similarity thresholds. The start and end of the
range is specified in the weight_range
argument, and the steps
argument gives the number of points into which this range is divided.
Here we use 200 steps, meaning that we get 200 estimations. The
recall_precision_thres determines which of these estimations is used
for our approximation of (TP_{recall\approx100}) (see
here for details). Here we say that we use
the lowest similarity threshold that has an estimated precision of at
least 5%.
est_pr = estimate_validity(g, weight_range = c(1,20), steps=200,
recall_precision_thres = 0.05)
est_pr contains the estimated precision and recall (and F1) scores. We will use this below to compare with the gold standard validation.
Reviewer note: A more generalized version of this function will be added to the MASKED R package.
The gold standard data is included as a (lazy) data object in the R package for this appendix.
gold_matches$description
head(gold_matches$articles)
head(gold_matches$matches)
This function computes the P, R and F1 scores based on the gold
standard. The weight_range
and steps
arguments work in the same way
as in the estimate_validity
function.
gold_pr = gtd_pr(g, weight_range = c(1,20), steps = 200)
For additional validation of the precision, we manually coded 100 event-article pairs that accoring to the algorithm, and the estimated threshold of 6.54, are considered a match.
?precision_check
mean(precision_check$correct) ## precision for threshold 6.54
mean(precision_check$correct[precision_check$weight > 10]) ## precision for higher threshold
This data also contains URLs linking to the GTD event description and Guardian article, and comments from the coder for more ambiguous references to clarify why the match is correct or incorrect.
View(precision_check)
Here we compare the est_pr and gold_pr results. Note that this only works if the weight_range and steps arguments for the functions that generated this data are identical.
Plot the F1 scores
par(mfrow=c(1,1), mar=c(4,4,1,0), xpd=F)
plot(est_pr$threshold, est_pr$F1, type='l',
ylim = c(min(est_pr$F1, gold_pr$F1m), max(est_pr$F1, gold_pr$F1m)+10),
xlab='Similarity threshold', ylab='F1', lwd=2, col='grey', bty='l')
lines(gold_pr$weight, gold_pr$F1m, lty=1, lwd=1)
Add the vertical bars for the highest F1 scores
top_est = round(est_pr$threshold[est_pr$F1 == max(est_pr$F1)][1], 2)
top_gold = round(gold_pr$weight[gold_pr$F1m == max(gold_pr$F1m)][1], 2)
graphics::abline(v=top_gold, lty=2)
graphics::abline(v=top_est, lty=2, col='darkgrey')
graphics::text(x=top_gold-0.4, y=19, labels=top_gold,
srt=0, adj=1, font=3, cex=1, family='monospace')
graphics::text(x=top_est+1.85, y=19, labels=top_est,
srt=0, adj=1, font=3, cex=1, family='monospace')
legend('topright', bty='n', legend = c('estimation', 'gold standard'),
lty=c(1,1), lwd=c(2,1), col=c('grey', 'black'))
To analyze which events are covered in the news, we transform g
so
that each row is a unique event, with variables for whether the event
matches at least one news article (has_news) and the number of news
articles (N_news).
For the analysis in the paper we used a threshold of 6.54 (weight_thres
= 6.54
) based on our validation analysis. Also, we limited our focus to
events with at least one fatal victim (killed > 0
).
e = gtd_event_coverage(g, weight_thres = 6.54)
e = subset(e, killed > 0) ## only include fatal GTD events
As reported in the paper, results of the statistical analysis are quite robust for small to moderate changes in the threshold. Also, if GTD events without fatal victims (around half of all events) are not removed, we mostly find the same effects, but with different effect sizes, and the effect of cultural values distance (one of the two cultural distance variables) disappears. We decided to focus only on fatal events because we belief that for the non-fatal attacks the difference between an event and non-event (or at least non-terrorist event) is more ambiguous. However, there could still be relevant information pertaining to the gatekeeping function in the coverage of non-fatal events. A more detailed analysis to untangle how predictors of coverage differ for fatal and non fatal attacks did not fit the scope of our paper, but it can be explored here by skipping the subset.
It should also be pointed out that the GTD contains more event characteristics than we included here. This can be merged to the data to explore other predictors of coverage.
To create the worldmaps (figure 6 and 7) we need to locate the region of the GTD events. We use the latitude and longitude of events and find the closest regions.
e$region = get_region_id(e$lon, e$lat, e$country)
Now we can aggregate to region.
geo = e[,list(sum_news=sum(N_news), ## total number of news articles
mean_news=mean(N_news)), ## average news articles per event
by='region']
The plot_worldmap
function wraps the code (mainly using ggplot2) for
visualizing the worldmap. The input is the region name and a numeric
score for that region.
plot_worldmap(geo$region, geo$sum_news) ## most coverage in total. not reported in paper
plot_worldmap(geo$region, geo$mean_news) ## most coverage per event. figure 6
For the multilevel analysis we use the lme4 package to fit the model, and the sjPlot package to display the results.
library(lme4)
library(sjPlot)
For the analysis of table 2 we used only the events for which we have
the cultural distance. Here we merge the event data e
(as created
above) to the UK_dist_scores data, which is a data frame with distance
measures of the UK to other countries. Since this data only contains
countries for which we have the cultural distance (UN countries that
participated in the World Values Survey wave 5), this merge only returns
a subset of events.
e2 = merge(e, UK_dist_scores, by='country')
Now we fit the generalized linear models (glmer) with a binomial error
distribution (by default uses the logistic link function). Note that if
non-fatal GTD events are included in the data, the natural log
transformation of killed
is undefined for non-fatal events. A solution
would be to use a log plus one transformation, for which you can simply
replace the log(killed)
with log1p(killed)
.
m1 = glmer(has_news ~ 1 + (1| country),
data=e2, family = binomial)
m2 = glmer(has_news ~ 1 + log(killed) + suicide + (1| country),
data=e2, family = binomial)
m3 = glmer(has_news ~ 1 + log(killed) + suicide + scale(geo_dist) + (1 | country),
data=e2, family = binomial)
m4 = glmer(has_news ~ 1 + log(killed) + suicide + scale(geo_dist) + value_dist + UN_disagree + (1 | country),
data=e2, family = binomial)
For the all countries model in table 2 we perform the same analysis, but without merging to the UK_dist_scores. Instead, we merge with UK_geo_dist, which only contains the geographic distance but for all countries in the data.
e3 = merge(e, UK_geo_dist, by='country')
m1 = glmer(has_news ~ 1 + (1| country),
data=e3, family = binomial)
m2 = glmer(has_news ~ 1 + log(killed) + suicide + (1| country),
data=e3, family = binomial)
m3 = glmer(has_news ~ 1 + log(killed) + suicide + scale(geo_dist) + (1 | country),
data=e3, family = binomial)
m30 = glm(has_news ~ 1 + log(killed) + suicide + scale(geo_dist),
data=e3, family = binomial)
m20 = glm(has_news ~ 1 + log(killed) + suicide + country,
data=e3, family = binomial)
gold_matches
phtest_glmer(m2,m20)
anova(m1,m2,m3)
summary(m2)
summary(m20)
tab_model(m2,m20, show.se=T)
To create figure 7, we first loop over different weight thresholds, fit the full model for this threshold, and store the coefficients for each model.
l = list()
for (weight_thres in seq(1, 20, by=0.5)) {
print(weight_thres)
e = gtd_event_coverage(g, weight_thres = weight_thres)
e = subset(e, killed > 0)
e = merge(e, UK_dist_scores, by='country')
full_model = glmer(has_news ~ 1 + log(killed) + suicide + scale(geo_dist) + value_dist + UN_disagree + (1 | country),
data=e, family = binomial)
x = plot_model(full_model)
x = x$data
x$threshold = weight_thres
l[['']] = x
}
d = rbindlist(l)
For the following visualization we also need to install the facetscales package. This is not a dependency of the gtdnews package because it is only on github. The following code installs the package directly from github (Note that you need to have the remotes package installed)
remotes::install_github("zeehio/facetscales")
library(ggplot2)
library(facetscales)
## Set nicer labels for the independent variables
d$term = as.character(d$term)
d$term[d$term == 'log(killed)'] = 'fatal victims'
d$term[d$term == 'suicide'] = 'suicide'
d$term[d$term == 'scale(geo_dist)'] = 'geo dist'
d$term[d$term == 'value_dist'] = 'cultural val dist'
d$term[d$term == 'UN_disagree'] = 'UN voting diff'
## make factor, so that ggplot puts the results in the right order
cnames = unique(d$term)
d$term = factor(as.character(d$term), levels=cnames)
## use custom y-axis limits
scales_y <- list(
`fatal victims` = scale_y_continuous(limits = c(0,3)),
`suicide` = scale_y_continuous(limits = c(0, 30)),
`geo dist` = scale_y_continuous(limits = c(0,2)),
`cultural val dist` = scale_y_continuous(limits = c(0,2)),
`UN voting diff` = scale_y_continuous(limits = c(0,2))
)
ggplot(data = d, aes(threshold, estimate)) +
geom_line(color = "black", size = 1) +
geom_errorbar(aes(ymin=conf.low, ymax=conf.high), colour="black", width=.4, ) +
geom_point(color="black") +
theme(plot.margin = unit(c(-1,0.4,0.4,0.4), "cm")) +
geom_hline(yintercept=1, color='black', linetype=2) +
labs(title = "", subtitle = "",
y = "odds ratio", x = "similarity threshold") +
facet_grid_sc(term~., scales = list(y = scales_y))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.