estimate_validity: Estimate weight threshold by estimating precision and recall

Description Usage Arguments Details Value

View source: R/est_validity.r

Description

This function estimates a weight threshold for a document comparison network that serves as an "event matching" task. The "from" documents in the edgelist need to be events, or other types of documents of which you can be sure that the date of the "to" documents cannot precede them.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
estimate_validity(
  g,
  weight_range,
  steps,
  min_weight = NA,
  do_plot = T,
  from_sample = NULL,
  weight_col = "weight",
  n_sample = NA,
  recall_precision_thres = 0.05
)

Arguments

g

The edgelist output of newsflow.compare (use the argument: return_as = "edgelist").

weight_range

A vector of length 2, with the min and max weight threshold

steps

The number of observations for which to calculate the weight threshold

min_weight

Optionally, set a minimum weight only for this calculation

do_plot

IF set to FALSE, do not plot results (results are also returned as a data.frame)

weight_col

the name of the column with the weight scores

n_sample

Draw a random sample of events. Overrides from_subset

recall_precision_thres

To estimate the recall we need to estimate the number of true positives given a low precision (see details). Here you can specify the precision threshold.

from_subset

Optionally, a logical vector of the same length as nrow(g$from_meta) to look only as specific cases

Details

For the estimation to work best, the following settings should be considered in the newsflow.compare function (that creates the g input).

See details for more information on how the estimation works.

We define a true positive as a match between a news document and event document where the news document indeed covers the event. Accordingly, without actually looking at the news coverage, we can be sure that if the news document was published before the actual occurence of the event, it is a false positive. We can use this information to get an estimate of the precision, recall and F1 scores. While the exact values of theses scores will not be accurate, they can be used to see if certain differences in preparing or comparing the data (in particular, using different weight thresholds) improves the results.

To calculate these estimates we make the assumption that the probability of false positives in the matches for a given event is the same before and after the event. We can then calculate the probability of false positives as the number of matches divided by the number of news articles before the event to which the event has been compared (for the edgelist output of newsflow.compare, the total number of comparisons is included in the "from_meta" attribute). We can then estimate the number of true positives as the observed number of matches after the event minus the expected number of matches. To estimate the number of false negatives, we assume that the number of true positives estimated with a low precision is an estimate of the real number of true positives. The precision level used can be specified in the recall_precision_threshold argument

Value

A plot of the estimation, and the data.frame with estimates can be assigned


maskedforreview/gtdnews documentation built on April 12, 2021, 11:53 a.m.