time_based_validity: Inspect effects of thresholds on matches over time

Description Usage Arguments Value

View source: R/plot.r

Description

If it can be assumed that matches should only occur within a given time range (e.g., event data should match news items after the event occured) a low effort validation can be obtained by looking at whether the matches only occur within this time range. This function plots the percentage of matches within a given time range (hourdiff) for different thresholds of the weight column. This can be used to determine a good threshold.s

Usage

1
2
3
time_based_validity(g, total_hourdiff, expected_hourdiff,
  min_weight = NA, lambda = log(2)/24, breaks = 100,
  hist_breaks = NA)

Arguments

g

The edgelist output of newsflow.compare (use the argument: return_as = "edgelist"). Has to come directly from newsflow.compare (i.e. no intermediate operations performed such as subsetting), because the current function requires certain attributes that are removed if g is changed. Also, the margin_attr argument in newsflow.compare has to be TRUE (as is the default)

total_hourdiff

The range of the hourdiff value in g. This should be the same as the hour.window in newsflow.compare (if g has not been subsetted afterwards).

expected_hourdiff

A vector of length 2, that indicates the range (including endpoints) in which you expect matches to occur based on reasonable assumptions about the data. For matching events to news articles, a very reasonable assumption is that we expect matches to occur 'after' the event took place, and a reasonable second assumption is that we expect matches to occur 'within a limited amount of time' after the event.

min_weight

Filter out all matches below the given weight

breaks

The number of breaks for the weight threshold

hist_breaks

the number of breaks on the histogram

Value

A plot, and the plot data can be assigned


kasperwelbers/restecode documentation built on Feb. 12, 2020, 11:39 a.m.