(This howto (and source extraction in general) is work-in-progress. All feedback appreciated!)
Using graph transformations on the grammatical (dependency) structure of text, we can extract quotes/paraphrases and sources with reasonable accuracy (See e.g. my 2013 LSE Text as data paper) Technically, extracting sources in not different from extracting tokens. However, two strong caveats apply here:
Sources can be extracted using the amcat.gettokens
command by specifying sources_nl
as the module and adding sources=T
as a filter:
library(amcatr) conn = amcat.connect("http://preview.amcat.nl") t = amcat.gettokens(conn, project=403, articleset=10284, module="sources_nl", filters=c(sources=T), page_size=1, npages=1, ) head(t, n=10)
As can be seen above, this retrieves tokens from the specified articleset, with two additional columns:
source_id
and source_place
.
The former is the quote number within the article, and can be used to match specific quotes to specific sources.
The latter inficates whether a token is from the quote or from its source.
As before, the keep=
argument and pos1=
filters can be used to reduce the amount of information requested:
library(amcatr) conn = amcat.connect("http://preview.amcat.nl") t = amcat.gettokens(conn, project=403, articleset=10284, module="sources_nl", filters=c(sources=T, pos1="M", pos1="N"), keep=c("aid", "lemma", "pos1", "source_id", "source_place"), page_size=1, npages=1, ) tail(t, n=10)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.