dict_match: Match a dictionary to a text column in a data.frame

View source: R/dictionary_functions.r

dict_matchR Documentation

Match a dictionary to a text column in a data.frame

Description

dict_match is the most bare bones version of the dict_* functions. It returns for each match the indices of the df and dict data.frames. This can be used for all sorts of filtering and joining, but the most common use cases are also facilitated by the dict_filter and dict_add functions.

Usage

dict_match(
  df,
  dict,
  text_col = "text",
  context_col = NULL,
  index_col = NULL,
  mode = c("hits", "terms"),
  keep_longest = TRUE,
  as_ascii = FALSE,
  use_wildcards = TRUE,
  cache = NULL
)

Arguments

df

A data.frame (or tibble, data.table, whatever). The column name specified in the text_col argument (default "text") will be matched to the dictionary

dict

A dictionary data.frame or a character vector. If data.frame, needs to have a column called 'string'. When importing a dictionary (e.g., from quanteda.dictionaries or textdata), please check out import_dictionary.

text_col

The column in df with the text to query. Defaults to 'text'.

context_col

Optionally, a column in df with context ids. If used, texts across rows are grouped together so that you can perform Boolean queries across rows. The primary use case is if texts are tokens/words, such as produced by tidytext, udpipe or spacyr.

index_col

Optionally, a column in df with indices for texts within a context. In particular, if texts are tokens, these are the token positions. This is only relevant if not all tokens are used, and we therefore don't know these positions. The indices then need to be provided to correctly match multitoken strings and proximity queries.

mode

There are three modes: "hits" and "terms" and "unique". The "hits" mode prioritizes finding full and unique matches. For example, if we query <climate chang*>~10, then in the text "climate change is changing the world" we'll only find one unique hit for "climate change". Alternatively, in "terms" mode we would match "climate", "change" and "changing". "hits" mode is often what you want for counting occurrences. "terms" mode is especially useful if you are matching a dictionary to tokens, and want to match every token that satisfies the query.

keep_longest

If TRUE, then overlapping in case of overlapping queries strings in unique_hits mode, the query with the most separate terms is kept. For example, in the text "mr. Bob Smith", the query [smith OR "bob smith"] would match "Bob" and "Smith". If keep_longest is FALSE, the match that is used is determined by the order in the query itself. The same query would then match only "Smith".

as_ascii

if TRUE, perform search in ascii. Can be useful if you know text contains things like accents, and these are either used inconsistently or you simply can't be bothered to type them in your queries.

use_wildcards

Set to FALSE if you want to disable wildcards. For instance useful if you have a huge dictionary without wildcards that might have ? or * in emoticons and stuff. Note that you can also always escape wildcards with a double backslash (\? or \*)

cache

Cache the search index to speed up subsequent searches. if cache is a filename or path, the cache will be stored on disk. If cache is a number, the cache will be stored in memory, and the number will indicate the maximum nr of Mb to keep in cache. If NULL, no cache will be kept.

Value

dict_match: A data.table with matches, specifying the index of the df (data_index) and the index of the dict (dict_index). If mode = 'hits', a hit_id column indicates which matches of the same dict_index are part of the same hit. If mode = 'terms', a 'term' column shows which terms were matched.

Examples

dict = data.frame(string = c('<this is just>', '<a example>~3'))

## full text matches the text twice, once for each query
full_text = data.frame(text = c('This is just a simple example', 'Simple is good'))
dict_match(full_text, dict)

## tokens in context 'doc_id' matches token 1-3 for query 1, and 4&6 for query 2.
## note that the hit_id also shows that 1-3 belong together (hit_id 1 for query 1)
tokens = data.frame(
   text = c('This','is','just','a','simple','example', 'Simple', 'is','good'),
   doc_id = c(1,1,1,1,1,1,2,2,2))
dict_match(tokens, dict, context_col='doc_id')

kasperwelbers/textquery documentation built on Dec. 24, 2024, 12:47 a.m.