dict_match: Match a dictionary to a text column in a data.frame
In kasperwelbers/textquery: Dictionary search with advanced Boolean operators

dict_match

R Documentation

Match a dictionary to a text column in a data.frame

Description

dict_match is the most bare bones version of the dict_* functions. It returns for each match the indices of the df and dict data.frames. This can be used for all sorts of filtering and joining, but the most common use cases are also facilitated by the dict_filter and dict_add functions.

Usage

dict_match(
  df,
  dict,
  text_col = "text",
  context_col = NULL,
  index_col = NULL,
  mode = c("hits", "terms"),
  keep_longest = TRUE,
  as_ascii = FALSE,
  use_wildcards = TRUE,
  cache = NULL
)

Arguments

`df`	A data.frame (or tibble, data.table, whatever). The column name specified in the text_col argument (default "text") will be matched to the dictionary
`dict`	A dictionary data.frame or a character vector. If data.frame, needs to have a column called 'string'. When importing a dictionary (e.g., from quanteda.dictionaries or textdata), please check out `import_dictionary`.
`text_col`	The column in df with the text to query. Defaults to 'text'.
`context_col`	Optionally, a column in df with context ids. If used, texts across rows are grouped together so that you can perform Boolean queries across rows. The primary use case is if texts are tokens/words, such as produced by tidytext, udpipe or spacyr.
`index_col`	Optionally, a column in df with indices for texts within a context. In particular, if texts are tokens, these are the token positions. This is only relevant if not all tokens are used, and we therefore don't know these positions. The indices then need to be provided to correctly match multitoken strings and proximity queries.
`mode`	There are three modes: "hits" and "terms" and "unique". The "hits" mode prioritizes finding full and unique matches. For example, if we query <climate chang*>~10, then in the text "climate change is changing the world" we'll only find one unique hit for "climate change". Alternatively, in "terms" mode we would match "climate", "change" and "changing". "hits" mode is often what you want for counting occurrences. "terms" mode is especially useful if you are matching a dictionary to tokens, and want to match every token that satisfies the query.
`keep_longest`	If TRUE, then overlapping in case of overlapping queries strings in unique_hits mode, the query with the most separate terms is kept. For example, in the text "mr. Bob Smith", the query [smith OR "bob smith"] would match "Bob" and "Smith". If keep_longest is FALSE, the match that is used is determined by the order in the query itself. The same query would then match only "Smith".
`as_ascii`	if TRUE, perform search in ascii. Can be useful if you know text contains things like accents, and these are either used inconsistently or you simply can't be bothered to type them in your queries.
`use_wildcards`	Set to FALSE if you want to disable wildcards. For instance useful if you have a huge dictionary without wildcards that might have ? or * in emoticons and stuff. Note that you can also always escape wildcards with a double backslash (\? or \*)
`cache`	Cache the search index to speed up subsequent searches. if cache is a filename or path, the cache will be stored on disk. If cache is a number, the cache will be stored in memory, and the number will indicate the maximum nr of Mb to keep in cache. If NULL, no cache will be kept.

Value

dict_match: A data.table with matches, specifying the index of the df (data_index) and the index of the dict (dict_index). If mode = 'hits', a hit_id column indicates which matches of the same dict_index are part of the same hit. If mode = 'terms', a 'term' column shows which terms were matched.

Examples

dict = data.frame(string = c('<this is just>', '<a example>~3'))

## full text matches the text twice, once for each query
full_text = data.frame(text = c('This is just a simple example', 'Simple is good'))
dict_match(full_text, dict)

## tokens in context 'doc_id' matches token 1-3 for query 1, and 4&6 for query 2.
## note that the hit_id also shows that 1-3 belong together (hit_id 1 for query 1)
tokens = data.frame(
   text = c('This','is','just','a','simple','example', 'Simple', 'is','good'),
   doc_id = c(1,1,1,1,1,1,2,2,2))
dict_match(tokens, dict, context_col='doc_id')

kasperwelbers/textquery documentation built on Dec. 24, 2024, 12:47 a.m.