dict_filter: Filter a data.frame using a Boolean query
In kasperwelbers/textquery: Dictionary search with advanced Boolean operators

dict_filter

R Documentation

Filter a data.frame using a Boolean query

Description

This is a convenience function for using dictionary search to filter a data.frame.

Usage

dict_filter(
  df,
  dict,
  keep_context = TRUE,
  text_col = "text",
  context_col = NULL,
  index_col = NULL,
  keep_longest = TRUE,
  as_ascii = FALSE,
  use_wildcards = TRUE,
  cache = NULL
)

Arguments

`df`	A data.frame (or tibble, data.table, whatever). The column name specified in the text_col argument (default "text") will be matched to the dictionary
`dict`	A dictionary data.frame or a character vector. If data.frame, needs to have a column called 'string'. When importing a dictionary (e.g., from quanteda.dictionaries or textdata), please check out `import_dictionary`.
`keep_context`	in dict_filter. If TRUE, then all rows within a context are selected if at least one of the rows matches the dictionary.
`text_col`	The column in df with the text to query. Defaults to 'text'.
`context_col`	Optionally, a column in df with context ids. If used, texts across rows are grouped together so that you can perform Boolean queries across rows. The primary use case is if texts are tokens/words, such as produced by tidytext, udpipe or spacyr.
`index_col`	Optionally, a column in df with indices for texts within a context. In particular, if texts are tokens, these are the token positions. This is only relevant if not all tokens are used, and we therefore don't know these positions. The indices then need to be provided to correctly match multitoken strings and proximity queries.
`keep_longest`	If TRUE, then overlapping in case of overlapping queries strings in unique_hits mode, the query with the most separate terms is kept. For example, in the text "mr. Bob Smith", the query [smith OR "bob smith"] would match "Bob" and "Smith". If keep_longest is FALSE, the match that is used is determined by the order in the query itself. The same query would then match only "Smith".
`as_ascii`	if TRUE, perform search in ascii. Can be useful if you know text contains things like accents, and these are either used inconsistently or you simply can't be bothered to type them in your queries.
`use_wildcards`	Set to FALSE if you want to disable wildcards. For instance useful if you have a huge dictionary without wildcards that might have ? or * in emoticons and stuff. Note that you can also always escape wildcards with a double backslash (\? or \*)
`cache`	Cache the search index to speed up subsequent searches. if cache is a filename or path, the cache will be stored on disk. If cache is a number, the cache will be stored in memory, and the number will indicate the maximum nr of Mb to keep in cache. If NULL, no cache will be kept.

Value

The input df in the original class, filtered on the matched rows

Examples

dict = data.frame(string = c('<this is just>', '<a example>~3'))

full_text = data.frame(text = c('This is just a simple example', 'Simple is good'))

## returns the matched row
dict_filter(full_text, dict)

## dict can also be a character vector for a simple lookup
dict_filter(full_text, 'simple AND good')

tokens = data.frame(
   text = c('This','is','just','a','simple','example', 'Simple', 'is','good'),
   doc_id = c(1,1,1,1,1,1,2,2,2))

## for rows in a context, by default returns every matched context
dict_filter(tokens, dict, context_col='doc_id')

## but can also return just the matched rows
dict_filter(tokens, dict, context_col='doc_id', keep_context=FALSE)

kasperwelbers/textquery documentation built on Dec. 24, 2024, 12:47 a.m.