dict_filter: Filter a data.frame using a Boolean query

View source: R/dictionary_functions.r

dict_filterR Documentation

Filter a data.frame using a Boolean query

Description

This is a convenience function for using dictionary search to filter a data.frame.

Usage

dict_filter(
  df,
  dict,
  keep_context = TRUE,
  text_col = "text",
  context_col = NULL,
  index_col = NULL,
  keep_longest = TRUE,
  as_ascii = FALSE,
  use_wildcards = TRUE,
  cache = NULL
)

Arguments

df

A data.frame (or tibble, data.table, whatever). The column name specified in the text_col argument (default "text") will be matched to the dictionary

dict

A dictionary data.frame or a character vector. If data.frame, needs to have a column called 'string'. When importing a dictionary (e.g., from quanteda.dictionaries or textdata), please check out import_dictionary.

keep_context

in dict_filter. If TRUE, then all rows within a context are selected if at least one of the rows matches the dictionary.

text_col

The column in df with the text to query. Defaults to 'text'.

context_col

Optionally, a column in df with context ids. If used, texts across rows are grouped together so that you can perform Boolean queries across rows. The primary use case is if texts are tokens/words, such as produced by tidytext, udpipe or spacyr.

index_col

Optionally, a column in df with indices for texts within a context. In particular, if texts are tokens, these are the token positions. This is only relevant if not all tokens are used, and we therefore don't know these positions. The indices then need to be provided to correctly match multitoken strings and proximity queries.

keep_longest

If TRUE, then overlapping in case of overlapping queries strings in unique_hits mode, the query with the most separate terms is kept. For example, in the text "mr. Bob Smith", the query [smith OR "bob smith"] would match "Bob" and "Smith". If keep_longest is FALSE, the match that is used is determined by the order in the query itself. The same query would then match only "Smith".

as_ascii

if TRUE, perform search in ascii. Can be useful if you know text contains things like accents, and these are either used inconsistently or you simply can't be bothered to type them in your queries.

use_wildcards

Set to FALSE if you want to disable wildcards. For instance useful if you have a huge dictionary without wildcards that might have ? or * in emoticons and stuff. Note that you can also always escape wildcards with a double backslash (\? or \*)

cache

Cache the search index to speed up subsequent searches. if cache is a filename or path, the cache will be stored on disk. If cache is a number, the cache will be stored in memory, and the number will indicate the maximum nr of Mb to keep in cache. If NULL, no cache will be kept.

Value

The input df in the original class, filtered on the matched rows

Examples

dict = data.frame(string = c('<this is just>', '<a example>~3'))

full_text = data.frame(text = c('This is just a simple example', 'Simple is good'))

## returns the matched row
dict_filter(full_text, dict)

## dict can also be a character vector for a simple lookup
dict_filter(full_text, 'simple AND good')

tokens = data.frame(
   text = c('This','is','just','a','simple','example', 'Simple', 'is','good'),
   doc_id = c(1,1,1,1,1,1,2,2,2))

## for rows in a context, by default returns every matched context
dict_filter(tokens, dict, context_col='doc_id')

## but can also return just the matched rows
dict_filter(tokens, dict, context_col='doc_id', keep_context=FALSE)

kasperwelbers/textquery documentation built on Dec. 24, 2024, 12:47 a.m.