calculate_str_boundary: Identify boundary of string match

calculate_str_boundaryR Documentation

Identify boundary of string match

Description

calculate_str_boundary will use boundary patterns and a target within the boundary to identify a chunk of interest within a string.

Usage

calculate_str_boundary(
  string,
  boundaries,
  target,
  match_index = 1,
  return_as_index = TRUE
)

Arguments

string

A character object of length 1.

boundaries

A character object of length 2 (concatenated).

target

A character object for the REGEX match within boundary.

match_index

Integer, determine which match to use if more than one found (default: 1).

return_as_index

Logical value, if set to TRUE will output the start and end points of the provided string, otherwise returns the exact text of the match.

Details

Although RegEx can be used directly to achieve a similar results (forward lookups, etc.), this function provides a simple way to find a pattern within a particular boundary. This can be useful is edits of HTML files, where one wants to excise or adjust text between tags (e.g. <script></script>). The logic is as follows: (a) identify all points in the string where the boundaries and target are found, (b) calculate the difference between all combinations of the boundaries from the target, (c) determine which boundary are closest to the start and end of the target match, (d) return the entire range of the boundaries with the target either as a vector of start/end locations or the entire text content of the match.

To vectorize over several strings and patterns, it is recommended to use a for loop, apply family, or purrr functions (e.g. pmap).

Value

Either a vector of start and end points for the match, or a character value of the entire matched range in the provided string.

Examples

## Not run: 
# Load libraries
library(dplyr); library(stringr); library(magrittr)

# Create fake text
test_data <- '<head><script>RANDOMTEXT</script><script>TARGET.TEXT, OTHER RANDOMTEXT</script><script>RANDOMTEXT</script></head>'

# Determine match
tartget_chunk <- calculate_str_boundary(string = test_data,
                                        boundaries = c('<script>', '</script>'),
                                        target = 'TARGET\\.TEXT')

# Delete from initial text
stringr::str_sub(test_data, tartget_chunk[1], tartget_chunk[2]) <- ''

## End(Not run)

al-obrien/farrago documentation built on April 14, 2023, 6:20 p.m.