knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
String-heavy datasets are common in ecology. Species names arrive with typos, observer notes contain free-text descriptions, and site labels mix upper and lower case in ways that break joins. Cleaning and matching these strings is often the slowest part of an analysis pipeline, both in wall-clock time and in the number of lines of R code required.
vectra handles all string operations inside its C engine, operating directly
on raw byte buffers rather than going through R's internal CHARSXP string
pool. When you write mutate(name_clean = tolower(trimws(species))), that
expression is serialized into an execution plan and evaluated in C as data
flows through the pipeline. No temporary R character vectors are allocated, no
intermediate copies exist, and the global string cache that R maintains for
every unique character value is bypassed entirely.
This architecture has two practical consequences. First, string operations
scale linearly with data size and do not suffer from R's O(n log n) string
interning overhead. Second, the CPU-intensive operations (regex matching,
edit-distance computation) can be parallelized with OpenMP because the C
engine owns the memory layout. For datasets with more than a thousand rows,
vectra automatically distributes regex grepl, levenshtein, dl_dist, and
jaro_winkler across available cores.
This vignette walks through every string operation vectra supports, from
basic transformations through regex pattern matching to fuzzy joins. We will
use ecological data throughout: species names with realistic errors, site
labels from field surveys, and observer notes that need cleaning before
analysis. The examples are deliberately small so that outputs fit on screen,
but every operation shown here works identically on datasets with millions of
rows, streaming through memory one row group at a time without materializing
the full result until collect() is called.
We will start with a small dataset of tree observations where names need cleaning. The data has trailing spaces, inconsistent case, and a few columns we will want to slice apart.
library(vectra) trees <- data.frame( species = c("Quercus robur ", " fagus sylvatica", "ACER platanoides", "Betula pendula", " Pinus SYLVESTRIS "), site = c("plot_A01", "plot_B12", "plot_A01", "plot_C03", "plot_B12"), observer = c("J. Smith", "A. Mueller", "J. Smith", "B. Novak", "A. Mueller"), dbh_cm = c(45.2, 38.1, 22.7, 31.0, 55.3), stringsAsFactors = FALSE ) f <- tempfile(fileext = ".vtr") write_vtr(trees, f)
trimws() strips leading and trailing whitespace from every value in a
column. The C engine operates on the raw bytes, so the overhead per string is
a single pass to find the first and last non-space characters. We apply it
inside mutate() just as we would in base R.
tbl(f) |> mutate(species_clean = trimws(species)) |> select(species, species_clean) |> collect()
The original column is preserved alongside the cleaned version. If we no
longer need it, a subsequent select() drops it from the pipeline before any
downstream node sees it.
toupper() and tolower() convert ASCII characters in place. These are pure
byte-level operations (A-Z mapped to a-z and vice versa), so they are
effectively free compared to the I/O cost of reading the column from disk.
tbl(f) |> mutate(species_lower = tolower(trimws(species)), species_upper = toupper(trimws(species))) |> select(species_lower, species_upper) |> collect()
Chaining tolower(trimws(...)) creates a single expression node in the plan.
The engine evaluates the innermost function first and passes the result
directly to the outer one, with no intermediate allocation visible from R.
nchar() returns the byte length of each string as a double (vectra's
integer type). This is useful for filtering out suspiciously short or long
entries that might indicate data-entry errors.
tbl(f) |> mutate(name_len = nchar(trimws(species))) |> select(species, name_len) |> collect()
Because vectra operates on raw bytes, nchar() returns the byte length, not
the number of Unicode code points. For pure ASCII text (English letters,
digits, common punctuation), bytes and characters are the same thing. But for
multi-byte UTF-8 characters such as accented letters ("ü" = 2 bytes, "é" = 2
bytes) or CJK ideographs (3 bytes each), the reported length will be higher
than what base::nchar() returns in R. The same applies to substr(): its
start and stop positions refer to byte offsets. In ecological data, this
rarely matters because taxonomic names and site codes are overwhelmingly
ASCII, but it is worth keeping in mind if your data contains place names with
diacritics or observer names in non-Latin scripts.
substr(x, start, stop) and substring(x, first, last) both extract a
portion of each string by position. Positions are 1-based, following R
convention. These are useful for pulling fixed-width codes out of identifier
columns.
tbl(f) |> mutate(plot_letter = substr(site, 6, 6), plot_number = substr(site, 7, 8)) |> select(site, plot_letter, plot_number) |> collect()
The site codes follow a plot_X## pattern, so character 6 is the letter and
characters 7-8 are the numeric part. Both are returned as strings.
startsWith() and endsWith() return a logical column. They take a literal
string argument (not a column reference) and match it against the beginning or
end of each value. Because the comparison is a simple memcmp on the byte
prefix or suffix, it runs faster than a regex equivalent.
tbl(f) |> filter(startsWith(site, "plot_A")) |> select(species, site) |> collect()
tbl(f) |> mutate(is_mueller = endsWith(observer, "Mueller")) |> select(observer, is_mueller) |> collect()
startsWith and endsWith are particularly useful in filter() for
subsetting data by site code prefixes or file path suffixes, since the C
engine can evaluate them without compiling a regex.
paste() and paste0() concatenate an arbitrary number of arguments into a
single string column. These are N-ary functions: you can pass two columns,
five columns, or a mix of columns and literal strings. The C engine's
two-pass string builder computes the total output length across all arguments
first, allocates a single buffer, and fills it in one sweep.
paste0() joins its arguments with no separator, which is the common case
for building identifiers or composite keys.
tbl(f) |> mutate(record_id = paste0(site, "_", trimws(species))) |> select(record_id) |> collect()
Here we pass three arguments: the site column, a literal underscore, and
the trimmed species name. The engine handles mixed column-and-literal
arguments natively.
paste() uses a space as the default separator, matching base R behaviour.
The sep parameter overrides this.
tbl(f) |> mutate(label = paste(observer, site, sep = " @ ")) |> select(label) |> collect()
When building composite labels from more than two columns, passing all of
them in a single paste() call is cleaner than nesting multiple paste0()
calls.
tbl(f) |> mutate(full_label = paste(observer, site, trimws(species), sep = " | ")) |> select(full_label) |> collect()
The engine constructs each output string in a single allocation regardless of how many arguments are passed, so there is no performance penalty for concatenating five columns versus two.
vectra's grepl(), gsub(), and sub() accept a fixed parameter that
controls whether the pattern is treated as a literal byte sequence or as a
POSIX regular expression. When fixed = TRUE (the default in vectra), the
engine uses a direct substring search, which avoids the cost of compiling a
regex automaton. For exact substring tests and simple replacements, this is
the right choice.
Note that vectra defaults to fixed = TRUE, which differs from base R where
fixed defaults to FALSE. We chose this default because in our experience
the majority of string filtering and replacement tasks in data cleaning
involve literal substrings, not regex patterns. When you need actual regex,
pass fixed = FALSE explicitly. This is covered in the next section.
grepl(pattern, x, fixed = TRUE) returns a boolean column that is TRUE
wherever the pattern appears as a literal substring. We use it here to find
all records where the observer's surname contains "Smith".
tbl(f) |> filter(grepl("Smith", observer, fixed = TRUE)) |> collect()
Only one row matches. The fixed-string search scans each byte buffer with a Boyer-Moore-style skip, so it scales well even on columns with long string values.
gsub(pattern, replacement, x, fixed = TRUE) replaces every occurrence of
the pattern; sub() replaces only the first. Both accept fixed = TRUE for
literal matching and return a new string column.
A common cleaning task is standardizing separators. Suppose our site codes sometimes use underscores and sometimes hyphens.
messy_sites <- data.frame( site = c("plot_A01", "plot-B12", "plot_A01", "plot-C03", "plot_B12"), stringsAsFactors = FALSE ) f2 <- tempfile(fileext = ".vtr") write_vtr(messy_sites, f2) tbl(f2) |> mutate(site_clean = gsub("-", "_", site, fixed = TRUE)) |> collect()
Every hyphen becomes an underscore. Since fixed = TRUE is the default,
writing gsub("-", "_", site) without the explicit argument produces the
same result.
sub() replaces only the first occurrence. This is useful when a pattern
might appear multiple times and we want to change just the leading instance.
notes <- data.frame( note = c("tree dead, bark loose, dead branches", "alive, healthy canopy", "dead standing, no bark"), stringsAsFactors = FALSE ) f3 <- tempfile(fileext = ".vtr") write_vtr(notes, f3) tbl(f3) |> mutate(note_edited = sub("dead", "DEAD", note, fixed = TRUE)) |> select(note, note_edited) |> collect()
In the first row, only the first "dead" is uppercased; the second "dead" in
"dead branches" remains unchanged. gsub would have replaced both.
When literal substring matching is not enough, vectra supports full POSIX
regular expressions by passing fixed = FALSE. The C engine compiles each
regex pattern once per batch using regcomp() and applies it across all rows.
For batches with more than 1000 rows, the matching loop is parallelized with
OpenMP, with each thread compiling its own copy of the regex for thread
safety.
The regex dialect is POSIX Extended Regular Expressions (ERE), the same
dialect used by grep -E on Unix systems. This means character classes like
[A-Z], quantifiers +, *, ?, alternation with |, and grouping with
() all work as expected. Backreferences in replacement strings (\\1,
\\2) refer to captured groups. Perl-specific extensions like lookaheads,
lookbehinds, and non-greedy quantifiers (*?) are not supported; this is a
deliberate constraint that keeps the regex engine simple and predictable.
We will build a dataset with species names that have various formatting
issues. The regex ^[A-Z][a-z]+ [a-z]+$ matches properly formatted binomial
names (capital genus, lowercase epithet, single space).
species_list <- data.frame( name = c("Quercus robur", "fagus sylvatica", "ACER PLATANOIDES", "Betula pendula", "Pinus sylvestris", "Tilia cordata"), stringsAsFactors = FALSE ) f4 <- tempfile(fileext = ".vtr") write_vtr(species_list, f4) tbl(f4) |> mutate(valid_format = grepl("^[A-Z][a-z]+ [a-z]+$", name, fixed = FALSE)) |> collect()
Three names fail: "fagus sylvatica" (lowercase genus), "ACER PLATANOIDES" (all caps), and "Pinus sylvestris" (double space). This kind of validation pass is a useful first step before attempting joins against a reference taxonomy.
gsub() with fixed = FALSE supports POSIX regex in the pattern and
backreferences in the replacement string. Backreferences \\1, \\2, etc.
refer to captured groups in the pattern.
Suppose we want to reformat species names from "Genus epithet" to "G. epithet" for compact display labels.
tbl(f4) |> mutate(short_name = gsub("^([A-Za-z])[a-z]+ ([a-z]+)$", "\\1. \\2", name, fixed = FALSE)) |> select(name, short_name) |> collect()
The pattern captures the first letter of the genus as group \\1 and the
entire epithet as group \\2. Rows that do not match the pattern are returned
unchanged, which is consistent with base R's gsub behaviour.
sub() with regex works the same way but replaces only the first match. This
is handy for stripping a leading prefix that varies in length.
records <- data.frame( code = c("ID:001-Quercus", "ID:042-Fagus", "ID:007-Betula", "ID:113-Pinus"), stringsAsFactors = FALSE ) f5 <- tempfile(fileext = ".vtr") write_vtr(records, f5) tbl(f5) |> mutate(genus = sub("^ID:[0-9]+-", "", code, fixed = FALSE)) |> select(code, genus) |> collect()
The regex ^ID:[0-9]+- matches the literal "ID:" prefix, one or more digits,
and the trailing hyphen. Replacing it with an empty string leaves just the
genus name.
str_extract(x, pattern) returns the first match of the regex pattern, or
NA if no match is found. When the pattern contains a capture group (round
brackets), str_extract returns the contents of that group rather than the
full match.
This is particularly useful for pulling structured components out of semi-formatted strings. Here we extract numeric plot identifiers from site codes.
sites <- data.frame( site_code = c("Forest_Plot_042", "Meadow_Transect_007", "Forest_Plot_113", "Wetland_Quad_019"), stringsAsFactors = FALSE ) f6 <- tempfile(fileext = ".vtr") write_vtr(sites, f6) tbl(f6) |> mutate(plot_num = str_extract(site_code, "([0-9]+)")) |> select(site_code, plot_num) |> collect()
The capture group ([0-9]+) matches one or more digits. Since there is a
group, str_extract returns the group content. Without the parentheses, it
would return the full match, which in this case is the same.
A more interesting example extracts the habitat type from the site code. The pattern uses a group to capture everything before the first underscore.
tbl(f6) |> mutate(habitat = str_extract(site_code, "^([A-Za-z]+)_")) |> select(site_code, habitat) |> collect()
The group captures "Forest", "Meadow", "Forest", "Wetland" while the
trailing underscore is part of the match but not part of the group. This
approach is cleaner than combining substr with a computed position.
Exact string matching breaks down when data comes from multiple sources with inconsistent spelling. A field botanist writes "Qurecus robur" (transposed letters), a database export has "Fagus silvatica" (single-letter substitution), and a citizen-science platform records "quercus ROBUR" (case mismatch). Fuzzy string matching quantifies how different two strings are and lets us match records that are close enough.
vectra provides three families of distance functions, all usable inside
mutate() and filter(). Each takes two string arguments (columns or
literals) and returns a numeric distance or similarity score. The raw variants
(levenshtein, dl_dist) return integer edit counts; the normalized variants
(levenshtein_norm, dl_dist_norm) divide by the length of the longer
string to produce a value between 0 and 1; and jaro_winkler returns a
similarity score where 1 means identical.
levenshtein(x, y) counts the minimum number of single-character edits
(insertions, deletions, substitutions) needed to transform x into y. The
result is a non-negative integer. levenshtein_norm(x, y) divides by the
length of the longer string, producing a value between 0 (identical) and 1
(completely different).
typos <- data.frame( field_name = c("Qurecus robur", "Fagus silvatica", "Acer platanodes", "Betula pendula", "Pinus sylvestrs"), ref_name = c("Quercus robur", "Fagus sylvatica", "Acer platanoides", "Betula pendula", "Pinus sylvestris"), stringsAsFactors = FALSE ) f7 <- tempfile(fileext = ".vtr") write_vtr(typos, f7) tbl(f7) |> mutate(lev = levenshtein(field_name, ref_name), lev_norm = levenshtein_norm(field_name, ref_name)) |> collect()
"Qurecus robur" vs "Quercus robur" has a Levenshtein distance of 2 (the transposed "re" to "er" counts as two operations: delete and insert, or two substitutions depending on the alignment). "Betula pendula" matches exactly, giving a distance of 0.
dl_dist(x, y) extends Levenshtein by counting transpositions of two
adjacent characters as a single edit operation. This better reflects the
kinds of errors humans make when typing. dl_dist_norm(x, y) is the
normalized variant.
tbl(f7) |> mutate(dl = dl_dist(field_name, ref_name), dl_norm = dl_dist_norm(field_name, ref_name)) |> collect()
Compare the first row: "Qurecus" vs "Quercus" is a transposition of "r" and "e". Levenshtein counts this as 2 edits, but Damerau-Levenshtein counts it as 1. For human-typed data, DL distance is usually the better default because transposition errors are common.
jaro_winkler(x, y) returns a similarity score between 0 and 1, where 1
means the strings are identical. Unlike the edit-distance functions, higher
values indicate better matches. Jaro-Winkler gives extra weight to matching
prefixes, making it particularly effective for personal names and short
strings where the first few characters are usually correct.
tbl(f7) |> mutate(jw = jaro_winkler(field_name, ref_name)) |> collect()
The scores are all above 0.9 for these examples because the errors are small relative to string length. Jaro-Winkler is less informative for long strings with small differences (everything scores above 0.95) but excels at distinguishing between short strings like surnames.
The choice depends on the error profile of your data:
Damerau-Levenshtein (dl_dist / dl_dist_norm): the best general
default. Handles insertions, deletions, substitutions, and transpositions.
Use this for species names, place names, and any free-text field where typos
are the primary error source.
Levenshtein (levenshtein / levenshtein_norm): appropriate when
transpositions are rare or when you need compatibility with systems that
use classical Levenshtein. The scores will be slightly more conservative
(higher distance) than DL for transposed strings.
Jaro-Winkler (jaro_winkler): best for short strings (under 15
characters) and personal names. The prefix weighting helps when the first
few characters are reliable and errors tend to occur later in the string.
Less useful for long species names where a single-character change barely
moves the score.
In practice, we often want to flag records that exceed some distance threshold from a reference value. Here we filter for species entries that are within a normalized DL distance of 0.15 from "Quercus robur".
field_data <- data.frame( species = c("Qurecus robur", "Quercus robor", "Fagus sylvatica", "Quercus robur", "Quercis rubur", "Betula pendula"), plot = c("A1", "A2", "B1", "A3", "B2", "C1"), stringsAsFactors = FALSE ) f8 <- tempfile(fileext = ".vtr") write_vtr(field_data, f8) tbl(f8) |> mutate(dist = dl_dist_norm(species, "Quercus robur")) |> filter(dist < 0.15) |> collect()
Only entries close to "Quercus robur" survive the filter. "Fagus sylvatica" and "Betula pendula" are too distant, and "Quercis rubur" sits right at the boundary (whether it passes depends on the exact normalized distance).
When matching against multiple candidates, sorting by distance gives us the best matches first.
tbl(f8) |> mutate(sim = jaro_winkler(species, "Quercus robur")) |> arrange(desc(sim)) |> collect()
The exact match scores 1.0, and the candidates with small typos rank next. This pattern is useful for building a shortlist of probable matches to review manually.
Fuzzy matching within mutate() works when we already have paired columns to
compare. More often, we have two separate tables and need to find which rows
in a messy dataset correspond to which rows in a clean reference. This is
where fuzzy_join() comes in.
fuzzy_join(x, y, by, method, max_dist) joins two vectra tables using
approximate string matching. The by argument is a named character vector of
length 1 specifying which columns to compare. The result includes all columns
from both sides plus a fuzzy_dist column with the normalized distance of
each match.
ref <- data.frame( canonical = c("Quercus robur", "Fagus sylvatica", "Acer platanoides", "Betula pendula", "Pinus sylvestris", "Tilia cordata"), family = c("Fagaceae", "Fagaceae", "Sapindaceae", "Betulaceae", "Pinaceae", "Malvaceae"), stringsAsFactors = FALSE ) f_ref <- tempfile(fileext = ".vtr") write_vtr(ref, f_ref) messy <- data.frame( field_species = c("Qurecus robur", "Fagus silvatica", "Acer platanodes", "Betla pendula", "Pinis sylvestris"), count = c(12L, 7L, 3L, 15L, 9L), stringsAsFactors = FALSE ) f_messy <- tempfile(fileext = ".vtr") write_vtr(messy, f_messy) fuzzy_join( tbl(f_messy), tbl(f_ref), by = c("field_species" = "canonical"), method = "dl", max_dist = 0.25 ) |> collect()
Each messy field name finds its closest match in the reference table. The
fuzzy_dist column shows how far apart they are, which is useful for
flagging matches that might need manual verification (say, anything above
0.15).
The method parameter selects the distance algorithm. The three options are
"dl" (Damerau-Levenshtein, the default), "levenshtein", and "jw"
(Jaro-Winkler). When using "jw", max_dist is interpreted as the maximum
dissimilarity (i.e., 1 - jaro_winkler_score), so max_dist = 0.2 keeps
matches with a Jaro-Winkler score of 0.8 or higher.
fuzzy_join( tbl(f_messy), tbl(f_ref), by = c("field_species" = "canonical"), method = "jw", max_dist = 0.15 ) |> collect()
Jaro-Winkler produces a tighter match set here because it penalizes differences in the prefix more heavily.
Without blocking, fuzzy_join compares every probe row against every
build row. For large tables this is quadratic. The block_by parameter
specifies a column for exact-match blocking: only pairs that share the same
blocking key are compared.
In a taxonomic context, the natural blocking key is the genus. We split the genus from both sides and block on it, so "Qurecus robur" is only compared against reference rows where the genus starts with "Q" (or more precisely, where the blocking column matches exactly).
ref_blocked <- data.frame( genus = c("Quercus", "Fagus", "Acer", "Betula", "Pinus", "Tilia"), canonical = c("Quercus robur", "Fagus sylvatica", "Acer platanoides", "Betula pendula", "Pinus sylvestris", "Tilia cordata"), family = c("Fagaceae", "Fagaceae", "Sapindaceae", "Betulaceae", "Pinaceae", "Malvaceae"), stringsAsFactors = FALSE ) f_ref2 <- tempfile(fileext = ".vtr") write_vtr(ref_blocked, f_ref2) messy_blocked <- data.frame( genus_field = c("Quercus", "Fagus", "Acer", "Betula", "Pinus"), field_species = c("Qurecus robur", "Fagus silvatica", "Acer platanodes", "Betla pendula", "Pinis sylvestris"), count = c(12L, 7L, 3L, 15L, 9L), stringsAsFactors = FALSE ) f_messy2 <- tempfile(fileext = ".vtr") write_vtr(messy_blocked, f_messy2) fuzzy_join( tbl(f_messy2), tbl(f_ref2), by = c("field_species" = "canonical"), method = "dl", max_dist = 0.25, block_by = c("genus_field" = "genus") ) |> collect()
With blocking, each probe row is compared against at most one reference row (the one sharing its genus) instead of all six. On real datasets with thousands of reference species, this reduces the number of distance computations by orders of magnitude.
The n_threads parameter controls how many OpenMP threads the distance
computations use. The default is 4, which is a reasonable starting point. On
a workstation with many cores, increasing it to 8 or 16 can help for large
joins. On a laptop where other processes need CPU time, reducing it to 2 is
reasonable.
fuzzy_join( tbl(f_messy), tbl(f_ref), by = c("field_species" = "canonical"), method = "dl", max_dist = 0.25, n_threads = 8L ) |> collect()
fuzzy_join() is a streaming operation: the probe side flows through the
pipeline and the build side is materialized into a hash table internally. But
sometimes we need to look up the same reference repeatedly with different
query sets, and rebuilding the hash table each time is wasteful.
materialize() converts a vectra node into a persistent in-memory block. Once
materialized, we can probe it with block_lookup() (exact matching) or
block_fuzzy_lookup() (fuzzy matching) as many times as needed without
re-reading the source data.
block_lookup(block, column, keys) performs a hash-based exact match on a
string column. The hash index is built lazily on the first call and cached for
subsequent calls.
blk <- tbl(f_ref) |> select(canonical, family) |> materialize() block_lookup(blk, "canonical", c("Quercus robur", "Betula pendula"))
The result includes a query_idx column that maps each result row back to
the input key vector (1-based). Queries that match multiple rows in the block
return all matches.
Field data often has inconsistent capitalization. The ci = TRUE parameter
performs case-insensitive matching without requiring a separate lowercased
column in the reference.
block_lookup(blk, "canonical", c("quercus robur", "BETULA PENDULA"), ci = TRUE)
Both queries match despite the case mismatch. Internally, the engine builds a separate case-folded hash index on first use and caches it alongside the case-sensitive one.
block_fuzzy_lookup() computes string distances between query keys and every
row in the block (or, with blocking, only rows sharing a blocking key). The
interface mirrors fuzzy_join() but operates on a materialized block.
blk2 <- tbl(f_ref2) |> select(genus, canonical, family) |> materialize() block_fuzzy_lookup( blk2, "canonical", keys = c("Qurecus robur", "Fagus silvatica"), method = "dl", max_dist = 0.2, block_col = "genus", block_keys = c("Quercus", "Fagus"), n_threads = 2L )
The block_col and block_keys arguments work together: the first query
("Qurecus robur") is only compared against rows where genus == "Quercus",
and the second query against rows where genus == "Fagus". This is the
same blocking logic as fuzzy_join() but applied to a materialized block
that can be reused across multiple lookup calls without re-reading disk.
String operations in vectra fall into three performance tiers. Understanding which tier an operation belongs to helps predict where the bottleneck will be in a pipeline.
toupper, tolower, trimws, nchar, substr, startsWith, endsWith,
and paste/paste0 all operate on raw bytes with O(n) single-pass
algorithms. Their cost is negligible compared to disk I/O. On a modern SSD,
reading a 100 MB string column takes roughly 50 ms; applying tolower to it
adds less than 1 ms, so in practice these operations contribute negligible
overhead compared to the I/O cost of reading the data from disk.
grepl, gsub, sub, and str_extract compile a POSIX regex (or set up a
fixed-string scanner) and apply it to every row. Fixed-string matching
(fixed = TRUE) is in tier 1 territory. Regex matching cost depends on the
pattern: a simple alternation like ^(A|B|C)$ is nearly as fast as fixed,
while a pattern with nested quantifiers like (.+?)* can be exponentially
slow on pathological inputs. For batches larger than 1000 rows, the engine
distributes the matching loop across OpenMP threads with per-thread regex
compilation, so doubling the available cores roughly halves the wall time.
The practical advice: use fixed = TRUE (or omit the parameter, since it
defaults to TRUE) whenever the pattern is a literal string. Reserve
fixed = FALSE for cases where you genuinely need regex features like
character classes, alternations, or capture groups.
levenshtein, dl_dist, and jaro_winkler compute an O(mn) dynamic
programming alignment for each pair of strings, where m and n are the string
lengths. For column-vs-column comparisons in mutate(), this is O(rows
m * n). For fuzzy joins without blocking, it is O(probe_rows * build_rows *
m * n). Blocking reduces this to O(probe_rows * avg_partition_size * m * n),
which is why it matters so much. A loose max_dist does not change the
number of distance computations, but it increases the number of rows that
survive the threshold, and in a many-to-many fuzzy join this can cause a
combinatorial explosion of output rows. If both tables contain 10,000 rows
and a generous threshold matches each probe row to 50 build rows, the result
has 500,000 rows. Tightening max_dist or adding a block_by column are
the two most effective ways to keep the output manageable.
All three distance functions use OpenMP parallelization when the input exceeds
1000 rows. The work is distributed with schedule(dynamic, 64), meaning
threads grab 64-row chunks as they finish, which balances load well when
string lengths vary.
String-producing expressions (everything that outputs a new string column)
use a two-pass strategy. The first pass computes the output length for each
row without writing any bytes. The second pass allocates a single contiguous
buffer and fills it. This avoids the realloc-and-copy pattern that R uses
internally, which can fragment memory and cause O(n^2) behaviour on long
strings. For paste() with many arguments or gsub with complex
replacements, the two-pass approach keeps memory allocation linear.
For taxonomic name matching, dl_dist_norm with a threshold of 0.15-0.20 is
the workhorse. It catches transpositions (the most common typo type in Latin
binomials), single-letter substitutions, and dropped characters. If your
reference list is large and you can block on genus, the false-positive rate at
0.20 is low enough for automated matching with a manual review pass for
distances above 0.10.
For matching personal names (observers, collectors, authors),
jaro_winkler with a threshold of 0.10-0.15 (remember, this is
dissimilarity, so 0.10 means a similarity score of 0.90 or higher) is the
better choice. Names are short, and the prefix weighting in Jaro-Winkler
exploits the fact that people rarely mistype the first letter of their own
name.
For free-text fields where errors include missing words, reordered words, or
abbreviations, none of these character-level metrics work well. Consider
preprocessing with tolower, trimws, and gsub to normalize the strings
before computing distances.
Start conservative (low max_dist) and increase gradually. A max_dist of
0.10 with DL distance catches most single-character typos in strings of 10+
characters. At 0.20, you start matching strings that differ by two edits,
which increases recall but also increases false positives. Above 0.30, the
match set usually contains too much noise to be useful without manual
filtering.
For Jaro-Winkler, the scale is inverted (higher score = better match). A
max_dist of 0.10 (score >= 0.90) is a reasonable starting point. Below 0.05
(score >= 0.95), you catch only very minor differences.
A useful workflow is to run the fuzzy join once with a generous threshold, sort
the results by fuzzy_dist, and examine the matches near the boundary. This
gives an empirical sense of where true matches end and false positives begin
for your specific data.
vectra operates on raw bytes and assumes UTF-8 encoding. If your input data
is Latin-1 or another encoding, convert it to UTF-8 before writing to .vtr
format. In R, iconv(x, from = "latin1", to = "UTF-8") handles this. Once
the data is in UTF-8, all string operations work correctly on ASCII characters
and pass through multi-byte sequences without corruption, though the distance
metrics count bytes rather than Unicode code points for multi-byte characters.
This byte-level behaviour has a practical implication for distance
computation: if your strings contain accented characters encoded as multi-byte
UTF-8 sequences (e.g., "u" + combining accent), the edit distance between
"u" and "ü" will be larger than 1 because the engine sees multiple bytes
changing, not a single character. For Latin-alphabet ecological data (species
names, place names), this is rarely a problem because accented characters are
uncommon in standardized taxonomic databases. If your data does include
accented characters, consider normalizing them to ASCII before computing
distances (e.g., replacing "ü" with "u" via gsub).
A typical cleaning pipeline chains several string operations before matching. Here is a pattern we use repeatedly for species name standardization.
raw_names <- data.frame( species = c(" Quercus ROBUR ", "fagus sylvatica.", "Acer platanoides (L.)", "BETULA pendula"), stringsAsFactors = FALSE ) f9 <- tempfile(fileext = ".vtr") write_vtr(raw_names, f9) tbl(f9) |> mutate(clean = tolower(trimws(species))) |> mutate(clean = gsub(".", "", clean, fixed = TRUE)) |> mutate(clean = gsub(" +", " ", clean, fixed = FALSE)) |> select(species, clean) |> collect()
The pipeline strips whitespace, lowercases everything, removes periods,
and collapses multiple spaces into one. Each mutate() call creates a
streaming node, so chaining them costs no extra memory. After this cleaning, the
strings are ready for exact or fuzzy matching against a standardized reference.
The most effective matching strategy layers exact and fuzzy methods. Start with an exact join on cleaned names (catching the easy cases at near-zero cost), then run a fuzzy join on the unmatched remainder. This avoids spending CPU cycles on distance computations for records that match perfectly after basic cleaning.
ref_clean <- data.frame( canonical = c("quercus robur", "fagus sylvatica", "acer platanoides", "betula pendula"), status = c("accepted", "accepted", "accepted", "accepted"), stringsAsFactors = FALSE ) f_refc <- tempfile(fileext = ".vtr") write_vtr(ref_clean, f_refc) cleaned <- tbl(f9) |> mutate(clean = tolower(trimws(species))) |> mutate(clean = gsub(".", "", clean, fixed = TRUE)) |> mutate(clean = gsub(" +", " ", clean, fixed = FALSE)) # Step 1: exact join on cleaned names exact <- left_join( cleaned, tbl(f_refc), by = c("clean" = "canonical") ) |> collect() exact
Records where status is NA did not match exactly and would go to a second
pass with fuzzy_join(). In this example all four match after cleaning, which
is the ideal outcome: the fuzzy join handles only the genuinely ambiguous
cases.
This layered strategy reflects how taxonomic name resolution works in practice. The Global Biodiversity Information Facility (GBIF) backbone taxonomy, for instance, contains over 7 million names. Running a fuzzy join against the full list for every input record would be prohibitively expensive. Cleaning first, exact-matching second, and fuzzy-matching only the remainder keeps the computational cost proportional to the number of genuinely problematic records rather than the total dataset size. In a typical citizen-science dataset, 80-90% of records match exactly after basic normalization, leaving only 10-20% for the distance computation stage.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.