grepsParallel: Parallel pairwise grep-style matching

Description Usage Arguments Value Author(s) Examples

View source: R/grepsParallel.R

Description

Performs a two-way grep-style analysis on two character vectors using parallel computation. Calculates pairwise matching scores based on a rigid customized routine and returns matching strings ranked from best to worst. The user is able to influence the algorithm by tweaking matching parameters.

Usage

1
2
3
grepsParallel(x, y, noCores, sepx = "\\.", sepy = "\\.",
  limitChar = 0, limitWord = 0, booster = 0.8, wordIgnore = NULL,
  checkBoth = TRUE, ignore.case = TRUE)

Arguments

x

a character vector containing elements to be considered in pairwise grep-analysis. Words are separated by 'sepx'.

y

a character vector containing elements to be considered in pairwise grep-analysis. Words are separated by 'sepy'.

noCores

is a numerical value specifying the number of cores to be used for parallel computation.

sepx

a regex-style expression which indicates how words are separated in 'x'. If 'x' is already a final vector and does not need to be segmented, input 'sepx = NULL'. Defaults to "\\."

sepy

a regex-style expression which indicates how words are separated in 'y'. If 'y' is already a final vector and does not need to be segmented, input 'sepy = NULL'. Defaults to "\\."

limitChar

a numerical value from 0 to 1 which provides a lower proportional bound for a word-to-word match to be considered significant. If the user prioritizes loosely matched words, one can leave this value low such as 0.1. Alternatively, if the end-user prioritizes strongly matched individual words, 'limitChar' can be increased to a value of say, 0.7. Defaults to 0.

limitWord

a numerical value greater than or equal to 0 which provides a proportional filter for significant overall characters matched. Defaults to 0.

booster

a numerical value between 0 to 1 which provides a boost to the matching score of exceptionally well-matched words. For meaningful results, its value should be greater than 'limitWord'. Defaults to 0.8.

wordIgnore

a character vector which should be ignored while searching for matches. Examples could be redundant characters such as "the" or "of". Defaults to NULL.

checkBoth

a logical which indicates whether both left and right grep analyses should be conducted (TRUE), or if only a left grep analysis is necessary (FALSE). Defaults to TRUE.

ignore.case

a logical which indicates if cases should be ignored when matching. Defaults to TRUE.

Value

a list containing two matrices. The first "result" matrix has a total number of rows equal to the length of vector x. The first column contains a repeat of vector 'x' and the corresponding columns contain ranked 'y' vector matches to the corresponding rows. The matches are ranked from best to worst as column number increases. The second "rank" matrix contains a matrix with equivalent dimension as the first matrix. Instead of containing the matches from 'y', this matrix contains the matching scores of the respective components from the first matrix. A ranking score of 99 implies a perfect match. Perfect matches are isolated for each row.

Author(s)

Atreya Shankar

Examples

1
2
3
4
5
6
7
## Not run: 

x <- c("foo.test.xyz", "baz.foosh", "bat")
y <- c("ba","foosba.asd", "bats.at", "foos", "gams.asd")
test <- demystas::grepsParallel(x, y, 2)

## End(Not run)

pik-piam/demystas documentation built on Oct. 26, 2019, 12:15 a.m.