keepCommonText: Extract Longest Common Text Out Of Character Vector

View source: R/keepCommonText.R

keepCommonTextR Documentation

Extract Longest Common Text Out Of Character Vector

Description

This function allows recovering the single longest common text-fragments (from center, head or tail) out of character vector txt. Only the first of all of the longest solutions will be returned.

Usage

keepCommonText(
  txt,
  minNchar = 1,
  side = "center",
  hiResol = TRUE,
  silent = TRUE,
  callFrom = NULL,
  debug = FALSE
)

Arguments

txt

character vector to be treated

minNchar

(integer) minumin number of characters that must remain

side

(character) may be be either 'center', 'any', 'terminal', 'left' or 'right'; only with side='center' or 'any' internal text-segments may be found

hiResol

(logical) find best solution, but at much higher comptational cost (eg 3x slower, however hiResol=FALSE rather finds anchor which may need to get extended)

silent

(logical) suppress messages

callFrom

(character) allow easier tracking of messages produced

debug

(logical) display additional messages for debugging

Details

Please note, that finding common parts between chains of characters is not a completely trivial task. This topic still has ongoing research for the application of sequence-alignments, where chains of characters to be compared get very long. This function uses a k-mer inspirated approach. The initial aim with this function was allowing to treat smaller chains of characters (and finding shorter strteches of common text), like eg with column-names.

Important : This function identifies only the first best hit, ie other shared/common character-chains of the same length will not be found !

Using the argument hiResol=FALSE it is possible to accelerate the search aprox 3x (with larger character-vectors), however, frequently the very best solution may not be found. This means, that in this case the result should rather be considered a 'seed', allowing check if further extension may improve the result, ie for identifying a (slightly) longer chain of common characters.

With longer vectors and longer character chains this may get demanding on computational reesources, the argument hiResol=FALSE allows reducing this at the price of missing the best solution. With this argument single common/matching characters will not be searched if all text-elements are longer than 500 characters, an empty character vector will be returned.

When argument side is either left, right or terminal only terminal common text may be found (a potentially even longer internal text will be lost). Of course, choosing this option makes searches much faster.

This function does not return the position of the shared/common characters within the text, you may use gregexpr or regexec to locate them.

Value

This function returns a character vector of length=1, ie only one (normally the longest) common sequence of characters is identified. If nothing is found common/shared an empty character-vector is returned

See Also

Use gregexpr or regexec in grep for locating the identified common characters in the initial query.

Inverse : Trim redundant text (from either side) to keep only varaible part using trimRedundText; you may also look for related functions in package stringr

Examples

txt1 <- c("abcd_abc_kjh", "bcd_abc123", "cd_abc_po")
keepCommonText(txt1, side="center")       # trim from right

txt2 <- c("ddd_ab","ddd_bcd","ddd_cde")
trimRedundText(txt2, side="left")          #  
keepCommonText(txt2, side="center")        # 

wrMisc documentation built on Sept. 11, 2024, 6:10 p.m.