View source: R/keepCommonText.R
keepCommonText | R Documentation |
This function allows recovering the single longest common text-fragments (from center, head or tail) out of character vector txt
.
Only the first of all of the longest solutions will be returned.
keepCommonText(
txt,
minNchar = 1,
side = "center",
hiResol = TRUE,
silent = TRUE,
callFrom = NULL,
debug = FALSE
)
txt |
character vector to be treated |
minNchar |
(integer) minumin number of characters that must remain |
side |
(character) may be be either 'center', 'any', 'terminal', 'left' or 'right'; only with |
hiResol |
(logical) find best solution, but at much higher comptational cost (eg 3x slower, however |
silent |
(logical) suppress messages |
callFrom |
(character) allow easier tracking of messages produced |
debug |
(logical) display additional messages for debugging |
Please note, that finding common parts between chains of characters is not a completely trivial task. This topic still has ongoing research for the application of sequence-alignments, where chains of characters to be compared get very long. This function uses a k-mer inspirated approach. The initial aim with this function was allowing to treat smaller chains of characters (and finding shorter strteches of common text), like eg with column-names.
Important : This function identifies only the first best hit, ie other shared/common character-chains of the same length will not be found !
Using the argument hiResol=FALSE
it is possible to accelerate the search aprox 3x (with larger character-vectors), however, frequently the very best solution may not be found.
This means, that in this case the result should rather be considered a 'seed', allowing check if further extension may improve the result,
ie for identifying a (slightly) longer chain of common characters.
With longer vectors and longer character chains this may get demanding on computational reesources, the argument hiResol=FALSE
allows reducing this at the price of missing the best solution.
With this argument single common/matching characters will not be searched if all text-elements are longer than 500 characters, an empty character vector will be returned.
When argument side
is either left
, right
or terminal
only terminal common text may be found (a potentially even longer internal text will be lost).
Of course, choosing this option makes searches much faster.
This function does not return the position of the shared/common characters within the text, you may use gregexpr
or regexec
to locate them.
This function returns a character vector of length=1, ie only one (normally the longest) common sequence of characters is identified. If nothing is found common/shared an empty character-vector is returned
Use gregexpr
or regexec
in grep
for locating the identified common characters in the initial query.
Inverse : Trim redundant text (from either side) to keep only varaible part using trimRedundText
;
you may also look for related functions in package stringr
txt1 <- c("abcd_abc_kjh", "bcd_abc123", "cd_abc_po")
keepCommonText(txt1, side="center") # trim from right
txt2 <- c("ddd_ab","ddd_bcd","ddd_cde")
trimRedundText(txt2, side="left") #
keepCommonText(txt2, side="center") #
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.