keepCommonText: Extract Longest Common Text Out Of Character Vector
In wrMisc: Analyze Experimental High-Throughput (Omics) Data

keepCommonText

R Documentation

Extract Longest Common Text Out Of Character Vector

Description

This function allows recovering the single longest common text-fragments (from center, head or tail) out of character vector txt. Only the first of all of the longest solutions will be returned.

Usage

keepCommonText(
  txt,
  minNchar = 1,
  side = "center",
  hiResol = TRUE,
  silent = TRUE,
  callFrom = NULL,
  debug = FALSE
)

Arguments

`txt`	character vector to be treated
`minNchar`	(integer) minumin number of characters that must remain
`side`	(character) may be be either 'center', 'any', 'terminal', 'left' or 'right'; only with `side='center'` or `'any'` internal text-segments may be found
`hiResol`	(logical) find best solution, but at much higher comptational cost (eg 3x slower, however `hiResol=FALSE` rather finds anchor which may need to get extended)
`silent`	(logical) suppress messages
`callFrom`	(character) allow easier tracking of messages produced
`debug`	(logical) display additional messages for debugging

Details

Please note, that finding common parts between chains of characters is not a completely trivial task. This topic still has ongoing research for the application of sequence-alignments, where chains of characters to be compared get very long. This function uses a k-mer inspirated approach. The initial aim with this function was allowing to treat smaller chains of characters (and finding shorter strteches of common text), like eg with column-names.

Important : This function identifies only the first best hit, ie other shared/common character-chains of the same length will not be found !

Using the argument hiResol=FALSE it is possible to accelerate the search aprox 3x (with larger character-vectors), however, frequently the very best solution may not be found. This means, that in this case the result should rather be considered a 'seed', allowing check if further extension may improve the result, ie for identifying a (slightly) longer chain of common characters.

With longer vectors and longer character chains this may get demanding on computational reesources, the argument hiResol=FALSE allows reducing this at the price of missing the best solution. With this argument single common/matching characters will not be searched if all text-elements are longer than 500 characters, an empty character vector will be returned.

When argument side is either left, right or terminal only terminal common text may be found (a potentially even longer internal text will be lost). Of course, choosing this option makes searches much faster.

This function does not return the position of the shared/common characters within the text, you may use gregexpr or regexec to locate them.

Value

This function returns a character vector of length=1, ie only one (normally the longest) common sequence of characters is identified. If nothing is found common/shared an empty character-vector is returned

Examples

txt1 <- c("abcd_abc_kjh", "bcd_abc123", "cd_abc_po")
keepCommonText(txt1, side="center")       # trim from right

txt2 <- c("ddd_ab","ddd_bcd","ddd_cde")
trimRedundText(txt2, side="left")          #  
keepCommonText(txt2, side="center")        #

wrMisc documentation built on April 3, 2025, 8:17 p.m.