group_str: Group near elements of string vectors

Description Usage Arguments Value See Also Examples

View source: R/group_str.R

Description

This function groups elements of a string vector (character or string variable) according to the element's distance ('similatiry'). The more similar two string elements are, the higher is the chance to be combined into a group.

Usage

1
2
3
4
5
6
7
8
9
group_str(
  strings,
  precision = 2,
  strict = FALSE,
  trim.whitespace = TRUE,
  remove.empty = TRUE,
  verbose = FALSE,
  maxdist
)

Arguments

strings

Character vector with string elements.

precision

Maximum distance ("precision") between two string elements, which is allowed to treat them as similar or equal. Smaller values mean less tolerance in matching.

strict

Logical; if TRUE, value matching is more strictly. See 'Examples'.

trim.whitespace

Logical; if TRUE (default), leading and trailing white spaces will be removed from string values.

remove.empty

Logical; if TRUE (default), empty string values will be removed from the character vector strings.

verbose

Logical; if TRUE, the progress bar is displayed when computing the distance matrix. Default in FALSE, hence the bar is hidden.

maxdist

Deprecated. Please use precision now.

Value

A character vector where similar string elements (values) are recoded into a new, single value. The return value is of same length as strings, i.e. grouped elements appear multiple times, so the count for each grouped string is still avaiable (see 'Examples').

See Also

str_find

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
oldstring <- c("Hello", "Helo", "Hole", "Apple",
               "Ape", "New", "Old", "System", "Systemic")
newstring <- group_str(oldstring)

# see result
newstring

# count for each groups
table(newstring)

# print table to compare original and grouped string
frq(oldstring)
frq(newstring)

# larger groups
newstring <- group_str(oldstring, precision = 3)
frq(oldstring)
frq(newstring)

# be more strict with matching pairs
newstring <- group_str(oldstring, precision = 3, strict = TRUE)
frq(oldstring)
frq(newstring)

Example output

[1] "Hello, Helo"      "Hello, Helo"      "Hole"             "Ape, Apple"      
[5] "Ape, Apple"       "New"              "Old"              "System, Systemic"
[9] "System, Systemic"
newstring
      Ape, Apple      Hello, Helo             Hole              New 
               2                2                1                1 
             Old System, Systemic 
               1                2 

# x <character> 
# total N=9  valid N=9  mean=5.00  sd=2.74
 
      val frq raw.prc valid.prc cum.prc
      Ape   1   11.11     11.11   11.11
    Apple   1   11.11     11.11   22.22
    Hello   1   11.11     11.11   33.33
     Helo   1   11.11     11.11   44.44
     Hole   1   11.11     11.11   55.56
      New   1   11.11     11.11   66.67
      Old   1   11.11     11.11   77.78
   System   1   11.11     11.11   88.89
 Systemic   1   11.11     11.11  100.00
     <NA>   0    0.00        NA      NA


# x <character> 
# total N=9  valid N=9  mean=3.33  sd=2.00
 
              val frq raw.prc valid.prc cum.prc
       Ape, Apple   2   22.22     22.22   22.22
      Hello, Helo   2   22.22     22.22   44.44
             Hole   1   11.11     11.11   55.56
              New   1   11.11     11.11   66.67
              Old   1   11.11     11.11   77.78
 System, Systemic   2   22.22     22.22  100.00
             <NA>   0    0.00        NA      NA


# x <character> 
# total N=9  valid N=9  mean=5.00  sd=2.74
 
      val frq raw.prc valid.prc cum.prc
      Ape   1   11.11     11.11   11.11
    Apple   1   11.11     11.11   22.22
    Hello   1   11.11     11.11   33.33
     Helo   1   11.11     11.11   44.44
     Hole   1   11.11     11.11   55.56
      New   1   11.11     11.11   66.67
      Old   1   11.11     11.11   77.78
   System   1   11.11     11.11   88.89
 Systemic   1   11.11     11.11  100.00
     <NA>   0    0.00        NA      NA


# x <character> 
# total N=9  valid N=9  mean=2.44  sd=1.13
 
               val frq raw.prc valid.prc cum.prc
        Ape, Apple   2   22.22     22.22   22.22
 Hello, Helo, Hole   3   33.33     33.33   55.56
          New, Old   2   22.22     22.22   77.78
  System, Systemic   2   22.22     22.22  100.00
              <NA>   0    0.00        NA      NA


# x <character> 
# total N=9  valid N=9  mean=5.00  sd=2.74
 
      val frq raw.prc valid.prc cum.prc
      Ape   1   11.11     11.11   11.11
    Apple   1   11.11     11.11   22.22
    Hello   1   11.11     11.11   33.33
     Helo   1   11.11     11.11   44.44
     Hole   1   11.11     11.11   55.56
      New   1   11.11     11.11   66.67
      Old   1   11.11     11.11   77.78
   System   1   11.11     11.11   88.89
 Systemic   1   11.11     11.11  100.00
     <NA>   0    0.00        NA      NA


# x <character> 
# total N=9  valid N=9  mean=2.89  sd=1.54
 
              val frq raw.prc valid.prc cum.prc
       Ape, Apple   2   22.22     22.22   22.22
      Hello, Helo   2   22.22     22.22   44.44
        Hole, Old   2   22.22     22.22   66.67
              New   1   11.11     11.11   77.78
 System, Systemic   2   22.22     22.22  100.00
             <NA>   0    0.00        NA      NA

sjmisc documentation built on Dec. 11, 2021, 9:34 a.m.