regexp2df: [!+] Capture information to a dataframe by regular...

Description Usage Arguments Value Contribution Note Author(s) See Also Examples

View source: R/regexp2df.R

Description

Capture information in substrings of text that match named and unnamed tokens of regular expressions and convert the result to a data frame.

Usage

1
2
3
4
5
6
7
8
regexp2df(
  text,
  pattern,
  ignore.case = FALSE,
  perl = TRUE,
  stringsAsFactors = default.stringsAsFactors(),
  ...
)

Arguments

text

The text to be parsed: a character vector where matches are sought, or an object which can be coerced by as.character to a character vector.

pattern

Perl-like regular expression.

ignore.case

if FALSE, the pattern matching is case sensitive and if TRUE, case is ignored during matching.

perl

logical. Should Perl-compatible regexps be used?

stringsAsFactors

logical, passed to as.data.frame.

...

Other arguments to be passed to gregexpr.

Value

A data frame with parsed information.

Contribution

In this function ideas from this answer on github.com were used.

Note

Call to function gregexpr with parameter perl = TRUE is used.

Author(s)

Author Vilmantas Gegzna, contributor MrFlick, as he provided ideas on github.com (see section Contribution).

See Also

More about regular expressions used in R: regex
Website handy for creating and testing Perl-like regular expressions (library pcre, version 1, not 2) https://regex101.com/r/dS3iP1/1#pcre

Functions gregexpr, regcapturedmatches, operator from package magrittr %>%.

Other spMisc utilities: bru(), clc(), clear(), fCap(), isFALSE(), list_AddRm(), make.filenames(), open_wd(), printDuration(), st01()

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
text1 <- c("A_111  B_aaa",
              "A_222  B_bbb",
              "A_333  B_ccc",
              "A_444  B_ddd",
              "A_555  B_eee")

# Named tokens
pattern1_named_tokens <- 'A_(?<Part_A>.*)  B_(?<Part_B>.*)'

regexp2df(text1, pattern1_named_tokens)
##     Part_A Part_B
## 1    111    aaa
## 2    222    bbb
## 3    333    ccc
## 4    444    ddd
## 5    555    eee

# Unnamed tockens - groups inside brackets:
pattern1_unnamed_tokens <- 'A_(.*)  B_(.*)'

regexp2df(text1, pattern1_unnamed_tokens)
##       X     X.1
## 1    111    aaa
## 2    222    bbb
## 3    333    ccc
## 4    444    ddd
## 5    555    eee


#----------------------------------------------------------
# Wrong. There must be NO SPACES in token's name:


## Not run: 
pattern2 <- 'A (?<Part A>.*)  B (?<Part B>.*)'
regexp2df(text1, pattern2)

## Error ...


## End(Not run)
#----------------------------------------------------------
text3 <- c("sn555 ID_O20-5-684_N52_2_Subt2_01.",
              "sn555 ID_O20-5-984_S52_8_Subt10_11.")

pattern3 <- paste0('sn(?<serial_number>.*) ID_(?<ID>.*)_(?<Class>[NS])',
                   '(?<Sector>.*)_(?<Point>.*)_[Ss]ubt.*\\.');

regexp2df(text3, pattern3)

##   serial_number    ID       Class Sector Point
## 1      555      O20-5-684     N     52     2
## 2      555      O20-5-984     S     52     8

#----------------------------------------------------------
# List all .R files in your working directory:

regexp2df(dir(),'(?<R_file>.*\\.[rR]$)')


# Do the same by using chaining operator %>%:
library(dplyr)

dir() %>% regexp2df('(?<R_file>\\.*[rR]$)')

#----------------------------------------------------------
# Capture several types of files:

expr <- paste0('(?<R_file>.*\\.[rR]$)|',
               '(?<Rmd_file>.*\\.[rR]md$)|',
               '(?<CSV_file>.*\\.[cC][sS][vV]$)')
dir() %>% regexp2df(expr)

GegznaV/spMisc documentation built on April 26, 2020, 5:59 p.m.