splitToDF: Read a data.frame from a character vector according to a...

Description Usage Arguments Value Note Author(s)

Description

Read a dataframe from a character vector, using a regular expression with named fields to extract values from matching items. The named fields become columns in the result, and each matching item in the input yields a row in the result. FIXME (Eventually): when the stringi package regexp code can handle named subexpressions, use stri_extract_all_regex(..., simplify=TRUE)

Usage

1
splitToDF(rx, s, namedOnly = TRUE, validOnly = TRUE, guess = TRUE, ...)

Arguments

rx:

Perl-type regular expression with named fields, as described in ?regex

s:

character vector. Each element must match rx, i.e. must have at least one character matching each named field in rx.

namedOnly:

if TRUE (the default), return columns only for named subexpressions of the regex. Otherwise, a column is returned for every subexpression.

validOnly:

if TRUE (the default), return rows only for elements of s matching rx. Otherwise, a row is returned for each element of s, and rows for those not matching rx are filled with NA.

guess:

if TRUE paste the columns together with commas, and use read.csv to try return the columns already converted to appropriate types, e.g. integer or real. Defaults to TRUE.

...:

additional parameters to read.csv() used when guess is TRUE.

Value

a data.frame. Each column is a vector and corresponds to a named field in rx, going from left to right. Each row in the data.frame corresponds to an item in s which matches rx. If no items of s match rx, the function returns NULL. If guess is TRUE, columns have been converted to their guessed types.

Note

This function serves a similar purpose to read.csv, except that the rules for splitting input lines into columns are much more flexible. Any format which can be described by a regular expression with named fields can be handled. For example, logfile messages often contain extra text and variable field positions and interspersed unrelated messages which prevent direct use of functions like read.csv or scan to extract what is really just a dataframe with syntactic sugar and interleaved junk.

For example, if input lines look like this:

1
2
3
4
5
s = c( "Mar 10 06:25:11 SG [62442.231077] pps-gpio: PPS @ 1425968711.000018004: pre_age = 163, post_age = 1130",
       "Mar 10 06:25:11 SG [62442.23108] usb-debug: device 45 disconnected",
       "Mar 10 06:25:12 SG [62443.2311] pps-gpio: PPS @ 1425968712.000011015: pre_age = 1055, post_age = 11655",
       "Mar 10 06:25:13 SG [62444.2] dbus[2872]: [system] Successfully activated service 'org.freedesktop.PackageKit'
       "Mar 10 06:25:13 SG [62444.23] pps-gpio: PPS @ 1425968713.000011275: pre_age = 160, post_age = 12120" )

and we wish to extract timestamps and pre_age and post_age from the pps-gpio messages as a data.frame, we can use this regular expression:

1
rx = "pps-gpio: PPS @ (?<ts>[0-9]+\\.[0-9]*): pre_age = (?<preAge>[0-9]+), post_age = (?<postAge>[0-9]+)"

splitToDF(rx, s) then gives:

1
2
3
4
          ts preAge postAge
1 1425968711    163    1130
2 1425968712   1055   11655
3 1425968713    160   12120

where the first column is numeric and others are integer.

Author(s)

John Brzustowski jbrzusto@REMOVE_THIS_PART_fastmail.fm


jbrzusto/motus-R-package documentation built on May 18, 2019, 7:03 p.m.