curateDFtoDF: Curate data.frame into a data.frame

curateDFtoDFR Documentation

Curate data.frame into a data.frame

Description

Curate data.frame into a data.frame

Usage

curateDFtoDF(
  x,
  curationL2 = NULL,
  matchWholeString = TRUE,
  trimWhitespace = TRUE,
  whitespace = "_ ",
  expandWhitespace = TRUE,
  keepAllColnames = TRUE,
  verbose = TRUE,
  ...
)

Arguments

x

data.frame

curationL2

list with curation rules as described above, or a character vector of yaml files, which will be imported into a list format using yaml::yaml.load_file().

matchWholeString, trimWhitespace, whitespace, expandWhitespace

arguments passed to curateVtoDF().

keepAllColnames

logical indicating whether to keep all colnames from x in addition to those created during curation. keepAllColnames=FALSE will only keep colnames specifically described in the curationL2 list, while keepAllColnames=TRUE will keep all original colnames, and any colnames added during the curation steps.

verbose

logical indicating whether to print verbose output

...

additional arguments are passed to curateVtoDF()

Details

This function takes a data.frame as input, where one or more columns are expected to be used in data curation to create another data.frame. This situation is useful when the final desired data.frame depends upon values in more than one column of the input data.frame.

Specifically, this function is a wrapper around curateVtoDF().

Typically, curationL2 is derived from YAML formatted files, and loaded into a list with this type of setup:

curationL2 <- yaml::yaml.load_file("curation.yaml").

The structure of curationL2:

  • curationL2 is a list object, whose names(curationL2) are values in colnames(x) and represent column of data used as input.

  • each list element in curationL2 is also a list, whose names represent colnames to create or update in the output data.frame.

  • these lists contain character vectors length=2 containing a regular expression substitution pattern (see base::gsub), and a replacement pattern.

The list is processed in order, and names can be repeated as necessary to apply the proper substitution patterns in the order required. New columns created during the curation may also be used in later curation steps.

Example curation.yaml YAML format. Take note that there is required leading space in the format.

From_ColnameA:
  To_ColnameC:
  - - patternA
    - replacementA
  - - patternB
    - replacementB
  To_ColnameD:
  - - patternC
    - replacementC
  - - patternD
    - replacementD
From_ColnameB:
  To_ColnameE:
  - - patternE
    - replacementE
  - - patternF
    - replacementF

When the rule creates a colname already present in colnames(x), then only values specifically matched by the substitution patterns are modified. For example, this technique can be used to modify the group assignment of a Sample_ID:

Sample_ID:
  Group:
  - - Sample1234
    - WildType

The rules above will match "Sample1234" in the "Sample_ID" column of x, and assign "WildType" to the "Group" column only for matching entries.

In addition to values in colnames(x), the "from" value may also be "rownames" which will cause the curation rules to act upon values in rownames(x) instead of values in a specific column of x.

Note that if a "to" column does not already exist, then all values in the "from" column which do not match any substitution pattern will be used to fill the remainder of the "to" column. Once the "to" column exists, then only entries with a matching substitution pattern are replaced using the replacement pattern.

For example, for NanoString data, the column "CartridgeWell" can be derived from rownames(x), after which the new column "CartridgeWell" can be used in subsequent curation steps.

Additional notes:

  • The substitution pattern is automatically expanded to include the whole input string, if not already present. For example supplying "WT" will match "^.*(WT).*$". However if the substitution pattern is "^.*(WT).*$" then it will not be expanded.

  • When the substitution pattern is expanded, the string is also enclosed in parentheses "()" which means the replacement can use "\\1" to use the successfully matched pattern as the output string. For example if "WT" and "Mutant" are always valid genotypes, then it would be sufficient to define substitution pattern "WT|Mutant" and replacement pattern "\\1".

  • When the substitution pattern is expanded, and the string is enclosed in parentheses, any parentheses in the substitution pattern are therefore one level deeper, for example "file([A-Z]+)" will be expanded to "^.*(file([A-Z]+)).*$". See the example below, where the replacement pattern uses "\\2" to use only the internal parentheses.

See Also

Other jam design functions: curateVtoDF(), groups2contrasts()

Examples

set.seed(123);
df <- data.frame(filename=paste(
   paste0("file",
      sapply(1:5, function(i) {
         paste(sample(LETTERS, 5), collapse="")
      })),
   rep(c("WT", "Mut"), each=3),
   rep(c("Veh","EtOH"), 3),
   sep="_"));
df;

# Note a couple ways of accomplishing similar results:
# Genotype matches "WT|wildtype" and replaces with "WT",
# then matches "Mut|mutant" and replaces with "Mut"
#
# Treatment matches "Veh|EtOH" and simply replaces with
# whatever was matched
curationYaml <- c(
"filename:
  Genotype:
  - - WT|wildtype
    - WT
  - - Mut|mutant
    - Mut
  Treatment:
  - - Veh|EtOH
    - \\1
  File:
  - - file([A-Z]+)
    - \\1
  FileStem:
  - - file([A-Z]+)
    - \\2");
# print the curation.yaml to show its structure
cat(curationYaml)
curationL <- yaml::yaml.load(curationYaml);
curateDFtoDF(df, curationL);


jmw86069/splicejam documentation built on April 21, 2024, 4:57 p.m.