curateVtoDF: Curate vector into a data.frame

curateVtoDFR Documentation

Curate vector into a data.frame

Description

Curate vector into a data.frame

Usage

curateVtoDF(
  x,
  curationL = NULL,
  matchWholeString = TRUE,
  trimWhitespace = TRUE,
  whitespace = "_ ",
  expandWhitespace = TRUE,
  previous = NULL,
  verbose = TRUE,
  ...
)

Arguments

x

character vector as input

curationL

list containing curation rules, as described above, or a character vector of yaml files, which will be imported into a list format using yaml::yaml.load_file().

matchWholeString

logical indicating whether to match the whole string for each entry in x. If matchWholeString=TRUE then the substitution patterns are all extended where needed, in order to expand the pattern to match the whole string.

trimWhitespace

logical indicating whether to trim leading and trailing whitespace characters from x.

whitespace

character vector containing whitespace characters.

expandWhitespace

logical indicating whether substitution patterns should be modified so any whitespace characters in the pattern will match the defined whitespace characters. For example when expandWhitespace=TRUE, the pattern "_KO_" will be modified to "[ _]+KO[ _]+" so the pattern will match " KO " and "_KO_".

previous

optional data.frame whose colnames may be present as names in curationL, or single vector with length(previous)=length(x). If previous is supplied as a data.frame, and the curation colname is present in colnames(previous), then unmatched substutition patterns will retain the data in the relevant column of previous. This mechanism allows editing single values in an existing column, based upon pattern matching in another column.

verbose

logical indicating whether to print verbose output.

...

additional arguments are ignored.

Details

This function is intended to curate a vector into a data.frame with specifically assigned colnames. It is intended to be a more generic method of curation annotations than splitting a characteer string by some delimiter, for example where the order of annotations may differ entry to entry, but where there are known patterns that are sufficient to describe an annotation column.

That said, if annotations can be reliably split using a delimiter, that method is often a better choice. In that case, this function may be useful to make input data fit the expected format.

For example from c("Sample1_WT_LPS_1hour", "Sample2_KO_LPS_2hours") we can tell whether a sample is KO or WT by looking for that substring.

The curationL is a list with the following properties:

  • names(curationL) represent colnames to create in the output data.frame.

  • each list element contains a list of two-element vectors

  • each two-element vector contains a substitution pattern and substitution replacement

When matchWholeString=TRUE the substitution patterns are extended to match the whole string, using parentheses around the main pattern. For example if the pattern is "KO" and replacement is "KO", then the pattern is extended to "^.KO.$", so the entire string will be replaced with "KO".

Typically, curationL is derived from YAML formatted files, and loaded into a list with this type of setup:

curationL <- yaml::yaml.load_file("curation.yaml").

The generic YAML format is as follows:

NewColname_1:
- - patternA
  - replacementA
- - patternB
  - replacementB
NewColname_2:
- - patternC
  - replacementC

A specific example:

Treatment:
- - LPS
  - LPS
- - Control|cntrl|ctrl
  - Control
Genotype:
- - WT|wildtype
  - WT
- - KO|knockout|knock
  - KO

See Also

Other jam design functions: curateDFtoDF(), groups2contrasts()

Examples

set.seed(123);
x <- paste(
   paste0("file",
      sapply(1:5, function(i) {
         paste(sample(LETTERS, 5), collapse="")
      })),
   rep(c("WT", "Mut"), each=3),
   rep(c("Veh","EtOH"), 3),
   sep="_");
x;

curationYaml <- c(
"Genotype:
- - WT|wildtype
  - WT
- - Mut|mutant
  - Mut
Treatment:
- - Veh|EtOH
  - \\1
File:
- - file([A-Z]+)
  - \\1
FileStem:
- - file([A-Z]+)
  - \\2");
# print the curation.yaml to show its structure
cat(curationYaml)
curationL <- yaml::yaml.load(curationYaml);
curateVtoDF(x, curationL);


jmw86069/splicejam documentation built on Nov. 4, 2024, 10:53 a.m.