curateDFtoDF: Curate data.frame into a data.frame
In jmw86069/splicejam: Analysis and Visualization of Gene Splice Variants and Transcriptome Data

curateDFtoDF

R Documentation

Curate data.frame into a data.frame

Description

Curate data.frame into a data.frame

Usage

curateDFtoDF(
  x,
  curationL2 = NULL,
  matchWholeString = TRUE,
  trimWhitespace = TRUE,
  whitespace = "_ ",
  expandWhitespace = TRUE,
  keepAllColnames = TRUE,
  verbose = TRUE,
  ...
)

Arguments

`x`	data.frame
`curationL2`	list with curation rules as described above, or a character vector of yaml files, which will be imported into a list format using `yaml::yaml.load_file()`.
`matchWholeString`, `trimWhitespace`, `whitespace`, `expandWhitespace`	arguments passed to `curateVtoDF()`.
`keepAllColnames`	logical indicating whether to keep all colnames from `x` in addition to those created during curation. `keepAllColnames=FALSE` will only keep colnames specifically described in the `curationL2` list, while `keepAllColnames=TRUE` will keep all original colnames, and any colnames added during the curation steps.
`verbose`	logical indicating whether to print verbose output
`...`	additional arguments are passed to `curateVtoDF()`

Details

This function takes a data.frame as input, where one or more columns are expected to be used in data curation to create another data.frame. This situation is useful when the final desired data.frame depends upon values in more than one column of the input data.frame.

Specifically, this function is a wrapper around curateVtoDF().

Typically, curationL2 is derived from YAML formatted files, and loaded into a list with this type of setup:

curationL2 <- yaml::yaml.load_file("curation.yaml").

The structure of curationL2:

curationL2 is a list object, whose names(curationL2) are values in colnames(x) and represent column of data used as input.
each list element in curationL2 is also a list, whose names represent colnames to create or update in the output data.frame.
these lists contain character vectors length=2 containing a regular expression substitution pattern (see base::gsub), and a replacement pattern.

The list is processed in order, and names can be repeated as necessary to apply the proper substitution patterns in the order required. New columns created during the curation may also be used in later curation steps.

Example curation.yaml YAML format. Take note that there is required leading space in the format.

From_ColnameA:
  To_ColnameC:
  - - patternA
    - replacementA
  - - patternB
    - replacementB
  To_ColnameD:
  - - patternC
    - replacementC
  - - patternD
    - replacementD
From_ColnameB:
  To_ColnameE:
  - - patternE
    - replacementE
  - - patternF
    - replacementF

When the rule creates a colname already present in colnames(x), then only values specifically matched by the substitution patterns are modified. For example, this technique can be used to modify the group assignment of a Sample_ID:

Sample_ID:
  Group:
  - - Sample1234
    - WildType

The rules above will match "Sample1234" in the "Sample_ID" column of x, and assign "WildType" to the "Group" column only for matching entries.

In addition to values in colnames(x), the "from" value may also be "rownames" which will cause the curation rules to act upon values in rownames(x) instead of values in a specific column of x.

Note that if a "to" column does not already exist, then all values in the "from" column which do not match any substitution pattern will be used to fill the remainder of the "to" column. Once the "to" column exists, then only entries with a matching substitution pattern are replaced using the replacement pattern.

For example, for NanoString data, the column "CartridgeWell" can be derived from rownames(x), after which the new column "CartridgeWell" can be used in subsequent curation steps.

Additional notes:

The substitution pattern is automatically expanded to include the whole input string, if not already present. For example supplying "WT" will match "^.*(WT).*$". However if the substitution pattern is "^.*(WT).*$" then it will not be expanded.
When the substitution pattern is expanded, the string is also enclosed in parentheses "()" which means the replacement can use "\\1" to use the successfully matched pattern as the output string. For example if "WT" and "Mutant" are always valid genotypes, then it would be sufficient to define substitution pattern "WT|Mutant" and replacement pattern "\\1".
When the substitution pattern is expanded, and the string is enclosed in parentheses, any parentheses in the substitution pattern are therefore one level deeper, for example "file([A-Z]+)" will be expanded to "^.*(file([A-Z]+)).*$". See the example below, where the replacement pattern uses "\\2" to use only the internal parentheses.

Examples

set.seed(123);
df <- data.frame(filename=paste(
   paste0("file",
      sapply(1:5, function(i) {
         paste(sample(LETTERS, 5), collapse="")
      })),
   rep(c("WT", "Mut"), each=3),
   rep(c("Veh","EtOH"), 3),
   sep="_"));
df;

# Note a couple ways of accomplishing similar results:
# Genotype matches "WT|wildtype" and replaces with "WT",
# then matches "Mut|mutant" and replaces with "Mut"
#
# Treatment matches "Veh|EtOH" and simply replaces with
# whatever was matched
curationYaml <- c(
"filename:
  Genotype:
  - - WT|wildtype
    - WT
  - - Mut|mutant
    - Mut
  Treatment:
  - - Veh|EtOH
    - \\1
  File:
  - - file([A-Z]+)
    - \\1
  FileStem:
  - - file([A-Z]+)
    - \\2");
# print the curation.yaml to show its structure
cat(curationYaml)
curationL <- yaml::yaml.load(curationYaml);
curateDFtoDF(df, curationL);

jmw86069/splicejam documentation built on April 14, 2025, 3:12 a.m.

jmw86069/splicejam index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

jmw86069/splicejam
Analysis and Visualization of Gene Splice Variants and Transcriptome Data

curateDFtoDF: Curate data.frame into a data.frame
In jmw86069/splicejam: Analysis and Visualization of Gene Splice Variants and Transcriptome Data

Curate data.frame into a data.frame

Description

Usage

Arguments

Details

See Also

Examples

Related to curateDFtoDF in jmw86069/splicejam...

R Package Documentation

Browse R Packages

We want your feedback!

jmw86069/splicejam Analysis and Visualization of Gene Splice Variants and Transcriptome Data

curateDFtoDF: Curate data.frame into a data.frame In jmw86069/splicejam: Analysis and Visualization of Gene Splice Variants and Transcriptome Data

Curate data.frame into a data.frame

Description

Usage

Arguments

Details

See Also

Examples

Related to curateDFtoDF in jmw86069/splicejam...

R Package Documentation

Browse R Packages

We want your feedback!

jmw86069/splicejam
Analysis and Visualization of Gene Splice Variants and Transcriptome Data

curateDFtoDF: Curate data.frame into a data.frame
In jmw86069/splicejam: Analysis and Visualization of Gene Splice Variants and Transcriptome Data