preprocess_sample_colors: Preprocess PSI or cRPKM data frame using configuration file

View source: R/preprocess_sample_colors.R

preprocess_sample_colorsR Documentation

Preprocess PSI or cRPKM data frame using configuration file

Description

preprocess_sample_colors re-orders PSI or cRPKM sample columns to a specified order, and it defines sample pools and colors. Order and colors are taken from a pre-defined tab-delimited file (see details).

Usage

preprocess_sample_colors(data, config, subg = TRUE, expr = FALSE,
  col = NULL, multi_col = NULL)

Arguments

data

A n x 2*m+1 data frame of PSI and quality score values where n is the number of AS events and m is the number of samples. If expr=TRUE, a n x m+1 data frame of cRPKM. In both cases, the first column corresponds to the row metadata. Metadata column values must be unique, duplicated values will be discarded.

config

Filename of the configuration file for data. Also accepts m x 4 data frame of the configuration file, or an m x 5 data frame if the SubgroupName column is included.

subg

Set to TRUE to define a subgroup structure using the SubgroupName column in config. If FALSE, or if the file does not contain this column, samples will not be subgrouped (a separate subgroup will be defined for each sample, preserving the sample names). If FALSE and the file contains a SubgroupName column, that column will be ignored.

expr

Set to TRUE if formatting a cRPKM table. Otherwise, FALSE.

col

Vector of colors with length matching the number of samples. If specified, this will override the color settings specified in config.

multi_col

Vector of colors with length matching the number of rows in data. If specified, this can be used to define the color corresponding to each event in plot_multievent()

Details

preprocess_sample_colors depends on a pre-defined "sample inventory" database file in tab-delimited format. This file is species-specific and consists of five columns: (optional) Order, SampleName, SubroupName (optional), GroupName, (optional) RColorCode. The header is required in the file. Order of the columns is flexible.

For example:

Order    SampleName    SubgroupName    GroupName    RColorCode
1        Oocyte_a      Oocyte          EarlyDev     firebrick4
2        Oocyte_b      Oocyte          EarlyDev     firebrick4
3        Embr_4C_a     Embr_4C         EarlyDev     firebrick4
4        Embr_4C_b     Embr_4C         EarlyDev     firebrick4
5        ESC_CGR8      ESC             ESC          firebrick
etc..

where:

  • Order: A specific ordering of the samples from left to right of the plot.

  • SampleName: Name of the sample. MUST match sample name in input table.

  • SubgroupName: Use to define sample pools that will be plotted in the same data point (see plot_event, plot_expr and plot_multievent).

  • GroupName: Use for plotting the average PSI of samples belonging to the same group. Averages will be calculated from the individual samples, not from the subgroups (to avoid overrepresentation of subgroups with fewer samples).

  • RColorCode: Color name as specified by colors or hex color code (#RRGGBB).

The SampleName must match the column names in data. It is possible for config to contain more samples than the data. In this case, the extra samples will be ignored. It is also possible that config contains only a subset of the samples in data. In this case, only the samples specified in the config will be plotted and everything else is ignored.

If a SampleName or SubgroupName is matched to multiple groups, only the first match will be used. Similarly, if a GroupName is matched to multiple RColorCodes, the first one will be applied to all elements in the group.

To use the SubgroupNames in config, subg must be set to TRUE AND config must contain a SubgroupName column. If any of these conditions is not met, one subgroup will be created for each sample, preserving their name and order, and overriding any subgroups in config.

The colors in config can be overridden by specifiying col. This was mainly added to support the col option provided by plot_event – particularly when config is not provided.

This function is also used for formatting cRPKM input data by setting expr = TRUE.

Value

A list containing:

data

data frame of PSI/cRPKM values with columns re-ordered

qual

data frame of quality scores with columns re-ordered. NULL if expr = TRUE

sample_order

data frame with the order corresponding to each sample name

subgroup

data frame with the subgroup corresponding to each sample. If subg=FALSE, or SubgroupName is not present in config, a subgroup is made for each sample, preserving sample names.

subgroup_order

data frame with the order corresponding to each subgroup.

group

data frame with the group corresponding to each subgroup

group_order

data frame with the order and color corresponding to each group.

multi_col

if multi_col was specified, a data frame with the color coresponding to each event/gene ID. Else, NULL.

config

if a config was supplied, config data frame summarising the order-sample-subgroup-group-color relationships described in sample_order, subgroup, subgroup_order, group and group_order, after correcting for ambiguous relationships. If col was supplied, colours in config are overridden with col. If no config was supplied, one will be composed with default parameters.

original_config

data frame with the config supplied to the function

See Also

plot_event, plot_expr, plot_multievent

Examples

#Tables from vast-tools need formatting before using this function
a <- format_table(psi)
reorderedpsi <- preprocess_sample_colors(a, config = config)

b <- format_table(crpkm,expr=TRUE)
reorderedcrpkm <- preprocess_sample_colors(b, config = config, expr = TRUE)

# Subgroups can be avoided even if the config file has them
reorderedpsi <- preprocess_sample_colors(a, config = config, subg = FALSE)

# Mapping colours to events (e.g. for plotting with plot_multievent)
reorderedpsi <- preprocess_sample_colors(a[1:2,], config = config, multi_col = c("red","blue"))

kcha/psiplot documentation built on March 27, 2022, 4:20 a.m.