parse_pdata: Parse key-value pairs in GEO series matrix file

View source: R/phenodata.R

parse_pdataR Documentation

Parse key-value pairs in GEO series matrix file

Description

Lots of GSEs now use "characteristics_ch*" for key-value pairs of annotation. If that is the case, this simply cleans those up and transforms the keys to column names and the values to column values.

Usage

parse_pdata(data, columns = NULL, sep = ":", split = ";")

Arguments

data

A data.frame like object, tibble and data.table are also okay.

columns

A character vector, should be ended with "(ch\d*)(\.\d*)?". these columns in data will be parsed. If NULL, all columns started with "characteristics_ch" will be used.

sep

A string separating paired key-value, usually ":".

split

Passed to strsplit function. Default is ";"'.

Details

A characteristics annotation column usually contains multiple key-value items, so we should first split these columns by split and then extract key-value pairs. A new column will be added whose name is the first group in the "(ch\d*)(\.\d*)?$" regex pattern of the orginal column name connected with key element in key-value pair by string "_" and the new column value is the character vector of value element in all key-value pair.

Value

A modified data.frame.

Examples

gse53987 <- rgeo::get_geo(
    "gse53987", tempdir(),
    gse_matrix = TRUE, add_gpl = FALSE,
    pdata_from_soft = FALSE
)
gse53987_smp_info <- Biobase::pData(gse53987)
gse53987_smp_info$characteristics_ch1 <- stringr::str_replace_all(
    gse53987_smp_info$characteristics_ch1,
    "gender|race|pmi|ph|rin|tissue|disease state",
    function(x) paste0("; ", x)
)
gse53987_smp_info <- rgeo::parse_pdata(gse53987_smp_info)
gse53987_smp_info[grepl(
    "^ch1_|characteristics_ch1", names(gse53987_smp_info)
)]

Yunuuuu/rgeo documentation built on Dec. 23, 2024, 10:01 p.m.