csDEXdataSet: Object to store a csDEX dataset.

Description Usage Arguments Details Value Author(s) Examples

View source: R/csdex.R

Description

Initialize and instance of the csDEXdataSet object, by constructing a feature-by-condition data matrix with associated metadata.

Usage

1
csDEXdataSet(data.dir, design.file, type = "count", col.condition = "Experiment.target", col.replicate = "File.accession", col.testable = "testable", col.read.count = "input.read.count", col.additional = c(), data.file.ext = "txt", aggregation = NULL, min.bin.count = NULL, min.gene.count = NULL, zero.out= NULL))

Arguments

data.dir

The directory containing the replicate (data) files.

design.file

The design file with one line per replicate. Needs to contain at least two columns, denoting file name (default: "File.accession") and condition name (default: "Experiment.target"). An optional column can be included to denote total read counts associated to a replicate (default: "input.read.count"). Also, an optional binary column (default: "testable") can be used to test only selected conditions. Other columns can be stored, but will not be used by csDEX.

type

Data type, either "PSI" (percent-spliced in) or "count" (read counts).

col.condition

Name of the column in the design file denoting the experimental condition.

col.replicate

Name of the column in the design file denoting unique file identifiers (without extension).

col.read.count

Column denotig original input read counts.

col.additional

Additional columns to import (for use in e.g. complex model designs).

col.testable

Optional column with binary (TRUE/FALSE) information for which conditions to test.

data.file.ext

Replicate files extension (default: "txt").

aggregation

Replicate aggregation function. A pointer to a function for merging values of replicates into one per each sample (default: mean).

min.bin.count

Minimum value at a feature to be considered non-zero. If defined, values below this threshold will be set to zero.

min.gene.count

Minimum sum of expression values accross a gene to be considered expressed. If defined, all the features associated to the gene will be set to zero.

zero.out

Use for PSI data to prevent PSI being not equal to either 0 or 1 for constitutive features. If feature value is less than min.count, it is set to NA and ignored by the aggregation function.

Details

The main task of this function is to construct a csDEXdataSet object by aggregating the replicate data into one column per sample. Arbitrary aggregation functions can be passed, the most common ones being mean, sum and max. The options min.bin.count, min.gene.count and zero.out can be used for filtering.

All metadata provided by the design file is stored in the object colData. The two mandatory columns are specified by col.condition and col.replicate, as defined above. The total experiment read count can be provided along with each replicate, to enable usage of estiamteGeneCPM function in the downstream analysis.

The row metadata is parsed as featureID (individual regions of interest) and groupID (gene).

Value

An initialized csDEXdataSet object.

Author(s)

Martin Stra<c5><be>ar.

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
##---- Should be DIRECTLY executable !! ----
##-- ==>  Define data, use random,
##--	or do  help(data=index)  for the standard data sets.

## The function is currently defined as
function (data.dir, design.file, type = "count", col.condition = "Experiment.target", 
    col.replicate = "File.accession", data.file.ext = "txt", 
    aggregation = NULL, min.bin.count = NULL, min.gene.count = NULL, 
    zero.out = NULL) 
{
    if (!(type %in% c("count", "PSI"))) 
        stop("type must be one of ('count', 'PSI')")
    if (is.null(min.bin.count)) 
        min.bin.count = 0
    if (type == "PSI") {
        if (is.null(aggregation)) 
            aggregation = mean
        if (is.null(zero.out)) 
            zero.out = TRUE
        if (is.null(min.gene.count)) 
            min.gene.count = 1
    }
    else if (type == "count") {
        if (is.null(aggregation)) 
            aggregation = sum
        if (is.null(zero.out)) 
            zero.out = FALSE
        if (is.null(min.gene.count)) 
            min.gene.count = 0
    }
    zeroOut <- function(repData, min.count = 0) {
        genes = unlist(lapply(row.names(repData), function(x) strsplit(x, 
            ":")[[1]][1]))
        for (g in unique(genes)) {
            inxs = genes == g
            zeros = which(apply(repData[inxs, ], 2, sum) < min.count)
            repData[inxs, zeros] = NA
        }
        return(repData)
    }
    design = read.csv(design.file, sep = "\t", header = TRUE)
    stopifnot(col.condition %in% colnames(design))
    stopifnot(col.replicate %in% colnames(design))
    conditions = sort(unique(design[, col.condition]))
    n.con = length(conditions)
    message("Processing expression data")
    exprData = NULL
    lib.sizes = NULL
    for (i in 1:length(conditions)) {
        cond = conditions[i]
        message(sprintf("Condition %s", cond))
        replicates = design[design[, col.condition] == cond, 
            col.replicate]
        cond.lib.size = NULL
        if (!is.null(design$input.read.count)) {
            cond.lib.size = aggregation(design[design[, col.condition] == 
                cond, "input.read.count"])
            lib.sizes = c(lib.sizes, cond.lib.size)
        }
        repData = NULL
        n.rep = length(replicates)
        for (j in 1:length(replicates)) {
            rep = replicates[j]
            rep.path = file.path(data.dir, paste(rep, data.file.ext, 
                sep = "."))
            y = read.table(rep.path, header = FALSE, comment.char = "_")
            y$V1 = as.character(y$V1)
            n.row = nrow(y)
            message(sprintf("     Replicate %s, num. rows: %d", 
                rep.path, n.row))
            if (is.null(repData)) {
                repData = matrix(0, ncol = n.rep, nrow = n.row)
                row.names(repData) = y$V1
            }
            repData[y$V1, j] = y$V2
        }
        repData[repData < min.bin.count] = 0
        if (zero.out || min.gene.count > 0) 
            repData = zeroOut(repData, min.gene.count)
        repVec = suppressWarnings(apply(repData, 1, aggregation, 
            na.rm = TRUE))
        repVec[is.infinite(repVec)] = 0
        repVec[is.na(repVec)] = 0
        if (is.null(exprData)) {
            exprData = matrix(0, ncol = n.con, nrow = n.row)
            row.names(exprData) = row.names(repData)
            colnames(exprData) = conditions
        }
        exprData[, i] = repVec
    }
    rowData = data.frame(featureID = row.names(exprData), groupID = unlist(lapply(row.names(repData), 
        function(x) strsplit(x, ":")[[1]][1])), binID = unlist(lapply(row.names(repData), 
        function(x) strsplit(x, ":")[[1]][2])))
    rowData$featureID = as.character(rowData$featureID)
    rowData$groupID = as.character(rowData$groupID)
    rowData$binID = as.character(rowData$binID)
    row.names(rowData) = rowData$featureID
    colData = data.frame(condition = colnames(exprData))
    if (!is.null(lib.sizes)) 
        colData$lib.size = lib.sizes
    new("csDEXdataSet", exprData = exprData, rowData = rowData, 
        colData = colData, dataType = type)
  }

mstrazar/csDEX documentation built on May 23, 2019, 8:16 a.m.