NOTES.md

One or two levels in QFeatures (closed)

One or two assay levels could be considered in QFeatures:

This question on the bioc-devel list ask for advice on SE processing, and whether a new SE or new assay in the original SE should be preferred. While the letter is arguably more elegant, and is also used in SingleAssayExperiment pipelines, it doesn't seem to be the case when using SummarizedExperiments.

As for features (or MultiAssayExperiments in general), the two-level approach isn't readily available out-of-the-box, and would require additional developments:

Despite the elegant of the two-level option, it seems that the additional development isn't warranted at this time.

The updateAssay function was originally intended for the two-level approach, i.e. to add an assay to an SE. This is not considered anymore (for now, at least).

There is one exception though. When aggregating features with aggregateFeatures(), a second assay is added, named aggcounts that counts the number of features that were aggregate for each sample and each low-level features.

How to add new assays (closed)

  1. Through aggregation with aggregateFeatures.

  2. Processing an SE.

This can/could be done explicitly with addAssay

addAssay(cptac, logTransform(cptac[["peptides"]]), name = "peptides_log")
addAssay(cptac, logTransform(cptac[[1]]), name = "peptides_log")

or implicitly

logTransform(cptac, "peptides", name = "peptides_log")
logTransform(cptac, 1, name = "peptides_log")
  1. Joining SEs (for example multiple TMT batches) (TODO)
joinAssays(QFeatures, c("pep_batch1", "pep_batch2", "pep_batch3"), name = "peptides")
joinAssays(QFeatures, c(1, 2, 3), name = "peptides")

See below.

QFeatures API

Processing functions

Assays

hlpsms <- hlpsms[1:5000, ] ## faster

ft1 <- readQFeatures(hlpsms, ecol = 1:10, name = "psms", fname = "Sequence")
sum(rownames(ft1[[1]]) == "ANLPQSFQVDTSk")
ft1 <- aggregateFeatures(ft1, "psms", fcol = "Sequence",
                         name = "peptides", fun = colSums)
sapply(rownames(ft1), anyDuplicated)
ft1

## subsetting still works
ft2 <- subsetByFeature(ft1, "ANLPQSFQVDTSk")
ft2

The underlying reason why this fails is due to matrix subsetting by name when these names aren't unique.

m <- matrix(1:10, ncol = 2)
colnames(m) <- LETTERS[1:2]
rownames(m) <- c("a", letters[1:4])
m
m["a", ]

And of course, this affects SEs ...

se <- SummarizedExperiment(m)
assay(se["a", ])

... and MultiAssayExperiments.

Note that in the example above, "ANLPQSFQVDTSk" is present in both the psms and peptides assays, and the

for (k in setdiff(all_assays_names, leaf_assay_name)) { ... }

loop in .subsetByFeature isn't executed at all. This will need to be investigated. But the behaviour above can be reproduced even when that's not the case. See

hlpsms$Sequence2 <- paste0(hlpsms$Sequence, "2")
ft1 <- readQFeatures(hlpsms, ecol = 1:10, name = "psms", fname = "Sequence2")
...

This could be fixed by switching to indices:

> (i <- which(rownames(m) == "a"))
[1] 1 2
> m[i, ]
  A B
a 1 6
a 2 7
> se[i, ]
class: SummarizedExperiment
dim: 2 2
metadata(0):
assays(1): ''
rownames(2): a a
rowData names(0):
colnames(2): A B
colData names(0):

See issue #91.

Assay links

Currently, we have

Joining assays (closed)

To combine assays, we also need 1. relaxed MatchedAssayExperiment constrains (see #46) 2. assay links with multiple parent assays (see #52)

combine,MSnSet,MSnSet does two things, i.e. rbind and cbind. Here, we nedd (at least in a first instance) and have cbind,SummarizedExperiment.

We need a join-type of function, that adds NAs at the assay level. To do this, we need to have a union of features before rbinding the assays.

As for rowData, we want to

The row data will be accessible through links between assays anyway.

Naming:

joinAssays(QFeatures, c("pep_batch1", "pep_batch2", "pep_batch3"), name = "peptides")
joinAssays(QFeatures, c(1, 2, 3), name = "peptides")

Algorithm: 1. Find which mcols to keep 2. Extend with rownames and NAs (depending on type of join) 3. Order assays 4. cbind assays (see cbind,SummarizedExperiment)

Do we want a public join for SummarizedExperiments? Discuss with SE maintainers.

Note: if we were to have assay from multiple fractions to be rbinded, we could consider a rbindAssays, mergeFractions, bindFractions, ...

Replacing vs adding assays

Issues https://github.com/rformassspectrometry/QFeatures/issues/193 and https://github.com/rformassspectrometry/QFeatures/issues/186.

Currently, assays are replaced with - filterNA() - filterFeatures() (and possibly others)

Sometimes, we want to add, rather than replace, for example if we want to test/assess the effect of different filters. This could be defined by the names argument. If missing (default), the assays are replaced. If present and of same length than i, new assays are added.

There are multiple ideas/discussion replated to QFeatures becoming very large (and slow). Rather than adding more assays, we could: - use logical for subsetting; - use multiple assays within a SingleCellExperiment (or SE), when the dimensions remain identical (for exmple logTransform()); - have a unique database to handle and manage all data (assays and rowData).

But we agree that the interface, for the user, should remain simple, i.e. different assays. For now, keep the same philosophy and create new assays for all operations, and start a reflexion for more in-depth refactoring.

See also HDF5 backend issue.

Devel roadmap

scp

scpdata

QFeatures

Tabular input (issue 199)

  1. Single-set case, multiplexed: requires colAnnotation only. Also LF with a re-ordered peptide/protein-level table (runs are missing in this case).
|------+------------+-----------|
| cols | Quant 1..N | more cols |
|      |            |           |
|      |            |           |
|      |            |           |
|------+------------+-----------|
readQFeatures(hlpsms, quantCols = 1:10)
readQFeatures(hlpsms, colAnnotation = colann)

## also possible, but redundant
readQFeatures(hlpsms, colAnnotation = colann, quantCols = 1:10)
  1. Multi-set case, multiplexed: requires colAnnotation and runCol.
|-----+------+------------+-----------|
| Run | cols | Quant 1..N | more cols |
|   1 |      |            |           |
|   1 |      |            |           |
|-----+------+------------+-----------|
|   2 |      |            |           |
|-----+------+------------+-----------|
readQFeatures(hlpsms, quantCols = 1:10, runCol = "file")
readQFeatures(hlpsms, colAnnotation = colann, runCol = "file")
  1. Multi-set case, LF: requires colData and runCol with a optional multiplexing (for plexDIA).
|-----+------+---------+-----------+-----------|
| Run | cols | Quant 1 | more cols | multiplex |
|   1 |      |         |           |           |
|   1 |      |         |           |           |
|-----+------+---------+-----------+-----------|
|   2 |      |         |           |           |
|-----+------+---------+-----------+-----------|
  1. Special case DIANN. A specialised function that parses the table to case 2.

Users can either use the arguments above or a colAnnotation data.frame (that will become the colData).

DIANN data

dfr |>
  diannWider() |>
  readQFeatures()

readQFeaturesFromDIANN <- funtion(dfr, multiplexing = NULL, ...) {
    if (!is.null(multiplexing))
        x <- .diannWider(multiplexing)
    readQFeatures(x, ...)
}


rformassspectrometry/Features documentation built on Sept. 25, 2024, 11:30 a.m.