In acidgenomics/r-cellosaurus: Cellosaurus

## nolint start
suppressPackageStartupMessages({
    library(DT)
    library(Cellosaurus)
})
## nolint end

Introduction

The package is designed to import Cellosaurus annotations and serve as a cell line identifier mapping toolkit. The main Cellosaurus() generator function returns a Cellosaurus object, which extends the S4 DFrame class from Bioconductor. Additional functions are defined to work with the Cellosaurus object, notably mapCells(), which provides mapping support to Cellosaurus identifiers directly from cell line names, DepMap identifiers, and ATCC identifiers.

Cellosaurus table

The primary generator functions imports annotations from the cellosaurus.txt file on the FTP server.

object <- Cellosaurus()
print(object)

The object is structured as cells in rows, metadata in columns.

print(colnames(object))

Data is encoded using run-length encoding (Rle from S4Vectors) to lower memory overhead.

This approach provides simple spreadsheet-like access to Cellosaurus annotations, which are more intuitive to users than nested JSON-style lists.

i <- seq(from = 1L, to = 10L)
j <- which(vapply(
    X = as.data.frame(object),
    FUN = is.atomic,
    FUN.VALUE = logical(1L)
))
datatable(
    data = as.data.frame(object)[i, j],
    options = list(scrollX = TRUE)
)

Mapping cell lines

It remains a common problem in cancer research that cell line inventories are only maintained by cell line name, without any systematic standardization against a reference database, such as Cellosaurus, ATCC, or DepMap. The advantage of standardizing upon Cellosaurus identifiers as the primary research resource identifier is that the Cellosaurus database is a superset of cells that include all cells in ATCC and DepMap. The Cellosaurus database also provides nicely curated metadata on problematic cell lines, notably contamination, and common misspellings of cell line names that persist across a number of commercial vendor databases.

The mapCells() function in the package aims to simplify mapping of cell line names and identifiers from other databases, notably DepMap and ATCC, to Cellosaurus identifiers. The function is designed to be as simple as possible and support mixed input in a single call.

cells <- c("ACH-000551", "CCL-240", "Duadi", "THP1", "RAW 264.7")
i <- mapCells(object, cells = cells)
print(i)
datatable(
    data = as.data.frame(object)[i, j],
    options = list(scrollX = TRUE)
)

Excluding problematic cell lines

The Cellosaurus database nicely keeps track of known issues with cell lines, which can be broken down roughly into two classes: "problematic" and "contaminated". It can be useful for downstream analysis to exclude these cell lines -- in particular, we recommend generally excluding any contaminated cell lines but not necessarily all problematic cell lines.

subset <- excludeProblematicCells(object)
print(nrow(subset))
print(any(subset[["isProblematic"]]))
print(any(subset[["isContaminated"]]))

Note that "contaminated" cell lines are a subset of "problematic" cell lines in the database. These in general should be avoided in downstream workflows.

subset <- excludeContaminatedCells(object)
print(nrow(subset))
print(any(subset[["isContaminated"]]))
print(any(subset[["isProblematic"]]))

For reference, search the website for "problematic" cell lines with "Problematic cell line" and "contaminated" cell lines with "Problematic cell line: Contaminated".