## nolint start suppressPackageStartupMessages({ library(DT) library(Cellosaurus) }) ## nolint end
The package is designed to import Cellosaurus annotations and serve as a
cell line identifier mapping toolkit. The main Cellosaurus()
generator
function returns a Cellosaurus
object, which extends the S4 DFrame
class
from Bioconductor. Additional functions are defined to work with the
Cellosaurus
object, notably mapCells()
, which provides mapping support to
Cellosaurus identifiers directly from cell line names, DepMap identifiers,
and ATCC identifiers.
The primary generator functions imports annotations from the cellosaurus.txt
file on the FTP server.
object <- Cellosaurus() print(object)
The object is structured as cells in rows, metadata in columns.
print(colnames(object))
Data is encoded using run-length encoding (Rle
from S4Vectors) to lower
memory overhead.
This approach provides simple spreadsheet-like access to Cellosaurus annotations, which are more intuitive to users than nested JSON-style lists.
i <- seq(from = 1L, to = 10L) j <- which(vapply( X = as.data.frame(object), FUN = is.atomic, FUN.VALUE = logical(1L) )) datatable( data = as.data.frame(object)[i, j], options = list(scrollX = TRUE) )
It remains a common problem in cancer research that cell line inventories are only maintained by cell line name, without any systematic standardization against a reference database, such as Cellosaurus, ATCC, or DepMap. The advantage of standardizing upon Cellosaurus identifiers as the primary research resource identifier is that the Cellosaurus database is a superset of cells that include all cells in ATCC and DepMap. The Cellosaurus database also provides nicely curated metadata on problematic cell lines, notably contamination, and common misspellings of cell line names that persist across a number of commercial vendor databases.
The mapCells()
function in the package aims to simplify mapping of cell line
names and identifiers from other databases, notably DepMap and ATCC, to
Cellosaurus identifiers. The function is designed to be as simple as possible
and support mixed input in a single call.
cells <- c("ACH-000551", "CCL-240", "Duadi", "THP1", "RAW 264.7") i <- mapCells(object, cells = cells) print(i) datatable( data = as.data.frame(object)[i, j], options = list(scrollX = TRUE) )
The Cellosaurus database nicely keeps track of known issues with cell lines, which can be broken down roughly into two classes: "problematic" and "contaminated". It can be useful for downstream analysis to exclude these cell lines -- in particular, we recommend generally excluding any contaminated cell lines but not necessarily all problematic cell lines.
subset <- excludeProblematicCells(object) print(nrow(subset)) print(any(subset[["isProblematic"]])) print(any(subset[["isContaminated"]]))
Note that "contaminated" cell lines are a subset of "problematic" cell lines in the database. These in general should be avoided in downstream workflows.
subset <- excludeContaminatedCells(object) print(nrow(subset)) print(any(subset[["isContaminated"]])) print(any(subset[["isProblematic"]]))
For reference, search the website for "problematic" cell lines with
"Problematic cell line"
and "contaminated" cell lines with
"Problematic cell line: Contaminated"
.
utils::sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.