knitr::opts_chunk$set(echo = TRUE)

This short vignette shows how to standardize the names of the antibody-derived tags in a CITE-seq experiment in SingleCellExperiment format.

Example 1: Kotliarov PBMC

To demonstrate, we will use the "Kotliarov" dataset available in the package "scRNAseq".

Load the Kotliarov CITE-seq data

library("scRNAseq")
library("AbNames")

kotliarov <- KotliarovPBMCData(mode = "adt")
head(rownames(kotliarov))

Create a data frame to allow matching to the CITEseq data set

To directly match to the citeseq data set, some formatting of the antigen names may be required. Here we remove the "_PROT" suffix. An alternative is to first map antigen names to the gene_aliases data set and use both antigens and identifiers to match to the citeseq data set. However, it is still a good idea to remove prefixes and suffixes to avoid false matches. For example, "PROT" is an alias of the gene SLC6A7 and would match if not removed.

You can include information such as catalogue number and antibody clone to assist matching if it is available. When matching to the citseq data set, the names of the columns to use for matching must equal those in the citeseq data set. For example, below we match using only the antigen name, using the column "Antigen". By default, columns "Clone", "Antigen", and "Cat_Number" (catalogue number) are used for matching if available.

# Create a data.frame for matching to the CITEseq data set
# Column "Antigen" will be used for matching
df <- data.frame(Original = rownames(kotliarov),
                 # Remove _PROT (optionally with a space before)
                 Antigen = gsub(" ?_PROT", "", rownames(kotliarov)))
head(df)

Match to the CITEseq data set and review suggested consensus names

It is a good idea to review matches as errors in e.g. catalogue number will lead to incorrect matches.

The function matchToCiteseq will report which columns are used for matching. Note that unless specified otherwise, the entries of the CITEseq data set are matched using the default columns even if these do not exist in the new data set.

df <- matchToCiteseq(df)
print(head(df))

In general, the names of the original experiment matched the most commonly used name in the CITEseq data set, but we can see that some have different "standard" names.

df[! df$Antigen == df$Antigen_std, ]

By looking at the n_matched column, we can see how many entries in the cite-seq data set (includeing the new query data) could be matched to each antigen. Here we see that only AnnexinV and the isotype controls could not be matched by name only. That is, the only match came from the Kotliarov data set itself.

head(df[order(df$n_matched),], 10)

Rename single cell experiment

# To rename the singleCellExperiment, we need a named vector of the new names,
# with names being the original names of the singleCellExperiment.

# "structure" is a method that allows us to create a named vector in one step
new_nms <- structure(df[[2]], names = rownames(kotliarov))
print(head(new_nms))

kotliarov <- renameADT(kotliarov, new_nms)
head(rownames(kotliarov))

Example 2: CiteFuse example data

In this example, the CITE-seq data are stored as an altExp. The procedure for standardising the antibody names is the same.

Matching directly to the CITEseq data set

# Load CiteFuse example data

library(CiteFuse)
data("CITEseq_example", package = "CiteFuse")
sce_citeseq <- preprocessing(CITEseq_example)
rownames(altExp(sce_citeseq, "ADT"))

df <- data.frame(Antigen = rownames(altExp(sce_citeseq, "ADT")))
df <- matchToCiteseq(df)

# Print entries that differ from the original
df[! df$Antigen == df$Antigen_std, ]

# Print the entries with the fewest matches
head(df[order(df$n_matched),], 10)

We recommend checking the consensus names, especially the isotype controls. We have attempted to manually identify the isotype controls in the citeseq dataset. However, it is possible that we have missed some.

Here we see that the consensus name for "IgG1" is "Mouse IgG1, kappa isotype Ctrl". This is because when other studies refer to "IgG1", we can tell from the antibody clone ID that they are referring to an isotype control with unknown specificity. Without further information, we cannot tell whether the antibody in the CiteFuse example data is an isotype control or an anti-human secondary antibody.

Similarly for TCRa and TCRb, we can tell by matching antigens by clone ID that these are more commonly referred to as TCR a/b (alpha/beta) and TCR g/d (gamma/delta) respectively.

Check matching entries in CITEseq data set

Suppose, as in the above example, we are not sure if a consensus name is correct. We may wish to examine the matching entries in the citeseq data set. One way to do this is to load the citeseq data set and check the matching entries using the function ``.

# We load dplyr to simplify data.frame manipulations.
library(dplyr)

# Load citeseq data set
data(citeseq)

tcrb <- citeseq %>%
    # We will first select entries that equal "TCRb"
    dplyr::filter(Antigen == "TCRb") %>%
    # Then subset to just the columns we want to use for matching
    dplyr::select(Antigen, Clone, Cat_Number, ALT_ID)

# Then we will find entries in citeseq that match any of these columns.

# filter_by_union is a function to return rows where any the value of any
# column occurs in a reference data.frame
citeseq %>%
    filter_by_union(tcrb) %>%
    # Select just columns of interest for viewing
    dplyr::select(Study, Antigen, Clone, Cat_Number, ALT_ID)

From the above result we can confirm that two studies have an antigen named "TCRb". By matching on antibody clone, we see that it is more commonly referred to as TCRab or TCR alpha/beta.

Rename SingleCellExperiment

A note about the ALT_IDs

The column ALT_ID in the data set gene_aliases is an ID that allows the antigens in the citeseq data set to be distinguished. If the HGNC ID is sufficient, then the ALT_ID will be the HGNC_ID. If not, it may be an ID from e.g. the protein ontology or the National Cancer Institute thesaurus, or if no stable ID was found, the antibody and clone name.

Session Info

sessionInfo()
# alternative data sets
#library(SingleCellMultiModal)
#cord_blood <- CITEseq(DataType="cord_blood", modes="scADT_Counts",
#                      dry.run=FALSE, DataClass = "SingleCellExperiment")
#cord_blood


HelenLindsay/AbNames documentation built on June 6, 2023, 1:18 p.m.