Overview and preparations

This script applies the Common Cell type Nomenclature (CCN) to an existing hierarchically-structured taxonomy of cell types, in this case Human Middle Temporal Gyrus (MTG): May 2022. More information about the CCN is available at the Allen Brain Map and in the associated eLife publication. This script requires a taxonomy and some cell meta-data as input and outputs several files useful for publication and for ingest into in-process taxonomy services. Please post any thoughts about the CCN to the Community Forum! This scripts relates to the May 2022 version of the human Middle Temporal Gyrus (MTG) that is used for multiple projects and is available in the Transcriptomics Explorer.

There are two files required as input to this script:

  1. dend.RDS: this is an R dendrogram object representing the taxonomy to be annotated. If you used R for cell typing in your manuscript, this is likely a variable that was created at some point during this process and from which your dendrogram images are made. While this assumes a hierarchical structure of the data, additional cell sets of any kind can be made later in the script. Code for converting from other formats to the R dendrogram format is not provided, but please post to the Community Forum if you have questions about this.
  2. metadata.csv: a table which includes a unique identifier for each cell (in this example it is stored in the sample_name column) as well as the corresponding cell type from the dendrogram for each cell (in this example it is stored in the cluster_label column). Additional metadata of any kind can be optionally included in this table.

Example files from an upcoming SEA-AD taxonomy at the Allen Institute (~June 2022) are included with this library.

The general steps of this script are as follows. First, a unique taxonomy_id is chosen, which will be used as a prefix for all the cell set accession IDs. The R dendrogram is read in and used as the starting point for defining cell sets by including both provisional cell types (terminal leaf nodes) and groups of cell types with similar expression patterns (internal nodes). The main script then assigns accession ids and other required columns for each cell set and outputs an intermediate table, along with a minimally annotated dendrogram for visualization. Next, the user manually annotates these cell sets to include common usage terms (aligned aliases), and can also add additional cell sets which can correspond to any combination of cell types in the taxonomy. These steps can be done manually, through a metadata table, or both. This updated table is read back into R and dendrograms are optionally updated to include the new nomenclature information. Next, cells are assigned nomenclature tags corresponding to their cell set assignments. This is automated for any cell sets based on cell types or other available meta-data. Finally, the code produces a set of standardized files for visualization of updated taxonomic structure and for input into in-process databases for cross-taxonomy comparison (described below) or inclusion as supplemental materials for manuscripts utilizing the annotated taxonomy.

Build the nomenclature

The remainder of this script describes how to run the CCN in R. At this point open RStudio (or R) and start running the code below. The first few blocks correspond to housekeeping things needed to get the workspace set up.

Workspace setup

# NOTE: REPLACE THIS LINK BELOW WITH YOUR WORKING DIRECTORY
outputFolder = paste0(getwd(),"/")

setwd(outputFolder)                           # Needed only if copying and pasting in R
knitr::opts_chunk$set(echo = TRUE)            # Needed only for RStudio
knitr::opts_knit$set(root.dir = outputFolder) # Needed only for RStudio

Load required libraries

suppressPackageStartupMessages({
  library(CCN)
  library(dplyr)
  library(dendextend)
  library(data.table)
  library(jsonlite)
})

Define taxonomy variables

taxonomy_id is the name of the taxonomy in the format: CCN, where:

If more than 10 taxonomies are being generated on a single date, it is reasonable to increment or decrement the data by a day to ensure uniqueness of taxonomies. More generally, to keep taxonomy IDs unique, please select a taxonomy_id NOT IN THIS TABLE, and add your taxonomy_ID to the table as needed. Strategies for better tracking of taxonomy IDs are currently under consideration , including a Cell Type Taxonomony Service currently in development for Allen Institute taxonomies, additional databasing options for the Brain Initiative - Cell Census Network (BICCN), a Cell Annotation Platform (CAP) under development as part of the Human Cell Atlas, and likely other options. One future goal will be to centralize these tracking services and replace the above table.

taxonomy_author is the name of a point person for this taxonomy (e.g., someone who is responsible for it's content). This could either be the person who built the taxonomy, the person who uploaded the data, or the first or corresponding author on a relevant manuscript. By default this person is the same for all cell_sets; however, there is opportunity to manually add additional aliases (and associated assignees and citations that may be different from the taxonomy_author) below.

taxonomy_citation is a the citation or permanent data identifier corresponding to the taxonomy (or "" if there is no associated citation). Ideally the DOI for the publication will be used, or alternatively some other permanent link.

taxonomy_id <- "CCN202204130"

taxonomy_author <- "Jeremy Miller"
# In this case this represents the person uploading the data and taxonomy.  Note that the taxonomy itself was  
#   created by Nik Jorstad and refined by Kyle Travaglini.

taxonomy_citation <- "" # Unpublished as of April 2022, but please check the SEA-AD website for a soon-to-be submitted publication!

Define prefixes for cell_set_label

first_label is a named vector (used as prefix for the cell_set_label), where:

Initially this first_label tag was intended for use with the cell_set_label, which has since been deprecated. However, it is still required for the code to run properly and is a useful way to track cell type (e.g., terminal nodes of the tree). We recommend choosing a relevant delimiter (in this case SEAAD-MTG) and setting the numeric index to 1. The result will be set of labeled cell types in the order presented in the dendrogram.

# For a single labeling index for the whole taxonomy
first_label <- setNames("SEAAD_MTG", 1)

# If you want to use Neurons and Non-neurons
#first_label <- setNames(
#    c("Neuron", "Non-neuron"),
#    c(1       , 111)

Define anatomic structures

A new component of the CCN is the concept of an anatomic structure. This represents the location in the brain (or body) from where the data in the taxonomy was collected. Ideally this will be linked to a standard ontology via the ontology_tag. In this case, we choose "middle temporal gyrus" from "UBERON" since UBERON is specifically designed to be a species-agnostic ontology and we are interested in building cross-species brain references. It is worth noting that these structures can be defined separately for each cell set at a later step, but in the initial set-up only a single structure can be input for the entire taxonomy.

structure = "middle temporal gyrus"

A shortcut function for identifying UBERON (or other ontology terms is included in this package for convenience). We use that here to identify the proper term for "middle temporal gyrus".

find_ontology_terms(structure, exact=TRUE, ontology="UBERON")
ontology_tag = "UBERON:0002771"  # or "none" if no tag is available.

Read in the dendrogram

As discussed above, we have provided a dendrogram of human MTG cell types (called "dend") as an example. Any dendrogram of cell types in the "dendrogram" R format, where the desired cluster aliases are in the "labels" field of the dendrogram will work for this code. Other formats might work and will try to be forced into dendrogram format.

# REPLACE THIS LINE OF CODE WITH CODE TO READ IN YOUR DENDROGRAM, IF NEEDED
data(dend) # Example SEA-AD dendrogram

# Attempt to format dendrogram if the input is in a different format
dend <- as.dendrogram(dend)

# Plot the unannotated dendrogram as a sanity check
plot(dend,main="You should see your desired cell type names on the base of this plot")

Assign the nomenclature!

Most of the steps for assigning the initial nomenclature tags from a dendrogram are done in a single function. If you'd prefer to run it line by line (for example if your data is in slightly different format), you can parse the build_nomenclature_table function, which is well-annotated. This function has been reasonably-well commented to try and explain how each section works.

nomenclature_information <- build_nomenclature_table(
  dend,  first_label, taxonomy_id, taxonomy_author, taxonomy_citation, structure, ontology_tag)

The output of this script is list with three components:

The following columns are included in the cell_set_information table. Most of these columns (indicated by a ^) are components of the CCN.

Automatic annotation of cell sets

This section allows you to automatically annotate existing cell sets and add new cell sets as needed based on existing metadata columns. Note that this function is only applicable for metadata that represent groups of cell types (e.g., something like donor_sex would not be appropriate). More specifically, this function will do the following for each selected column:

  1. Identify all values corresponding to that column (e.g., "class_label" includes "Neuronal: GABAergic", "Neuronal: Glutamatergic", and "Non-neuronal and Non-neural")
  2. For each value it (i) finds all relevant dendrogram labels, (ii) generates the corresponding cell set label, and (iii) cross-references this label with the existing table
  3. If the label exists, the new metadata is added to the requested column in the nomenclature table, and if not it will generate a new entree in the table and then add the requested metadata
  4. This processes is repeated for all relevant metadata columns and associated values

In this case we apply it for subclass and class calls. We are including a subset of the data soon to be released as a Transcriptomics Explorer on the Allen Cell Types Database. For this function either cell metadata or cell type metadata will work; however, later in the script linkages of individual cells to cell sets are required, and this can be done directly from a cell metadata file but not a cell set metadata file.

# REPLACE THIS LINE OF CODE WITH CODE TO READ IN YOUR DENDROGRAM, IF NEEDED
data(metadata) # Example SEA-AD dendrogram

cell_set_information <- nomenclature_information$cell_set_information
metadata_columns     <- c("subclass_label", "class_label")
metadata_order       <- c("subclass_order", "class_order")  # Optional column indicating the order to include metadata
annotation_columns   <- c("cell_set_preferred_alias","cell_set_preferred_alias") # Default is cell_set_preferred_alias
cluster_column       <- "cluster_label"  # Column where the cluster information that went into "cell_set_preferred_alias" 
                                         #     is stored (default is "cluster_label")
append               <- FALSE # If TRUE (default), it will append info; if FALSE, it will ignore cases where there is already an entry
cell_set_information <- annotate_nomenclature_from_metadata(cell_set_information, metadata, metadata_columns, 
                                                            metadata_order, annotation_columns, cluster_column, append)

# Since we used aligned alias terms for subclass and class, we will also save these terms in the cell_set_aligned_alias_slot
annotation_columns   <- c("cell_set_aligned_alias","cell_set_aligned_alias") 
append               <- TRUE # If TRUE, it will append info; if FALSE, it will ignore cases where there is already an entry
cell_set_information <- annotate_nomenclature_from_metadata(cell_set_information, metadata, metadata_columns, 
                                                            metadata_order, annotation_columns, cluster_column, append)

Save the initial dendrogram and nomenclature table

Here we save the cell_set_information to a csv file (nomenclature_table.csv) and output the initial_dendrogram to the screen for manual annotation. You can uncomment the pdf- and dev- lines to save as a pdf (initial_dendrogram.pdf) .

#pdf("initial_dendrogram.pdf",height=8,width=15)
plot_dend(nomenclature_information$initial_dendrogram, node_size=3)
#dev.off()

write.csv(cell_set_information,"initial_nomenclature_table.csv",row.names=FALSE)

Manual annotation of cell sets (optional)

This step is where you can update the nomenclature_table.csv file to add or change aliases and structures, or to add additional cell sets entirely. By default all cell sets corresponding to cell types ("leaf" nodes) are initially assigned exactly one alias that are assigned as the preferred alias, and all remaining cell sets (internal nodes) do not have any aliases.

The original_label column can be used to identify cell sets at the stage for updates by matching with the node (or leaf) label shown in the plotted dendrogram. Do not update the cell_set_accession, cell_set_label, or taxonomy_id columns of existing cell sets (although these will need to be provided for added cell sets). If needed cell_set_structures and associated annotation_tags can be updated at this point, although in most cases these will likely remain unchanged.

The cell_set_alias_assignee and cell_set_alias_citation columns can be updated at this point, as described above. Our example MTG file on the GitHub repo includes some examples of how this can be done.

The cell_set_aligned_alias is a special alias slot designed to match cell types across multiple taxonomies. Ideally these terms will be selected from a semi-controlled vocabulary of terms that are agreed-upon in the relevant cell typing community, for example due to their historical significance or includes in a respected ontology, and will also be assigned in the context of a reference taxonomy. For mammalian neocortex we proposed a specific format for such aligned aliases:

A series of such aligned aliases were used in a recent series of studies in mammalian primary motor cortex (M1), and can be found in one of the supplementary tables of the flagship paper. We use the same aligned alias terms here, with the exception that we replace "L4/5 IT" with "L4 IT" since MTG includes a traditional layer 4 while M1 does not.

Additional cell sets can be added at this time as well. This is very important! To do this, take the following steps:

  1. In a new row, define the cell_set_accession as the largest existing value plus one
  2. Set the taxonomy_id to match the other cell sets. Likely cell_set_structures and associated annotation_tags will also match the other cell sets
  3. If the new cell set corresponds to combinations of cell types present in the tree, the cell_set_label must include the numeric components of the cell_set_labels for relevant cell types. For example, if you wanted to build a new cell set that includes "MTG 001", "MTG 002", and "MTG 005", the cell_set_label would be set as "MTG 001-002, 005". If the cell set is unrelated to cell types, it should be given a name distinct from what is shown in the tree (in our example, any name EXCEPT "MTG #").
  4. Any of the alias columns can be set as described above.

Other columns can also be added to this table, if desired, and those columns will be appended to the dendrogram object in the code below as well.

Read in the updated nomenclature

Once this file has been completed, save the result as a csv file, and continue with the code below, using that file name as input.

# Since we didn't change anything we are not reading in the file, but here is how you would do it
#updated_nomenclature <- read.csv("initial_nomenclature_table.csv")
updated_nomenclature <- cell_set_information

Manual annotation within R

In this case we are going to complete the manual annotation steps within R for the purposes of the vignette. This is another option instead of (or in addition to) writing and reading the files.

# Add node labels corresponding to all IT cells, Non-neuronal, and Non-neural cells
updated_nomenclature[updated_nomenclature$cell_set_accession=="CS202204130_201","cell_set_preferred_alias"] <- "IT"
updated_nomenclature[updated_nomenclature$cell_set_accession=="CS202204130_238","cell_set_preferred_alias"] <- "Non-neuronal"
updated_nomenclature[updated_nomenclature$cell_set_accession=="CS202204130_238","cell_set_aligned_alias"]   <- "Non-neuronal"
updated_nomenclature[updated_nomenclature$cell_set_accession=="CS202204130_248","cell_set_preferred_alias"] <- "Non-neural"
updated_nomenclature[updated_nomenclature$cell_set_accession=="CS202204130_248","cell_set_aligned_alias"]   <- "Non-neural"

Label cell set children

Create an additional tag called child_cell_set_accessions, which is a "|"-separated character vector indicating all of the child cell sets in the dendrogram (e.g., "provisional cell types", "leaves", or "terminal nodes"). This is calculated by using the cell_set_label tags and will help with integration into downstream ontology use cases.

updated_nomenclature <- define_child_accessions(updated_nomenclature)

Label cell set parents

Create an additional tag called parent_cell_set_accession, which is defined as the node with the fewest children that contains a cell set. If the cell sets follows a strict hierarchy (e.g., a tree) then the parent will always be the node in the hierarchy directly above it; however, this function still can return parents in more complex, multi-inheritance structured tables. This can be helpful for integration into downstream ontology use cases.

updated_nomenclature <- define_parent_accessions(updated_nomenclature)

Output the nomenclature table

It is useful to review the nomenclature table at this point, and therefore it is output here.

write.csv(updated_nomenclature,"nomenclature_table.csv",row.names=FALSE)

At this point the nomenclature table is finalized. The remaining steps involve checking that things look right and outputting files.

Update and save the dendrogram

Update the dendrogram and plot the results

This code will take the information from the table above and add it to the initial dendrogram object. When plotted the only visible difference will be that the new cell set alias names (if any) will show up to replace the n## labels from the initial plot. However, ALL of the meta-data read in from the table will be added to the relevant nodes or leafs. Cell sets not linked to the tree will be ignored in this step, but will be added to the relevant text files output below.

updated_dendrogram <- update_dendrogram_with_nomenclature(nomenclature_information$initial_dendrogram,updated_nomenclature)

# In this case we are displaying the dendrogram to the screen, but uncomment the lines below to save as a file
#pdf("updated_dendrogram.pdf",height=8,width=15)
plot_dend(updated_dendrogram, node_size=3)
#dev.off()

Review the dendrogram carefully at this point! You should see all of the node labels that you added in the nomenclature table. If you don't, then you'll need to review the outputted nomenclature_table.csv in the context of the initial dendrogram to make sure everything matches, correct any errors in the nomenclature table, and repeat the steps above starting from "Read in the updated nomenclature". If things look correct, then it's time to proceed.

Save the dendrogram in various formats

Plots only show a small fraction of the data available in these dendrogram objects; to see the rest the dendrogram needs to be saved. We find both the R "dendrogram" format and the "json" format useful for different applications at the Allen Institute and code for saving data as both are presented below. First, let's save the R object.

# Save as an R data object
saveRDS(updated_dendrogram, file="dend.RDS")
# Alternative format
# save(updated_dendrogram, file="dend.rda")

Now let's save the json format. Note that only some dendrogram attributes can be converted to a list. If this section crashes you may need to update the "omit_names" variable. Typically vectors of length 1 (e.g., characters, integers, factors) work but more complex attributes are less reliable.

# Convert to a list
dend_list <- dend_to_list(updated_dendrogram, omit_names = c("markers","markers.byCl","class"))

# Save as a json file
dend_JSON <- toJSON(dend_list, complex = "list", pretty = TRUE)
out <- file("dend.json", open = "w")
writeLines(dend_JSON, out)
close(out)

Define cell to cell set mappings

Up to this point the document describes how to apply the CCN to cell sets based on a hierarchical (or non-hierarchical) dendrograms, with an additional manual annotation step. This final section describes how cells within a data set can be mapped onto this nomenclature. Doing this would better allow mapping of cells and cell sets between multiple taxonomies, particularly in cases where multiple taxonomies contain the same cells.

Set up metadata variables

Prior to assigning nomenclature to individual cells, we need to link each cell to the updated nomenclature for each cell set using information from the metadata file read in above (or if needed, a separate file can be read in here). More specifically, we need to create a character vector of cell_set_accession_ids called cell_id that corresponds to each cell used for generating the taxonomy. This variable is used as a starting point to assign all cells to all cell sets. In this example, the metadata file we read in above includes a preferred_alias column (called cluster_label) corresponding to the cell type names from this Transcriptomics Explorer.

# Read in metadata and collect correct columns for sample name and cell set accession id
# Not needed, as it was read above.

# OPTION 1: COLUMN FOR ACCESSION ID ALREADY EXISTS - (not common)
# cell_id <- metadata$cell_set_accession
# samples <- metadata$sample_name

# OPTION 2: NEED TO GENERATE COLUMN FROM DENDROGRAM LABELS
label_col <- "cluster_label"  # Column name with dendrogram labels
cell_id   <- updated_nomenclature[match(metadata[,label_col],updated_nomenclature$cell_set_preferred_alias),
                                  "cell_set_accession"]
cell_id[is.na(cell_id)] = "none"
samples   <- metadata$sample_name

Assign cells to dendrogram cell sets

Next, we want to automatically link each cell to each cell set that is available in the dendrogram. This is done as a single line of code (Option 1). Note: if a dendrogram is not available, the mapping table can be initialized using a single cell set (Option 2).

# OPTION 1: START FROM A DENDROGRAM
mapping <- cell_assignment_from_dendrogram(updated_dendrogram,samples,cell_id)

# OPTION 2: USE ONLY THE `updated_nomenclature` table
#mapping <- data.frame(sample_name=samples, call=((cell_id==cell_id[1])+1-1))
#colnames(mapping) <- c("sample_name",cell_id[1])

The result of this script is a data frame where the first columns corresponds to the cell sample_name. This is the term used at the Allen Institute for a unique cell ID. This ID is unique across all data at the Allen Institute. In principle, a replacement (or additional) unique ID value could be added for importing into external databases. The remaining columns correspond to the probabilities of each cell mapping to each cell type (from the dendrogram). In this case we define hard probabilities (0 = unassigned to cell set; 1 = assigned to cell set) but this could be adapted to reflect real probabilities calculated elsewhere.

Assign cells to remaining cell sets

This section assigns cell sets that were defined as combinations of cell types, but that were NOT included in the above section. As written, this function assumes that the cell_set_label is assigned using the specific format described above. If Option 2 was selected in the previous code block, all cell_set_labels must have the same prefix.

mapping <- cell_assignment_from_groups_of_cell_types(updated_nomenclature,cell_id,mapping)

Finally, we can add cell to cell set mappings specified by any other metadata. No function is required for this mapping. Instead, add new columns to mapping corresponding to the relevant cell set accessions and provide associated probabilities in the matrix entries. In principle, one could also read in text files and add meta-data columns that way, if desired. For most taxonomies, this step can be skipped, as in most cases cell sets are defined exclusively as a combination of cell types. We print the remaining cell sets here, which in most cases should be none.

missed_ids <- setdiff(updated_nomenclature$cell_set_accession,colnames(mapping))
print(paste0(missed_ids,collapse="; "))

Output cell set assignments

Finally, we output the cell to cell_set assignments as a csv file.

# Output cell to cell set assignments
fwrite(mapping,"cell_to_cell_set_assignments.csv")

Create CCN standard file

The final output of the CCN script is a .zip file that contains these files (all described and outputted above):

The resulting .zip file should be shared in manuscripts and other sources as the standard CCN file, for linkage to other taxonomies.

# Add all relevant files to a zip file with desired filename.  dend.json and/or cell_to_cell_set_assignments.csv can be manually removed below if needed

files = c("dend.json","nomenclature_table.csv","cell_to_cell_set_assignments.csv")
zip(paste0("nomenclature.zip"), files=files)

Session info.

sessionInfo()


AllenInstitute/CCN documentation built on April 15, 2023, 10:48 p.m.