This script applies the Common Cell type Nomenclature (CCN) to an existing hierarchically-structured taxonomy of cell types, in this case Human Middle Temporal Gyrus (MTG): May 2022. More information about the CCN is available at the Allen Brain Map and in the associated eLife publication. This script requires a taxonomy and some cell meta-data as input and outputs several files useful for publication and for ingest into in-process taxonomy services. Please post any thoughts about the CCN to the Community Forum! This scripts relates to the May 2022 version of the human Middle Temporal Gyrus (MTG) that is used for multiple projects and is available in the Transcriptomics Explorer.
There are two files required as input to this script:
dend.RDS
: this is an R dendrogram
object representing the taxonomy to be annotated. If you used R for cell typing in your manuscript, this is likely a variable that was created at some point during this process and from which your dendrogram images are made. While this assumes a hierarchical structure of the data, additional cell sets of any kind can be made later in the script. Code for converting from other formats to the R dendrogram
format is not provided, but please post to the Community Forum if you have questions about this. metadata.csv
: a table which includes a unique identifier for each cell (in this example it is stored in the sample_name
column) as well as the corresponding cell type from the dendrogram for each cell (in this example it is stored in the cluster_label
column). Additional metadata of any kind can be optionally included in this table. Example files from an upcoming SEA-AD taxonomy at the Allen Institute (~June 2022) are included with this library.
The general steps of this script are as follows. First, a unique taxonomy_id is chosen, which will be used as a prefix for all the cell set accession IDs. The R dendrogram is read in and used as the starting point for defining cell sets by including both provisional cell types (terminal leaf nodes) and groups of cell types with similar expression patterns (internal nodes). The main script then assigns accession ids and other required columns for each cell set and outputs an intermediate table, along with a minimally annotated dendrogram for visualization. Next, the user manually annotates these cell sets to include common usage terms (aligned aliases), and can also add additional cell sets which can correspond to any combination of cell types in the taxonomy. These steps can be done manually, through a metadata table, or both. This updated table is read back into R and dendrograms are optionally updated to include the new nomenclature information. Next, cells are assigned nomenclature tags corresponding to their cell set assignments. This is automated for any cell sets based on cell types or other available meta-data. Finally, the code produces a set of standardized files for visualization of updated taxonomic structure and for input into in-process databases for cross-taxonomy comparison (described below) or inclusion as supplemental materials for manuscripts utilizing the annotated taxonomy.
The remainder of this script describes how to run the CCN in R. At this point open RStudio (or R) and start running the code below. The first few blocks correspond to housekeeping things needed to get the workspace set up.
# NOTE: REPLACE THIS LINK BELOW WITH YOUR WORKING DIRECTORY outputFolder = paste0(getwd(),"/") setwd(outputFolder) # Needed only if copying and pasting in R knitr::opts_chunk$set(echo = TRUE) # Needed only for RStudio knitr::opts_knit$set(root.dir = outputFolder) # Needed only for RStudio
suppressPackageStartupMessages({ library(CCN) library(dplyr) library(dendextend) library(data.table) library(jsonlite) })
taxonomy_id
is the name of the taxonomy in the format: CCN
If more than 10 taxonomies are being generated on a single date, it is reasonable to increment or decrement the data by a day to ensure uniqueness of taxonomies. More generally, to keep taxonomy IDs unique, please select a taxonomy_id NOT IN THIS TABLE, and add your taxonomy_ID to the table as needed. Strategies for better tracking of taxonomy IDs are currently under consideration , including a Cell Type Taxonomony Service currently in development for Allen Institute taxonomies, additional databasing options for the Brain Initiative - Cell Census Network (BICCN), a Cell Annotation Platform (CAP) under development as part of the Human Cell Atlas, and likely other options. One future goal will be to centralize these tracking services and replace the above table.
taxonomy_author
is the name of a point person for this taxonomy (e.g., someone who is responsible for it's content). This could either be the person who built the taxonomy, the person who uploaded the data, or the first or corresponding author on a relevant manuscript. By default this person is the same for all cell_sets; however, there is opportunity to manually add additional aliases (and associated assignees and citations that may be different from the taxonomy_author) below.
taxonomy_citation
is a the citation or permanent data identifier corresponding to the taxonomy (or "" if there is no associated citation). Ideally the DOI for the publication will be used, or alternatively some other permanent link.
taxonomy_id <- "CCN202204130" taxonomy_author <- "Jeremy Miller" # In this case this represents the person uploading the data and taxonomy. Note that the taxonomy itself was # created by Nik Jorstad and refined by Kyle Travaglini. taxonomy_citation <- "" # Unpublished as of April 2022, but please check the SEA-AD website for a soon-to-be submitted publication!
first_label
is a named vector (used as prefix for the cell_set_label), where:
Initially this first_label
tag was intended for use with the cell_set_label
, which has since been deprecated. However, it is still required for the code to run properly and is a useful way to track cell type (e.g., terminal nodes of the tree). We recommend choosing a relevant delimiter (in this case SEAAD-MTG) and setting the numeric index to 1. The result will be set of labeled cell types in the order presented in the dendrogram.
# For a single labeling index for the whole taxonomy first_label <- setNames("SEAAD_MTG", 1) # If you want to use Neurons and Non-neurons #first_label <- setNames( # c("Neuron", "Non-neuron"), # c(1 , 111)
A new component of the CCN is the concept of an anatomic structure
. This represents the location in the brain (or body) from where the data in the taxonomy was collected. Ideally this will be linked to a standard ontology via the ontology_tag
. In this case, we choose "middle temporal gyrus" from "UBERON" since UBERON is specifically designed to be a species-agnostic ontology and we are interested in building cross-species brain references. It is worth noting that these structures can be defined separately for each cell set at a later step, but in the initial set-up only a single structure can be input for the entire taxonomy.
structure = "middle temporal gyrus"
A shortcut function for identifying UBERON (or other ontology terms is included in this package for convenience). We use that here to identify the proper term for "middle temporal gyrus".
find_ontology_terms(structure, exact=TRUE, ontology="UBERON") ontology_tag = "UBERON:0002771" # or "none" if no tag is available.
As discussed above, we have provided a dendrogram of human MTG cell types (called "dend") as an example. Any dendrogram of cell types in the "dendrogram" R format, where the desired cluster aliases are in the "labels" field of the dendrogram will work for this code. Other formats might work and will try to be forced into dendrogram format.
# REPLACE THIS LINE OF CODE WITH CODE TO READ IN YOUR DENDROGRAM, IF NEEDED data(dend) # Example SEA-AD dendrogram # Attempt to format dendrogram if the input is in a different format dend <- as.dendrogram(dend) # Plot the unannotated dendrogram as a sanity check plot(dend,main="You should see your desired cell type names on the base of this plot")
Most of the steps for assigning the initial nomenclature tags from a dendrogram are done in a single function. If you'd prefer to run it line by line (for example if your data is in slightly different format), you can parse the build_nomenclature_table
function, which is well-annotated. This function has been reasonably-well commented to try and explain how each section works.
nomenclature_information <- build_nomenclature_table( dend, first_label, taxonomy_id, taxonomy_author, taxonomy_citation, structure, ontology_tag)
The output of this script is list with three components:
cell_set_information
: A data.frame (table) of taxonomy information (see below)initial_dendrogram
: The initially inputted dendrogram, EXCEPT that all nodes are labeled with short labels at this point (n1, n2, n3, etc.) for use with manual annotation steps below updated_dendrogram
: A dendrogram updated to include everything in the cell_set_information
table. Node names are set with the preferred_alias tag, which is blank by default (so likely the node names will be missing) The following columns are included in the cell_set_information
table. Most of these columns (indicated by a ^) are components of the CCN.
cell_set_accession
^: The unique identifier (cell set accession id) assigned for each cell set of the format original_label
: The original cell type label in the dendrogram. This is used for QC only but is not part of the CCN. cell_set_label
: A label of the format cell_set_preferred_alias
^: This is the label that will be shown in the dendrogram and should represent what you would want the cell set to be called in a manuscript or product. If the CCN is applied to a published work, this tag would precisely match what is included in the paper. cell_set_aligned_alias
^: This is a special tag designed to match cell types across different taxonomies. We discuss this tag in great detail in our manuscript, and will be discussed briefly below. As output from build_nomenclature_table
, this will be blank for all cell sets. cell_set_additional_aliases
^: Any additional aliases desired for a given cell set, separated by a "|". For example, this allows inclusion of multiple historical names. As output from build_nomenclature_table
, this will be blank for all cell sets. cell_set_structure
^: The structure, as described above. Can be modified for specific cell sets below. Multiple cell_set_structures can be given separated by a "|". cell_set_ontology_tag
^: The ontology_tag, as described above. Can be modified for specific cell sets below. Multiple cell_set_ontology_tags can be given separated by a "|", and must match cell_set_structure above.cell_set_alias_assignee
: By default the taxonomy_author, as described above. In this case, if aliases are assigned by different people, additional assignees can be added by separating using a "|". The format is [preferred_alias_assignee]|[aligned_alias_assignee]|[additional_alias_assignee(s)]. If aliases are added without adding additional assignees it is assumed that the assignee is the same for all aliases. cell_set_alias_citation
: By default the taxonomy_citation, as described above (or can be left blank). In this case, if preferred (or other) aliases are assigned based on a different citation, additional citations can be added by separating using a "|", with the same rules as described by cell_set_alias_assignee. Ideally the DOI for the publication will be used (or another permanent link). taxonomy_id
^: The taxonomy_id, as described above. This should not be changed. This section allows you to automatically annotate existing cell sets and add new cell sets as needed based on existing metadata columns. Note that this function is only applicable for metadata that represent groups of cell types (e.g., something like donor_sex would not be appropriate). More specifically, this function will do the following for each selected column:
In this case we apply it for subclass and class calls. We are including a subset of the data soon to be released as a Transcriptomics Explorer on the Allen Cell Types Database. For this function either cell metadata or cell type metadata will work; however, later in the script linkages of individual cells to cell sets are required, and this can be done directly from a cell metadata file but not a cell set metadata file.
# REPLACE THIS LINE OF CODE WITH CODE TO READ IN YOUR DENDROGRAM, IF NEEDED data(metadata) # Example SEA-AD dendrogram cell_set_information <- nomenclature_information$cell_set_information metadata_columns <- c("subclass_label", "class_label") metadata_order <- c("subclass_order", "class_order") # Optional column indicating the order to include metadata annotation_columns <- c("cell_set_preferred_alias","cell_set_preferred_alias") # Default is cell_set_preferred_alias cluster_column <- "cluster_label" # Column where the cluster information that went into "cell_set_preferred_alias" # is stored (default is "cluster_label") append <- FALSE # If TRUE (default), it will append info; if FALSE, it will ignore cases where there is already an entry cell_set_information <- annotate_nomenclature_from_metadata(cell_set_information, metadata, metadata_columns, metadata_order, annotation_columns, cluster_column, append) # Since we used aligned alias terms for subclass and class, we will also save these terms in the cell_set_aligned_alias_slot annotation_columns <- c("cell_set_aligned_alias","cell_set_aligned_alias") append <- TRUE # If TRUE, it will append info; if FALSE, it will ignore cases where there is already an entry cell_set_information <- annotate_nomenclature_from_metadata(cell_set_information, metadata, metadata_columns, metadata_order, annotation_columns, cluster_column, append)
Here we save the cell_set_information to a csv file (nomenclature_table.csv
) and output the initial_dendrogram to the screen for manual annotation. You can uncomment the pdf- and dev- lines to save as a pdf (initial_dendrogram.pdf
) .
#pdf("initial_dendrogram.pdf",height=8,width=15) plot_dend(nomenclature_information$initial_dendrogram, node_size=3) #dev.off() write.csv(cell_set_information,"initial_nomenclature_table.csv",row.names=FALSE)
This step is where you can update the nomenclature_table.csv
file to add or change aliases and structures, or to add additional cell sets entirely. By default all cell sets corresponding to cell types ("leaf" nodes) are initially assigned exactly one alias that are assigned as the preferred alias, and all remaining cell sets (internal nodes) do not have any aliases.
The original_label
column can be used to identify cell sets at the stage for updates by matching with the node (or leaf) label shown in the plotted dendrogram. Do not update the cell_set_accession
, cell_set_label
, or taxonomy_id
columns of existing cell sets (although these will need to be provided for added cell sets). If needed cell_set_structures and associated annotation_tags can be updated at this point, although in most cases these will likely remain unchanged.
The cell_set_alias_assignee
and cell_set_alias_citation
columns can be updated at this point, as described above. Our example MTG file on the GitHub repo includes some examples of how this can be done.
The cell_set_aligned_alias
is a special alias slot designed to match cell types across multiple taxonomies. Ideally these terms will be selected from a semi-controlled vocabulary of terms that are agreed-upon in the relevant cell typing community, for example due to their historical significance or includes in a respected ontology, and will also be assigned in the context of a reference taxonomy. For mammalian neocortex we proposed a specific format for such aligned aliases:
A series of such aligned aliases were used in a recent series of studies in mammalian primary motor cortex (M1), and can be found in one of the supplementary tables of the flagship paper. We use the same aligned alias terms here, with the exception that we replace "L4/5 IT" with "L4 IT" since MTG includes a traditional layer 4 while M1 does not.
Additional cell sets can be added at this time as well. This is very important! To do this, take the following steps:
cell_set_accession
as the largest existing value plus one taxonomy_id
to match the other cell sets. Likely cell_set_structures and associated annotation_tags will also match the other cell sets Other columns can also be added to this table, if desired, and those columns will be appended to the dendrogram object in the code below as well.
Once this file has been completed, save the result as a csv file, and continue with the code below, using that file name as input.
# Since we didn't change anything we are not reading in the file, but here is how you would do it #updated_nomenclature <- read.csv("initial_nomenclature_table.csv") updated_nomenclature <- cell_set_information
In this case we are going to complete the manual annotation steps within R for the purposes of the vignette. This is another option instead of (or in addition to) writing and reading the files.
# Add node labels corresponding to all IT cells, Non-neuronal, and Non-neural cells updated_nomenclature[updated_nomenclature$cell_set_accession=="CS202204130_201","cell_set_preferred_alias"] <- "IT" updated_nomenclature[updated_nomenclature$cell_set_accession=="CS202204130_238","cell_set_preferred_alias"] <- "Non-neuronal" updated_nomenclature[updated_nomenclature$cell_set_accession=="CS202204130_238","cell_set_aligned_alias"] <- "Non-neuronal" updated_nomenclature[updated_nomenclature$cell_set_accession=="CS202204130_248","cell_set_preferred_alias"] <- "Non-neural" updated_nomenclature[updated_nomenclature$cell_set_accession=="CS202204130_248","cell_set_aligned_alias"] <- "Non-neural"
Create an additional tag called child_cell_set_accessions
, which is a "|"-separated character vector indicating all of the child cell sets in the dendrogram (e.g., "provisional cell types", "leaves", or "terminal nodes"). This is calculated by using the cell_set_label
tags and will help with integration into downstream ontology use cases.
updated_nomenclature <- define_child_accessions(updated_nomenclature)
Create an additional tag called parent_cell_set_accession
, which is defined as the node with the fewest children that contains a cell set. If the cell sets follows a strict hierarchy (e.g., a tree) then the parent will always be the node in the hierarchy directly above it; however, this function still can return parents in more complex, multi-inheritance structured tables. This can be helpful for integration into downstream ontology use cases.
updated_nomenclature <- define_parent_accessions(updated_nomenclature)
It is useful to review the nomenclature table at this point, and therefore it is output here.
write.csv(updated_nomenclature,"nomenclature_table.csv",row.names=FALSE)
At this point the nomenclature table is finalized. The remaining steps involve checking that things look right and outputting files.
This code will take the information from the table above and add it to the initial dendrogram object. When plotted the only visible difference will be that the new cell set alias names (if any) will show up to replace the n## labels from the initial plot. However, ALL of the meta-data read in from the table will be added to the relevant nodes or leafs. Cell sets not linked to the tree will be ignored in this step, but will be added to the relevant text files output below.
updated_dendrogram <- update_dendrogram_with_nomenclature(nomenclature_information$initial_dendrogram,updated_nomenclature) # In this case we are displaying the dendrogram to the screen, but uncomment the lines below to save as a file #pdf("updated_dendrogram.pdf",height=8,width=15) plot_dend(updated_dendrogram, node_size=3) #dev.off()
Review the dendrogram carefully at this point! You should see all of the node labels that you added in the nomenclature table. If you don't, then you'll need to review the outputted nomenclature_table.csv in the context of the initial dendrogram to make sure everything matches, correct any errors in the nomenclature table, and repeat the steps above starting from "Read in the updated nomenclature". If things look correct, then it's time to proceed.
Plots only show a small fraction of the data available in these dendrogram objects; to see the rest the dendrogram needs to be saved. We find both the R "dendrogram" format and the "json" format useful for different applications at the Allen Institute and code for saving data as both are presented below. First, let's save the R object.
# Save as an R data object saveRDS(updated_dendrogram, file="dend.RDS") # Alternative format # save(updated_dendrogram, file="dend.rda")
Now let's save the json format. Note that only some dendrogram attributes can be converted to a list. If this section crashes you may need to update the "omit_names" variable. Typically vectors of length 1 (e.g., characters, integers, factors) work but more complex attributes are less reliable.
# Convert to a list dend_list <- dend_to_list(updated_dendrogram, omit_names = c("markers","markers.byCl","class")) # Save as a json file dend_JSON <- toJSON(dend_list, complex = "list", pretty = TRUE) out <- file("dend.json", open = "w") writeLines(dend_JSON, out) close(out)
Up to this point the document describes how to apply the CCN to cell sets based on a hierarchical (or non-hierarchical) dendrograms, with an additional manual annotation step. This final section describes how cells within a data set can be mapped onto this nomenclature. Doing this would better allow mapping of cells and cell sets between multiple taxonomies, particularly in cases where multiple taxonomies contain the same cells.
Prior to assigning nomenclature to individual cells, we need to link each cell to the updated nomenclature for each cell set using information from the metadata file read in above (or if needed, a separate file can be read in here). More specifically, we need to create a character vector of cell_set_accession_id
s called cell_id
that corresponds to each cell used for generating the taxonomy. This variable is used as a starting point to assign all cells to all cell sets. In this example, the metadata file we read in above includes a preferred_alias
column (called cluster_label
) corresponding to the cell type names from this Transcriptomics Explorer.
# Read in metadata and collect correct columns for sample name and cell set accession id # Not needed, as it was read above. # OPTION 1: COLUMN FOR ACCESSION ID ALREADY EXISTS - (not common) # cell_id <- metadata$cell_set_accession # samples <- metadata$sample_name # OPTION 2: NEED TO GENERATE COLUMN FROM DENDROGRAM LABELS label_col <- "cluster_label" # Column name with dendrogram labels cell_id <- updated_nomenclature[match(metadata[,label_col],updated_nomenclature$cell_set_preferred_alias), "cell_set_accession"] cell_id[is.na(cell_id)] = "none" samples <- metadata$sample_name
Next, we want to automatically link each cell to each cell set that is available in the dendrogram. This is done as a single line of code (Option 1). Note: if a dendrogram is not available, the mapping
table can be initialized using a single cell set (Option 2).
# OPTION 1: START FROM A DENDROGRAM mapping <- cell_assignment_from_dendrogram(updated_dendrogram,samples,cell_id) # OPTION 2: USE ONLY THE `updated_nomenclature` table #mapping <- data.frame(sample_name=samples, call=((cell_id==cell_id[1])+1-1)) #colnames(mapping) <- c("sample_name",cell_id[1])
The result of this script is a data frame where the first columns corresponds to the cell sample_name
. This is the term used at the Allen Institute for a unique cell ID. This ID is unique across all data at the Allen Institute. In principle, a replacement (or additional) unique ID value could be added for importing into external databases. The remaining columns correspond to the probabilities of each cell mapping to each cell type (from the dendrogram). In this case we define hard probabilities (0 = unassigned to cell set; 1 = assigned to cell set) but this could be adapted to reflect real probabilities calculated elsewhere.
This section assigns cell sets that were defined as combinations of cell types, but that were NOT included in the above section. As written, this function assumes that the cell_set_label
is assigned using the specific format described above. If Option 2 was selected in the previous code block, all cell_set_labels must have the same prefix.
mapping <- cell_assignment_from_groups_of_cell_types(updated_nomenclature,cell_id,mapping)
Finally, we can add cell to cell set mappings specified by any other metadata. No function is required for this mapping. Instead, add new columns to mapping
corresponding to the relevant cell set accessions and provide associated probabilities in the matrix entries. In principle, one could also read in text files and add meta-data columns that way, if desired. For most taxonomies, this step can be skipped, as in most cases cell sets are defined exclusively as a combination of cell types. We print the remaining cell sets here, which in most cases should be none.
missed_ids <- setdiff(updated_nomenclature$cell_set_accession,colnames(mapping)) print(paste0(missed_ids,collapse="; "))
Finally, we output the cell to cell_set assignments as a csv file.
# Output cell to cell set assignments fwrite(mapping,"cell_to_cell_set_assignments.csv")
The final output of the CCN script is a .zip file that contains these files (all described and outputted above):
The resulting .zip file should be shared in manuscripts and other sources as the standard CCN file, for linkage to other taxonomies.
# Add all relevant files to a zip file with desired filename. dend.json and/or cell_to_cell_set_assignments.csv can be manually removed below if needed files = c("dend.json","nomenclature_table.csv","cell_to_cell_set_assignments.csv") zip(paste0("nomenclature.zip"), files=files)
Session info.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.