library("dplyr") library("TidyComb")
TidyComb is an R package designed for DrugComb data portal. It provides functions to facilitating the processing of drug (combination) dose response data. It involves three main topics: drug annotation, cell line annotation and drug sensitivity analysis.
This vignette will show a full workstream for processing drug combination data (to make it ready for DrugComb database). Please check other vignettes for more details about drug annotation, cell line annotation and drug sensitivity analysis.
The input data must be a data frame which contains following columns:
template <- read.csv(system.file("template.csv", package = "TidyComb"), stringsAsFactors = FALSE) head(template)
Note: Whithin one block, only one cell_line_name, drug_row, drug_col, conc_r_unit, conc_c_unit. The combination of conc_r or conc_c can not be duplicated in one block. Function CheckTemplate was designed for user to check the format of input data. For example:
When uploading new data for DrugComb database, 11 tables are required: Manually generate: study, assay Half manually generate: drug, cell_line, disease, tissue * Total automatically generate: response, surface, summary, curve, source
You can go to the github wiki page "About_tables" to find the details about each table.
In this section we will generate tables: 'cell_line', 'tissue', 'disease' and index table 'cell_id' (for facilating table matching in following sections).
First of all we need the cell line information from Cellosaurus. We download the whole data base to work directory and save it as file "cellosaurus.xml".
# Download "cellosaurus.html" file download.file("ftp://ftp.expasy.org/databases/cellosaurus/cellosaurus.xml", "cellosaurus.xml")
Next we can use downloaded file "cellosaurus.xml" to search the cellosaurus accession for each cell
cells <- unique(na.omit(template$cell_line_name)) # The cell name vector must be unique and without NA cell.match <- MatchCellAcc(cells, file = system.file("cellosaurus.xml", package = "TidyComb")) print(cell.match)
Note: Here we have to manually check the matched table and make sure we use the correct cellosaurus accession for further process. Searching conflict cell lines on ExpasyCellosaurus web page may be helpful.
In this case, cell line U87 was matched with two different records. We pick the first one (from "ATCC" company) and generate a "cell line name" - "cellosaurus accession" index table for further data processing.
cell.match.clean <- cell.match[-2, c("input_name", "cellosaurus_accession")]
Function GenerateCell
will check cell_line list existing in DruComb database and generate 'cell_line' table and 'cell_line_id' for new cell lines.
cell <- GenerateCell(cell.match.clean$cellosaurus_accession, file = system.file("cellosaurus.xml", package = "TidyComb")) cell$cell_line cell_index <- cell$cell_id %>% left_join(cell.match.clean, by = "cellosaurus_accession")
Function CheckTissue
will check whether there are new tissues (to DrugComb database) in new 'cell_line' table.
new_tisse <- CheckTissue(cell$cell_line$tissue)
In this case, the tissue information for cell line "H-STS" is missing. We have to manually include the information into the cell_line table.
Note:
cell$cell_line$tissue <- "liver" new_tissue <- CheckTissue(c(cell$cell_line$tissue))
We can see that there is no new tissues. Function CheckTissue
returns the tissue ID for 'liver' in DrugComb.
tissue_index <- new_tissue$old tissue_index
If there are new tissues detected, function CheckTissue
will return a list which contains number of total tissues currently in DrugComb, tissues in input data wich already in DrugComb, and new tissues. Whith these information, we can easily generate 'tissue' table for new tissues.
CheckDisease
function is same as CheckTissue
function, but the input data should be a data frame which contains:
new_disease <- CheckDisease(select(cell$cell_line, id = disease_id, dname = disease_name)) disease_index <- rbind.data.frame(new_disease$old, new_disease$new) disease_table <- new_disease$new disease_table
cell_line_table <- cell$cell_line %>% left_join(tissue_index, by = c("tissue" = "tname")) %>% rename(tissue_id = "id.y") %>% select(name, synonyms, cellosaurus_accession, disease_id, id = id.x, tissue_id) cell_line_table
Function GetCid
using the PubChem API to get the CIDs for input drug names.
drug_input <- na.omit(unique(c(template$drug_row, template$drug_col))) drug_cid <- GetCid(drug_input, type = "name") drug_cid
Note: Function GetCid
might return multiple CIDs for some drugs or return NA
for some drugs. In these cases, user should manually fix the CIDs.
In our case drug "ABT-888" get 2 CIDs. Here we keep the first CID.
drug_cid <- drug_cid[!duplicated(drug_cid$input_id), ] drug_cid
new_drug <- CheckDrug(drug_cid$cid)
drug_name <- GetPubNames(cids = new_drug$new) drug_info <- GetPubchemPro(cids = new_drug$new) drug <- full_join(drug_name, drug_info) pub_phase <- GetPubPhase(drug$cid) drug <- left_join(drug, pub_phase, by = "cid")
chembl <- GetChembl(drug$inchikey) drug <- left_join(drug, chembl, by = "inchikey")
Merge clinical trail phase from PubChem and ChEMBL. Pick the maximum clinical trail phase.
drug$phase <- apply(drug[, c("phase", "chembl_phase")], 1, function(x){ max(x, na.rm = TRUE) }) drug <- select(drug, -chembl_phase)
unichem <- GetIds(drug$inchikey) drug <- left_join(drug, unichem, by = "inchikey") drug$chembl_id <- unlist(apply(drug[, c("chembl_id.x", "chembl_id.y")], 1, function(x){ if (length(na.omit(x)) == 0){ return(NA) } else{ return(paste(na.omit(x), collapse = "; ")) } })) drug <- drug[, which(!colnames(drug) %in% c("chembl_id.x", "chembl_id.y"))]
Maching DrugBank and Stitch informations for 'drug' table relys on a local mySQL data base. There is nothing about TidyComb. I'll not show these in the vinegget.
drug$id <- seq(new_drug$n + 1, length.out = nrow(drug))
drug <- select(drug, dname = "name", id, chembl_id, inchikey, smiles, cid, molecular_formula, clinical_phase = "phase", #cid_m, cid_s, stitch_name, drugbank_id = "uni_drugbank", kegg_id = "uni_kegg_c") drug
drug_index <- drug_cid %>% left_join(new_drug$old, by = "cid") %>% left_join(select(drug, cid, id), by = "cid") drug_index$id <- apply(drug_index[, c("id.x", "id.y")], 1, function(x){ na.omit(x) }) drug_index <- drug_index[, which(!colnames(drug_index) %in% c("id.x", "id.y"))] drug_index
tables <- CalculateTemplate(template) names(tables)
pipeline
function generates 5 data frames:
response_with_score
: a table containing calculated synergy scores of each interaction block.
response
: the response table ready for uploading to DrugComb.
summary
: the summary table ready for uploading to DrugComb.
surface
: the curface table ready for uploading to DrugComb.
* curve
: the curve table ready for uploading to DrugComb.
bad_quality <- tables$synergy$block_id[which(tables$synergy$inhibition < -200 | tables$synergy$inhibition > 200)] bad_quality <- unique(bad_quality) summary <- tables$summary %>% mutate(quality = rep(NA, n())) if (length(bad_quality) != 0){ summary$quality[which(summary$block_id %in% bad_quality)] <- "bad" }
curve <- left_join(tables$curve, drug_index, by = c("drug_row" = "input_id")) %>% rename(drug_row_id = "id") %>% left_join(drug_index, by = c("drug_col" = "input_id")) %>% rename(drug_col_id = "id") %>% select(block_id, drug_row_id, drug_col_id, b, c, d, e, model) curve summary <- left_join(summary, drug_index, by = c("drug_row" = "input_id")) %>% rename(drug_row_id = "id") %>% left_join(drug_index, by = c("drug_col" = "input_id")) %>% rename(drug_col_id = "id") %>% left_join(cell_index, by = c("cell_line_name" = "input_name")) %>% rename(cell_line_id = "id") %>% select(block_id, drug_row_id, drug_col_id, cell_line_id, conc_r_unit, conc_c_unit, synergy_zip, synergy_loewe, synergy_hsa, synergy_bliss, ic50_row, ic50_col, ri_row, ri_col, css_row, css_col, css, S, quality) summary
cell_line_table tissue_index disease_table tables$synergy summary tables$surface curve
dir.create("Upload") write.csv(cell_line_table, "Upload/cell_line.csv", row.names = FALSE) write.csv(tissue_index, "Upload/tissue.csv", row.names = FALSE) write.csv(disease_table, "Upload/disease.csv", row.names = FALSE) write.csv(drug, "Upload/drug.csv", row.names = FALSE) write.csv(tables$synergy, "Upload/response.csv", row.names = FALSE) write.csv(summary, "Upload/summary.csv", row.names = FALSE) write.csv(tables$surface, "Upload/surface.csv", row.names = FALSE) write.csv(curve, "Upload/curve.csv", row.names = FALSE)
Tables "study", "assay" requires manual work for extracting information from data sources. Table "source" relys on these two tables. We can not show the workflow for these 3 tables in this vignette. Please check github wiki page for TidyComb for more details.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.