library("dplyr")
library("TidyComb")

TidyComb is an R package designed for DrugComb data portal. It provides functions to facilitating the processing of drug (combination) dose response data. It involves three main topics: drug annotation, cell line annotation and drug sensitivity analysis.

This vignette will show a full workstream for processing drug combination data (to make it ready for DrugComb database). Please check other vignettes for more details about drug annotation, cell line annotation and drug sensitivity analysis.

Input Data

The input data must be a data frame which contains following columns:

template <- read.csv(system.file("template.csv", package = "TidyComb"),
                     stringsAsFactors = FALSE)
head(template)

Note: Whithin one block, only one cell_line_name, drug_row, drug_col, conc_r_unit, conc_c_unit. The combination of conc_r or conc_c can not be duplicated in one block. Function CheckTemplate was designed for user to check the format of input data. For example:

Output Data

When uploading new data for DrugComb database, 11 tables are required: Manually generate: study, assay Half manually generate: drug, cell_line, disease, tissue * Total automatically generate: response, surface, summary, curve, source

You can go to the github wiki page "About_tables" to find the details about each table.

1. Cell line annotation

In this section we will generate tables: 'cell_line', 'tissue', 'disease' and index table 'cell_id' (for facilating table matching in following sections).

1.1 Get cellosaurus accession for cell lines.

First of all we need the cell line information from Cellosaurus. We download the whole data base to work directory and save it as file "cellosaurus.xml".

# Download "cellosaurus.html" file
download.file("ftp://ftp.expasy.org/databases/cellosaurus/cellosaurus.xml", 
              "cellosaurus.xml")

Next we can use downloaded file "cellosaurus.xml" to search the cellosaurus accession for each cell

cells <- unique(na.omit(template$cell_line_name)) # The cell name vector must be unique and without NA
cell.match <- MatchCellAcc(cells, file = system.file("cellosaurus.xml", package = "TidyComb"))
print(cell.match)

Note: Here we have to manually check the matched table and make sure we use the correct cellosaurus accession for further process. Searching conflict cell lines on ExpasyCellosaurus web page may be helpful.

In this case, cell line U87 was matched with two different records. We pick the first one (from "ATCC" company) and generate a "cell line name" - "cellosaurus accession" index table for further data processing.

cell.match.clean <- cell.match[-2, c("input_name", "cellosaurus_accession")]

1.2 Generate tables according to cellosaurus accessions.

  1. 'cell_line' and 'cell_index' table

Function GenerateCell will check cell_line list existing in DruComb database and generate 'cell_line' table and 'cell_line_id' for new cell lines.

cell <- GenerateCell(cell.match.clean$cellosaurus_accession, 
                     file = system.file("cellosaurus.xml", package = "TidyComb"))
cell$cell_line
cell_index <- cell$cell_id %>% 
  left_join(cell.match.clean, by = "cellosaurus_accession")
  1. 'tissue' table and 'tissue_index' table for table matching in the following steps.

Function CheckTissue will check whether there are new tissues (to DrugComb database) in new 'cell_line' table.

new_tisse <- CheckTissue(cell$cell_line$tissue)

In this case, the tissue information for cell line "H-STS" is missing. We have to manually include the information into the cell_line table.

Note:

cell$cell_line$tissue <- "liver"
new_tissue <- CheckTissue(c(cell$cell_line$tissue))

We can see that there is no new tissues. Function CheckTissue returns the tissue ID for 'liver' in DrugComb.

tissue_index <- new_tissue$old
tissue_index

If there are new tissues detected, function CheckTissue will return a list which contains number of total tissues currently in DrugComb, tissues in input data wich already in DrugComb, and new tissues. Whith these information, we can easily generate 'tissue' table for new tissues.

  1. 'disease' table and 'disease_index' table for table matching in the following steps.

CheckDisease function is same as CheckTissue function, but the input data should be a data frame which contains:

new_disease <- CheckDisease(select(cell$cell_line, id = disease_id, dname = disease_name))
disease_index <- rbind.data.frame(new_disease$old, new_disease$new)
disease_table <- new_disease$new
disease_table
  1. Cleaning 'cell_line' table
cell_line_table <- cell$cell_line %>% 
  left_join(tissue_index, by = c("tissue" = "tname")) %>% 
  rename(tissue_id = "id.y") %>% 
  select(name, synonyms, cellosaurus_accession, disease_id, id = id.x, tissue_id)
cell_line_table

2. Prepare Drug information

2.1 Get correct CIDs for drugs

Function GetCid using the PubChem API to get the CIDs for input drug names.

drug_input <- na.omit(unique(c(template$drug_row, template$drug_col)))
drug_cid <- GetCid(drug_input, type = "name")
drug_cid

Note: Function GetCid might return multiple CIDs for some drugs or return NA for some drugs. In these cases, user should manually fix the CIDs.

In our case drug "ABT-888" get 2 CIDs. Here we keep the first CID.

drug_cid <- drug_cid[!duplicated(drug_cid$input_id), ]
drug_cid

2.2 Check whether there are new drugs to DrugComb

new_drug <- CheckDrug(drug_cid$cid)

2.3 Get drug information from multiple databases

  1. PubChem
drug_name <- GetPubNames(cids = new_drug$new)
drug_info <- GetPubchemPro(cids = new_drug$new)
drug <- full_join(drug_name, drug_info)
pub_phase <- GetPubPhase(drug$cid)
drug <- left_join(drug, pub_phase, by = "cid")
  1. ChEMBL
chembl <- GetChembl(drug$inchikey)
drug <- left_join(drug, chembl, by = "inchikey")

Merge clinical trail phase from PubChem and ChEMBL. Pick the maximum clinical trail phase.

drug$phase <- apply(drug[, c("phase", "chembl_phase")], 1, function(x){
  max(x, na.rm = TRUE)
  })
drug <- select(drug, -chembl_phase)
  1. UniChem
unichem <- GetIds(drug$inchikey)
drug <- left_join(drug, unichem, by = "inchikey") 
drug$chembl_id <- unlist(apply(drug[, c("chembl_id.x", "chembl_id.y")], 1,
                        function(x){
                          if (length(na.omit(x)) == 0){
                            return(NA)
                          } else{
                            return(paste(na.omit(x), collapse = "; "))
                          }
                        }))
drug <- drug[, which(!colnames(drug) %in% c("chembl_id.x", "chembl_id.y"))]
  1. DrugBank and Stitch

Maching DrugBank and Stitch informations for 'drug' table relys on a local mySQL data base. There is nothing about TidyComb. I'll not show these in the vinegget.

  1. Add drug id into drug table
drug$id <- seq(new_drug$n + 1, length.out = nrow(drug))
  1. Arrange the column order of drug table and get the output
drug <- select(drug, dname = "name", id, chembl_id, inchikey, smiles, cid, 
               molecular_formula, clinical_phase = "phase", #cid_m, cid_s, stitch_name, 
               drugbank_id = "uni_drugbank", kegg_id = "uni_kegg_c")
drug

2.4 Generate drug_index table for table matching in the following processes

drug_index <- drug_cid %>% 
  left_join(new_drug$old, by = "cid") %>% 
  left_join(select(drug, cid, id), by = "cid")
drug_index$id <- apply(drug_index[, c("id.x", "id.y")], 1, function(x){
  na.omit(x)
})
drug_index <- drug_index[, which(!colnames(drug_index) %in% c("id.x", "id.y"))]
drug_index

3. calculate synergy scores and generate other files for uploading

tables <- CalculateTemplate(template)
names(tables)

pipeline function generates 5 data frames: response_with_score: a table containing calculated synergy scores of each interaction block. response: the response table ready for uploading to DrugComb. summary: the summary table ready for uploading to DrugComb. surface: the curface table ready for uploading to DrugComb. * curve: the curve table ready for uploading to DrugComb.

4. Check and mark dose-response quality for each block

bad_quality <- tables$synergy$block_id[which(tables$synergy$inhibition < -200 |
                                               tables$synergy$inhibition > 200)]
bad_quality <- unique(bad_quality)
summary <- tables$summary %>% 
  mutate(quality = rep(NA, n()))
if (length(bad_quality) != 0){
  summary$quality[which(summary$block_id %in% bad_quality)] <- "bad"
}

5. Replace drug names with drug_id, cell_line_name with cell_line_id

curve <- left_join(tables$curve, drug_index, by = c("drug_row" = "input_id")) %>% 
  rename(drug_row_id = "id") %>% 
  left_join(drug_index, by = c("drug_col" = "input_id")) %>% 
  rename(drug_col_id = "id") %>% 
  select(block_id, drug_row_id, drug_col_id, b, c, d, e, model)
curve
summary <- left_join(summary, drug_index, by = c("drug_row" = "input_id")) %>% 
  rename(drug_row_id = "id") %>% 
  left_join(drug_index, by = c("drug_col" = "input_id")) %>% 
  rename(drug_col_id = "id") %>% 
  left_join(cell_index, by = c("cell_line_name" = "input_name")) %>% 
  rename(cell_line_id = "id") %>% 
  select(block_id, drug_row_id, drug_col_id, cell_line_id, conc_r_unit, 
         conc_c_unit, synergy_zip, synergy_loewe, synergy_hsa, synergy_bliss,
         ic50_row, ic50_col, ri_row, ri_col, css_row, css_col, css, S, quality)
summary

Output and check data

cell_line_table
tissue_index
disease_table
tables$synergy
summary
tables$surface
curve
dir.create("Upload")
write.csv(cell_line_table, "Upload/cell_line.csv", row.names = FALSE)
write.csv(tissue_index, "Upload/tissue.csv", row.names = FALSE)
write.csv(disease_table, "Upload/disease.csv", row.names = FALSE)
write.csv(drug, "Upload/drug.csv", row.names = FALSE)
write.csv(tables$synergy, "Upload/response.csv", row.names = FALSE)
write.csv(summary, "Upload/summary.csv", row.names = FALSE)
write.csv(tables$surface, "Upload/surface.csv", row.names = FALSE)
write.csv(curve, "Upload/curve.csv", row.names = FALSE)

Tables "study", "assay" requires manual work for extracting information from data sources. Table "source" relys on these two tables. We can not show the workflow for these 3 tables in this vignette. Please check github wiki page for TidyComb for more details.



DrugComb/TidyComb documentation built on June 22, 2022, 2:49 a.m.