In DrugComb/TidyComb: Preparing drug combination data for DrugComb

library("dplyr")
library("TidyComb")

TidyComb is an R package designed for DrugComb data portal. It provides functions to facilitating the processing of drug (combination) dose response data. It involves three main topics: drug annotation, cell line annotation and drug sensitivity analysis.

This vignette will show a full workstream for processing drug combination data (to make it ready for DrugComb database). Please check other vignettes for more details about drug annotation, cell line annotation and drug sensitivity analysis.

Input Data

The input data must be a data frame which contains following columns:

block_id: It is the unique identifier for each drug response (or combination) block.
drug_row: It is the name (or ID) of the drug which was added along the rows of drug combination testing matrix.
drug_col: It is the name (or ID) of the drug which was added along the columns of drug combination testing matrix.
cell_line_name: It is the name (or ID) of cell line on which drug combination response was tested.
conc_r_unit: The unit of concentrations used when adding drug_row. For example "uM".
conc_c_unit: The unit of concentrations used when adding drug_col. For example "nM".
conc_r: The concentrations of drug_row was added in the drug combination test.
conc_c: The concentrations of drug_col was added in the drug combination test.
inhibition: The percentage cell growth inhibition at each combinations.

template <- read.csv(system.file("template.csv", package = "TidyComb"),
                     stringsAsFactors = FALSE)
head(template)

Note: Whithin one block, only one cell_line_name, drug_row, drug_col, conc_r_unit, conc_c_unit. The combination of conc_r or conc_c can not be duplicated in one block. Function CheckTemplate was designed for user to check the format of input data. For example:

Output Data

When uploading new data for DrugComb database, 11 tables are required: Manually generate: study, assay Half manually generate: drug, cell_line, disease, tissue * Total automatically generate: response, surface, summary, curve, source

You can go to the github wiki page "About_tables" to find the details about each table.

1. Cell line annotation

In this section we will generate tables: 'cell_line', 'tissue', 'disease' and index table 'cell_id' (for facilating table matching in following sections).

1.1 Get cellosaurus accession for cell lines.

First of all we need the cell line information from Cellosaurus. We download the whole data base to work directory and save it as file "cellosaurus.xml".

# Download "cellosaurus.html" file
download.file("ftp://ftp.expasy.org/databases/cellosaurus/cellosaurus.xml", 
              "cellosaurus.xml")

Next we can use downloaded file "cellosaurus.xml" to search the cellosaurus accession for each cell

cells <- unique(na.omit(template$cell_line_name)) # The cell name vector must be unique and without NA
cell.match <- MatchCellAcc(cells, file = system.file("cellosaurus.xml", package = "TidyComb"))
print(cell.match)

Note: Here we have to manually check the matched table and make sure we use the correct cellosaurus accession for further process. Searching conflict cell lines on ExpasyCellosaurus web page may be helpful.

In this case, cell line U87 was matched with two different records. We pick the first one (from "ATCC" company) and generate a "cell line name" - "cellosaurus accession" index table for further data processing.

cell.match.clean <- cell.match[-2, c("input_name", "cellosaurus_accession")]

1.2 Generate tables according to cellosaurus accessions.

'cell_line' and 'cell_index' table

Function GenerateCell will check cell_line list existing in DruComb database and generate 'cell_line' table and 'cell_line_id' for new cell lines.

cell <- GenerateCell(cell.match.clean$cellosaurus_accession, 
                     file = system.file("cellosaurus.xml", package = "TidyComb"))
cell$cell_line
cell_index <- cell$cell_id %>% 
  left_join(cell.match.clean, by = "cellosaurus_accession")

'tissue' table and 'tissue_index' table for table matching in the following steps.

Function CheckTissue will check whether there are new tissues (to DrugComb database) in new 'cell_line' table.

new_tisse <- CheckTissue(cell$cell_line$tissue)

In this case, the tissue information for cell line "H-STS" is missing. We have to manually include the information into the cell_line table.

Note:

the tissue name must be lower cases.
If the tissue name is "central_nevours_system", DrugComb requires more precise classifications: 'brain', 'nervours_system', or 'spinal_cord_or_other_CNS'.

cell$cell_line$tissue <- "liver"
new_tissue <- CheckTissue(c(cell$cell_line$tissue))

We can see that there is no new tissues. Function CheckTissue returns the tissue ID for 'liver' in DrugComb.

tissue_index <- new_tissue$old
tissue_index

If there are new tissues detected, function CheckTissue will return a list which contains number of total tissues currently in DrugComb, tissues in input data wich already in DrugComb, and new tissues. Whith these information, we can easily generate 'tissue' table for new tissues.

'disease' table and 'disease_index' table for table matching in the following steps.

CheckDisease function is same as CheckTissue function, but the input data should be a data frame which contains:

id: disease ID
dname: disease name

new_disease <- CheckDisease(select(cell$cell_line, id = disease_id, dname = disease_name))
disease_index <- rbind.data.frame(new_disease$old, new_disease$new)
disease_table <- new_disease$new
disease_table

Cleaning 'cell_line' table

cell_line_table <- cell$cell_line %>% 
  left_join(tissue_index, by = c("tissue" = "tname")) %>% 
  rename(tissue_id = "id.y") %>% 
  select(name, synonyms, cellosaurus_accession, disease_id, id = id.x, tissue_id)
cell_line_table

2. Prepare Drug information

2.1 Get correct CIDs for drugs

Function GetCid using the PubChem API to get the CIDs for input drug names.

drug_input <- na.omit(unique(c(template$drug_row, template$drug_col)))
drug_cid <- GetCid(drug_input, type = "name")
drug_cid

Note: Function GetCid might return multiple CIDs for some drugs or return NA for some drugs. In these cases, user should manually fix the CIDs.

In our case drug "ABT-888" get 2 CIDs. Here we keep the first CID.

drug_cid <- drug_cid[!duplicated(drug_cid$input_id), ]
drug_cid

2.2 Check whether there are new drugs to DrugComb

new_drug <- CheckDrug(drug_cid$cid)

2.3 Get drug information from multiple databases

PubChem

drug_name <- GetPubNames(cids = new_drug$new)
drug_info <- GetPubchemPro(cids = new_drug$new)
drug <- full_join(drug_name, drug_info)
pub_phase <- GetPubPhase(drug$cid)
drug <- left_join(drug, pub_phase, by = "cid")

ChEMBL

chembl <- GetChembl(drug$inchikey)
drug <- left_join(drug, chembl, by = "inchikey")

Merge clinical trail phase from PubChem and ChEMBL. Pick the maximum clinical trail phase.

drug$phase <- apply(drug[, c("phase", "chembl_phase")], 1, function(x){
  max(x, na.rm = TRUE)
  })
drug <- select(drug, -chembl_phase)

UniChem

unichem <- GetIds(drug$inchikey)
drug <- left_join(drug, unichem, by = "inchikey") 
drug$chembl_id <- unlist(apply(drug[, c("chembl_id.x", "chembl_id.y")], 1,
                        function(x){
                          if (length(na.omit(x)) == 0){
                            return(NA)
                          } else{
                            return(paste(na.omit(x), collapse = "; "))
                          }
                        }))
drug <- drug[, which(!colnames(drug) %in% c("chembl_id.x", "chembl_id.y"))]

DrugBank and Stitch

Maching DrugBank and Stitch informations for 'drug' table relys on a local mySQL data base. There is nothing about TidyComb. I'll not show these in the vinegget.

Add drug id into drug table

drug$id <- seq(new_drug$n + 1, length.out = nrow(drug))

Arrange the column order of drug table and get the output

drug <- select(drug, dname = "name", id, chembl_id, inchikey, smiles, cid, 
               molecular_formula, clinical_phase = "phase", #cid_m, cid_s, stitch_name, 
               drugbank_id = "uni_drugbank", kegg_id = "uni_kegg_c")
drug

2.4 Generate drug_index table for table matching in the following processes

drug_index <- drug_cid %>% 
  left_join(new_drug$old, by = "cid") %>% 
  left_join(select(drug, cid, id), by = "cid")
drug_index$id <- apply(drug_index[, c("id.x", "id.y")], 1, function(x){
  na.omit(x)
})
drug_index <- drug_index[, which(!colnames(drug_index) %in% c("id.x", "id.y"))]
drug_index

3. calculate synergy scores and generate other files for uploading

tables <- CalculateTemplate(template)
names(tables)

pipeline function generates 5 data frames: response_with_score: a table containing calculated synergy scores of each interaction block. response: the response table ready for uploading to DrugComb. summary: the summary table ready for uploading to DrugComb. surface: the curface table ready for uploading to DrugComb. * curve: the curve table ready for uploading to DrugComb.

4. Check and mark dose-response quality for each block

bad_quality <- tables$synergy$block_id[which(tables$synergy$inhibition < -200 |
                                               tables$synergy$inhibition > 200)]
bad_quality <- unique(bad_quality)
summary <- tables$summary %>% 
  mutate(quality = rep(NA, n()))
if (length(bad_quality) != 0){
  summary$quality[which(summary$block_id %in% bad_quality)] <- "bad"
}

5. Replace drug names with drug_id, cell_line_name with cell_line_id

curve <- left_join(tables$curve, drug_index, by = c("drug_row" = "input_id")) %>% 
  rename(drug_row_id = "id") %>% 
  left_join(drug_index, by = c("drug_col" = "input_id")) %>% 
  rename(drug_col_id = "id") %>% 
  select(block_id, drug_row_id, drug_col_id, b, c, d, e, model)
curve
summary <- left_join(summary, drug_index, by = c("drug_row" = "input_id")) %>% 
  rename(drug_row_id = "id") %>% 
  left_join(drug_index, by = c("drug_col" = "input_id")) %>% 
  rename(drug_col_id = "id") %>% 
  left_join(cell_index, by = c("cell_line_name" = "input_name")) %>% 
  rename(cell_line_id = "id") %>% 
  select(block_id, drug_row_id, drug_col_id, cell_line_id, conc_r_unit, 
         conc_c_unit, synergy_zip, synergy_loewe, synergy_hsa, synergy_bliss,
         ic50_row, ic50_col, ri_row, ri_col, css_row, css_col, css, S, quality)
summary

Output and check data

cell_line_table
tissue_index
disease_table
tables$synergy
summary
tables$surface
curve

dir.create("Upload")
write.csv(cell_line_table, "Upload/cell_line.csv", row.names = FALSE)
write.csv(tissue_index, "Upload/tissue.csv", row.names = FALSE)
write.csv(disease_table, "Upload/disease.csv", row.names = FALSE)
write.csv(drug, "Upload/drug.csv", row.names = FALSE)
write.csv(tables$synergy, "Upload/response.csv", row.names = FALSE)
write.csv(summary, "Upload/summary.csv", row.names = FALSE)
write.csv(tables$surface, "Upload/surface.csv", row.names = FALSE)
write.csv(curve, "Upload/curve.csv", row.names = FALSE)

Tables "study", "assay" requires manual work for extracting information from data sources. Table "source" relys on these two tables. We can not show the workflow for these 3 tables in this vignette. Please check github wiki page for TidyComb for more details.

DrugComb/TidyComb documentation built on June 22, 2022, 2:49 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

DrugComb/TidyComb
Preparing drug combination data for DrugComb

In DrugComb/TidyComb: Preparing drug combination data for DrugComb

Input Data

Output Data

1. Cell line annotation

1.1 Get cellosaurus accession for cell lines.

1.2 Generate tables according to cellosaurus accessions.

2. Prepare Drug information

2.1 Get correct CIDs for drugs

2.2 Check whether there are new drugs to DrugComb

2.3 Get drug information from multiple databases

2.4 Generate drug_index table for table matching in the following processes

3. calculate synergy scores and generate other files for uploading

4. Check and mark dose-response quality for each block

5. Replace drug names with drug_id, cell_line_name with cell_line_id

Output and check data

R Package Documentation

Browse R Packages

We want your feedback!

DrugComb/TidyComb Preparing drug combination data for DrugComb

In DrugComb/TidyComb: Preparing drug combination data for DrugComb

Input Data

Output Data

1. Cell line annotation

1.1 Get cellosaurus accession for cell lines.

1.2 Generate tables according to cellosaurus accessions.

2. Prepare Drug information

2.1 Get correct CIDs for drugs

2.2 Check whether there are new drugs to DrugComb

2.3 Get drug information from multiple databases

2.4 Generate drug_index table for table matching in the following processes

3. calculate synergy scores and generate other files for uploading

4. Check and mark dose-response quality for each block

5. Replace drug names with drug_id, cell_line_name with cell_line_id

Output and check data

R Package Documentation

Browse R Packages

We want your feedback!

DrugComb/TidyComb
Preparing drug combination data for DrugComb