knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(dplyr) library(cbioportalR) library(tidyr) library(gnomeR) data("clin_collab_df")
The purpose of this vignette is to outline best practices for downloading, QA-ing and analyzing data generated from MSK IMPACT, a targeted tumor-sequencing test that can detect more than 468 gene mutations and other critical genetic changes in common and rare cancers. Using a hepatocellular cancer case study, we demonstrate a data analysis pipeline using {cbioportalR} functions that can help users generate reproducible analyses using this data.
For this vignette, we will be using {cbioportalR}, a package to download data from the cBioPortal website. We will also be using {dplyr} and {tidyr} to clean and manipulate the data:
library(gnomeR) library(cbioportalR) library(dplyr) library(tidyr)
To access cBioPortal data using the {cbioportalR} package, you must set the cBioPortal database using the set_cbioportal_db()
function. To access public data, set this to db = public
. If you are using a private version of cBioPortal, you would set the db
argument to your institution's cBioPortal URL.
set_cbioportal_db(db = "public")
Scenario: You are a data analyst whose collaborator has sent you a clinical file of a cohort of patients with hepatocellular cancer that she is interested in for retrospective data analysis. In particular, she wants to look at IMPACT sequencing data for the cohort and investigate associations between genomic alterations and pathological and clinical characteristics. She asks if you can get the IMPACT data and do the analysis.
She gives you a clinical file with 80 sample IDs: clin_collab_df
.
head(clin_collab_df)
The sample IDs are in the cbioportal_id
column.
Before using cBioPortal to access the genomic data, you first want to do some QA on the clinical data and make sure it matches up with the clinical data in cBioPortal.
One of the first things to check in your data is whether you have multiple sample IDs from the same patient. Sometimes a clinical file will have a patient_ID column as well; this one doesn't, so you can make your own. The patient ID is just the first 9 digits of the cbioportal_id
:
clin_collab_df <- clin_collab_df %>% mutate( patient_id = substr(cbioportal_id, 1, 9) )
If there is only one sample per patient, there should be the same number of samples as patients.
clin_collab_df %>% summarize( patients = length(unique(patient_id)), samples= length(unique(cbioportal_id)))
So it's clear that we have multiple samples per patient. To find out which patient/s, you can count the patient_id
values and filter for >1.
multiple_samps <- clin_collab_df %>% count(patient_id) %>% filter(n > 1) multiple_samps
There are 2 patients who each have 2 samples in the collaborator's dataset. Filter the dataset to see the cbioportal_id
's in question:
clin_collab_df %>% filter( patient_id %in% (multiple_samps$patient_id))
These are patients and samples to ask your collaborator about: Does using both samples make sense? Often times the answer is no. And if not, which sample is the most appropriate one to include? (To get more info for yourself, you can enter the patient ids into the cBioPortal website.)
cbioportal_ids
Are In cBioPortal DatabaseTo do this, you need to retrieve the clinical data from cBioPortal using {cbioportalR}. You can use the get_clinical_by_sample()
function from {cbioportalR} to do this. Set the sample_id
parameter to the cbioportal_ids
from the clinical collaborator's file.\
Store the sample data in a file called clin_cbio
.
clin_cbio = get_clinical_by_sample(sample_id = clin_collab_df$cbioportal_id)
(You can disregard the warning message for now, though you may be interested in specific clinical attributes later.)
Note: If you are using the public version of cBioPortal, this function will only query the msk_impact_2017
study.
Notice that you now have 2 clinical files: one given to you by the collaborator (clin_collab_df
) and one you have retrieved yourself from cBioPortal (clin_cbio
).
Here's the header of clin_cbio
:
head(clin_cbio) %>% as.data.frame()
The sample IDs here are in the sampleId
column. You may notice that this file is in "long" format and each sample has multiple rows. Later we will convert this file to "wide" format to do QA checking on attributes.
But the first thing you want to know is whether you are able to find all of the cbioportal_ids
from your clin_collab_df
file in the clin_cbio
file.\
To do this, use the setdiff()
function:
setdiff(clin_collab_df$cbioportal_id, clin_cbio$sampleId)
So there are two sample ID's from your clinical file (clin_collab_df
) that are currently not found in cBioPortal (in your clin_cbio
file). Include these in the list of cBioPortal questions to ask your collaborator.
(Again, if you want to investigate a bit further, you could enter the patient cBioPortal IDs as queries into the cBioPortal website.)
Now we need to check whether clinical information in collaborator's file (clin_collab_df
) matches clinical information in cBioPortal (in your clin_cbio
file).
Look at the clin_collab_df
again:
head(clin_collab_df)
Aside from cbioportal_id
, you have cancer type (ctype
) and sample type (primary_mets
) variables. Because it's a hepatocellular cancer study, all of the ctype
values will be the same. To double check that, count ctype
:
clin_collab_df %>% count(ctype)
So the only variable you can check in this example is the primary_mets
. To see if the clin_cbio
file has an analogous variable to check, first see the attributes that are available in it.
clin_cbio %>% count(clinicalAttributeId)
To quickly see values associated with a particular attribute, filter by the attribute and count the values. For example:
clin_cbio %>% filter(clinicalAttributeId=="SAMPLE_TYPE") %>% count(value)
The attribute SAMPLE_TYPE
looks like the appropriate variable to check primary_mets
against. To do this, we will convert clin_cbio
to "wide" form (only for the SAMPLE_TYPE
variable for now), merge it with clin_collab_df
and then cross-tabulate the 2 variables.
To convert clin_cbio
to "wide" form:
clin_cbio_wide = clin_cbio %>% select( sampleId, clinicalAttributeId, value) %>% filter( clinicalAttributeId == "SAMPLE_TYPE") %>% pivot_wider(names_from = clinicalAttributeId, values_from = value)
Take a look at the "wide" file:
head(clin_cbio_wide) %>% as.data.frame()
Now to check the primary_mets
variable from clin_collab_df
against the SAMPLE_TYPE
variable from clin_cbio_wide
, merge the files and tabulate the variables.
clin_merged <- clin_cbio_wide %>% left_join(clin_collab_df, by = c("sampleId" = "cbioportal_id")) clin_merged %>% select(primary_mets, SAMPLE_TYPE) %>% table()
There is 1 sample that has a value of "Metastasis" for the primary_mets
variable but "Primary" for the SAMPLE_TYPE
variable. To find the sample ID, filter:
clin_merged %>% filter(primary_mets == "Metastasis" & SAMPLE_TYPE == "Primary")
Include this sample in the list of questions for your collaborator. Either she will need to update her clinical file with the correct value or you/she will have to notify cBioPortal to update their database.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.