In MSKCC-Epi-Bio/gnomeR: Wrangle and analyze IMPACT and TCGA mutation data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(dplyr)
library(cbioportalR)
library(tidyr)
library(gnomeR)


data("clin_collab_df")

Introduction

The purpose of this vignette is to outline best practices for downloading, QA-ing and analyzing data generated from MSK IMPACT, a targeted tumor-sequencing test that can detect more than 468 gene mutations and other critical genetic changes in common and rare cancers. Using a hepatocellular cancer case study, we demonstrate a data analysis pipeline using {cbioportalR} functions that can help users generate reproducible analyses using this data.

Setting up

For this vignette, we will be using {cbioportalR}, a package to download data from the cBioPortal website. We will also be using {dplyr} and {tidyr} to clean and manipulate the data:

library(gnomeR)
library(cbioportalR)
library(dplyr)
library(tidyr)

To access cBioPortal data using the {cbioportalR} package, you must set the cBioPortal database using the set_cbioportal_db() function. To access public data, set this to db = public. If you are using a private version of cBioPortal, you would set the db argument to your institution's cBioPortal URL.

set_cbioportal_db(db = "public")

Case Study

Scenario: You are a data analyst whose collaborator has sent you a clinical file of a cohort of patients with hepatocellular cancer that she is interested in for retrospective data analysis. In particular, she wants to look at IMPACT sequencing data for the cohort and investigate associations between genomic alterations and pathological and clinical characteristics. She asks if you can get the IMPACT data and do the analysis.

She gives you a clinical file with 80 sample IDs: clin_collab_df.

head(clin_collab_df)

The sample IDs are in the cbioportal_id column.

Before using cBioPortal to access the genomic data, you first want to do some QA on the clinical data and make sure it matches up with the clinical data in cBioPortal.

Check For Multiple Samples Per Patient

One of the first things to check in your data is whether you have multiple sample IDs from the same patient. Sometimes a clinical file will have a patient_ID column as well; this one doesn't, so you can make your own. The patient ID is just the first 9 digits of the cbioportal_id:

clin_collab_df <- clin_collab_df %>%
  mutate(
    patient_id = substr(cbioportal_id, 1, 9)
  )

If there is only one sample per patient, there should be the same number of samples as patients.

clin_collab_df %>%
    summarize( patients = length(unique(patient_id)),
            samples= length(unique(cbioportal_id)))

So it's clear that we have multiple samples per patient. To find out which patient/s, you can count the patient_id values and filter for >1.

multiple_samps <- clin_collab_df %>%
    count(patient_id) %>%
  filter(n > 1)
multiple_samps

There are 2 patients who each have 2 samples in the collaborator's dataset. Filter the dataset to see the cbioportal_id's in question:

clin_collab_df %>% 
  filter(
  patient_id %in% 
    (multiple_samps$patient_id))

These are patients and samples to ask your collaborator about: Does using both samples make sense? Often times the answer is no. And if not, which sample is the most appropriate one to include? (To get more info for yourself, you can enter the patient ids into the cBioPortal website.)

Check That All `cbioportal_ids` Are In cBioPortal Database

To do this, you need to retrieve the clinical data from cBioPortal using {cbioportalR}. You can use the get_clinical_by_sample() function from {cbioportalR} to do this. Set the sample_id parameter to the cbioportal_ids from the clinical collaborator's file.\ Store the sample data in a file called clin_cbio.

clin_cbio = get_clinical_by_sample(sample_id = clin_collab_df$cbioportal_id)

(You can disregard the warning message for now, though you may be interested in specific clinical attributes later.)

Note: If you are using the public version of cBioPortal, this function will only query the msk_impact_2017 study.

Notice that you now have 2 clinical files: one given to you by the collaborator (clin_collab_df) and one you have retrieved yourself from cBioPortal (clin_cbio).

Here's the header of clin_cbio:

head(clin_cbio) %>% as.data.frame()

The sample IDs here are in the sampleId column. You may notice that this file is in "long" format and each sample has multiple rows. Later we will convert this file to "wide" format to do QA checking on attributes.

But the first thing you want to know is whether you are able to find all of the cbioportal_ids from your clin_collab_df file in the clin_cbio file.\ To do this, use the setdiff() function:

setdiff(clin_collab_df$cbioportal_id, clin_cbio$sampleId)

So there are two sample ID's from your clinical file (clin_collab_df) that are currently not found in cBioPortal (in your clin_cbio file). Include these in the list of cBioPortal questions to ask your collaborator.

(Again, if you want to investigate a bit further, you could enter the patient cBioPortal IDs as queries into the cBioPortal website.)

Check Clinical Data Matches cBioPortal Database

Now we need to check whether clinical information in collaborator's file (clin_collab_df) matches clinical information in cBioPortal (in your clin_cbio file).

Look at the clin_collab_df again:

head(clin_collab_df)

Aside from cbioportal_id, you have cancer type (ctype) and sample type (primary_mets) variables. Because it's a hepatocellular cancer study, all of the ctype values will be the same. To double check that, count ctype:

clin_collab_df %>% count(ctype)

So the only variable you can check in this example is the primary_mets. To see if the clin_cbio file has an analogous variable to check, first see the attributes that are available in it.

clin_cbio %>% count(clinicalAttributeId)

To quickly see values associated with a particular attribute, filter by the attribute and count the values. For example:

clin_cbio %>% filter(clinicalAttributeId=="SAMPLE_TYPE") %>% count(value)

The attribute SAMPLE_TYPE looks like the appropriate variable to check primary_mets against. To do this, we will convert clin_cbio to "wide" form (only for the SAMPLE_TYPE variable for now), merge it with clin_collab_df and then cross-tabulate the 2 variables.

To convert clin_cbio to "wide" form:

clin_cbio_wide = clin_cbio %>% 
  select( sampleId, clinicalAttributeId, value) %>%
  filter( clinicalAttributeId == "SAMPLE_TYPE") %>% 
  pivot_wider(names_from = clinicalAttributeId, values_from = value)

Take a look at the "wide" file:

head(clin_cbio_wide) %>% as.data.frame()

Now to check the primary_mets variable from clin_collab_df against the SAMPLE_TYPE variable from clin_cbio_wide, merge the files and tabulate the variables.

clin_merged <- clin_cbio_wide %>% left_join(clin_collab_df, by = c("sampleId" = "cbioportal_id")) 
clin_merged %>% select(primary_mets, SAMPLE_TYPE) %>% table()

There is 1 sample that has a value of "Metastasis" for the primary_mets variable but "Primary" for the SAMPLE_TYPE variable. To find the sample ID, filter:

clin_merged %>% filter(primary_mets == "Metastasis" & SAMPLE_TYPE == "Primary")

Include this sample in the list of questions for your collaborator. Either she will need to update her clinical file with the correct value or you/she will have to notify cBioPortal to update their database.

MSKCC-Epi-Bio/gnomeR documentation built on March 28, 2024, 2:42 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com