In MSKCC-Epi-Bio/gnomeR: Wrangle and analyze IMPACT and TCGA mutation data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

set.seed(20230302)

# exit if user doesn't have synapser, a log in, or access to data.
if (genieBPC:::.is_connected_to_genie() == FALSE){
  knitr::knit_exit()
}

library(gnomeR)
library(dplyr)
library(genieBPC)
library(cbioportalR)

Introduction

This vignette will walk through how to apply {gnomeR} functions to data from AACR Project Genomics Evidence Neoplasia Information Exchange BioPharma Collaborative (GENIE BPC). A broad overview of AACR Project GENIE BPC can be found here, with details on the clinical data structure available on the {genieBPC} package website.

For the purposes of this vignette, we will use the first publicly available GENIE BPC data release of non-small cell lung cancer patients, NSCLC v2.0-public.

Note that the GENIE BPC genomic data are unique in a few particular ways:

These data are heterogeneous such that they come from multiple genomic panels from multiple institutions. Gene aliases across panels may vary.
The GENIE BPC fusions and copy number alterations data may differ from common data formats. Differences are described in the Data Formats subsection below.

Data Access

To gain access to the GENIE BPC data, please follow the instructions on the {genieBPC} pull_data_synapse() vignette to register for a Synapse account. Once your Synapse account is created and you authenticate yourself using genieBPC::set_synapse_credentials(), you'll be ready to pull the GENIE BPC clinical and genomic data from Synapse into your local environment:

library(genieBPC)

# if credentials are not stored in your R environment
set_synapse_credentials(username = "username", password = "password")

# if credentials are stored in your R environment
set_synapse_credentials()

Obtain GENIE BPC Data

# pull NSCLC v2.0-public data from Synapse into the R environment
nsclc_2_0 = pull_data_synapse(cohort = "NSCLC",
                              version = "v2.0-public")

The resulting nsclc_2_0 object is a nested list of datasets, including the mutations, fusions, and copy number data.

nsclc_2_0$NSCLC_v2.0$mutations_extended
nsclc_2_0$NSCLC_v2.0$fusions
nsclc_2_0$NSCLC_v2.0$cna

Note that while the GENIE BPC clinical data are only available via Synapse, the genomic data can be accessed via both Synapse and cBioPortal. Using the {cbioportalR} package, users can pull the GENIE BPC genomic data directly from cBioPortal:

library(cbioportalR)

# connect to the GENIE instance of cBioPortal
cbioportalR::set_cbioportal_db("https://genie.cbioportal.org/api")

# view list of available studies from this instance of the portal
# NSCLC v2.0-public is: nsclc_public_genie_bpc
available_studies()

# obtain genomic data for GENIE BPC NSCLC v2.0-public
mutations_extended_2_0 <- get_mutations_by_study("nsclc_public_genie_bpc")
cna_2_0 <- get_cna_by_study("nsclc_public_genie_bpc")
fusions_2_0 <- get_fusions_by_study("nsclc_public_genie_bpc")

Data Formats

The genomic data for GENIE BPC are stored both on Synapse and in cBioPortal. The data structure differs depending on where the genomic data are downloaded from. Therefore, the remainder of this vignette will proceed by outlining the process of annotating genomic data separately for genomic data downloaded from Synapse and genomic data downloaded from cBioPortal.

Differences Between Synapse and cBioPortal Genomic Data

Please note that pulling genomic GENIE data from Synapse using pull_data_synapse() and pulling GENIE data from cBioPortal may result in small differences in the data due to systematic differences in the processing pipelines employed by Synapse and cBioPortal. These differences may include:

Data formatting - Some data sets (e.g. CNA files) may appear in wide format in Synapse data versus long format in cBioPortal data, or column attributes and names may appear sightly different (e.g. fusions files).
Default filtering - By default, cBioPortal filters out Silent, Intron, IGR, 3'UTR, 5'UTR, 3'Flank and 5'Flank, except for the promoter mutations of the TERT gene. See cBioPortal documentation for more details. These are retained in Synapse processing pipelines.
Hugo Symbols - Some genes have more than one accepted Hugo Symbol and may be referred to differently between data sources (e.g. NSD3 is an alias for WHSC1L1). Some tools exist to help you resolve gene aliases across genomic data sources. See gnomeR::recode_alias(), cbioportal::get_alias() and vignettes from the {gnomeR} and {cbioportalR} for more information on how to use these functions and work with gene aliases.

Selecting a Cohort for Analysis

The following code chunk uses the genieBPC::create_analytic_cohort() to create an analytic cohort of patients diagnosed with stage IV NSCLC of adenocarcinoma histology. Then, for patients with multiple genomic samples, the genieBPC::select_unique_ngs() function chooses the genomic sample with OncoTree code LUAD (if available). For patients with multiple samples with OncoTree code LUAD, we will select the metastatic genomic sample. If any patients have multiple metastatic samples with OncoTree code LUAD, take the latest of the samples.

Note: for patients with exactly one genomic sample, that unique genomic sample will be returned regardless of whether it meets the argument criteria specified below.

# create analytic cohort of patients diagnosed with Stage IV adenocarcinoma
nsclc_2_0_example <- create_analytic_cohort(
  data_synapse = nsclc_2_0$NSCLC_v2.0,
  stage_dx = c("Stage IV"),
  histology = "Adenocarcinoma"
)

# select unique NGS samples for this analytic cohort
nsclc_2_0_samples <- select_unique_ngs(
  data_cohort = nsclc_2_0_example$cohort_ngs,
  oncotree_code = "LUAD",
  sample_type = "Metastasis",
  min_max_time = "max"
)

Create a dataframe of the corresponding panel and sample IDs:

# specify sample panels and IDs
nsclc_2_0_sample_panels <- nsclc_2_0_samples %>% 
  select(cpt_seq_assay_id, cpt_genie_sample_id) %>%
  rename(panel_id = cpt_seq_assay_id,
         sample_id = cpt_genie_sample_id) %>%
  filter(!is.na(panel_id))

Process Data with `create_gene_binary()`

The create_gene_binary() function takes inputs of mutations, fusions, and CNA data and returns a binary matrix with the alteration status for each gene, annotating missingness when genes were not included on a next generation sequencing panel.

It is critical to utilize the specify_panel argument of create_gene_binary(). Samples included in GENIE BPC were sequenced across multiple sequencing platforms, with the genes included varying across panels. Without the specify_panel argument, missingness will not be correctly annotated, and genes that were not tested will be incorrectly documented as not being altered.

Note: you can optionally check and recode any older gene names to their newer Hugo Symbol in your data set by passing the genie option to create_gene_binary(recode_aliases=).

Using the genomic data from Synapse:

The fusions and CNA data as downloaded from Synapse require some modifications prior to being supplied to the gnomeR::create_gene_binary() function.

First, the CNA file can be transposed to match the expected input for create_gene_binary() using pivot_cna_longer():

# transpose CNA data from Synapse
cna_synapse_long <- pivot_cna_longer(nsclc_2_0$NSCLC_v2.0$cna)

Next, the fusions file can be transposed to match the expected input for create_gene_binary()

# transpose fusions data from Synapse
fusions_synapse_long <- reformat_fusion(nsclc_2_0$NSCLC_v2.0$fusions)

Finally, the reformatted genomic data can be supplied to create_gene_binary() to annotate genomic alterations for patients in the analytic cohort of interest.

The CNA data as downloaded from cBioPortal only includes high level CNA (-2, 2), so we will specify high_level_cna_only = TRUE to be consistent with the results based on the genomic data as downloaded from cBioPortal.

Additionally, we will use the built in 'genieoption to check gene aliases (see?create_gene_binary` for more info).

nsclc_2_0_gen_dat_synapse <-
  create_gene_binary(
    mutation = nsclc_2_0$NSCLC_v2.0$mutations_extended,
    cna = cna_synapse_long,
    high_level_cna_only = TRUE,
    fusion = fusions_synapse_long,
    samples = nsclc_2_0_sample_panels$sample_id,
    specify_panel = nsclc_2_0_sample_panels, 
    recode_aliases = "genie"
  )

Using the genomic data from cBioPortal:

nsclc_2_0_gen_dat_cbio <-
  create_gene_binary(
    mutation = mutations_extended_2_0,
    cna = cna_2_0,
    fusion = fusions_2_0,
    samples = nsclc_2_0_sample_panels$sample_id,
    specify_panel = nsclc_2_0_sample_panels, 
    recode_aliases = "genie"
  )

Binary genomic matrices created using the genomic data downloaded from Synapse and cBioPortal should be equal. We will proceed using the nsclc_2_0_gen_dat_cbio object.

Collapse Data with `summarize_by_gene()`

We can summarize the presence of any alteration event (mutation, amplification, deletion, structural variant) with the summarize_by_gene() function, such that each gene is a column that captures the presence of any event regardless of alteration type.

Summarizing the first 10 samples for KRAS alterations:

Using the genomic data from Synapse:

nsclc_2_0_gen_dat_synapse[1:10, ] %>% 
  select(sample_id, KRAS, KRAS.Amp) %>%
  summarize_by_gene()

Analyzing Data

After the data have been transformed into a binary format, we can create summaries and visualizations to better understand the data.

Summarize Data with `tbl_genomic()`

The tbl_genomic() function summarizes the frequency of alteration events from the binary data returned from create_gene_binary() or summarize_by_gene().

Using the genomic data from Synapse:

Summarizing the frequencies of KEAP1, STK11, and SMARCA4 alteration events:

nsclc_2_0_gen_dat_synapse %>% 
  select(sample_id, KEAP1, STK11, SMARCA4) %>%
  tbl_genomic()

Users can subset their data set to only include genes above a certain prevalence frequency threshold before passing to the function using the subset_by_frequency() function.

Below, we summarize alteration events with >=10% frequency:

nsclc_2_0_gen_dat_synapse %>%
  subset_by_frequency(t = 0.1) %>%
  tbl_genomic()

Using the genomic data from cBioPortal:

Summarizing the frequencies of KEAP1, STK11, and SMARCA4 alteration events:

nsclc_2_0_gen_dat_cbio %>%
  select(sample_id, KEAP1, STK11, SMARCA4) %>%
  tbl_genomic()

Summarizing alteration events with >=10% frequency:

nsclc_2_0_gen_dat_cbio %>%
  subset_by_frequency(t = 0.1) %>%
  tbl_genomic()

Data Visualizations

We can use the mutation_viz() function to visualize several aspects of the mutation data, including variant classification, variant type, SNV class and top variant genes.

For the purposes of this vignette we will visualize the genomic data from cBioPortal.

Using the genomic data from cBioPortal:

mutation_viz_gen_dat_cbio <- mutation_viz(mutations_extended_2_0)

mutation_viz_gen_dat_cbio

References

Additional details regarding the GENIE BPC data and the {genieBPC} R package are published in the following papers:

Lavery, J. A., Brown, S., Curry, M. A., Martin, A., Sjoberg, D. D., & Whiting, K. (2023). A data processing pipeline for the AACR project GENIE biopharma collaborative data with the {genieBPC} R package. Bioinformatics (Oxford, England), 39(1), btac796. https://doi.org/10.1093/bioinformatics/btac796
Lavery, J. A., Lepisto, E. M., Brown, S., Rizvi, H., McCarthy, C., LeNoue-Newton, M., Yu, C., Lee, J., Guo, X., Yu, T., Rudolph, J., Sweeney, S., AACR Project GENIE Consortium, Park, B. H., Warner, J. L., Bedard, P. L., Riely, G., Schrag, D., & Panageas, K. S. (2022). A Scalable Quality Assurance Process for Curating Oncology Electronic Health Records: The Project GENIE Biopharma Collaborative Approach. JCO clinical cancer informatics, 6, e2100105. https://doi.org/10.1200/CCI.21.00105

Technical details regarding proper analysis of this data can be found in the following publication:

Brown, S., Lavery, J. A., Shen, R., Martin, A. S., Kehl, K. L., Sweeney, S. M., Lepisto, E. M., Rizvi, H., McCarthy, C. G., Schultz, N., Warner, J. L., Park, B. H., Bedard, P. L., Riely, G. J., Schrag, D., Panageas, K. S., & AACR Project GENIE Consortium (2022). Implications of Selection Bias Due to Delayed Study Entry in Clinical Genomic Studies. JAMA oncology, 8(2), 287–291. https://doi.org/10.1001/jamaoncol.2021.5153
Kehl, K. L., Uno, H., Gusev, A., Groha, S., Brown, S., Lavery, J. A., Schrag, D., & Panageas, K. S. (2023). Elucidating Analytic Bias Due to Informative Cohort Entry in Cancer Clinico-genomic Datasets. Cancer epidemiology, biomarkers & prevention: a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology, 32(3), 344–352. https://doi.org/10.1158/1055-9965.EPI-22-0875
Kehl, K. L., Riely, G. J., Lepisto, E. M., Lavery, J. A., Warner, J. L., LeNoue-Newton, M. L., Sweeney, S. M., Rudolph, J. E., Brown, S., Yu, C., Bedard, P. L., Schrag, D., Panageas, K. S., & American Association of Cancer Research (AACR) Project Genomics Evidence Neoplasia Information Exchange (GENIE) Consortium (2021). Correlation Between Surrogate End Points and Overall Survival in a Multi-institutional Clinicogenomic Cohort of Patients With Non-Small Cell Lung or Colorectal Cancer. JAMA network open, 4(7), e2117547. https://doi.org/10.1001/jamanetworkopen.2021.17547

MSKCC-Epi-Bio/gnomeR documentation built on March 28, 2024, 2:42 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

MSKCC-Epi-Bio/gnomeR
Wrangle and analyze IMPACT and TCGA mutation data

In MSKCC-Epi-Bio/gnomeR: Wrangle and analyze IMPACT and TCGA mutation data

Introduction

Data Access

Obtain GENIE BPC Data

Data Formats

Differences Between Synapse and cBioPortal Genomic Data

Selecting a Cohort for Analysis

Process Data with `create_gene_binary()`

Collapse Data with `summarize_by_gene()`

Analyzing Data

Summarize Data with `tbl_genomic()`

Data Visualizations

References

R Package Documentation

Browse R Packages

We want your feedback!

MSKCC-Epi-Bio/gnomeR Wrangle and analyze IMPACT and TCGA mutation data

In MSKCC-Epi-Bio/gnomeR: Wrangle and analyze IMPACT and TCGA mutation data

Introduction

Data Access

Obtain GENIE BPC Data

Data Formats

Differences Between Synapse and cBioPortal Genomic Data

Selecting a Cohort for Analysis

Process Data with create_gene_binary()

Collapse Data with summarize_by_gene()

Analyzing Data

Summarize Data with tbl_genomic()

Data Visualizations

References

R Package Documentation

Browse R Packages

We want your feedback!

MSKCC-Epi-Bio/gnomeR
Wrangle and analyze IMPACT and TCGA mutation data

Process Data with `create_gene_binary()`

Collapse Data with `summarize_by_gene()`

Summarize Data with `tbl_genomic()`