knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

set.seed(20230302)

# exit if user doesn't have synapser, a log in, or access to data.
if (genieBPC:::.is_connected_to_genie() == FALSE){
  knitr::knit_exit()
}
library(gnomeR)
library(dplyr)
library(genieBPC)
library(cbioportalR)

Introduction

This vignette will walk through how to apply {gnomeR} functions to data from AACR Project Genomics Evidence Neoplasia Information Exchange BioPharma Collaborative (GENIE BPC). A broad overview of AACR Project GENIE BPC can be found here, with details on the clinical data structure available on the {genieBPC} package website.

For the purposes of this vignette, we will use the first publicly available GENIE BPC data release of non-small cell lung cancer patients, NSCLC v2.0-public.

Note that the GENIE BPC genomic data are unique in a few particular ways:

Data Access

To gain access to the GENIE BPC data, please follow the instructions on the {genieBPC} pull_data_synapse() vignette to register for a Synapse account. Once your Synapse account is created and you authenticate yourself using genieBPC::set_synapse_credentials(), you'll be ready to pull the GENIE BPC clinical and genomic data from Synapse into your local environment:

library(genieBPC)

# if credentials are not stored in your R environment
set_synapse_credentials(username = "username", password = "password")
# if credentials are stored in your R environment
set_synapse_credentials()

Obtain GENIE BPC Data

# pull NSCLC v2.0-public data from Synapse into the R environment
nsclc_2_0 = pull_data_synapse(cohort = "NSCLC",
                              version = "v2.0-public")

The resulting nsclc_2_0 object is a nested list of datasets, including the mutations, fusions, and copy number data.

Note that while the GENIE BPC clinical data are only available via Synapse, the genomic data can be accessed via both Synapse and cBioPortal. Using the {cbioportalR} package, users can pull the GENIE BPC genomic data directly from cBioPortal:

library(cbioportalR)

# connect to the GENIE instance of cBioPortal
cbioportalR::set_cbioportal_db("https://genie.cbioportal.org/api")

# view list of available studies from this instance of the portal
# NSCLC v2.0-public is: nsclc_public_genie_bpc
available_studies()
# obtain genomic data for GENIE BPC NSCLC v2.0-public
mutations_extended_2_0 <- get_mutations_by_study("nsclc_public_genie_bpc")
cna_2_0 <- get_cna_by_study("nsclc_public_genie_bpc")
fusions_2_0 <- get_fusions_by_study("nsclc_public_genie_bpc")

Data Formats

The genomic data for GENIE BPC are stored both on Synapse and in cBioPortal. The data structure differs depending on where the genomic data are downloaded from. Therefore, the remainder of this vignette will proceed by outlining the process of annotating genomic data separately for genomic data downloaded from Synapse and genomic data downloaded from cBioPortal.

Differences Between Synapse and cBioPortal Genomic Data

Please note that pulling genomic GENIE data from Synapse using pull_data_synapse() and pulling GENIE data from cBioPortal may result in small differences in the data due to systematic differences in the processing pipelines employed by Synapse and cBioPortal. These differences may include:

Selecting a Cohort for Analysis

The following code chunk uses the genieBPC::create_analytic_cohort() to create an analytic cohort of patients diagnosed with stage IV NSCLC of adenocarcinoma histology. Then, for patients with multiple genomic samples, the genieBPC::select_unique_ngs() function chooses the genomic sample with OncoTree code LUAD (if available). For patients with multiple samples with OncoTree code LUAD, we will select the metastatic genomic sample. If any patients have multiple metastatic samples with OncoTree code LUAD, take the latest of the samples.

Note: for patients with exactly one genomic sample, that unique genomic sample will be returned regardless of whether it meets the argument criteria specified below.

# create analytic cohort of patients diagnosed with Stage IV adenocarcinoma
nsclc_2_0_example <- create_analytic_cohort(
  data_synapse = nsclc_2_0$NSCLC_v2.0,
  stage_dx = c("Stage IV"),
  histology = "Adenocarcinoma"
)

# select unique NGS samples for this analytic cohort
nsclc_2_0_samples <- select_unique_ngs(
  data_cohort = nsclc_2_0_example$cohort_ngs,
  oncotree_code = "LUAD",
  sample_type = "Metastasis",
  min_max_time = "max"
)

Create a dataframe of the corresponding panel and sample IDs:

# specify sample panels and IDs
nsclc_2_0_sample_panels <- nsclc_2_0_samples %>% 
  select(cpt_seq_assay_id, cpt_genie_sample_id) %>%
  rename(panel_id = cpt_seq_assay_id,
         sample_id = cpt_genie_sample_id) %>%
  filter(!is.na(panel_id))

Process Data with create_gene_binary()

The create_gene_binary() function takes inputs of mutations, fusions, and CNA data and returns a binary matrix with the alteration status for each gene, annotating missingness when genes were not included on a next generation sequencing panel.

It is critical to utilize the specify_panel argument of create_gene_binary(). Samples included in GENIE BPC were sequenced across multiple sequencing platforms, with the genes included varying across panels. Without the specify_panel argument, missingness will not be correctly annotated, and genes that were not tested will be incorrectly documented as not being altered.

Note: you can optionally check and recode any older gene names to their newer Hugo Symbol in your data set by passing the genie option to create_gene_binary(recode_aliases=).

Using the genomic data from Synapse:

The fusions and CNA data as downloaded from Synapse require some modifications prior to being supplied to the gnomeR::create_gene_binary() function.

First, the CNA file can be transposed to match the expected input for create_gene_binary() using pivot_cna_longer():

# transpose CNA data from Synapse
cna_synapse_long <- pivot_cna_longer(nsclc_2_0$NSCLC_v2.0$cna)

Next, the fusions file can be transposed to match the expected input for create_gene_binary()

# transpose fusions data from Synapse
fusions_synapse_long <- reformat_fusion(nsclc_2_0$NSCLC_v2.0$fusions)

Finally, the reformatted genomic data can be supplied to create_gene_binary() to annotate genomic alterations for patients in the analytic cohort of interest.

The CNA data as downloaded from cBioPortal only includes high level CNA (-2, 2), so we will specify high_level_cna_only = TRUE to be consistent with the results based on the genomic data as downloaded from cBioPortal.

Additionally, we will use the built in 'genieoption to check gene aliases (see?create_gene_binary` for more info).

nsclc_2_0_gen_dat_synapse <-
  create_gene_binary(
    mutation = nsclc_2_0$NSCLC_v2.0$mutations_extended,
    cna = cna_synapse_long,
    high_level_cna_only = TRUE,
    fusion = fusions_synapse_long,
    samples = nsclc_2_0_sample_panels$sample_id,
    specify_panel = nsclc_2_0_sample_panels, 
    recode_aliases = "genie"
  )

Using the genomic data from cBioPortal:

nsclc_2_0_gen_dat_cbio <-
  create_gene_binary(
    mutation = mutations_extended_2_0,
    cna = cna_2_0,
    fusion = fusions_2_0,
    samples = nsclc_2_0_sample_panels$sample_id,
    specify_panel = nsclc_2_0_sample_panels, 
    recode_aliases = "genie"
  )

Binary genomic matrices created using the genomic data downloaded from Synapse and cBioPortal should be equal. We will proceed using the nsclc_2_0_gen_dat_cbio object.

Collapse Data with summarize_by_gene()

We can summarize the presence of any alteration event (mutation, amplification, deletion, structural variant) with the summarize_by_gene() function, such that each gene is a column that captures the presence of any event regardless of alteration type.

Summarizing the first 10 samples for KRAS alterations:

Using the genomic data from Synapse:

nsclc_2_0_gen_dat_synapse[1:10, ] %>% 
  select(sample_id, KRAS, KRAS.Amp) %>%
  summarize_by_gene()

Analyzing Data

After the data have been transformed into a binary format, we can create summaries and visualizations to better understand the data.

Summarize Data with tbl_genomic()

The tbl_genomic() function summarizes the frequency of alteration events from the binary data returned from create_gene_binary() or summarize_by_gene().

Using the genomic data from Synapse:

Summarizing the frequencies of KEAP1, STK11, and SMARCA4 alteration events:

nsclc_2_0_gen_dat_synapse %>% 
  select(sample_id, KEAP1, STK11, SMARCA4) %>%
  tbl_genomic()

Users can subset their data set to only include genes above a certain prevalence frequency threshold before passing to the function using the subset_by_frequency() function.

Below, we summarize alteration events with >=10% frequency:

nsclc_2_0_gen_dat_synapse %>%
  subset_by_frequency(t = 0.1) %>%
  tbl_genomic()

Using the genomic data from cBioPortal:

Summarizing the frequencies of KEAP1, STK11, and SMARCA4 alteration events:

nsclc_2_0_gen_dat_cbio %>%
  select(sample_id, KEAP1, STK11, SMARCA4) %>%
  tbl_genomic()

Summarizing alteration events with >=10% frequency:

nsclc_2_0_gen_dat_cbio %>%
  subset_by_frequency(t = 0.1) %>%
  tbl_genomic()

Data Visualizations

We can use the mutation_viz() function to visualize several aspects of the mutation data, including variant classification, variant type, SNV class and top variant genes.

For the purposes of this vignette we will visualize the genomic data from cBioPortal.

Using the genomic data from cBioPortal:

mutation_viz_gen_dat_cbio <- mutation_viz(mutations_extended_2_0)

mutation_viz_gen_dat_cbio

References

Additional details regarding the GENIE BPC data and the {genieBPC} R package are published in the following papers:

Technical details regarding proper analysis of this data can be found in the following publication:



MSKCC-Epi-Bio/gnomeR documentation built on March 28, 2024, 2:42 a.m.