knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Load libraries:
library(magrittr) library(dplyr) library(stringr) library(readr) library(KrasAlleleCna)
Establish paths:
pkg_dir <- system.file(package = "KrasAlleleCna") extdata_dir <- system.file("extdata", package = "KrasAlleleCna") data_dir <- system.file("data", package = "KrasAlleleCna")
BAM files were download from Genomic Data Commons (GDC): TCGA-COAD, TCGA-READ, TCGA-AML, TCGA-PAAD, TCGA-DLBC
Tumor purity scores were downloaded from https://gdc.cancer.gov/about-data/publications/pancanatlas and labeled "ABSOLUTE purity/ploidy file - TCGA_mastercalls.abs_tables_JSedit.fixed.txt"
Tumor CNV data were downloaded from https://gdc.cancer.gov/about-data/publications/pancanatlas and labeled "Copy Number - broad.mit.edu_PANCAN_Genome_Wide_SNP_6_whitelisted.seg"
All downloaded files were downloaded to "data-raw/".
The Manifest and Sample Sheet were downloaded from GDC Data Portal after adding the files to the cart. They were saved to "data-raw/".
The sample sheet was loaded and filtered for the desired sample types.
# read in tsv and adjust column names tcga_sample_sheet <- read_tsv(file.path(extdata_dir, "tcga_sample_sheet.tsv")) colnames(tcga_sample_sheet) <- str_replace_all(colnames(tcga_sample_sheet), " ", "_") %>% str_to_lower() # Filter for primary tumors samples tcga_sample_sheet %<>% filter(sample_type %in% c("Primary Blood Derived Cancer - Bone Marrow", "Primary Blood Derived Cancer - Peripheral Blood", "Primary Tumor")) %>% filter(!str_detect(file_name, "gapfillers")) %>% unique() head(tcga_sample_sheet)
There were some case IDs with more than one file. I used a bunch of manual filters to eventually only have one file per sample. You can see these filters in "R/download-createfilenamelists.R".
I began by gathering the IDs with multiple files.
ids_table <- table(tcga_sample_sheet$case_id) ids_table <- ids_table[ids_table > 1] length(ids_table)
Only r length(ids_table)
samples had more than one file, so this was not a wide-spread problem.
tcga_sample_sheet %<>% group_by(case_id) %>% mutate(keep = choose_one_filename(file_name)) %>% filter(keep == file_name) %>% ungroup() # check that all case_IDs are unique n_unique(tcga_sample_sheet$case_id) == nrow(tcga_sample_sheet)
The final list of filenames was saved to "data/tcga_filename_list.txt"
cat(unlist(tcga_sample_sheet$file_id), file = file.path(extdata_dir, "tcga_filename_list.txt"), sep = "\n")
This script is submitted as a batch array, passing as the first argument a file listing the file names from GDC (created in above). Below is the step-by-step process in 'download_process_tcga_bams.sh':
samtools index
indexes the BAM filebcftools mpileup | bcftools call
makes a VCF of variants in KRASbcftools mpileup
creates an annotated VCFannovar
annotates the VCF file to find mutations in KRASThe script was run as a batch array using the following command.
sbatch --array=1-$(wc -l < data-raw/tcga_filename_list.txt) \ bash-scripts/download_process_tcga_bams.sh \ data-raw/tcga_filename_list.txt
Some files failed to download, returning "internal server error" or "" instead of the BAM file. The file names of the files that failed were stored to "data-raw/failed_downloads.txt" and turned into a "RData" file "failed_downloads.RData".
failed_downloads <- readLines(file.path(extdata_dir, "failed_downloads.txt")) failed_downloads <- str_split_fixed(unlist(failed_downloads), ":", 2)[, 1]
I added this information to the downloaded
column of the sample sheet tibble before saving the final data table to "data/"
tidy_tcga_sample_sheet <- tcga_sample_sheet %>% mutate(downloaded = !(file_id %in% failed_downloads)) usethis::use_data(tidy_tcga_sample_sheet, overwrite = TRUE)
Below is a table of the number of failed downloads per TCGA project.
tidy_tcga_sample_sheet %>% filter(!downloaded) %>% group_by(project_id) %>% summarise(failed_downloads = n_distinct(file_id)) %>% knitr::kable(col.names = c("TCGA project", "num. failed downloads")) %>% kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE, position = "left")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.