An R Package for Analyzing Gene Expression data from The Cancer Genome Atlas
tcgaRNAML is a collection of functions used to explore the relationship between clinical features and gene expression data from The Cancer Genome Atlas (TCGA) database. The package represents only a small subset of code written for the purpose of general TCGA data exploration. It relies on the TCGA2STAT package to access and download RNA-seq datasets from the TGCA data portal. The package currently supports two functions: makeROCs and varselectVenn.
The easiest way to install tcgaRNAML is from GitHub: 1. Clone the Repository to your Local Machine 2. Unzip the Contents 3. Move to the Unzipped Directory and Build from the Command Line (ie. R CMD build tcgaRNAML) 4. Now Install from the Command Line (ie. R CMD install tcgaRNAML_0.1.tar.gz) 5. Restart your R Session 6. Load the Package (ie. library(tcgaRNAML))
This function generates a multi-Receiver Operating Characteristic (ROC) plot using RNA-seq data from The Cancer Genome Atlas (TCGA) database. Here, the predictor variables are 20501 genes and their normalized gene expression values, while the response variable is a user-specified clinical feature. TCGA gene expression data is imported using the TCGA2STAT package. Feature selection is performed, and the remaining features (genes) are processed by five different machine learning classifiers: LASSO-Logistic, K-Nearest Neighbor, Random Forest, a Radial-Kernal Support Vector Machine, and a Sigmoid-Kernal Support Vector Machine. An ROC curve is generated from each classifier, and combined onto one plot. If a relationship does exist, the milti-ROC plot indicates which machine learning algorithm has the best predictive power.
Users can specify two different target variables: tumor stage and gender. Additionally, six cancer types are currently supported: Adrenocortical Carcinoma (ACC), Bladder Urothelial Carcinoma (BLCA), Kidney Renal Clear Cell Carcinoma (KIRC), Kidney Renal Papillary Cell Carcinoma (KIRP), Liver Heptocellular Carcinoma (LIHC), and Thyroid Carcinoma (THCA). Users can also specify "Random" in place of the target variable argument. This will generate a multi-ROC plot with a randomly-selected target and cancer type.
makeROCs("THCA", "Gender")
Neither of the five models achieved an AUC above 0.75, with the LASSO-Logistic model performing slightly better than the other algorithms. In general, predicting gender may not be as clinically useful as, for instance, predicting tumor stage, metastatic status, or vital status.
Here's a multi-panel plot of four other simulations using varying arguments generated by makeROCs:
This function generates a Venn diagram using RNA-seq data from the The Cancer Genome Atlas (TCGA) database. Users specify which cancer types to include (varselectVenn currently supports 2- and 3-set Venn diagrams), as well as the target variable to predict. The predictors consist of 20501 genes and their normalized gene expression values. The data is processed by a random forest classifier, and the variables (genes) are ranked by their influence on the model’s predictive power. Users specify how many of the high-importance genes to retain, and a Venn diagram is generated and saved in the current working directory as a .tiff file (Tagged Image File Format). The function also returns a list object that specifies which genes were retained for each cancer type, as well as which genes were at the intersection of all specified cancers types.
ex01 <- varselectVenn(c("KIRC","KIRP"), 80, "vitalstatus")
For KIRP and KIRC, all 20501 genes were ranked in terms of their impact on the random forest model's ablity to predict vital status based on gene expression levels. The top 80 genes were retained from each cancer type, and a Venn diagram was generated. Three genes were determined to be highly-important in predicting vital status for both cancer types. These can be accessed in the returned list object:
ex01$Intersect
"MTHFD2"
"PTTG1"
"KIF18B"
Each of these three genes are implicated in various types of cancers. Perhaps most notably is MTHFD2, which encodes an enzyme responsible for regulating the balance between DNA methylation and nucleotide synthesis. MTHFD2 is thus prognostic marker in renal cancer, endometrial cancer, and glioma according to The Human Protein Atlas. Additionally, PTTG1 is implicated in a number of cancers, and KIF18B is a marker in liver cancer, pancreatic cancer, and melanoma.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.