knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(glab.library)
Originally written by Haroon, 07/12/2023. Recommend running this tutorial by downloading this Rmd file and input data files into the same directory. Use RStudio and run code chunks individually.
The first part of this tutorial will cover how to format the data before the principal component analysis and projection can be done. The tutorial is written in R Markdown format -- similar to Jupyter Notebooks for Python where code chunks are interspersed between descriptive text and can be run.
Data sets may have thousands of different genes making it difficult to see trends in gene expression among samples. Principal Component Analysis makes it easier to see these trends in gene expression data by reducing the number of dimensions. Gene expression data is instead measured by the principal components or directions where the data has the most variance. These principal components can then be plotted, allowing one to see trends in gene expression.
PCA projection can be useful when we have new samples and the corresponding gene expression data; we can compare the original PCA done with the old samples with a PCA containing the new samples by multiplying the new gene expression data by the principal components or loadings of the original PCA and then projecting this new PCA over the original. This preserves the original PCA while allowing one to make comparisons between old and new samples in terms of their gene expression.
Input data (can download from this repository):
The Meso and Kich cell-line gene expression files are read with fread() from the data.table package due to the large file sizes. Both gene expression data sets are data frames, with columns representing the samples and rows representing the different genes. Each value indicates the expression level of a gene for a given sample.
library(data.table) Kich_CellLines <- data.frame(fread("HiSeqV2_KICH.txt")) Meso_CellLines <- data.frame(fread("HiSeqV2_MESO.txt"))
The Meso and Kich cell-line gene expression data sets contain genes which do not code for proteins. The following protein coding gene list is read and will be used to filter each cell-line data set so that they only have protein coding genes.
TOIL protein coding genelist union
: Protein coding gene listprotein.coding <- glab.library::`TOIL protein coding genelist union` #Reformats the protein coding gene list as a dataframe pcg_header <- as.data.frame(colnames(protein.coding)) names(pcg_header) <- colnames(protein.coding) protein.coding <- rbind(pcg_header, protein.coding) colnames(protein.coding) <- "protein_coding_genes" rm(pcg_header)
The filter() function from the dplyr pacakge is used to filter both the Kich and Meso gene expression data sets so that they only have protein coding genes. The files are filtered so that only rows with genes that correspond to the coding genes are included.
library(dplyr) Kich_Filtered <- Kich_CellLines %>% filter(Kich_CellLines[,1] %in% protein.coding$protein_coding_genes) Meso_Filtered <- Meso_CellLines %>% filter(Meso_CellLines[,1] %in% protein.coding$protein_coding_genes)
In order for PCA projection to work, both cell-line data sets must have the same genes. The genes from both data sets are matched and only genes found in both will be included.
#Determines which genes in the SKCM file match those in the UVM file and vice versa Kich_Match_RowNum <- match(Meso_Filtered$sample, Kich_Filtered$sample) Meso_Match_RowNum <- match(Kich_Filtered$sample, Meso_Filtered$sample) #Genes not present in both data sets are filtered out from each Kich_Filtered_GeneMatch <-Kich_Filtered[Kich_Match_RowNum,] Meso_Filtered_GeneMatch <-Meso_Filtered[Meso_Match_RowNum,]
New text files containing the filtered Kich and Meso gene expression data sets are created so the PCA and projection can be done.
write.table(Kich_Filtered_GeneMatch, file = "Kich_CellLines.txt",row.names = F, sep = "\t") write.table(Meso_Filtered_GeneMatch, file = "Meso_CellLines.txt",row.names = F, sep = "\t")
An annotation file is read and will be used to color cell-lines by group (all Kich files go into group "Kich", all Meso files go into group "Meso")
#Creates DF with Kich cell-lines group_annotation <- data.frame(colnames(Kich_Filtered_GeneMatch)[-1], "Kich") #Creates column names for DF colnames(group_annotation) <- c("Sample", "Group") #Meso cell-lines are added to the DF group_annotation <- group_annotation %>% add_row(Sample = colnames(Meso_Filtered_GeneMatch[-1]), Group = "Meso") #Writes annotation file to be used in part 2 write.table(group_annotation, file = "group_annotation.txt")
The second part of this tutorial will cover how to perform the principal component analysis and project PCAs onto each other. A PCA on the Kich gene expression data set will be done before the PCA of the Meso gene expression data set is projected on to it and vice versa.
The PCAs will be colored with the annotation file created in part 1. Each sample will be colored according to what cancer type (Meso or Kich) it is. First, the annotation files from part 1 will be read.
group_annotation <- read.table("group_annotation.txt")
Scripts:
The script above will be called in code directly from the package and will be used to do the PCA on one of the gene expression sets while the other is projected on top.
The projection function is called in directly from the package as shown below. The PCA is done on the Kich gene expression data which used the original samples. The Meso gene expression data which uses new samples is then projected on top of it. The 'file' parameter corresponds to the original PCA (Kich) while the 'file2' parameter corresponds to the projection (Meso).
glab.library::intersect_doPCA_from_file_and_project_second_dataset(file = "Kich_CellLines.txt", file2 = "Meso_CellLines.txt", train_string = "test", fread = F)
The previous intersect function will produce the PCA score file for the original data (Kich in this case) which is inputted into the "file" parameter. The file should end with "prcomp_scores". It also produces the scores matrix file for the projected data (Meso in this case) which is inputted into the "rotated.file" parameter. The file should end with "prcomp_rotated".
The info.name parameter matches the cell-line/sample names from the annotation file with the the cell-line/sample names from the original PCA scores file (Kich in this case). This will map the Kich cell-line names to their respective points on the plot. The info.name2 parameter matches the cell-line/sample names from the annotation file with the the cell-line/sample names from the projected data (Meso in this case). This will map the Meso cell-line names to their respective points on the plot. "group_annotation$Sample" contains the sample names for both the Kich and Meso cell-lines so it is inputted into both parameters.
The info.type parameter matches the cell-line/sample groupings from the annotation file with the the samples/cell-lines from the original PCA scores file (Kich in this case). The info.type2 parameter matches the cell-line/sample groupings from the annotation file with the the samples/cell-lines from the projected data (Meso in this case). "group_annotation$Group" contains the cancer types for both the Kich and Meso cell-lines so it is inputted into both parameters. This will color each cell-line on the plot according to what cancer type it is (either Kich or Meso).
The "PCX" and "PCY" parameters control which principal components will be displayed. In this case, PC1 and PC2 will be displayed.
The "labelsProj" parameter controls whether or not the points from the projected file are labeled. In this case, "labelsProj" is set to true and the labels for the points from the projected file are labeled.
The "labelsOrgPCA" parameter controls whether or not the points from the original PCA file are labeled. In this case, "labelsOrgPCA" is set to false and the labels for the points from the original file are not labeled.
The following projection plots will show different principal components and labels being displayed.
glab.library::plot_pca_projection_labeloptions( file = "Kich_CellLines_prcomp_scores.txt", rotated.file = "Meso_CellLines_test_prcomp_rotated.txt", info.name = group_annotation$Sample, info.type = group_annotation$Group, info.name2 = group_annotation$Sample, info.type2 = group_annotation$Group, PCx = "PC1", PCy = "PC2", ellipse = F, labelsProj = T, labelsOrgPCA = F, title = "PC1 vs PC2: MESO Projection onto KICH PCA")
glab.library::plot_pca_projection_labeloptions( file = "Kich_CellLines_prcomp_scores.txt", rotated.file = "Meso_CellLines_test_prcomp_rotated.txt", info.name = group_annotation$Sample, info.type = group_annotation$Group, info.name2 = group_annotation$Sample, info.type2 = group_annotation$Group, PCx = "PC1", PCy = "PC2", ellipse = F, labelsProj = F, labelsOrgPCA = T, title = "PC1 vs PC2: MESO Projection onto KICH PCA")
glab.library::plot_pca_projection_labeloptions( file = "Kich_CellLines_prcomp_scores.txt", rotated.file = "Meso_CellLines_test_prcomp_rotated.txt", info.name = group_annotation$Sample, info.type = group_annotation$Group, info.name2 = group_annotation$Sample, info.type2 = group_annotation$Group, PCx = "PC2", PCy = "PC3", ellipse = F, labelsProj = T, labelsOrgPCA = F, title = "PC2 vs PC3: MESO Projection onto KICH PCA")
glab.library::plot_pca_projection_labeloptions( file = "Kich_CellLines_prcomp_scores.txt", rotated.file = "Meso_CellLines_test_prcomp_rotated.txt", info.name = group_annotation$Sample, info.type = group_annotation$Group, info.name2 = group_annotation$Sample, info.type2 = group_annotation$Group, PCx = "PC2", PCy = "PC3", ellipse = F, labelsProj = F, labelsOrgPCA = T, title = "PC2 vs PC3: MESO Projection onto KICH PCA")
Now that the Meso data has been projected and plotted over the PCA plot of the Kich data set, let's try and do the opposite. Now the Meso data will be used for the orginal PCA and the its loadings will be applied to the Kich data set so it can be projected over the Meso PCA.
glab.library::intersect_doPCA_from_file_and_project_second_dataset(file = "Meso_CellLines.txt", file2 = "Kich_CellLines.txt", train_string = "test", fread = F)
glab.library::plot_pca_projection_labeloptions( file = "Meso_CellLines_prcomp_scores.txt", rotated.file = "Kich_CellLines_test_prcomp_rotated.txt", info.name = group_annotation$Sample, info.type = group_annotation$Group, info.name2 = group_annotation$Sample, info.type2 = group_annotation$Group, PCx = "PC1", PCy = "PC2", ellipse = F, labelsProj = T, labelsOrgPCA = F, title = "PC1 vs PC2: KICH Projection onto MESO PCA")
glab.library::plot_pca_projection_labeloptions( file = "Meso_CellLines_prcomp_scores.txt", rotated.file = "Kich_CellLines_test_prcomp_rotated.txt", info.name = group_annotation$Sample, info.type = group_annotation$Group, info.name2 = group_annotation$Sample, info.type2 = group_annotation$Group, PCx = "PC1", PCy = "PC2", ellipse = F, labelsProj = F, labelsOrgPCA = T, title = "PC1 vs PC2: KICH Projection onto MESO PCA")
glab.library::plot_pca_projection_labeloptions( file = "Meso_CellLines_prcomp_scores.txt", rotated.file = "Kich_CellLines_test_prcomp_rotated.txt", info.name = group_annotation$Sample, info.type = group_annotation$Group, info.name2 = group_annotation$Sample, info.type2 = group_annotation$Group, PCx = "PC2", PCy = "PC3", ellipse = F, labelsProj = T, labelsOrgPCA = F, title = "PC2 vs PC3: KICH Projection onto MESO PCA")
glab.library::plot_pca_projection_labeloptions( file = "Meso_CellLines_prcomp_scores.txt", rotated.file = "Kich_CellLines_test_prcomp_rotated.txt", info.name = group_annotation$Sample, info.type = group_annotation$Group, info.name2 = group_annotation$Sample, info.type2 = group_annotation$Group, PCx = "PC2", PCy = "PC3", ellipse = F, labelsProj = F, labelsOrgPCA = T, title = "PC2 vs PC3: KICH Projection onto MESO PCA")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.