Erik Christensen 24/03/2021
This is document is a brief example on how data can be prepared for upload to TMExplorer. I’ll use GSE75688 since it has additional columns in the matrix and a separate cell-type info file.
First let’s look at the dataset in TMExplorer.
gse75688 <- queryTME(geo_accession = 'GSE75688')[[1]]
dim(counts(gse75688))
## [1] 57915 563
counts(gse75688)[1:5,1:4]
## BC01_Pooled BC01_Tumor BC02_Pooled BC03_Pooled
## TSPAN6_ENSG00000000003.10 2.33 1.25 43.96 7.64
## TNMD_ENSG00000000005.5 0.00 0.00 0.00 0.00
## DPM1_ENSG00000000419.8 60.70 28.44 74.73 41.41
## SCYL3_ENSG00000000457.9 47.93 4.43 9.89 7.61
## C1orf112_ENSG00000000460.12 4.79 1.67 10.87 0.92
The counts file is a 57915x563 matrix with a single column of rownames. This is the target format. Let’s compare that to the dataset downloaded from GEO.
geo <- read.csv('GSE75688_GEO_processed_Breast_Cancer_raw_TPM_matrix.txt.gz',
sep='\t')
dim(geo)
## [1] 57915 566
geo[1:5,1:4]
## gene_id gene_name gene_type BC01_Pooled
## 1 ENSG00000000003.10 TSPAN6 protein_coding 2.33
## 2 ENSG00000000005.5 TNMD protein_coding 0.00
## 3 ENSG00000000419.8 DPM1 protein_coding 60.70
## 4 ENSG00000000457.9 SCYL3 protein_coding 47.93
## 5 ENSG00000000460.12 C1orf112 protein_coding 4.79
The GEO matrix has three additional columns. In order to match the
target format, we’ll drop the gene_type
column, merge the gene_id
and gene_name
columns, then set the merged result as the rownames.
NOTE: The rownames here can be gene names, gene IDs, some combination of
the two like we have here, but they must be unique (R cannot set
non-unique rownames) and identify the gene (instead of just being
numbers).
geo$gene_type <- NULL
geo$gene_id <- paste(geo$gene_name, geo$gene_id,sep='_')
rownames(geo) <- geo$gene_id
geo$gene_id <- NULL
geo$gene_name <- NULL
dim(geo)
## [1] 57915 563
geo[1:5,1:4]
## BC01_Pooled BC01_Tumor BC02_Pooled BC03_Pooled
## TSPAN6_ENSG00000000003.10 2.33 1.25 43.96 7.64
## TNMD_ENSG00000000005.5 0.00 0.00 0.00 0.00
## DPM1_ENSG00000000419.8 60.70 28.44 74.73 41.41
## SCYL3_ENSG00000000457.9 47.93 4.43 9.89 7.61
## C1orf112_ENSG00000000460.12 4.79 1.67 10.87 0.92
all(geo == counts(gse75688))
## [1] TRUE
After modification, the dimensions are the same and all values in the
matrix match the file in TMExplorer. If this was a new dataset, the
modified geo
object would be what you submit as the genes x cells
matrix.
This dataset also has cell type information available. In order to prepare that for TMExplorer we’ll need to load in the file and remove the extra columns.
geo_cellinfo <- read.csv('GSE75688_final_sample_information.txt',sep='\t')
head(geo_cellinfo)
## sample type index index2 index3
## 1 BC01_02 SC Tumor Tumor Tumor
## 2 BC01_03 SC Tumor Tumor Tumor
## 3 BC01_04 SC Tumor Tumor Tumor
## 4 BC01_05 SC Tumor Tumor Tumor
## 5 BC01_06 SC Tumor Tumor Tumor
## 6 BC01_08 SC Tumor Tumor Tumor
geo_cellinfo$type <- NULL
geo_cellinfo$index <- NULL
geo_cellinfo$index2 <- NULL
names(geo_cellinfo)[2] <- "label"
dim(geo_cellinfo)
## [1] 528 2
head(geo_cellinfo)
## sample label
## 1 BC01_02 Tumor
## 2 BC01_03 Tumor
## 3 BC01_04 Tumor
## 4 BC01_05 Tumor
## 5 BC01_06 Tumor
## 6 BC01_08 Tumor
This file contains the cell type for each cell in the original dataset, as shown below.
table(geo_cellinfo$label)
##
## Bcell Immune Myeloid Stromal Tcell Tumor
## 83 4 38 23 54 326
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.