require('exprsex')
Many gene expression studies are missing sex labels. Additionally, previous studies have shown that metadata sex labels frequently contain errors (Toker et al 2016); however, previous expression-based methods are limited to large mixed sex studies. exprsex
is a program for sex labeling microarray gene expression data that has high performance across a variety of platforms and sample sizes.
Use cases: 1. You have a dataset that you want to sex label with the default model. 2. You have several similar datasets (e.g. same platform) and you want to train a more specific model on that dataset.
exprsex
includes the functionality for downloading a dataset from GEO, log-transforming if needed, finding the probe-to-gene mapping, and converting the data to genes. For more information on gene mapping, please go to the mapping vignette. Note that this step is slow because it includes gene mapping.
study1 <- getPrepGSE("GSE55668", gpl.dir = "../../gpl_ref/") study2 <- getPrepGSE("GSE48452") study3 <- getPrepGSE("GSE14323")
Let's grab the labels and reformat them for easier use. exprsex
allows for numeric sex labels (0="female" and 1="male"), or lowercase character labels (female/male).
# study 1 has sex labels with caps ("Female", "Male"), convert to lowercase sex_lab1 <- study1$GSE55668_series_matrix.txt.gz$pheno$`Sex:ch1` sex_lab1.2 <- unlist(sapply(sex_lab1, tolower)) # study 2 has sex labels formatted as "Sex: ", so we remove this sex_lab2 <- study2$GSE48452_series_matrix.txt.gz$pheno$characteristics_ch1.6 sex_lab2.2 <- sapply(sex_lab2, function(x) gsub("Sex: ", "", x)) # study 3 has no metadata sex labels
We can use default methods to sex label a dataset out of the box. We convert the expression matrix to ranks, and then can simply predict sex labels for the file.
# grab the gene expression matrix expr1 <- study1$GSE55668_series_matrix.txt.gz$gene_mat # add the "missing" rows to the expresion data that are in the gene list, # reorder the genes (this helps with comparison), and convert to ranks expr1.r <- reorderRank(expr1) # predict sex labels data(default_fit) study1_preds <- predSexLab(default_fit, expr1.r, numeric_lab = FALSE)
We can compare these to the sex labels the dataset has and see that they have very good concordance!
confMat <- function(actual, pred){ table(data.frame(cbind(actual, pred))) } confMat(sex_lab1.2, study1_preds)
exprsex
uses scores and a threshold score to classify. We can see these scores if we use another argument to predSexLab
:
// TODO add equations
study1_pred_scores <- predSexLab(default_fit, expr1.r, numeric_lab = FALSE, scores=TRUE) study1_res <- rbind(study1_pred_scores, "actual"=sex_lab1.2) study1_df <- data.frame(t(study1_res)) study1_df$score_m <- sapply(study1_df$score_m, as.numeric) study1_df$score_f <- sapply(study1_df$score_f, as.numeric) # we can easily plot these scores ggplot(data=study1_df, aes(x=score_f, y=score_m))+ geom_point(aes(shape=actual, color=sex))
We can also visualize the results in other ways, looking at the actual values used for labeling (by setting the ret_expr
flag to true).
study1_pred_scores2 <- predSexLab(fit_ex, expr1.r, numeric_lab = FALSE, scores=FALSE, ret_expr=TRUE) # this returns a matrix of m and f genes # // TODO: move this to visualization methods # we can put these together and visualize the results in a PC plot comb <- rbind(study1_pred_scores2$f_mat, study1_pred_scores2$m_mat) comb2 <- comb[apply(comb, 1, function(x) !any(is.na(x))),] pcs <- prcomp(comb2) pc_df <- data.frame(cbind(pcs$rotation[,c("PC1", "PC2")], study1_pred_scores2$sl)) pc_df$PC1 <- sapply(pc_df$PC1, as.numeric) pc_df$PC2 <- sapply(pc_df$PC2, as.numeric) colnames(pc_df)[3] <- "sex" ggplot(pc_df, aes(x=PC1, y=PC2))+ geom_point(aes(color=sex))
The default model from exprsex
has high-performance across a variety of platforms; however, it is often useful to build more specific models for a particular platform or data type.
We have a couple options here: A) Use default male + female-specific gene sets B) Define our own gene sets
Fits include a set of female-specific genes, male-specific genes, and a cutoff threshold.
The trainSexLab
function uses the training data and labels to remove low-variance genes. Sex labels have to be numeric first.
# convert to numeric sex_lab1.2num <- c(0,1)[as.factor(sex_lab1.2)] sex_lab2.2num <- c(0,1)[as.factor(sex_lab2.2)] names(sex_lab1.2num) <- colnames(exp_rank) fit_ex <- trainSexLab(expr1.r, sex_lab1.2num ) # take a look at the fit str(fit_ex)
Run sex labeling.
Compare to the sex labels from the trained fit and the actual sex labels.
# predict using the new fit (warning: this fit was trained on the data!) preds2 <- predSexLab(fit_ex, expr2.r) confMat(sex_lab2.2num,preds2) preds1 <- predSexLab(fit_ex, expr1.r) confMat(sex_lab1.2num, preds1)
exprsex
uses three sets of genes for sex labeling: a list of consensus genes across platform, and male + female-specific genes identified from a meta-analysis.
We provide an existing list of consensus genes in list_genes
but you can create your own based on a set of studies and use it to rank and reorder.
list.dats <- list(study1, study2, study3) consensus.genes <- getConsensusGenes(list.dats) # // TODO - compare # --- now reorder using this --- # expr1 <- study1$GSE55668_series_matrix.txt.gz$gene_mat expr2 <- study2$GSE48452_series_matrix.txt.gz$gene_mat expr3.1 <- study3$`GSE14323-GPL571_series_matrix.txt.gz`$gene_mat expr3.2 <- study3$`GSE14323-GPL96_series_matrix.txt.gz`$gene_mat expr1.r2 <- reorderRank(expr1, gene_list=consensus.genes) expr2.r2 <- reorderRank(expr2, gene_list=consensus.genes) expr3.1r2 <- reorderRank(expr3.1, gene_list=consensus.genes) expr3.2r2 <- reorderRank(expr3.2, gene_list=consensus.genes)
For this case, we just choose to filter the genes by the consensus genes. It is possible you may have another set of genes
# // TODO - fill in
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.