require('exprsex')

Many gene expression studies are missing sex labels. Additionally, previous studies have shown that metadata sex labels frequently contain errors (Toker et al 2016); however, previous expression-based methods are limited to large mixed sex studies. exprsex is a program for sex labeling microarray gene expression data that has high performance across a variety of platforms and sample sizes.

Use cases: 1. You have a dataset that you want to sex label with the default model. 2. You have several similar datasets (e.g. same platform) and you want to train a more specific model on that dataset.

Prepare for sex labeling: Download a dataset and convert it to genes

exprsex includes the functionality for downloading a dataset from GEO, log-transforming if needed, finding the probe-to-gene mapping, and converting the data to genes. For more information on gene mapping, please go to the mapping vignette. Note that this step is slow because it includes gene mapping.

study1 <- getPrepGSE("GSE55668", gpl.dir = "../../gpl_ref/")
study2 <- getPrepGSE("GSE48452")
study3 <- getPrepGSE("GSE14323")

Data Prep: Obtain and reformat sex labels

Let's grab the labels and reformat them for easier use. exprsex allows for numeric sex labels (0="female" and 1="male"), or lowercase character labels (female/male).

# study 1 has sex labels with caps ("Female", "Male"), convert to lowercase
sex_lab1 <- study1$GSE55668_series_matrix.txt.gz$pheno$`Sex:ch1`
sex_lab1.2 <- unlist(sapply(sex_lab1, tolower))

# study 2 has sex labels formatted as "Sex: ", so we remove this
sex_lab2 <- study2$GSE48452_series_matrix.txt.gz$pheno$characteristics_ch1.6
sex_lab2.2 <- sapply(sex_lab2, function(x) gsub("Sex: ", "", x))

# study 3 has no metadata sex labels

Sex labeling a dataset with defaults

We can use default methods to sex label a dataset out of the box. We convert the expression matrix to ranks, and then can simply predict sex labels for the file.

# grab the gene expression matrix
expr1 <- study1$GSE55668_series_matrix.txt.gz$gene_mat

# add the "missing" rows to the expresion data that are in the gene list, 
# reorder the genes (this helps with comparison), and convert to ranks
expr1.r <- reorderRank(expr1) 

# predict sex labels
data(default_fit)
study1_preds <- predSexLab(default_fit, expr1.r, numeric_lab = FALSE)

We can compare these to the sex labels the dataset has and see that they have very good concordance!

confMat <- function(actual, pred){
  table(data.frame(cbind(actual, pred)))
} 

confMat(sex_lab1.2, study1_preds)

Understanding and visualizing the results

exprsex uses scores and a threshold score to classify. We can see these scores if we use another argument to predSexLab:

// TODO add equations

study1_pred_scores <- predSexLab(default_fit, expr1.r, numeric_lab = FALSE, scores=TRUE)

study1_res <- rbind(study1_pred_scores, "actual"=sex_lab1.2)
study1_df <- data.frame(t(study1_res))
study1_df$score_m <- sapply(study1_df$score_m, as.numeric)
study1_df$score_f <- sapply(study1_df$score_f, as.numeric)

# we can easily plot these scores
ggplot(data=study1_df, aes(x=score_f, y=score_m))+
  geom_point(aes(shape=actual, color=sex))

We can also visualize the results in other ways, looking at the actual values used for labeling (by setting the ret_expr flag to true).

study1_pred_scores2 <- predSexLab(fit_ex, expr1.r, numeric_lab = FALSE, scores=FALSE, ret_expr=TRUE)
# this returns a matrix of m and f genes 

# // TODO: move this to visualization methods
# we can put these together and visualize the results in a PC plot
comb <- rbind(study1_pred_scores2$f_mat, study1_pred_scores2$m_mat)
comb2 <- comb[apply(comb, 1, function(x) !any(is.na(x))),]
pcs <- prcomp(comb2)
pc_df <- data.frame(cbind(pcs$rotation[,c("PC1", "PC2")], study1_pred_scores2$sl))
pc_df$PC1 <- sapply(pc_df$PC1, as.numeric)
pc_df$PC2 <- sapply(pc_df$PC2, as.numeric)
colnames(pc_df)[3] <- "sex"

ggplot(pc_df, aes(x=PC1, y=PC2))+
  geom_point(aes(color=sex))

Training a sex labeling model

The default model from exprsex has high-performance across a variety of platforms; however, it is often useful to build more specific models for a particular platform or data type.

We have a couple options here: A) Use default male + female-specific gene sets B) Define our own gene sets

A: Train a model using the provided data and models

Fits include a set of female-specific genes, male-specific genes, and a cutoff threshold. The trainSexLab function uses the training data and labels to remove low-variance genes. Sex labels have to be numeric first.

# convert to numeric
sex_lab1.2num <- c(0,1)[as.factor(sex_lab1.2)] 
sex_lab2.2num <- c(0,1)[as.factor(sex_lab2.2)] 

names(sex_lab1.2num) <- colnames(exp_rank)
fit_ex <- trainSexLab(expr1.r, sex_lab1.2num )

# take a look at the fit
str(fit_ex)

Predict sex labels using this fit

Run sex labeling.

Compare to the sex labels from the trained fit and the actual sex labels.

# predict using the new fit (warning: this fit was trained on the data!)
preds2 <- predSexLab(fit_ex, expr2.r)
confMat(sex_lab2.2num,preds2)

preds1 <- predSexLab(fit_ex, expr1.r)
confMat(sex_lab1.2num, preds1)

B: Identify gene sets and re-train models

exprsex uses three sets of genes for sex labeling: a list of consensus genes across platform, and male + female-specific genes identified from a meta-analysis.

Identify consensus genes

We provide an existing list of consensus genes in list_genes but you can create your own based on a set of studies and use it to rank and reorder.

list.dats <- list(study1, study2, study3)
consensus.genes <- getConsensusGenes(list.dats)

# // TODO - compare

# --- now reorder using this --- #
expr1 <- study1$GSE55668_series_matrix.txt.gz$gene_mat
expr2 <- study2$GSE48452_series_matrix.txt.gz$gene_mat
expr3.1 <- study3$`GSE14323-GPL571_series_matrix.txt.gz`$gene_mat
expr3.2 <- study3$`GSE14323-GPL96_series_matrix.txt.gz`$gene_mat

expr1.r2 <- reorderRank(expr1, gene_list=consensus.genes)
expr2.r2 <- reorderRank(expr2, gene_list=consensus.genes)
expr3.1r2 <- reorderRank(expr3.1, gene_list=consensus.genes)
expr3.2r2 <- reorderRank(expr3.2, gene_list=consensus.genes)

Male + female-specific genes

For this case, we just choose to filter the genes by the consensus genes. It is possible you may have another set of genes

# // TODO - fill in


erflynn/exprsex documentation built on Feb. 23, 2020, 2:34 a.m.