knitr::opts_chunk$set(echo = TRUE, eval = FALSE)

Introduction

The FPML (Footprint Machine Learning) package is an internal package to the Price-Hood lab. It is intended to provide functions for quickly performing the following operations:

  1. Constructing motif datasets with associated ChIPseq data
  2. Merging these motif datasets with HINT and Wellington footprinting information for a given tissue and seed size
  3. Annotating these merged data with information about TF class, distance from a TSS, and GC content
  4. Using the annotated datasets to construct gradient boosted and linear models, plus assessing these models numerically and graphically.

The basic categories described above form the core of the FPML package and roughly divide nearly all the functions contained within. As such, the categories will be used to guide you through the different operations you'll want to run. After going through each category individually, we will put them all together in the final section to demonstrate exactly how to run the machine learning analysis for the lymphoblast dataset.

One more thing before we jump in is to consider the prerequisites for running the package. I'm assuming you are comfortable with looking at the package dependencies and installing those as needed, but even so, you'll still need the following PostgreSQL databases:

  1. The hg38 database
  2. The chipseq database, which includes the lymphoblast chipseq data
  3. The fimo database
  4. Both the HINT and Wellington databases for your tissue of interest, both seeds 16 and 20
  5. Both the HINT and Wellington databases for lymphoblast, both seeds 16 and 20
  6. ChIPseq data for any other tissue you want to use

If you're using just the lymphoblast data, this means 7 databases at a minimum; for any other tissue, it'll mean more like 12. For the purposes of this vignette, I'll assume you're just using the lymphoblast data.

A Quick Note on Saving

You'll see as you go along that you're creating a lot of really big datasets. These are a big pain in the butt to deal with because it would be dumb to create a dataset and then let it just disappear, but it's also really annoying to spend all the time saving and loading them. My suggestion is to save data as you go along, but be aware that it'll take up a fair amount of space. For instance, for the lymphoblast data...

You can read and write data using load() and save() if you wish, but I tend to use a faster way of doing it: the fst package. Similar to an RDS file, the fst file you write is just one data frame, but it can read/write pretty fast. Use it like this:

library(fst)
# Write some data and compress it 100x
write.fst(big.data, path = "./big.data.fst", compress = 100)

# Read back the data
big.data <- read.fst("./big.data.fst")

You don't have to compress it, but it does save space. I often find that the files in .fst format are larger than the .Rdata files if I don't compress them. But regardless of space, they should read/write much faster.

Constructing a Motif Dataset (constructMotifDataset.R)

This script contains 4 functions, all of which are aimed at constructing your initial dataset of motifs. If you're only interested in reproducing the lymphoblast dataset, I have included a function called constructLymphoblastDataset() that you can call directly. This function wraps the other functions into one place so you can just run one thing. Most basically, I've made it so you can just run without supplying any other arguments:

# Run with all default parameters
all.TFs <- constructLymphoblastDataset()

If you run that command on an Amazon EC2 instance with enough power, that will probably take you something like an hour or two. That will give you ALL the data, which is several hundred million points. However, there are a few modifications you might want to make; first, it's worth noting that in putting together the dataset, I screen out all duplicate rows using dplyr::distinct(). If you don't do that, the dataset is more like 2 billion rows, and most of it is repeated information, so I strongly suggest leaving that be. But if you'd really like to have ALL the raw information, you can get it by setting the distinctFlag to FALSE:

# Run and return all rows, including duplicated rows
all.TFs <- constructLymphoblastDataset(distinctFlag = FALSE)

Now that'll actually go faster, because it doesn't have to do the step of screening out duplicates, but you'll have a much more onerous dataset.

If you find that, regardless of your distinctFlag value, your dataset is much larger than you'd like, you can take a random sample of it by manipulating the sampleSize parameter. By default, we just return all the data (either distinct or not, depending on your preference), but you can take a smaller sample as follows:

# Run it such that it returns 10 million random rows
all.TFs <- constructLymphoblastDataset(sampleSize = 1e7)

This command DOES use only distinct rows because that's the default, but after finding distinct rows, it takes a 10 million row sample size, something like 2-3% of the full dataset. This is actually the command I would use if I were replicating exactly what we did for the paper.

There's a couple other parameters you can also specify, if your FIMO database is not on your local machine, or if it's on a different port than the default (port = 5432). For instance, if my FIMO database were on Khaleesi and port 5433 (it's currently on the default port 5432), I'd do this:

# Grab data from Khaleesi port 5433
tf.data <- constructLymphoblastDataset(fimoHost = "khaleesi", fimoPort = "5433")

That'll grab the dataset from Khaleesi, so you don't actually have to have the database locally.

There's one more parameter you might actually want to manipulate, and that's the numWorkers parameter. As the name suggests, it determines how many processes to run for putting together the dataset. The function uses the BiocParallel package (for goodness sake, DO NOT run it on Windows, it'll take forever) and by default, it assigns 62 workers. Why 62? Because that's how many TFs we have to look for. If you're working with fewer cores or would just like to rachet that down, go ahead and change the parameter:

# Run with half as many workers
all.TFs <- constructLymphoblastDataset(numWorkers = 31)

Here, I've told it to run with half as many workers. If you're running on Khaleesi, you'd want to do something like this as you don't have 62 cores to play with.

Other Constructing Functions

For purposes of the lymphoblast datset, the previously described function should be sufficient and you shouldn't really have to look under the hood much. However, if you're debugging or using another tissue, you'll definitely want to look at these too. Most importantly, there's the createTfDf() function, which grabs all motifs from FIMO for a given TF and adds ChIPseq data to them. In order to run it, you'll need 4 things:

Once you've got all of those things, you can run as follows:

# Grab data for one TF
tf.data <- createTfDf(TF, TFs.to.motifs, chipseq.regions, chipseq.hits)

Depending on how many motifs map to your TF, this could take tens of minutes or more like an hour or two. It's worth noting that this is the exact function used by the createLymphoblastDataset() function in parallel; that function merely creates all the proper arguments and then calls this function in parallel. Thus, if you want to know more about how to get the proper arguments, that function is a great place to refer to.

As with the lymphoblast code, you can specify more parameters if your FIMO database is not on your local machine, or if it's on a different port than the default. Like before, here's how I'd run things if my FIMO database were on Khaleesi and port 5433:

# Grab data for one TF from Khaleesi port 5433
tf.data <- createTfDf(TF, TFs.to.motifs, chipseq.regions, chipseq.hits,
                      fimoHost = "khaleesi", fimoPort = "5433")

So that's how you'd do things for 1 transcription factor, regardless of where your FIMO database is. The wrapper function for lymphoblast uses this code, and as I've already shown, you can change the parameters there as well.

On a similar subject, it's worth quickly going over how to create your TF-to-motif mapping. Rather than include another database dependency, I've just stored our universal TF-motif mapping (2017_08_23_Motif_TF_Map.RDS) in the extdata folder of this package. Then, to create a list of the proper format for fishing out motifs from FIMO, all you need to do is feed in your chipseq.hits data frame as follows:

# Create the TF-motif mapping in list form
TFs.to.motifs <- mapTFsToMotifs(chipseq.hits)

That will give you the list you need. A final function I've created is the one that takes the proper sample size. It's very simple, but I found that it was easier to remember the one function call than it was to call a couple commands in sequence. So, if you've made your dataset and you'd like to take a smaller sample size of it, simply call it as follows:

# Take a sample of some size
sampled.data <- sampleTfDataset(tf.data, sampleSize = 1e7)

The tf.data should be your full dataframe of motifs and the sampleSize should be whatever integer you want to give it.

To wrap up this section, here's a couple quick bullet points to sum up:

Merging with Footprint Data (mergeFootprintData.R)

This script contains just 2 functions: one for merging data for a single chromosome, one for using that function for all chromosomes in lymphoblast. I'll begin by describing the lymphoblast case (aka what you'll probably want to use to start with), then I'll talk about the simpler function to prepare you for cases where you'll have to go off the beaten path.

First, let's talk about the lymphoblast data. We'll assume here that your constructed dataset from the previous step was called fimo.df just for consistency with the script. Assuming that we're running on seed 20 and the footprint databases are all local, I can simply call it as follows:

# Merge in lymphoblast footprints for seed 20
merged.df <- mergeLymphoblastFootprints(fimo.df, 20)

This is a pretty long step, basically the longest of the whole process; figure on this taking maybe a day. And PLEASE run this on something powerful, lest you be at this for a week.

It's worth noting that you could supply the seed number as an integer OR as a string as it's just pasted onto a string in the script. But you can't do both seeds at once with the script; choose either "16" or "20" and if you want to do both, run them separately. You can join up the datasets later, but you can't search for them at the same time as they use different databases.

Now there is one more option you should know about, which is the host option. By default it's set to localhost, assuming that the databases are local to where you're running the function. But if not, you could specify their location:

# Merge in footprints for seed 16, using databases on Khaleesi
merged.df <- mergeLymphoblastFootprints(fimo.df, 16, host = "khaleesi")

As before, this also assumes we only look at the default port (port = 5432) for databases; if this becomes a problem, we can always change the port option as well:

# Merge in footprints for seed 20, using databases that are local but on port 5433
merged.df <- mergeLymphoblastFootprints(fimo.df, 20, port = "5433")

In the case of the footprint databases, this port option becomes pretty important; there's a whole collection of databases on the other port (5433), so you may have to change this parameter around when you mess with other tissues.

Merging for one Chromosome

As before, the lymphoblast-specific function simply uses a more basic function and does so in a parallel fashion. In this case, the basic function takes a chromosome, the fimo dataframe, and the relevant database tables for HINT/Wellington and for regions/hits. It will go through all of the relevant footprint tables and add footprint data from them to the fimo data frame we fed in.

In the lymphoblast function, this is done through a parallel process that runs 24 workers (1:22, X, Y) at once. But if we're running on just one chromosome, it'll be something like this:

# Run on chromosome Y
merged.Y <- mergeFootprintsOneChrom("Y", fimo.df, 
                                    hints_regions_tbl,
                                    hint_hits_tbl,
                                    well_regions_tbl, 
                                    well_hits_tbl)

Here, I'm assuming you've got connections to the HINT/Wellington tables that are open and ready to go. If you're not sure how to create those connections, the mergeLymphoblastFootprints() gives a great example of it. It essentially amounts to 1) Creating the proper database connection, and 2) Pointing to the correct tables. Because pointing to the databases is required BEFORE running this function, you don't have to worry about supplying it with a host or a port.

So, to once again sum up the section:

Annotating the Dataset (annotateDataset.R)

The third step is the same regardless of dataset, so there's no need to talk about a separate lymphoblast function. Here's the part where we add TF class, TSS distance, and GC content to the data. The only special thing you'll need here is the ChIPseq data you started with, as it is used for adding TF class data to the correct TFs. You'll also need access to the hg38 database, but we'll assume you have that too.

As before, we'll start with the highest level script, which should annotate any tissue dataset you give it, as long as you also supply the correct ChIPseq hits. We'll assume that your dataset is now merged and called merged.df:

# Add TF class, GC content, and asinh(TSS distance) to dataset
annotated.df <- annotateFootprintData(merged.df, chipseq.hits)

That command will take a little while, but not nearly as long as merging the footprint data took. If the merging step took a day, this step should take more like an hour in my experience. And that's assuming your hg38 database is local; if if's not, then you can specify the host and port for it:

# Add all the annotation info from port 5433 on Khaleesi
annotated.df <- annotateFootprintData(merged.df, chipseq.hits, host = "khaleesi", port = "5433")

Now this step clearly wraps a few different steps into one. We'll go over each step individually.

Individual Annotation Steps

First, we code up the correct TF class data and adds this to the data frame. For creating this data, which is in the one-hot format, we use the createMotifClassMap() function and we require the ChIPseq data. This function requires a TF-motif map in the form of a list; we saw how to create said list back in the constructMotifDataset.R portion, so we'll do those 2 steps back to back here:

# Create the one-hot format of TF class data
TFs.to.motifs <- mapTFsToMotifs(chipseq.hits)
motif_class_hot <- createMotifClassMap(merged.df, TFs.to.motifs)

This one-hot format is separate from the merged dataframe, so you'll have to actually join it to the dataframe to put them together. If you're lost on how to do that, the annotateFootprintData() function has an example of how to do it.

The second piece of annotation data we add is the GC content; the function I've written to do this is actually the exact function that Rory wrote, though I think I may have slimmed it down a bit. Like the previous function, it doesn't actually add itself to the dataset. Rather, it's made to take in one "start", one "end", one "chromosome", and a shoulder size, then return GC content for that section of DNA.

In order to use this function (getGCContent()) correctly, I recommend adding it directly to the dataframe via a dplyr::mutate() call. This would look something like:

# Add the GC content column to the merged dataframe
gc.added <- merged.df %>% dplyr::mutate("gc_content" = getGCContent(start, endpos, chrom))

You'll notice that I didn't supply a shoulder size; here, the default is set to 100. If you wanted to increase it, say to 500, you'd just add it as the following:

# Add GC content for 500 bp shoulder size
gc.added <- merged.df %>% dplyr::mutate("gc_content" = getGCContent(start, endpos, chrom, shoulder = 500))

So now we've added TF class and GC content; the final piece is to add TSS distance. Unlike the previous 2 functions, the function I've written here will add the data directly to the data frame. It requires the hg38 database, so you'll need to point to it. By default, it's assumed to be on the local machine on port 5432, so the following should do the trick most of the time:

# Add TSS distance to the dataframe
tss.added <- addTSSDistance(merged.df)

Of course, if the database were elsewhere, say on port 5433 of Khaleesi, then you'd do the following:

# Add TSS distance from port 5433 on Khaleesi
tss.added <- addTSSDistance(merged.df, host = "khaleesi", port = "5433")

And that's about it for annotation. So, to summarize:

Building Models (buildModels.R)

Here's the fun (and most complex) section of the pipeline. Up to this point, you've been able to basically just run one command in order to recreate the lymphoblast dataset. Now, however, there are going to be multiple steps no matter what you do. There's actually 9 different functions and I'm going to talk about them in 3 groups:

  1. Data preparation (2 functions)
  2. Model building (2 functions)
  3. Model evaluation (5 functions)

As you'd likely expect, you have to go through these steps in order, else they won't work too well. Let's jump right in and start with data preparation.

Data Preparation

Coming into this step, you should have a fully-annotated data frame; that is, it has all the following information:

If you don't have all that information in your dataframe, you should identify the step you missed in the earlier sections of the vignette. But if you just ran the 3 big commands I recommended, you should have exactly this.

Now if you're thinking big, your first question should be something like "What about if I want to use information from both seeds?" Obviously, this will necessitate that you have 2 data frames rather than 1; each data frame needs to be complete in its own right. But assuming this is indeed the case, we can jump right into the first preparation function: joinModelData().

As its name implies, this function will take in the dataframes for the 2 different seeds and join them together such that you have double columns for all footprint scores and fractions. Running it is very simple:

joined.df <- joinModelData(full.16, full.20)

That's it; clearly, the seed 16 data should come first, followed by the seed 20 data. You could theoretically switch their positions, but that would just end up mis-labeling the data; so don't do that. The command should take a few minutes (more than a few if your datasets are large), but once they're done you'll have a joined dataframe with data from both seeds.

Regardless of whether or not you join your data, you'll need to next feed your data into the prepModelData() function. This function will take in the data frame itself and break it into training, testing, and validation data. These data are then returned as a 6 membered list of matrices, covering the 3 different data groups and the 2 different data types ("X" data for all the predictors and "Y" data for the response).

Confused? Don't be; it's pretty easy to run it and get the results. All you really need are your data frame of data (annotated.df here) and a seed ID ("16", "20", or "both"), and the default options will carry you through

# Prep model data for both seeds
prepped.data <- prepModelData(annotated.df, "both")

The thing returned to you should be a named list of length = 6. Specifically, those members are:

Right now we don't really do anything with the validation data, though that is subject to change. More importantly, you don't actually have to do anything with the members of the list themselves if you want to make a model. This exactly list is all you'll have to put in to the modeling functions to make them work. So don't sweat it!

Bias the Training Data

A new-ish addition to the prepModelData() is the option to bias your training data. You wouldn't really want to bias any other set, as that would be an unfair way to test your model, but as Rory put it, "You can do whatever you want with your training set, so long as the model works".

With that in mind, we suspect that we could possibly get a benefit by biasing the training set to have a larger ratio of ChIPseq hits to non-hits (positives to negatives). Or, from the other angle, we might benefit from having a smaller negative-to-positive ratio. In the ordinary dataset, the ratio of negatives to positives is something on the order of 50-100. If you want to change this ratio by cutting out some negatives, you'll want to manipulate the biasTraining and biasRatio parameters.

For instance, if I want to make the ratio of negatives to positives equal to 10, I'd do the following:

# Bias the data to a 10-1 neg-pos ratio
prepped.data <- prepModelData(annotated.df, "both", 
                              biasTraining = TRUE, biasRatio = 10)

This option utilizes a somewhat buried function that we won't really discuss; all that function does is take a motif, a dataset, and a desired ratio, then return the biased dataset for that motif. We run it for all motifs in the prepModelData script to bias the full training set, but only if the biasTraining flag is set to TRUE (it's FALSE by default).

Footprint Cutoffs

Now if you're paying attention to model prep, you may have noticed that we haven't done any filtering based on footprint score. I haven't forgotten about it, it's just one of the defaults (or two actually). There are two cutoffs:

  1. hintCutoff : exclusive cutoff for HINT (default = -Inf)
  2. wellCutoff : exclusive cutoff for Wellington (default = Inf)

I've set the defaults to either -Inf or Inf to prevent any filtering from happening, at least by default. All HINT scores should be positive and all Wellington scores should be negative, so this works well enough. If you were to change both defaults to 0, you'd be screening out all motifs that don't have a non-zero footprinting score. So you'd do something like this:

# Prep data, but screen out rows w/o non-zero footprint scores
prepped.data <- prepModelData(annotated.df, "both", hintCutoff = 0, wellCutoff = 0)

You can play around with cutoffs and I have; just remember that they're exclusive, so your cutoff won't actually appear in the dataset. And also remember that the relationship between the cutoffs is an OR relationship; that is to say that if you've got a HINT score of 0 but a Wellington score of -1, you'd still appear in the dataset I just created. Hopefully that all makes sense.

Motif-Only Data

Another thing I included was the ability to disinclude the footprint columns, such that we're doing a "motifs-only" sort of analysis. This serves as something of a baseline where we can measure how much we're adding with the footprints. I included it here, not earlier in the package, because having it here allows you to apply cutoffs and directly compare to the same dataset WITH footprints.

If you want to run without the footprint-related columns, then just switch the related flag to TRUE (it's FALSE by default):

# Prep all the data, but do it only for motifs, without footprints
prepped.data <- prepModelData(annotated.df, "both", motifsOnly = TRUE)

You'll notice that I did all of these steps for the "both" seed option; I think you'll be fine changing that to "16" or "20" without my assistance.

So let's review data preparation before we move to actual model building:

Model Building

At this point, you need to have gone through the data preparation steps; I can't stress that enough, because these next steps won't work without the correctly-prepped data. But if we assume that you've done that, then we can move along to actually building the models.

As you're likely aware, there are 2 basic model types we construct: gradient boosted and generalized linear (logistic) regression. You can run just one or you can run both; in each case, you'll want to reserve some time for this because it's the second most lengthy step (the most lengthy being the merging). Each type of modeling is run pretty much the same way and returns basically the same information. Essentially:

That's it, it's very straightforward to run, though I'll admit that there's a fair amount going on under the hood. However, you shouldn't really have to worry about that at all, it's mostly just putting things in the correct format and labeling accordingly.

Let's run through each modeling approach in order, starting with the gradient boosted model. As stated above, it's simply run by supplying the prepped data list, in this case by using the buildBoostedModel() function:

boosted.output <- buildBoostedModel(prepped.data)

That's literally it; you'll get a 2-member list back, where each list member is named to make sure you know exactly what it is. One (BoostedModel) is the boosted model; you might want to use that later for prediction of for one of the model evaluation functions. The other (BoostedStats) is a stats data frame created using the model. It contains values that attempt to pinpoint the model efficacy; things like sensitivity and specificity of the model at different threshold values. While you can study this data frame directly, it's generally best used by feeding it directly into one of the model evaluation functions.

The linear models are run similarly; I say "models" and not "model" because it'll construct one model using all predictors, plus one model for each individual predictor (excluding TF classes). Run it just like you run the boosted model:

linear.output <- buildLinearModels(prepped.data)

Again, we just get a named list of 2 things in return, a model (LinearModel) and a stats data frame (LinearStats). In this case, we only return the full linear model including all predictors; we don't return the other models. The stats data frame, however, includes stats for ALL the models constructed.

That's all for model building; the more interesting part comes when we attempt to evaluate the models we've built.

Model Evaluation

The vast majority of model evaluation takes place graphically, which is the easiest way to understand the fairly large array of numbers we get back in our statistics data frames. Thus, most of the functions described below will be graphical in nature and have examples you can find in the extdata section of the package (they're all .png files). However, let's start with the one non-graphical method.

If we want to look at one number to understand model performance for predicting ChIPseq hits, I'd argue the best number to look at would be the Matthews correlation coefficient (MCC). The maximum MCC value for a model in the statistics data frame tells you the very best performance you get for that model, thus I've written a function called extractMaxMCC() that does this for all the models.

Simply provide the resulting stats dataframes (boosted and linear) to the function like this:

# Get the maximum MCC for each model
max.MCCs <- extractMaxMCC(boosted.output$BoostedStats, linear.output$LinearStats)

The output will be a table (a named vector) where every model in the statistics data frames is represented by its max MCC. In this way, you'll be able to quickly see which model has the highest max MCC and if you want to pull one out, all you need to do is call it by name. For instance, if I want the max MCC for the gradient boosted model, I'd do:

# Extract the max MCC for the boosted model
max.boosted <- maxMCCs["gradient boosted model"]

The different labels you can call correspond to the unique model names contained in the stats data frames. Now then, let's get on to graphical evaluation methods.

Graphical Evaluation

By now, you're probably pretty familiar with the 3 plots we use to assess performance of different models. There's a bunch of annoying data munging and whatnot that goes into making those plots, but you shouldn't have to worry about that. Rather, you'll just need to supply your statistics data frames, your dataType ("seed16", "seed20", "both", or "motifsOnly"), and a 3-member vector specifying where to save your plot files. These will be .png files, so name them accordingly. For instance, here's how I'd save the 3 plots to whatever local directory I'm in:

# Create the 3 statistical plots locally
plotStatCurves(boosted.output$BoostedStats, linear.output$LinearStats, "both",
               filePaths = c("./MCC_curve.png",
                             "./ROC_curve.png",
                             "./PrecRec_curve.png"))

Using the supplied dataType, the function will arrange all the curves in the correct order and take care of colors. Those 3 curves should form the basis of how you compare different models to one another. Now you might decide that you'd like to hone in on a particular model and examine it. At the current writing, there are 3 functions for doing so; one is specific to the boosted model and the other two are for any model.

First, let's talk about constructing the importance matrix, which is just for a boosted model. The importance matrix gives you some insight into the boosted model you supply, giving you information on the "gain" for each feature. Roughly speaking the gain is a weighted measure of how much that feature pops up in the ensemble of trees used by the boosted method, so the top feature could be argued to be the most important. While I would caution against reading too far into this, it's a useful way of better understanding the boosted model. But be careful; boosted models are by definition non-linear, so this isn't like you're reading off linear coefficients that you can directly compare to one another.

That said, you can construct your importance matrix and save it to a .png file using the plotImportanceMatrix() function:

# Plot the importance matrix locally
plotImportanceMatrix(boosted.output$BoostedModel, 
                     prepped.data$X_train, 
                     filePath = "./Importance_matrix.png")

As shown, you'll need the model itself plus the predictor training data. I just pulled it from the prepped model data list here, so you should be able to do the same.

Now let's talk about a more general evaluation function that can be applied to either linear or boosted models. The idea behind this approach is that once we've selected a model and a threshold that constitutes a predicted "positive", we want to know what that looks like on our test data. Rather than just looking at one MCC value (e.g. MCC = 0.45), we'd often like to see the 4 confusion matrix outcomes of the datal; that is, TP/TN/FP/FN for each point. For this purpose, I created a function called createTruthPlot() that will do that for you. It takes a few parameters, but you should basically have all of them.

Simply choose the threshold that give you your desired MCC (or fulfills some other criteria) and run it for your model of choice. Here, I'll do it for the full linear model, just to change things up, and I'll use a threshold of 0.125, which looks good on the MCC plot (extdata/testMCC.png):

# Create the "Truth" plot locally and return the data frame
truth.df <- createTruthPlot(linear.output$LinearModel,
                            modelType = "linear", 
                            seed = "both", 
                            threshold = 0.125,
                            X_test = prepped.data$X_test,
                            y_test = prepped.data$y_test,
                            plotFile = "./truthPlot.png")

In addition to our lovely plot, we also get a data frame in return. This data frame contains all the information used to create the plot, which includes:

The data frame can be used to immediately evaluate the distribution of each confusion matrix outcome using the table() command:

table(truth.df$Label)

You'll generally see the group being dominated by TN values, as most points are not ChIPseq hits and any decent model will generally identify them correctly.

The final thing we'll address is this mystical "threshold" we've been talking about. If you're looking for what threshold to use, you could go straight to the statistical curves we plotted earlier and pick one from there. But in general, picking from, say, the MCC curve will be basing your choice off one number. You might want to look at a handful of numbers instead, in particular:

The basic idea is that if we set our threshold higher....

That's a bunch of words, which may be more gobblety-gook, but if you want to take a look at a plot that helps visualize it, run the createThresholdPlot() function:

# Create a threshold plot for the boosted model
createThresholdPlot(boosted.output$BoostedStats,
                    modelName = "gradient boosted model",
                    plotFile = "./thresholdPlot.png")

So we're feeding it our stats data frame, which contains all these values; we're also giving it the model name to look for, which corresponds to our boosted model. If you're dealing with the linear data frame, you should take a look at the possible model names (e.g. unique(linear.output$LinearStats$Model.Name)) to determine what to use here. For instance, if I wanted to plot for the full linear model, it's actually called "glm small", so I'd do:

# Create a threshold plot for the boosted model
createThresholdPlot(linear.output$LinearStats,
                    modelName = "linear model (all regressors)",
                    plotFile = "./thresholdPlot.png")

As before, we also have to supply a destination for the plot, which I save to the current directory in the above calls. Once you've made the plot, you can look at the 4 values and decide where your best threshold is located. It's worth noting that the 4 values are not all independent; only 3 values are actually independent in the plot and can be used to find the 4th one. But I think it's useful to see all 4 of them anyway. In general, you're trying to find a compromise between Recall/NPV and Specificity/Precision.

Let's finish this section by summing up EVERYTHING we've done here:

Finale: Putting it all together

Up to this point, I've been talking about individual functions. Now I want to give you an example of exactly how I'd run all the steps together to reproduce what we've done for the footprinting paper. Specifically, here's the steps I'll do:

  1. Construct the dataset using 10 million distinct motifs
  2. Merge with both seed 16 and seed 20 footprint data to create 2 data sets that I can merge later
  3. Run each data set through annotation to add TF class, GC content, and TSS distance
  4. Join together the data and prep the for model building
  5. Build both linear and boosted models for the data
  6. Construct statistical curves to evaluate model performance in comparison with one another
  7. Using the boosted model, plot the threshold versus stats metrics to show what threshold to choose
  8. Examine the boosted model to look at its most important features
  9. Using the boosted model, create a truth plot and pull out the confusion matrix values

I'll reference these steps as I go along, but from here it's just going to be one big block of code:

## Important Note: I'm assuming all the databases are local and on port 5432, the default port ##

# 1. Construct the dataset using 10 million distinct motifs
all.TFs <- constructLymphoblastDataset(sampleSize = 1e7)

# 2. Merge with both seed 16 and seed 20 footprint data to create 2 data sets that I can merge later
seed.16 <- mergeLymphoblastFootprints(all.TFs, seedNum = 16) # This will take quite a while
seed.20 <- mergeLymphoblastFootprints(all.TFs, seedNum = 20) # This will take quite a while

####DETOUR####
# Important step: Grab the ChIPseq hits for lymphoblast locally

# Establish the connection to the database
db.chipseq <- DBI::dbConnect(drv=RPostgreSQL::PostgreSQL(),
                             user = "trena",
                             password = "trena",
                             dbname = "chipseq",
                             host= "localhost")                           

# Run the query and then turn it into a tibble (enhanced dataframe)
chipseq.hits <- DBI::dbGetQuery(db.chipseq, "select * from hits")
chipseq.hits <- tibble::as_tibble(chipseq.hits)

####END DETOUR####

# 3. Run each data set through annotation to add TF class, GC content, and TSS distance
annotated.16 <- annotateFootprintData(seed.16, chipseq.hits)
annotated.20 <- annotateFootprintData(seed.20, chipseq.hits)

# 4. Join together the data and prep the for model building (without bias, without filtering)
joined.df <- joinModelData(annotated.16, annotated.20)
prepped.data <- prepModelData(joined.df, seed = "both")

# 5. Build both linear and boosted models for the data
boosted.output <- buildBoostedModel(prepped.data)
linear.output <- buildLinearModels(prepped.data)

# 6. Construct statistical curves to evaluate model performance in comparison with one another (locally)
plotStatCurves(boosted.output$BoostedStats, linear.output$LinearStats, "both",
               filePaths = c("./MCC_curve.png",
                             "./ROC_curve.png",
                             "./PrecRec_curve.png"))

# 7. Using the boosted model, plot the threshold versus stats metrics to show what threshold to choose
createThresholdPlot(boosted.output$BoostedStats,
                    modelName = "gradient boosted model",
                    plotFile = "./thresholdPlot.png")

# 8. Examine the boosted model to look at its most important features (locally)
plotImportanceMatrix(boosted.output$BoostedModel, 
                     prepped.data$X_train, 
                     filePath = "./Importance_matrix.png")

# 9. Using the boosted model, create a truth plot (locally) and pull out the confusion matrix values
truth.df <- createTruthPlot(boosted.output$BoostedModel,
                            modelType = "boosted", 
                            seed = "both", 
                            threshold = 0.3,
                            X_test = prepped.data$X_test,
                            y_test = prepped.data$y_test,
                            plotFile = "./truthPlot.png")
table(truth.df$Label)

And that's it! The above workflow DID introduce another concept: how to get the ChIPseq hits directly from their database. But I'm confident you can handle that. Feel free to copy this entire workflow and change around parameters as you see fit. And let me know if you run into any questions!

Matt Richards



PriceLab/FPML documentation built on May 28, 2019, 2:25 p.m.