knitr::opts_chunk$set(collapse = TRUE, cache = FALSE, comment = "#>")
This vignette is the third chapter in the "Pathway Significance Testing with pathwayPCA
" workflow, providing a detailed perspective to the Creating Data Objects section of the Quickstart Guide. This vignette builds on the material covered in the "Import and Tidy Data" vignette. This guide will outline the major steps needed to create a data object for analysis with the pathwayPCA
package. These objects are called Omics
-class objects.
Before we move on, we will outline our steps. After reading this vignette, you should be able to
Omics
object class.Omics
objects.First, load the pathwayPCA
package and the tidyverse
package suite.
library(tidyverse) library(pathwayPCA)
Because this is the third chapter in the workflow, we assume that
pathwayCollection
object.If you are unsure about any of the three points above (or you don't know what these mean), please review the Import and Tidy Data vignette first. It isn't very long, but it will help you set up your data in the right way. If your data is not in the proper form, the steps in this vignette may be very difficult.
For the purpose of example, we will load some "toy" data: a combined assay / phenotype data frame and a pathwayCollection
list. These objects already fit the three criteria above. This tidy data set has 656 gene expression measurements (columns) on 250 colon cancer patients (rows).
data("colonSurv_df") colonSurv_df
Notice that the assay and survival response information have already been merged, so we have two additional columns (for Overall Survival Time and its corresponding death indicator). We also have a small collection of 15 pathways which correspond to our example colon cancer assay.
data("colon_pathwayCollection") colon_pathwayCollection str(colon_pathwayCollection$pathways, list.len = 10)
The pathway collection and tidy assay (with matched phenotype information) are all the information we need to create an Omics
-class data object.
Omics
-Class Objects DefinedNow that we have our data loaded, we can create an analysis object for the pathwayPCA
package.
In this package, all primary input data will be in an Omics
data object. There are three classes of Omics*
objects, but one function (CreateOmics
) creates all of them. Each class contains a tidy assay and pathwayCollection
list. The classes differ in the type of response information they can hold. The classes, and their responses, are
OmicsSurv
---a data object for survival information, which includes event time (the time of last follow-up with a subject) and event indicator (did the subject die, or was the observation right-censored).OmicsReg
---a data object for continuous responses (usually a linear regression response).OmicsCateg
---a data object for categorical responses, the dependent variable of a generalized linear model. Currently, we only support binary classification (through logistic regression).OmicsPathway
---a data object with no response. This is the "parent" class for the other three Omics
classes.R
Take a quick look back at the structure of our colonSurv_df
object. We have a table data frame with the first two columns as subject response information and the rest as an expression design matrix. Look at the types of the columns of this data frame (the <dbl>
and <int>
tags directly under the column names): these tags tell us that the columns contain "double / numeric" (dbl
) and "integer" (int
) information. The other tags we could potentially see here are <chr>
(character), <lgl>
(logical), or <fct>
(factor). These tags are important because they identify which "class" of data is in each column.
Here are some examples of how to change data between types. We inspect the first 10 entries of each object.
# Original integer column head(colonSurv_df$OS_event, 10) # Integer to Character head(as.character(colonSurv_df$OS_event), 10) # Integer to Logical head(as.logical(colonSurv_df$OS_event), 10) # Integer to Factor head(as.factor(colonSurv_df$OS_event), 10)
The CreateOmics
function puts the response information into specific classes:
numeric
(time) and logical
(death indicator) vectors.numeric
or integer
vector.factor
vector.These restrictions are on purpose: the internal data creation functions in the pathwayPCA
package have very specific requirements about the types of data they take as inputs. This ensures the integrity of your data analysis.
Omics
ObjectsAll new Omics
objects are created with the CreateOmics
function. You should use this function to create Omics
-class objects for survival, regression, or categorical responses. This CreateOmics
function internally calls on a specific creation function for each response type:
CreateOmicsSurv()
function creates an Omics
object with class OmicsSurv
. This object will contain:eventTime
: a numeric
vector of event times.eventObserved
: a logical
vector of death (or other event) indicators. This format precludes the option of recurrent-event survival analysis.assayData_df
: a tidy data.frame
or tibble
of assay data. Rows are observations or subjects; the columns are -Omics measures (e.g. transcriptome). The column names must match a subset of the genes provided in the pathways list (in the pathwayCollection
object).pathwayCollection
: a list
of pathway information, as returned by the read_gmt
function (see the Import and Tidy Data vignette for more details). The names of the genes in these pathways must match a subset of the genes recorded in the assay data frame (in the assayData_df
object).CreateOmicsReg()
function creates an Omics
object with class OmicsReg
. This object will contain:response
: a numeric
vector of the response.assayData_df
: a tidy data.frame
or tibble
of assay data, as described above.pathwayCollection
: a list
of pathway information, as described above.CreateOmicsCateg()
function creates an Omics
object with class OmicsCateg
. In future versions, this function will be able to take in $n$-ary responses and ordered categorical responses, but we only support binary responses for now. This object will contain:response
: a factor
vector of the response.assayData_df
: a tidy data.frame
or tibble
of assay data, as described above.pathwayCollection
: a list
of pathway information, as described above.In order to create example Omics
-class objects, we will consider the overall patient survival time (and corresponding censoring indicator) as our survival response, the event time as our regression response, and event indicator as our binary classification response.
Omics
Data ObjectNow we are prepared to create our first survival Omics
object for later analysis with either AES-PCA or Supervised PCA. Recall that the colonSurv_df
data frame has the survival time in the first column, the event indicator in the second column, and the assay expression data in the subsequent columns. Therefore, the four arguments to the CreateOmics
function will be:
assayData_df
: this will be only the expression columns of the colonSurv_df
data frame (i.e. all but the first two columns). In R
, we can remove the first two columns of the colonSurv_df
data frame by negative subsetting: colonSurv_df[, -(1:2)]
.pathwayCollection_ls
: this will be the colon_pathwayCollection
list object. Recall that you can import a .gmt
file into a pathwayCollection
object via the read_gmt
function, or create a pathwayCollection
list object by hand with the CreatePathwayCollection
function.response
: this will be the first two columns of the colonSurv_df
data frame. The survival time stored in the OS_time
column and the event indicator stored in the OS_event
column.respType
: this will be the word "survival"
or an abbreviation of it.Also, when you create an Omics*
-class object, the CreateOmics()
function prints helpful diagnostic messages about the overlap between the features in the supplied assay data and those in the pathway collection.
colon_OmicsSurv <- CreateOmics( assayData_df = colonSurv_df[, -(2:3)], pathwayCollection_ls = colon_pathwayCollection, response = colonSurv_df[, 1:3], respType = "surv" )
The last three sentences inform you of how strong the overlap is between the genes measured in your data and the genes selected in your pathway collection. This messages tells us that 9% of the 676 total genes included in all pathways were not measured in the assay; zero pathways were removed from the pathways list for having too few genes after gene trimming; and the genes in the pathways list call for 93.8% of the 656 genes measured in the assay. The last number is the most important: it measures how well your pathway collection overlaps with the genes measured in your assay. This number should be as close to 100% as possible. These diagnostic messages depend on the overlap between the pathway collection and the assay, so these messages are response agnostic.
In order to view a summary of the contents of the colon_OmicsSurv
object, you need simply to print it to the R
console.
colon_OmicsSurv
Also notice that the CreateOmics()
function stores a "cleaned" copy of the pathway collection. The object creation functions within the pathwayPCA
package subset the feature data frame by the genes in each pathway. Therefore, if we have genes in the pathways that are not recorded in the data frame, then we will necessarily create missing (NA
) predictors. To circumvent this issue, we check if each gene in each pathway is recorded in the data frame, and remove from each pathway the genes for which the assay does not have recorded expression levels. However, if we remove genes from pathways which do not have recorded levels in the predictor data frame, we could theoretically remove all the genes from a given pathway. Thus, we also check to make sure that each pathway in the given pathways list still has some minimum number of genes present (defaulting to three or more) after we have removed genes without corresponding expression levels.
The IntersectOmicsPwyCollct()
function performs these two actions simultaneously, and this function is called and executed automatically within the object creation step. This function removes the unrecorded genes from each pathway, trims the pathways that have fewer than the minimum number of genes allowed, and returns a "trimmed" pathway collection. If there are any pathways removed by this execution, the pathways
list within the trimPathwayCollection
object within the Omics
object will have a character vector of the pathways removed stored as the "missingPaths"
attribute. Access this attribute with the attr()
function.
Omics
Data ObjectsWe create regression- and categorical-type Omics
data objects identically to survival-type Omics
objects. We will use the survival time as our toy regression response and the death indicator as the toy classification response.
colon_OmicsReg <- CreateOmics( assayData_df = colonSurv_df[, -(2:3)], pathwayCollection_ls = colon_pathwayCollection, response = colonSurv_df[, 1:2], respType = "reg" ) colon_OmicsReg
colon_OmicsCateg <- CreateOmics( assayData_df = colonSurv_df[, -(2:3)], pathwayCollection_ls = colon_pathwayCollection, response = colonSurv_df[, c(1, 3)], respType = "categ" ) colon_OmicsCateg
Omics
-Class ObjectsIn order to access or edit a specific component of an Omics
object, we need to use specific accessor functions. These functions are named with the component they access.
The get*
functions access the part of the data object you specify. You can save these objects to their own variables, or simply print them to the screen for inspection. Here we print the assay data frame contained in the colon_OmicsSurv
object to the screen:
getAssay(colon_OmicsSurv)
This function is rather simple: it shows us what object is stored in the assayData_df
slot of the colon_OmicsSurv
data object. As we should expect, we see all the columns of the colonSurv_df
data frame except for the first two (the survival time and event indicator).
If we needed to edit the assay data frame in the colon_OmicsSurv
object, we can use the "replacement" syntax of the getAssay
function. These are the "set" functions, and they use the getSLOT(object) <- value
syntax. For example, if we wanted to remove all of the genes except for the first ten from the assay data, we can replace this assay data with a subset of the the original colonSurv_df
data frame. The SLOT
shorthand name is Assay
, and the replacement value is the first ten gene expression columns (in columns 3 through 12) of the colonSurv_df
data frame: colonSurv_df[, (3:12)]
.
getAssay(colon_OmicsSurv) <- colonSurv_df[, (3:12)]
Now, when we inspect the colon_OmicsSurv
data object, we see only ten variables measured in the assayData_df
slot, instead of our original 656.
colon_OmicsSurv
Before we move on, we should resest the data in the assayData_df
slot to the full data by
getAssay(colon_OmicsSurv) <- colonSurv_df[, -(1:2)]
Here is a table listing each of the "get" and "set" methods for the Omics
class, and which sub-classes they can access or modify.
| Command | Omics
Sub-class | Function |
|----------------------------------|:-----------------:|------------------------------------------------------------|
| getAssay(object)
| All | Extract the assayData_df
data frame stored in object
. |
| getAssay(object) <- value
| All | Set assayData_df
stored in object
to value
. |
| getSampleIDs(object)
| All | Extract the sampleIDs_char
vector stored in object
. |
| getSampleIDs(object) <- value
| All | Set sampleIDs_char
stored in object
to value
. |
| getPathwayCollection(object)
| All | Extract the pathwayCollection
list stored in object
. |
| getPathwayCollection(object) <- value
| All | Set pathwayCollection
stored in object
to value
. |
| getEventTime(object)
| Surv
| Extract the eventTime_num
vector stored in object
. |
| getEventTime(object) <- value
| Surv
| Set eventTime_num
stored in object
to value
. |
| getEvent(object)
| Surv
| Extract the eventObserved_lgl
vector stored in object
. |
| getEvent(object) <- value
| Surv
| Set eventObserved_lgl
stored in object
to value
. |
| getResponse(object)
| Reg
or Categ
| Extract the response
vector stored in object
. |
| getResponse(object) <- value
| Reg
or Categ
| Set response
stored in object
to value
. |
The response
vector accessed or edited with the getResponse
method depends on if the object
supplied is a "regression" Omics
-class object or a "categorical" one. For regression Omics
objects, getResponse(object)
and getResponse(object) <- value
get and set, respectively, the response_num
slot. However, for categorical Omics
objects, getResponse(object)
and getResponse(object) <- value
get and set, respectively, the response_fact
slot. This is because regression objects contain numeric
response vectors while categorical objects contain factor
response vectors.
pathwayCollection
ListAs we mentioned in the Importing with the read_gmt
Function subsection of the previous vignette, the pathwayCollection
object will be modified upon Omics
-object creation. Before, this list only had two elements, pathways
and TERMS
(we skipped importing the "description" field). Now, it has a third element: setsize
---the number of genes contained in each pathway.
getPathwayCollection(colon_OmicsSurv)
We now summarize our steps so far. We have
Omics
class and three sub-classes: survival, regression, and categorical (and the "parent" class).Omics
object for the three sub-classes.Now we are prepared to analyze our created data objects with either AES-PCA or Supervised PCA. Please read vignette chapter 4 next: Test Pathway Significance.
Here is the R session information for this vignette:
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.