knitr::opts_chunk$set(collapse = TRUE, comment = '#>')
This vignette describes how to prepare data and metadata for a gene expression meta-analysis according to the method of Hughey and Butte (2015).
The first step is to find microarray datasets relevant to what you're interested in. Most datasets will be on NCBI GEO or ArrayExpress, but not all, so it's a good idea to also search the literature.
The study metadata should be a comma-delimited text file, with one row for each study and at least the following columns:
| Name | Description |
|:----------------|:-------------------------------------------------------------------------------|
| study
| Name of the study, which must be unique. |
| studyDataType
| Indicates how the expression data is stored (see below for details). |
| platformInfo
| Microarray platform, used for mapping probes to genes (see below for details). |
There are currently five options for studyDataType
:
| studyDataType
| Description |
|:---------------------|:------------------------------------------------------------------------------------|
| affy_geo
| Raw Affymetrix data from a GEO study. |
| affy_custom
| Raw Affymetrix data from a non-GEO study (e.g., ArrayExpress). |
| affy_series_matrix
| Normalized, untransformed, probe-level Affymetrix data in a GEO series matrix file. |
| series_matrix
| Normalized, log-transformed (or equivalent) data in a GEO series matrix file. |
| eset_rds
| Normalized, log-transformed (or equivalent) data, already mapped to Entrez Gene IDs, saved as an ExpressionSet
in an RDS file. |
The options for platformInfo
depend on the studyDataType
:
| studyDataType
| platformInfo
|
|:---------------------------------------------------|:--------------------------------------------|
| affy_geo
, affy_custom
, or affy_series_matrix
| Name of corresponding BrainArray custom cdf |
| series_matrix
| Corresponding GPL identifier |
| eset_rds
| ready
|
The format of the gene expression data for each study should correspond to its studyDataType
. All the folders and/or files with expression data should be in one parent folder.
| studyDataType
| Format of expression data if name of study is GSE98765
|
|:----------------------------------------|:-----------------------------------------------------------|
| affy_geo
or affy_custom
| Folder named GSE98765
containing cel or cel.gz files. |
| affy_series_matrix
or series_matrix
| File from GEO named GSE98765_series_matrix.txt.gz
. |
| eset_rds
| RDS file named GSE98765.rds
containing a Bioconductor ExpressionSet
. |
For studies whose studyDataType
is affy_geo
or affy_custom
, install the custom CDF package(s). See the installCustomCdfPackages
documentation for details.
r
?metapredict::installCustomCdfPackages
For studies whose studyDataType
is affy_series_matrix
, download the custom CDF mapping(s). See the downloadCustomCdfMappings
documentation for details.
r
?metapredict::downloadCustomCdfMappings
studyDataType == 'series_matrix'
are supportedr
studyMetadata = read.csv('<path to study metadata file>', stringsAsFactors = FALSE)
metapredict::getUnsupportedPlatforms(studyMetadata)
getStudyData()
function, add another else if
statement that tells the function how to map probes to Entrez Gene IDs for that platform. Look at the code for currently supported platforms to see examples of how this is done.getSupportedPlatforms()
function.The sample metadata should be a comma-delimited text file, with one row for each sample and at least the following columns:
| Name | Description |
|:-----------------------------------------|:-------------------------------------------------------------|
| study
| Name of the corresponding study. |
| sample
| Name of the sample, which must be unique across all studies. |
| outcome
, class
, or something similar | Variable that the meta-analysis will be trying to predict. |
The format of the sample names depends on the studyDataType
:
| studyDataType
| Format of sample names |
|:----------------------------------------|:-------------------------------------------------------------------|
| affy_geo
or affy_custom
| Names of the .cel or .cel.gz files (excluding the file extension). |
| affy_series_matrix
or series_matrix
| Names of the GSM identifiers from the series matrix file. |
| eset_rds
| colnames of the expression matrix in the ExpressionSet
. |
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.