knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%", dev = "ragg_png" )
The goal of rgeo
is to provide a unified interface for most interactions between R and GEO database.
curl
to download all files, and utilize data.table
to implement all reading and preprocessing process. Reducing the dependencies is the initial purpose of this package since I have experienced several times of code running failure after updating packages when using GEOquery
.R function
.parse_gsm_list
, parse_pdata
, log_trans
and show_geo
.You can install the development version of rgeo from GitHub with:
if (!requireNamespace("pak")) { install.packages("pak", repos = sprintf( "https://r-lib.github.io/p/pak/devel/%s/%s/%s", .Platform$pkgType, R.Version()$os, R.Version()$arch ) ) } pak::pkg_install("Yunuuuu/rgeo")
library(rgeo) library(magrittr)
search_geo
The NCBI uses a search term syntax which can be associated with a specific search field enclosed by a pair of square brackets. So, for instance "Homo sapiens[ORGN]"
denotes a search for Homo sapiens
in the “Organism”
field. Details see https://www.ncbi.nlm.nih.gov/geo/info/qqtutorial.html. We can use the same term to query our desirable results in search_geo
. search_geo
will parse the searching results and return a data.frame
object containing all the records based on the search term. The internal of search_geo
is based on rentrez
package, which provides functions working with the NCBI Eutils API, so we can utilize NCBI API key
to increase the downloading speed, details see https://docs.ropensci.org/rentrez/articles/rentrez_tutorial.html#rate-limiting-and-api-keys.
Providing we want GSE GEO records related to human diabetes, we can get these records by following code, the returned object is a data.frame
:
diabetes_gse_records <- search_geo( "diabetes[ALL] AND Homo sapiens[ORGN] AND GSE[ETYP]" ) head(diabetes_gse_records[1:5])
Then, we can use whatever we're famaliar to filter the searching results. Providing we want GSE datasets with at least 6 diabetic nephropathy samples containing expression profiling. Here is the example code:
diabetes_nephropathy_gse_records <- diabetes_gse_records %>% dplyr::mutate( number_of_samples = stringr::str_match( Contains, "(\\d+) Samples?" )[, 2L, drop = TRUE], number_of_samples = as.integer(number_of_samples) ) %>% dplyr::filter( dplyr::if_any( c(Title, Summary), ~ stringr::str_detect(.x, "(?i)diabetes|diabetic") ), dplyr::if_any( c(Title, Summary), ~ stringr::str_detect(.x, "(?i)nephropathy") ), stringr::str_detect(Type, "(?i)expression profiling"), number_of_samples >= 6L ) head(diabetes_nephropathy_gse_records[1:5])
After filtering, we got r nrow(diabetes_nephropathy_gse_records)
candidate datasets. This can reduce a lot of time of us comparing with refining datasets by reading the summary records.
get_geo
GEO database mainly provides SOFT (Simple Omnibus Format in Text) formatted files for GPL, GSM and GDS entity. SOFT is designed for rapid batch submission and download of data. SOFT is a simple line-based, plain text format, meaning that SOFT files may be readily generated from common spreadsheet and database applications. A single SOFT file can hold both data tables and accompanying descriptive information for multiple, concatenated Platforms, Samples, and/or Series records.
rgeo
provide a GEOSoft
class object to store SOFT file contents, GEOSoft
object contains four slots ("accession", "meta", "datatable", and "columns"). accession
slot stores the GEO accession ID, meta
slot contains the metadata header in the SOFT formatted file, and datatable
slot contains the the data table in SOFT file which is the main data for us to use, along with a columns
slot providing descriptive column header for the datatable
data. We can use the function with the same name of these slots to extract the data.
get_geo
can download SOFT files and preprocess them well, here is some example code to get soft file from GPL
, GSM
and GDS
entity respectively.
gpl <- get_geo("gpl98", tempdir()) gpl head(datatable(gpl)) head(columns(gpl))
gsm <- get_geo("GSM1", tempdir()) gsm head(datatable(gsm)) head(columns(gsm))
gds <- get_geo("GDS10", tempdir()) gds head(datatable(gds)) head(columns(gds))
For GSE entity, there is also a soft file associated with it. But the structure is different with GPL
, GSM
and GDS
entity, rgeo
provide GEOSeries
class to keep contents in GSE soft file. Actually, a GSE soft file contains almost all contents in its subsets soft file including both GPL
and GSM
, so GEOSeries
class provides both gpl
and gsm
slots as a list of GEOSoft
. To download GSE soft file, we just set gse_matrix
to FALSE
in get_geo
function.
gse <- get_geo("GSE10", tempdir(), gse_matrix = FALSE) gse
It's more common to use a series matrix file in our usual analysis workflow, we can also handle it easily in rgeo
, as what we need to do is just set gse_matrix
to TRUE
in get_geo
function, which is also the default value. When gse_matrix
is TRUE
, get_geo
will return a ExpressionSet
object which can interact with lots of Bioconductor packages. There are two parameters controling the processing details when parsing series matrix file. When parsing phenoData from series matrix files directly, it's common to fail to discern characteristics_ch*
columns, which contain the important traits informations of corresponding samples, since many characteristics_ch*
columns in series matrix files often lacks separate strings. pdata_from_soft
, which indicates whether retrieve phenoData from GEO Soft file, can help handle this problem well. When the soft file is large and we don't want to use it, we can set pdata_from_soft
to FALSE
and use parse_pdata
function to parse it manully. Another important parameter is add_gpl
, where FALSE
indicates get_geo
will try to map the current GPL accession id into a Bioconductor annotation package, then we can use the latest bioconductor annotation package to get the up-to-date featureData, otherwise, get_geo
will add featureData from GPL soft file directly.
gse_matix <- get_geo("GSE10", tempdir()) gse_matix
gse_matrix_with_pdata <- get_geo( "gse53987", tempdir(), pdata_from_soft = FALSE, add_gpl = FALSE ) gse_matrix_smp_info <- Biobase::pData(gse_matrix_with_pdata) gse_matrix_smp_info$characteristics_ch1 <- stringr::str_replace_all( gse_matrix_smp_info$characteristics_ch1, "gender|race|pmi|ph|rin|tissue|disease state", function(x) paste0("; ", x) ) gse_matrix_smp_info <- parse_pdata(gse_matrix_smp_info) gse_matrix_smp_info[grepl( "^ch1_|characteristics_ch1", names(gse_matrix_smp_info) )]
get_geo_suppl
GEO stores raw data and processed sequence data files as the external supplementary data files. Sometimes, we may want to preprocess and normalize the rawdata by ourselves, in addition, it's not uncommon that a GSE entity series matrix won't contain the expression matrix, which is almost the case of high-throughout sequencing data. get_geo_suppl
is designed for these conditions. Usually, the expression matrix will be provided in the GSE supplementary files or in the GSM supplementary files.
If the expression matrix is given in the GSE supplementary files, we can download it directly use get_geo_suppl
, which will return a character vector containing the path of downloaded files.
gse160724 <- get_geo_suppl( ids = "GSE160724", tempdir(), pattern = "counts_anno" ) gse160724_dt <- data.table::fread(gse160724) head(gse160724_dt[1:5])
If the expression matrix is given in the GSM supplementary files, in this way, we start from derive all GSM accession ids and then download all GSM supplementary files and combine them into a expression matrix. Although no expression matrix in the series matrix file, it still contains the samples informations.
gse180383_smat <- get_geo( "GSE180383", tempdir(), gse_matrix = TRUE, add_gpl = FALSE, pdata_from_soft = FALSE ) gse180383_smat_cli <- Biobase::pData(gse180383_smat) head(gse180383_smat_cli[1:5]) gse180383_smat_gsmids <- gse180383_smat_cli[["geo_accession"]] gse180383_smat_gsm_suppl <- get_geo_suppl(gse180383_smat_gsmids, tempdir())
Another way, we can also derive sample accession ids from GSE soft files, which is what our laboratory prefers to since we can easily get exact sample traits information as described in the above by utilizing parse_gsm_list
function.
gse180383_soft <- get_geo( "GSE180383", tempdir(), gse_matrix = FALSE ) gse180383_soft_cli <- parse_gsm_list(gsm(gse180383_soft)) head(gse180383_soft_cli[1:5]) gse180383_soft_gsmids <- names(gsm(gse180383_soft)) gse180383_soft_gsm_suppl <- get_geo_suppl(gse180383_soft_gsmids, tempdir())
rgeo
also provide some useful function to help better interact with GEO.
show_geo
function: Require a geo entity id and open GEO Accession site in the default browser.log_trans
function: Require a expression matrix and this function will check whether this expression matrix has experienced logarithmic transformation, if it hasn't, log_trans
will do it. This is a helper function used in GEO2R
. sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.