A RAPID INTRODUCTION TO THE PACKAGE


Why SpiderSeqR?

SpiderSeqR accelerates and simplifies genomic research by offering a new framework for searching and filtering metadata from SRA and GEO databases.

It builds upon the database files from SRAdb and GEOmetadb packages (which contain metadata of genomic datasets from NCBI SRA and GEO databases respectively) and enhances them by integrating them and adding tools for ease of search and retrieval.
Our aim was to make publically available genomic datasets more accessible to enable re-use of existing datasets for high-throughput analysis.
Some of the key features of SpiderSeqR include:

Installation

The package is currently available on github at https://github.com/ss-lab-cancerunit/SpiderSeqR To install it, please run:

devtools::install_github("ss-lab-cancerunit/SpiderSeqR")

Setup and Requirements

In order to use SpiderSeqR, please ensure that you have the following installed:

Additionally, you will need free hard drive space to store the large database files (we would recommend having at least 100 GB of disc space available). The most recent database files took up only about 40 GB in total, but this is likely to gradually increase over time as the number of samples in the databases continues to grow.

You must run the following in order to start using SpiderSeqR (this should be done every time a new session is commenced).

library(SpiderSeqR)
startSpiderSeqRDemo() # DEMO. Use to it to quickly get a flavour of SpiderSeqR

Please note that the above initialises only the demo version with a small extract from the large database files.
It is is useful for this vignette and for getting familiar with SpiderSeqR, but to get access to whole databases, please run the code below:

startSpiderSeqR(path=getwd()) 
## The 'path' argument determines the location of large database files 
## (running this command for the first time can be time consuming, 
## but will be fast on all successive runs provided 
## that the directory does not change)

A note about databases

It is extremely helpful to understand how the information about datasets is organised in SRA and GEO databases.
SRA and GEO have different hierarchy systems for their accession levels.
In SRA these are:

The relationship is 'one to many' between higher and lower accession levels, i.e. one study can contain many samples, each of which may correspond to many experiments, of which each can contain multiple runs.
GEO has fewer accession levels, i.e.:

Like in SRA, each GEO series may contain many samples, but the situation is further complicated by some of the series playing a role of 'superseries (also GSExxxxxxx), which makes some series interconnected. This also means that some of the samples may belong to multiple series.
In spite of these challenges, SpiderSeqR aims to make the conversion between all these levels as seamless as possible.


SpiderSeqR by Task

For more detailed information, please refer to the documentation of the individual package functions.

The subsections below introduce typical tasks and examples of how they can be completed using main SpiderSeqR functions.

NOTE: the results of the queries will be different using the actual databases rather than the small demo extract and they are likely to take longer to run due to the size of the database (it is likely to take on the order of a few minutes for each search, depending on your machine).

Search for (unknown) experiments - by search terms

There are two functions that achieve this: searchAnywhere() and searchForTerm().

searchAnywhere() performs a fulltext search across the two databases and returns any entries with matches to search terms. Users can restrict the search to only seek matches within specific accession levels. For example, study level usually contains general information about the study aims and may contain spurious matches (e.g. a mention of a related gene in the abstract that is not investigated experimentally). This might be undesirable when searching for a well-studied molecule as it might result in a lot of matches, many of which will be irrelevant. However, it might be useful in other cases, when searching for a more uncommon gene. It is also possible to filter results of the search afterwards (see filterByTerm() and filterByTermByAccessionLevel()).

searchForTerm() is a much more specific function which filters out information for the user according to its inbuilt logic to minimise the possibility of unwanted matches. It was built in an attempt to maximally automate the process of collating a list of samples for high-throughput analysis. In that vein, it also attempts to identify relevant 'inputs' and 'controls' for ChIP-Seq and RNA-Seq studies respectively. However, we highly recommend to verify the suitability of matches manually, due to unavoidable limitations of our function and the variability with which the experimental data is annotated in the databases. If there are insufficient results produced by searchForterm(), it is worth performing a less-restrictive search using searchAnywhere() and compare the results.

TASK: "I would like to find the ChIP-Seq experiments that were done regarding a trim gene (more specific, but less sensitive search)"

searchForTerm(SRA_library_strategy = "ChIP-Seq", gene = "trim")

TASK: "I would like to find the ChIP-Seq experiments that were done regarding a trim gene (broad search)"

searchAnywhere("trim*")

Filter the results

TASK: "I would like to search within the results to further refine them"

df <- searchAnywhere("trim*")

## Filter anywhere in the data frame
df2a <- filterByTerm(df, "input")

## Filter within columns corresponding to accession levels of interest
df2b <- filterByTermByAccessionLevel(query = "input", 
                             df = df, acc_levels = "experiment")

Select columns of interest

TASK: "I find that there are too many columns to view comfortably;" I would rather have fewer columns to view my results

df <- searchAnywhere("trim*")

## Use pre-made column sets:

## Columns with accession numbers only
df_for_view <- selectColumns_Accession(df)

## A suggested set of columns that gives a good overview of the data
df_for_view <- selectColumns_Overview(df)

TASK: "I find that there are too many columns to view comfortably;" I am only interested in a few columns that matter to me

df <- searchAnywhere("trim*")

## Define your set
column_set <- c("run_accession", 
                "study_accession", 
                "gsm",
                "SRA_experiment_title",
                "SRA_sample_attribute")

df_for_view <- selectColumns(df, column_set)

Search for (known) experiments - by accession numbers (searchForAccession)

TASK: "I would like to retrieve a handy list of samples from a study I already know."

df <- searchForAccession("SRP063011") # Search for accession
## Select most informative columns for ease of viewing
selectColumns_Overview(df) 

It is also possible to search for multiple accessions at the same time, provided they belong to only one type:

df <- searchForAccession(c("SRR3624458", "SRR3624459", "SRR3624460"))

Convert between accession types (convertAccession)

TASK: "I have a study accession number and would like to find a list of sample accessions that belong to it."

convertAccession("GSM1173370") 

As with searchForAccession(), you can convert between multiple accessions at the same time:

convertAccession(c("SRR3624458", "SRR3624459", "SRR3624460"))

Find other samples in a study (addMissingSamples)

TASK: "I have found samples/runs of interest, but I would now like to retrieve all the entries within their study/series"

df <- searchAnywhere("trimkd", acc_levels = c("run", "gsm"))
## Using selectColumns_Overview to limit the number of columns displayed
df_for_view <- selectColumns_Overview(addMissingSamples(df)) 

Save a query and run it again at a later date (rerunSpiderSeqR)

In the spirit of reproducible research, SpiderSeqR provides tools for running its queries again. This functionality is available for all of its main search functions, i.e. searchForTerm(), searchAnywhere() and searchForAccession(). We believe that this feature might prove useful when wanting to check whether there are any new matches to a query of interest, after updating the database files (which is done by running startSpiderSeqR(); you can set expiry date for the function to make sure the files are updated regularly)

TASK: "I would like to save my query to re-use at a later date"

## Enable saving call output in a file
searchAnywhere("stat1", call_output = TRUE) 

TASK: "I have downloaded a new version of the database and would like to re-run a query"

rerunSpiderSeqR("filename")

Update the database files regularly

The database files are updated regularly by SRAdb and GEOmetadb packages. To find out how old your local files are, you can simply run startSpiderSeqR(). To update the files, set the expiry date to a desired limit whenever you run startSpiderSeqR().

TASK: "I would like to make sure that my database files are no older than one month"

startSpiderSeqR(getwd(), general_expiry = 30)

I would like to store my database files in a subdirectory

By default, startSpiderSeqR() tries to locate the database files within the specified directory. This may also include any of its subdirectories. You can therefore store the files in "./Database_Files" or even in ("./SRA_files", "./GEO_files" and "./SRR_GSM_files"). All the files will be located and in case of multiple matches, user will be prompted to choose the desired file.
Please note, however, that the specified directory remains the default location for downloading database files (for avoidance of doubt, in case each of the three files have a different location). Therefore, if you update or download a missing database, it will be placed in the specified directory, not in the location(s) of the other files. You can move the files to a desired directory and run startSpiderSeqR() again to update the connections to the new file location.

TASK: "I would like to be more flexible about the location of the database files and store them in a subdirectory without having to specify the exact path"

# For files located e.g. in data/database_files, the following will suffice
startSpiderSeqR(getwd())
# OR:
startSpiderSeqR(".")
# i.e. don't need to specify the full path:
startSpiderSeqR("data/database_files")

Find out more about the databases (getDatabaseInformation)

TASK: "I would like to know what are the experiment types (library_strategy) featured in the database."

## Choose a relevant option from the menu when it appears
getDatabaseInformation()


ss-lab-cancerunit/SpiderSeqR documentation built on Nov. 2, 2020, 12:18 a.m.