In cmatKhan/brentlabRnaSeqTools: A Collection of Functions to Interact with the Brentlab RNASeq Database

knitr::opts_chunk$set(
  tidy=FALSE,
  cache=FALSE,
  dev="png",
  collapse = TRUE,
  comment = "#>",
  eval = FALSE
)

library(brentlabRnaSeqTools)

all functions have documentation

You can always access this documentation by placing a question mark in front of the function or data variable which is loaded by the brentlabRnaSeqTools package. For example:

?postFastqSheet

database_info stores database information

database_info is a list object which becomes available when you attach the brentlabRnaSeqTools library. You can view a list of the slots by doing this:

?database_info

Alternatively, if you place database_info$ in your console and hit tab, a list of slots will appear. The same is true for values which themselves are lists. For example, if you enter database_info$kn99_urls$ and hit tab, a list of urls will pop up.

Usernames and passwords

For functions which upload data to the database, you'll need to use the same username/password you use to log into the frontend data entry system. Typically, this will look something like this:

username: I.SURNAME
password: password123

The database uses this username/password to generate a "token", which is what is actually used to sign in a user to the database system.

So, in order to interact with the database, you sometimes need your "authorization token". This is what getUserAuthToken() does:

# check the documentation
?getUserAuthToken

A valid call to this function looks like this:

username = 'I.SURNAME'
password = 'password123'

my_token = getUserAuthToken(database_info$kn99_urls$token_auth, username, password)

# view your token
print(my_token)

In general, you want to keep your username, password and token secret. One way to do that in R is to use your .Renviron file, and to ensure that it is in your .gitignore. Note: this assumes that you are working in a project directory

Using a local .Renviron

To make a .Renviron directory in your project, do this:

usethis::edit_r_environ("project")

You want to be sure at this point to add the .Renviron file to your .gitnore. Do this by just adding .Renviron to a new line in the .gitignore file.

Next, open the .Renviron file and add some environmental variables. For example, you might do this:

# note: the db_ variables will be used later in this vignette
db_username = 'db_username'
db_password = 'db_password123'
username = 'I.SURNAME'
password = 'password123'
token = 'lalskdfjaslkdf12341klajsdf' # output of getUserAuthToken()

Restart your R session (see the Session menu in the Rstudio window), and you can now access your environmental variables. Using the getUserAuthToken as an example, you would now do this:

username = Sys.getenv('username')
password = Sys.getenv('password')

my_token = getUserAuthToken(database_info$kn99_urls$token_auth, username, password)

# view your token
print(my_token)

From now on, it is assumed that you have a .Renviron file in your project directory with the variables listed above.

Send a fastq sheet to the database

Check the documentation with ?postFastqSheet. Using this function looks like this:

# NOTE: make sure you choose the right organism for your fastq file
database_url = database_info$kn99_urls$FastqFiles
auth_token = Sys.getenv('token')
# note: currently, this function accepts files in .csv, .tsv and .xlsx formats
new_fastq_path = '/path/to/new/fastq.xlsx'

# save the output in a variable. If there is a failure, this is where the error information will be
new_fastq_response = postFastqSheet(database_url, auth_token, new_fastq_path)

If the response is success or code 201 or 201, then the communcation with the database was successful. If it was not, then you'll get a failure or code 400. In that case, save the reponse variable like so:

# note, the name might be something like "fastq_response_20210701.rds"
write_rds(new_fastq_response, "database_log/unique_name.rds")

And send it to whoever can use this this to figure out what went wrong.

Send counts to the database

First, mount the cluster to your local computer so that you have access to the count file generated by the cluster based QC pipeline. Once you have done that, sending the counts file is similar to sending a fastq sheet:

# NOTE: make sure you choose the right organism for your counts file
database_counts_url = database_info$kn99_urls$Counts
run_number = 12345
auth_token = Sys.getenv("token")
new_counts_path = "/path/to/counts/file"
# See section on `archiveDatabase`
fastq_df = read_csv("data/20210701/fastq.csv")

new_counts_response = postCounts(database_counts_url, run_number, auth_token, new_counts_path, fastq_df)

To pull the current state of the database on your computer, use archiveDatabase(). Note: you do not need to keep historic copies locally, so clean this regularly by deleting old archives.

Note: this uses a different username and password than the ones we have been using above. This uses the "superuser" credentials. It is assumed that these have been stored in your .Renviron already.

database_host = database_info$kn99_host
database_name = database_info$kn99_db_name
database_user = Sys.getenv("db_username")
database_password = Sys.getenv("db_password")
# this assumes you have a data directory in your current working directory
output_dir = "data"

archiveDatabase(database_host, database_name, database_user, database_password, output_dir)

The output of this function will be a directory named by today's date in the output_dir. Inside will be a separate .csv for each table in the database, as well as a combined table. This is what we use for the fastq_df in the postCounts function above.

create a query sheet for the pipeline

To create a query sheet to run the pipeline (this used to be what the function queryDB on the cluster did), will use the function getMetadata. Note: if you are creating this to run a new run through the pipeline, then this needs to happen after you have added the fastq sheet to the database, and ensured that the new run is in the lts_sequence directory in the appropriate format.

# as in the archiveDatabase example, this is the "super user" credentials, not your personal credentials or token. It is assumed that this is in your .Renviron file.

database_host = database_info$kn99_host
database_name = database_info$kn99_db_name
database_user = Sys.getenv("db_username")
database_password = Sys.getenv("db_password")

combined_df = getMetadata(database_host, database_name, database_user, database_password)

You will next need to filter down to the set of samples you are interested in. Please ask for help with this if you need it. Here is an example which returns only the samples in a given run:

subset = combined_df %>%
  filter(runNumber == 12345)

Assuming you have the cluster mounted to your local system, you can now save this in your personal scratch rnaseq_pipeline/query directory:

write_csv(subset, "/path/to/mounted_scratch/rnaseq_pipeline/query/run_12345.csv")

Now you would run the pipeline on the cluster as before. Instructions may be found here:

https://github.com/BrentLab/rnaseq_pipeline/wiki

Get raw counts

Getting the raw counts is very similar to getMetadata

# as in the archiveDatabase example, this is the "super user" credentials, not your personal credentials or token. It is assumed that this is in your .Renviron file.

database_host = database_info$kn99_host
database_name = database_info$kn99_db_name
database_user = Sys.getenv("db_username")
database_password = Sys.getenv("db_password")

combined_df = getRawCounts(database_host, database_name, database_user, database_password)