library(knitr) knitr::opts_chunk$set(echo = TRUE)
smrtlinkr
is an R package that allows users to communicate with our internal
SmrtLink servers^[A complete list of SmrtLink
servers is available
from: https://confluence.pacificbiosciences.com/display/SL/On-site+SMRT+Link+servers]
to query the database, submit jobs, or obtain information about finished jobs from
reports. This document provides a tutorial of the basic functionality, though
more features are either available in the package or can be upon request.
From your personal computer or a PacBio RStudio server such as bayes:8787
you
can install the smrtlinkr package using the following commands:
install.packages("devtools") library(devtools) install_github("PacificBiosciences/smrtlinkr")
Broadly speaking, SmrtLink servers contain two types of information:
Interacting with the SMRTLink servers entails a serious of commands that can examine the available data on a server, submit a new job by referencing the data on the server as inputs, or examine the results/reports of a job^[A more detailed description of the SMRTLink REST API can be found here: http://smrtflow.readthedocs.io/en/latest/].
Most commands in smrtlinkr
for querying data of some are of the form
fetchXXX(... , server, port)
and require a server and a port to be specified.
These values are particular to each unique server (smrtlink-alpha,
smrtlink-beta, etc.) though by default the commands will query smrtlink-internal
on port 8081.
Both data and jobs are identified as being of particular types, e.g. a reference data type or a subreadset data type. To access data we typically need to know the type and id of an item. The following code displays all the available types of data, and shows how we can use this type information to find all the E. coli reference sequences available on smrtlink-beta.
First to list all the available data types:
library(smrtlinkr) # View all the available DataSet types df = fetchDataSetTypes()
Available Dataset Types
If you know the type of data you are looking for, you can easily load the list of datatypes and query it in R. For example, the code below allows us to find all the E. coli reference sequences known to SMRTLink beta.
# List all the reference data types that are known to the server df = fetchDataSetsByType("references", server = "smrtlink-beta", port = 8081) # Now let's print the path of all the E. coli ones. kable(df[grep("coli", df$name), c("name", "path")])
For both references and subreadsets, there are convenience functions
fetchSubreadSetInfo
and fetchReferenceSetInfo
which allow you to obtain more
information about a particular dataset. For example to find out information
about a particular reference dataset
# Use the reference ID and server information to get more information refId = df$id[20] # Now query for more info info = fetchReferenceSetInfo(refId, server = "smrtlink-beta", port = "8081") info
# Get an id for a particular subreadset id = fetchDataSetsByType("subreads")$id[1] # Now query for more info fetchSubreadSetInfo(id)
Similarly to datasets, jobs in SMRTLink also have a "type", the code below lists all the available job types.
# View all the available job types d = fetchJobTypes() kable(d)
Although users may think of a job as the name of a SMRTLink procedure (e.g. Resequencing)
internally all such jobs are of type pbsmrtpipe
and so this is the type that
will typically be queried to examine results on a smrtlink server. For example
# Let's get all the pbsmrtpipe jobs d = fetchJobsByType("pbsmrtpipe", server = "smrtlink-internal", port = 8081) # Let's only get the resequencing jobs, the type of pbsmrtpipe job is identified # in the comment column # Grep the comment column to find resequencing reseqRows = grep("resequencing", d$comment) # Now let's list 5 of these jobs names and ids d[reseqRows[10:15], c("name", "id")]
Given a set of resequencing jobs, we might want to for example get the underlying reference and subread sets that were used, we can do this with the following function:
paths = fetchSubreadsAndReferencesForResequencingJobs(d[reseqRows[10:15], ]) kable(paths)
smrtlinkr
can also be used to create jobs on SMRTLink, allowing users to
programatically create jobs and then view them on the SMRTlink server. Below is
example code which shows how to create a resequencing job using the postReseqJob
function. You can obtain a subreadset ID and referenceset ID by looking at your
smrtlink server (see data management) or you can also grab it programatically as
shown below.
# To create a job, we'll need a job name, a reference ID and a subreadset ID refName = "pBR322_HindIII_scr_F_EcoRI_tc6_R_unrolled6x" subreadName = "15881P_SB_1G1.25R_A4k" ### OBTAIN A REFERENCE SET ID ### # We can get the reference ID either by looking at SMRTLink or just querying here refs = fetchDataSetsByType("references") # Grep to find the ID of the one we want refID = refs$id[grep(refName, refs$name)] ### OBTAIN A SUBREAD SET ID ### # Same situation subreads = fetchDataSetsByType("subreads") subID = subreads$id[grep(subreadName, subreads$name)] # Now we just submit the job, calling it whatever we want postReseqJob("Submitted from R!", refID, subID)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.