Use devtools::install_github() to install directly from the GitHub repository.
devtools::install_github("seaflow-uw/popcycle", ref = <ref>)
where <ref>
can be a commit, tag, or branch.
To install this package using a reproducible environment with renv, first clone this repository then restore the project based on the renv.lock
file. Note, you use the same version of R that was used to create the renv.lock
file, wich you can find near the top of the ile.
git clone https://github.com/seaflow-uw/popcycle cd popcycle # optionaly switch to desired ref, e.g. git checkout -b 4.0.0 4.0.0 cp renv-Rprofile .Rprofile R -e 'renv::restore()' R CMD INSTALL .
Now to use this environment either start R from this directory, which will load the .Rprofile
file automatically, or run renv::activate("/path-to-git-repo")
from a different directory.
Docker images are available at https://hub.docker.com/r/ctberthiaume/popcycle. The latest images use package versions from renv.lock
and include RStudio. See rocker/rstudio for details on running with RStudio launched in a web browser.
docker pull ctberthiaume/popcycle:<version_tag> docker run -it --rm ctberthiaume/popcycle:<version_tag>
This guide will use the test SeaFlow data included in this repository in the tests/testdata
directory.
The rest of the example code in this guide assumes you have set the working directory for R to the root of the repository.
For example, if you cloned the repo to ~/git/popcycle
, set the R working directory with
setwd("~/git/popcycle")
If using renv
, make sure to activate the environment.
In this case we'll assume it's in the same directory as the repository.
renv::activate(".")
Then create a working directory for this user guide at user-guide-working-dir
with copies of test data files and directories.
working_dir <- "user-guide-working-dir" dir.create(working_dir) setwd(working_dir) db <- "testcruise.db" evt_dir <- "testcruise_evt" opp_dir <- "testcruise_opp" vct_dir <- "testcruise_vct" file.copy("../tests/testdata/testcruise_bare.db", "testcruise.db") dir.create("testcruise_evt") file.copy("../tests/testdata/evt_v2", "testcruise_evt", recursive = TRUE)
Load the popcycle
package. Also load dplyr
for convenient data exploration.
library(popcycle) library(dplyr)
SeaFlow instruments produce metadata files (SFL) that capture instrument metadata and ship underway data for each timepoint sampled.
This data is stored in the sfl
table of the database.
Curation and import of these files is handled by the seaflowpy Python package. This step has already been performed for our test dataset.
Use get_sfl_table(db)
to read SFL data.
sfl <- get_sfl_table(db) sfl
The first task will be to locate the 1µm reference beads and use their positions to create a set of filtering parameters. Later we'll use these parameters to filter for "optimally positioned particles" (OPP) within the water stream's virtual core.
First we'll get a Tibble of EVT files. For testing purposes some of these EVT files are invalid, so we'll subset down to just the good files.
evt_files <- get_evt_files(evt_dir, db = db) good_evt_files <- c( "2014_185/2014-07-04T00-00-02+00-00", "2014_185/2014-07-04T00-03-02+00-00", "2014_185/2014-07-04T01-15-02+00-00", "2014_185/2014-07-04T01-17-02+00-00", "2014_185/2014-07-04T01-30-02+00-00" ) evt_files <- evt_files %>% filter(file_id %in% good_evt_files) evt_files
We'll read the first EVT file and start locating beads.
For this task EVT data should be left in it's original log form (transform = F
).
evt <- readSeaflow(evt_files$path[1], transform = F) head(evt)
Use plot_cytogram()
to visualize particle data.
# plot_cytogram(DF, "D1", "D2", transform = F) # plot_cytogram(DF, "fsc_small", "D1", transform = F) # plot_cytogram(DF, "fsc_small", "D2", transform = F) plot_cytogram(evt, "fsc_small", "pe", transform = F) # plot_cytogram(DF, "fsc_small", "chl_small", transform = F)
Get D1, D2, FSC coordinates of the inflection point (1µm beads location). Click points to draw polygons around bead clusters. Right click to close each polygon.
ip <- inflection_point(evt) ip
Next, convert bead locations to filter parameters.
filter_params <- create_filter_params( inst, fsc = ip$fsc, d1 = ip$d1, d2 = ip$d2, min_d1 = min_d1, min_d2 = min_d2, width = width ) filter_params
plot_filter_cytogram(evt, filter_params)
filter_evt()
will reduce an EVT data frame down to just the focused particles located within the stream's virtual core.
Rows can be filtered down to the 2.5, 50, and 97.5 quantile error bounds using the q2.5
, q50
, and q97.5
boolean columns.
opp <- filter_evt(evt, filter_params) # Grab just the 2.5 quantile particles opp %>% filter(q2.5)
To assess the accuracy of filter parameters it can be helpful to visualize OPP data alongside the three quantiles of bead locations.
par(mfrow=c(2,2)) opp <- popcycle::filter_evt(evt, filter_params) popcycle::plot_cyt(opp, "fsc_small", "chl_small") popcycle::plot_cyt(opp, "fsc_small", "pe") abline(v = filter_params[, 'beads.fsc.small'], lty = 2,col = 2) b <- subset(opp, pe > 40000) popcycle::plot_cyt(b, "fsc_small", "D1") abline(h = filter_params[, 'beads.D1'], lty = 2, col = 2) popcycle::plot_cyt(b, "fsc_small", "D2") abline(h = filter_params[, 'beads.D2'], lty = 2, col = 2)
Finally, save the filter parameters to the database. Each set of saved parameters is assinged a universally unique identifier (UUID).
There are already two sets of filter parameters defined in this test database, so get_filter_table()
will retrieve a total of three sets of parameters with three IDs.
filter_id <- popcycle::save_filter_params(db, filter_params) print(filter_id) get_filter_table(db)
You may repeat this process and define new filter parameters for each section of the cruise with unique beads locations. For example, bead locations may change if PMT gains for forward scatter or phycoerythrin were changed.
The first step is to define a filter_plan
, a data frame or tibble that tracks when different filter parameters should be used throughout a cruise.
You can retrieve the filter plan with get_filter_plan_table()
and save a filter plan with save_filter_plan()
.
In the test database a filter plan table has already been defined with two entries.
filter_plan <- get_filter_plan_table(db) filter_plan
Add an entry to the filter plan table for the new filter parameters.
filter_plan <- bind_rows( filter_plan, tibble::tibble( start_date = lubridate::ymd_hms("2014-07-04 01:00:00"), filter_id = filter_id ) ) print(filter_plan) save_filter_plan(db, filter_plan)
And apply this plan to all files in the cruise with filter_evt_files()
.
If passed NULL
as the list of EVT files to filter, this function will filter all files that have not been filtered or that should have a different set of filter parameters applied based on the filter plan.
OPP data are saved as hourly Parquet files in opp_dir
.
Particle measurement values are saved as transformed (linearized) values on a scale of 1 - 10^3.5.
Also note that only 3-minute files with OPP data in all three quantiles are saved to hourly Parquet files, but summary records for all files processed are stored in the database opp
table.
filter_evt_files(db, evt_dir, NULL, opp_dir)
Summary information about each three minute sample of OPP data can be retrieved from the opp
table with get_opp_table()
.
Each file entry is further divided into a separate row for each filtering quantile.
For example, to filter down to only the 2.5 quantile:
get_opp_table(db) %>% filter(quantile == 2.5)
OPP data can be retrieved by date range or file ID with get_opp_by_date()
and get_opp_by_file()
.
Alternatively, each hourly Parquet file in opp_dir
can be read directly with arrow::read_parquet()
.
Now we're ready to set the gating for the different populations.
WARNINGS: 1. The order in which you gate the different populations is very important, choose it wisely. 2. The gating has to be performed over optimally positioned particles (opp), not over EVT files.
We want gating parameters that will be applicable to most opp
files from the cruise. To optimize your chance of setting the proper gating parameters, do not set them based on one randomly chosen file. Instead, merge/combine as many files as you computer can handle. In this example, we'll combine all OPP files in our test data set.
# get the list of all OPP files from the cruise opp_files <- get_opp_table(db) %>% filter(quantile == 2.5) %>% pull(file_id) opp <- get_opp_by_file(db, opp_dir, opp_files)
The first population to gate is always the beads. We use a manual gating method to classify the population based on fsc_small
and pe
.
Note: This will open an R plot for a cytogram of the OPP data. Draw a polygon around the population (Left click to draw segment, right-click to close the segment and finalize the polygon).
gates.log <- add_manual_classification(opp, "beads", "fsc_small", "pe")
Once the polygon for the beads has been drawn, we can start setting the gates for phytoplankton population:
WARNING: The order in which you gate the different populations is very important
Synechococcus
population based on fsc_small
and pe
Prochlorococcus
population based fsc_small
and chl_small
.Picoeukaryote
population based fsc_small
and chl_small
.In the case of Synechococcus
and Prochlorococcus
population, we use a semi-supervised algorithm (modified from flowDensity
package) to draw ellipses around the population using the function add_auto_classification()
.
This function breaks the cytogram plot into 4 quadrants by density. Using the parameter position = c()
, select the quadrant where you observe the population. Use the examples params provided for gates
, scale
, and min_pe
, but play around with them until you see optimal results. Make sure to append the updated classification onto the current by using the gates.log = gates.log
.
gates.log <- add_auto_classification("synecho", "fsc_small", "pe", position = c(FALSE, TRUE), gates = c(3.0, NA), scale = 0.975, min_pe = 3, gates.log = gates.log) gates.log <- add_auto_classification("prochloro", "fsc_small", "chl_small", position = c(FALSE, TRUE), gates = c(2.0, 0.5), scale = 0.95, gates.log = gates.log)
Once the parameters are defined for Synechococcus
and Prochlorococcus
population, use manual gating to cluster all the phytoplankton cells left (namely Picoeukaryote
population) using manual_classification
. Don’t worry about overlapping populations inside of your polygon. Since this is the last population we are gating, all other gated population will not be included.
gates.log <- add_manual_classification(opp, "picoeuk", "fsc_small", "chl_small", gates.log)
You can test your various gating parameters on the subset of files to ensure accuracy before saving the gating parameters. In this case, we will use the aggreagated OPP tibble of all files and display the output of the classification on a cytogram.
par(mfrow = c(4,5), mar = c(1.75, 1.5, 1.5, 1.5), oma = c(0.5, 0.5, 0.5, 0.5)) opp <- try(classify_opp(opp, gates_log)) try(plot_vct_cytogram(opp, para.x = "fsc_small", para.y = "chl_small"))
Once satisfied, save the gating in the db
. Similar to the save_filter_params
function, save_gating_params
saves the gating parameters and order in which the gating was performed. Every call to save_gating_params
creates a new gating entry in the database which can be retrieved by ID. Then, save the classification. (Note, when redo-doing classification, make sure to always set the gating params for “beads” first. If you skip this step, gates.log
will not be reset, and you classification will still contain the parameters wished to be changed).
gating_id <- save_gating_params(db, gates.log)
Batch gating of OPP data works is analogous to batch EVT filtering.
In place of a filter_plan
, there is a gating_plan
, and in place of filter_evt_files()
there is classify_opp_files()
.
Retrieve the current gating plan with get_gating_plan_table()
.
Once a new gating plan has been modified or creatd, save back to the database with save_gating_plan()
.
Apply this plan to all files in the cruise with classify_opp_files()
.
If passed NULL
as the list of OPP files to filter, this function will classify all files that have not been classified or that should have a different set of gating parameters applied based on the gating plan. VCT classifications are saved as hourly Parquet files in vct_dir
. Each file contains particle data from the source OPP hourly Parquet file as well as columns for classification and calibrated data (cell diameter, and cell carbon content).
classify_opp_files(db, opp_dir, NULL, vct_dir)
Summary information about each three minute sample of VCT classifications can be retrieved from the vct
table with get_vct_table()
.
Each file entry is further divided into a separate row for each filtering quantile and population.
VCT particle data can be retrieved by date range or file ID with get_vct_by_date()
and get_vct_by_file()
.
Alternatively, each hourly Parquet file in vct_dir
can be read directly with arrow::read_parquet()
.
Classified and calibrated data in VCT files can be converted to a lower resolution quantized or gridded format.
For a convenient command-line interface to this functionality and for an example workflow see executable_scripts/psd.R
.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.