Install

GitHub

Use devtools::install_github() to install directly from the GitHub repository.

devtools::install_github("seaflow-uw/popcycle", ref = <ref>)

where <ref> can be a commit, tag, or branch.

renv

To install this package using a reproducible environment with renv, first clone this repository then restore the project based on the renv.lock file. Note, you use the same version of R that was used to create the renv.lock file, wich you can find near the top of the ile.

git clone https://github.com/seaflow-uw/popcycle
cd popcycle
# optionaly switch to desired ref, e.g. git checkout -b 4.0.0 4.0.0
cp renv-Rprofile .Rprofile
R -e 'renv::restore()'
R CMD INSTALL .

Now to use this environment either start R from this directory, which will load the .Rprofile file automatically, or run renv::activate("/path-to-git-repo") from a different directory.

Docker

Docker images are available at https://hub.docker.com/r/ctberthiaume/popcycle. The latest images use package versions from renv.lock and include RStudio. See rocker/rstudio for details on running with RStudio launched in a web browser.

docker pull ctberthiaume/popcycle:<version_tag>
docker run -it --rm ctberthiaume/popcycle:<version_tag>

Setup for this guide

This guide will use the test SeaFlow data included in this repository in the tests/testdata directory. The rest of the example code in this guide assumes you have set the working directory for R to the root of the repository. For example, if you cloned the repo to ~/git/popcycle, set the R working directory with

setwd("~/git/popcycle")

If using renv, make sure to activate the environment. In this case we'll assume it's in the same directory as the repository.

renv::activate(".")

Then create a working directory for this user guide at user-guide-working-dir with copies of test data files and directories.

working_dir <- "user-guide-working-dir"
dir.create(working_dir)
setwd(working_dir)

db <- "testcruise.db"
evt_dir <- "testcruise_evt"
opp_dir <- "testcruise_opp"
vct_dir <- "testcruise_vct"

file.copy("../tests/testdata/testcruise_bare.db", "testcruise.db")
dir.create("testcruise_evt")
file.copy("../tests/testdata/evt_v2", "testcruise_evt", recursive = TRUE)

Load the popcycle package. Also load dplyr for convenient data exploration.

library(popcycle)
library(dplyr)

Import SFL metadata files

SeaFlow instruments produce metadata files (SFL) that capture instrument metadata and ship underway data for each timepoint sampled. This data is stored in the sfl table of the database. Curation and import of these files is handled by the seaflowpy Python package. This step has already been performed for our test dataset. Use get_sfl_table(db) to read SFL data.

sfl <- get_sfl_table(db)
sfl

Filtering

Define filter parameters

The first task will be to locate the 1µm reference beads and use their positions to create a set of filtering parameters. Later we'll use these parameters to filter for "optimally positioned particles" (OPP) within the water stream's virtual core.

Read EVT data

First we'll get a Tibble of EVT files. For testing purposes some of these EVT files are invalid, so we'll subset down to just the good files.

evt_files <- get_evt_files(evt_dir, db = db)
good_evt_files <- c(
  "2014_185/2014-07-04T00-00-02+00-00",
  "2014_185/2014-07-04T00-03-02+00-00",
  "2014_185/2014-07-04T01-15-02+00-00",
  "2014_185/2014-07-04T01-17-02+00-00",
  "2014_185/2014-07-04T01-30-02+00-00"
)
evt_files <- evt_files %>% filter(file_id %in% good_evt_files)
evt_files

We'll read the first EVT file and start locating beads. For this task EVT data should be left in it's original log form (transform = F).

evt <- readSeaflow(evt_files$path[1], transform = F)
head(evt)

Plot EVT data

Use plot_cytogram() to visualize particle data.

# plot_cytogram(DF, "D1", "D2", transform = F)
# plot_cytogram(DF, "fsc_small", "D1", transform = F)
# plot_cytogram(DF, "fsc_small", "D2", transform = F)
plot_cytogram(evt, "fsc_small", "pe", transform = F)
# plot_cytogram(DF, "fsc_small", "chl_small", transform = F)

Define the inflection point and filter parameters

Get D1, D2, FSC coordinates of the inflection point (1µm beads location). Click points to draw polygons around bead clusters. Right click to close each polygon.

ip <- inflection_point(evt)
ip

Next, convert bead locations to filter parameters.

filter_params <- create_filter_params(
  inst,
  fsc = ip$fsc,
  d1 = ip$d1,
  d2 = ip$d2,
  min_d1 = min_d1,
  min_d2 = min_d2,
  width = width
)
filter_params

Plot summary of filter parameters for one EVT file

plot_filter_cytogram(evt, filter_params)

Filter one EVT file

filter_evt() will reduce an EVT data frame down to just the focused particles located within the stream's virtual core. Rows can be filtered down to the 2.5, 50, and 97.5 quantile error bounds using the q2.5, q50, and q97.5 boolean columns.

opp <- filter_evt(evt, filter_params)
# Grab just the 2.5 quantile particles
opp %>% filter(q2.5)

Visualize OPP with filter parameters

To assess the accuracy of filter parameters it can be helpful to visualize OPP data alongside the three quantiles of bead locations.

par(mfrow=c(2,2))
opp <- popcycle::filter_evt(evt, filter_params)

popcycle::plot_cyt(opp, "fsc_small", "chl_small")

popcycle::plot_cyt(opp, "fsc_small", "pe")
abline(v = filter_params[, 'beads.fsc.small'], lty = 2,col = 2)

b <- subset(opp, pe > 40000)
popcycle::plot_cyt(b, "fsc_small", "D1")
abline(h = filter_params[, 'beads.D1'], lty = 2, col = 2)

popcycle::plot_cyt(b, "fsc_small", "D2")
abline(h = filter_params[, 'beads.D2'], lty = 2, col = 2)

Save filter parameters to a database file

Finally, save the filter parameters to the database. Each set of saved parameters is assinged a universally unique identifier (UUID).

There are already two sets of filter parameters defined in this test database, so get_filter_table() will retrieve a total of three sets of parameters with three IDs.

filter_id <- popcycle::save_filter_params(db, filter_params)
print(filter_id)

get_filter_table(db)

You may repeat this process and define new filter parameters for each section of the cruise with unique beads locations. For example, bead locations may change if PMT gains for forward scatter or phycoerythrin were changed.

Batch filtering many files

Filter plan

The first step is to define a filter_plan, a data frame or tibble that tracks when different filter parameters should be used throughout a cruise. You can retrieve the filter plan with get_filter_plan_table() and save a filter plan with save_filter_plan(). In the test database a filter plan table has already been defined with two entries.

filter_plan <- get_filter_plan_table(db)
filter_plan

Add an entry to the filter plan table for the new filter parameters.

filter_plan <- bind_rows(
  filter_plan,
  tibble::tibble(
    start_date = lubridate::ymd_hms("2014-07-04 01:00:00"),
    filter_id = filter_id
  )
)
print(filter_plan)
save_filter_plan(db, filter_plan)

Filter all EVT files in a cruise

And apply this plan to all files in the cruise with filter_evt_files(). If passed NULL as the list of EVT files to filter, this function will filter all files that have not been filtered or that should have a different set of filter parameters applied based on the filter plan. OPP data are saved as hourly Parquet files in opp_dir. Particle measurement values are saved as transformed (linearized) values on a scale of 1 - 10^3.5. Also note that only 3-minute files with OPP data in all three quantiles are saved to hourly Parquet files, but summary records for all files processed are stored in the database opp table.

filter_evt_files(db, evt_dir, NULL, opp_dir)

Read OPP data

OPP table

Summary information about each three minute sample of OPP data can be retrieved from the opp table with get_opp_table(). Each file entry is further divided into a separate row for each filtering quantile. For example, to filter down to only the 2.5 quantile:

get_opp_table(db) %>% filter(quantile == 2.5)

OPP data

OPP data can be retrieved by date range or file ID with get_opp_by_date() and get_opp_by_file(). Alternatively, each hourly Parquet file in opp_dir can be read directly with arrow::read_parquet().

Classification

Configure gating parameters

Now we're ready to set the gating for the different populations.

WARNINGS: 1. The order in which you gate the different populations is very important, choose it wisely. 2. The gating has to be performed over optimally positioned particles (opp), not over EVT files.

We want gating parameters that will be applicable to most opp files from the cruise. To optimize your chance of setting the proper gating parameters, do not set them based on one randomly chosen file. Instead, merge/combine as many files as you computer can handle. In this example, we'll combine all OPP files in our test data set.

# get the list of all OPP files from the cruise
opp_files <- get_opp_table(db) %>%
  filter(quantile == 2.5) %>%
  pull(file_id)

opp <- get_opp_by_file(db, opp_dir, opp_files)

The first population to gate is always the beads. We use a manual gating method to classify the population based on fsc_small and pe.

Note: This will open an R plot for a cytogram of the OPP data. Draw a polygon around the population (Left click to draw segment, right-click to close the segment and finalize the polygon).

gates.log <- add_manual_classification(opp, "beads", "fsc_small", "pe")

gating cytogram for bead

Once the polygon for the beads has been drawn, we can start setting the gates for phytoplankton population:

WARNING: The order in which you gate the different populations is very important

  1. Start by gating the Synechococcus population based on fsc_small and pe
  2. After that, gate the Prochlorococcus population based fsc_small and chl_small.
  3. Finally, gate the Picoeukaryote population based fsc_small and chl_small.

In the case of Synechococcus and Prochlorococcuspopulation, we use a semi-supervised algorithm (modified from flowDensity package) to draw ellipses around the population using the function add_auto_classification().

This function breaks the cytogram plot into 4 quadrants by density. Using the parameter position = c(), select the quadrant where you observe the population. Use the examples params provided for gates, scale, and min_pe, but play around with them until you see optimal results. Make sure to append the updated classification onto the current by using the gates.log = gates.log.

gates.log <- add_auto_classification("synecho", "fsc_small", "pe",
                                     position = c(FALSE, TRUE), gates = c(3.0, NA),
                                     scale = 0.975, min_pe = 3, gates.log = gates.log)

gates.log <- add_auto_classification("prochloro", "fsc_small", "chl_small",
                                     position = c(FALSE, TRUE), gates = c(2.0, 0.5),
                                     scale = 0.95, gates.log = gates.log)

Once the parameters are defined for Synechococcus and Prochlorococcuspopulation, use manual gating to cluster all the phytoplankton cells left (namely Picoeukaryote population) using manual_classification. Don’t worry about overlapping populations inside of your polygon. Since this is the last population we are gating, all other gated population will not be included.

gates.log <- add_manual_classification(opp, "picoeuk", "fsc_small", "chl_small", gates.log)

Gate one OPP data frame / tibble

You can test your various gating parameters on the subset of files to ensure accuracy before saving the gating parameters. In this case, we will use the aggreagated OPP tibble of all files and display the output of the classification on a cytogram.

par(mfrow = c(4,5), mar = c(1.75, 1.5, 1.5, 1.5), oma = c(0.5, 0.5, 0.5, 0.5))
opp <- try(classify_opp(opp, gates_log))
try(plot_vct_cytogram(opp, para.x = "fsc_small", para.y = "chl_small"))

Save gating parameters to database file

Once satisfied, save the gating in the db. Similar to the save_filter_params function, save_gating_params saves the gating parameters and order in which the gating was performed. Every call to save_gating_params creates a new gating entry in the database which can be retrieved by ID. Then, save the classification. (Note, when redo-doing classification, make sure to always set the gating params for “beads” first. If you skip this step, gates.log will not be reset, and you classification will still contain the parameters wished to be changed).

gating_id <- save_gating_params(db, gates.log)

Batch gating many files

Batch gating of OPP data works is analogous to batch EVT filtering. In place of a filter_plan, there is a gating_plan, and in place of filter_evt_files() there is classify_opp_files().

Gating plan

Retrieve the current gating plan with get_gating_plan_table(). Once a new gating plan has been modified or creatd, save back to the database with save_gating_plan().

Gate all OPP data in a cruise

Apply this plan to all files in the cruise with classify_opp_files(). If passed NULL as the list of OPP files to filter, this function will classify all files that have not been classified or that should have a different set of gating parameters applied based on the gating plan. VCT classifications are saved as hourly Parquet files in vct_dir. Each file contains particle data from the source OPP hourly Parquet file as well as columns for classification and calibrated data (cell diameter, and cell carbon content).

classify_opp_files(db, opp_dir, NULL, vct_dir)

Read VCT data

VCT table

Summary information about each three minute sample of VCT classifications can be retrieved from the vct table with get_vct_table(). Each file entry is further divided into a separate row for each filtering quantile and population.

VCT particle data

VCT particle data can be retrieved by date range or file ID with get_vct_by_date() and get_vct_by_file(). Alternatively, each hourly Parquet file in vct_dir can be read directly with arrow::read_parquet().

Particle size distribution

Classified and calibrated data in VCT files can be converted to a lower resolution quantized or gridded format. For a convenient command-line interface to this functionality and for an example workflow see executable_scripts/psd.R.



armbrustlab/popcycle documentation built on April 1, 2024, 2:41 p.m.