suppressPackageStartupMessages(library(tidyverse)) knitr::opts_chunk$set(warning=FALSE, message=FALSE) options(tibble.width = Inf) # Print all columns of tibbles
This vignette uses the phenoptrExamples
sample data
and functions from the tidyverse to demonstrate
reading and processing cell seg data from multiple fields and samples.
Use list_cell_seg_files
and purrr::map_df
to read all
cell seg data files in a single directory into a single data_frame
.
The result is
similar to reading an inForm merge table.
list_cell_seg_files
takes a directory path as an argument and
returns a list of paths to all the cell_seg_data.txt
files in a directory.
library(phenoptr) library(tidyverse) base_path <- system.file("extdata", "samples", package = "phenoptrExamples") paths <- list_cell_seg_files(base_path) length(paths) paths[1]
purrr::map_df
applies read_cell_seg_data
to each path in paths
. The
data_frame
s returned from each
call to read_cell_seg_data
are combined row-wise to create a single
merged data_frame
. The result is similar to an inForm merge data file.
csd <- purrr::map_df(paths, read_cell_seg_data) dim(csd)
Using table
is one way to summarize the data by Sample Name
or Slide ID
.
This data comes from nine fields taken of three slides.
table(csd$`Sample Name`, csd$Phenotype) %>% addmargins(2, list(Total=sum)) table(csd$`Slide ID`, csd$Phenotype) %>% addmargins(2, list(Total=sum))
Merged data may come from an inForm merge or from combining individual files as shown above. In either case, the data includes entries which come from multiple fields. Computing on merged data requires a few new techniques.
Use dplyr::group_by
and dplyr::summarize
to compute summary statistics for all fields
in a slide. For finer grouping, use multiple arguments to group_by
.
Use dplyr::filter
to select a particular phenotype tissue category.
This example computes the mean PDL1 expression for CD68+
and CK+
cells
in Tumor
, with the mean computed per Slide ID
.
csd %>% filter(`Tissue Category`=='Tumor', Phenotype %in% c('CD68+', 'CK+')) %>% group_by(`Slide ID`, Phenotype) %>% summarize(Mean_PDL1=mean(`Entire Cell PDL1 (Opal 520) Mean`))
Nearest-neighbor distances must be computed per-sample because the X/Y coordinates reported in cell seg data files are all relative to the top-left of the sample.
Use dplyr::group_by
to aggregate across subsets of a full
data set. In this case, we want to group by Sample Name
. Within each
group, use dplyr::do
to call find_nearest_distance
to compute the distance
columns and dplyr::bind_cols
to combine them with the original data.
# Use the same list of phenotypes for each sample phenos <- unique(csd$Phenotype) csd <- csd %>% group_by(`Sample Name`) %>% do(bind_cols(., find_nearest_distance(., phenos))) dim(csd) tail(names(csd), 5)
# Use cached values distances <- read_csv('nearest_distance.csv') csd <- bind_cols(csd, distances) dim(csd) tail(names(csd), 5)
The next example uses group_by
, filter
and summarize
again to
compute the average distance from a tumor cell
(CK+
) to the nearest macrophage (CD68+
), with the averages computed
per Slide ID
.
csd %>% group_by(`Slide ID`) %>% filter(Phenotype=='CK+') %>% # Only tumor cells summarize(mean_dist_to_CD68=round(mean(`Distance to CD68+`), 2))
count_within
for each field in merged datacount_within
is another function that must be computed per field.
This example uses dplyr::group_by
and dplyr::do
to call
count_within
for each field in a merged data file.
The result is a data_frame
with one row per radius per field.
Including Slide ID
in the group_by
arguments doesn't change the grouping,
it causes Slide ID
to be included in the result. This is helpful for
further aggregation.
See the section Aggregate counts and means per slide below for an example which aggregates counts per slide.
csd %>% group_by(`Slide ID`, `Sample Name`) %>% do(count_within(., from='CK+', to='CD68+', radius=15))
count_within
across samplesUse count_within_batch
to count cells within a radius for
multiple tissue categories, phenotypes and fields when the fields have not
been merged.
This example counts CK+
cells having a CD8+
cell within 10 or 25 microns,
and CK+
cells with a CD68+
cell within 10 or 25 microns.
dplyr::glimpse
gives a compact summary of the data.
base_path <- system.file("extdata", "samples", package = "phenoptrExamples") pairs <- list(c('CK+', 'CD8+'), c('CK+', 'CD68+')) radius <- c(10, 25) counts <- count_within_batch(base_path, pairs, radius, verbose=FALSE) %>% select(-source, -category) # Remove unneeded columns glimpse(counts)
# count_with_batch is slow, we cheat by caching the result counts <- read_csv('count_within_batch.csv') glimpse(counts)
Aggregating from_count
, to_count
and from_with
across fields is straightforward, it only requires
simple sums. Aggregating within_mean
requires computing the underlying
count of cells within the radius, summing, and computing a new mean.
(Note: the value of from_count * within_mean
is not reported by
count_with
because it may count cells multiple times.)
counts_per_sample <- counts %>% group_by(slide_id, from, to, radius) %>% summarize(from_count=sum(from_count), to_count=sum(to_count), from_with=sum(from_with), within=sum(from_count*within_mean), within_mean=within/from_count) %>% ungroup %>% select(-within) counts_per_sample
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.