suppressPackageStartupMessages(library(tidyverse))
knitr::opts_chunk$set(warning=FALSE, message=FALSE)
options(tibble.width = Inf) # Print all columns of tibbles

This vignette uses the phenoptrExamples sample data and functions from the tidyverse to demonstrate reading and processing cell seg data from multiple fields and samples.


Read multiple data files

Use list_cell_seg_files and purrr::map_df to read all cell seg data files in a single directory into a single data_frame. The result is similar to reading an inForm merge table.

Find all cell seg data files in a directory

list_cell_seg_files takes a directory path as an argument and returns a list of paths to all the cell_seg_data.txt files in a directory.

library(phenoptr)
library(tidyverse)
base_path <- system.file("extdata", "samples", package = "phenoptrExamples")
paths <- list_cell_seg_files(base_path)
length(paths)
paths[1]

Read and combine files

purrr::map_df applies read_cell_seg_data to each path in paths. The data_frames returned from each call to read_cell_seg_data are combined row-wise to create a single merged data_frame. The result is similar to an inForm merge data file.

csd <- purrr::map_df(paths, read_cell_seg_data)
dim(csd)

Using table is one way to summarize the data by Sample Name or Slide ID. This data comes from nine fields taken of three slides.

table(csd$`Sample Name`, csd$Phenotype) %>% addmargins(2, list(Total=sum))
table(csd$`Slide ID`, csd$Phenotype) %>% addmargins(2, list(Total=sum))

Compute on merged data

Merged data may come from an inForm merge or from combining individual files as shown above. In either case, the data includes entries which come from multiple fields. Computing on merged data requires a few new techniques.

Summarize per slide

Use dplyr::group_by and dplyr::summarize to compute summary statistics for all fields in a slide. For finer grouping, use multiple arguments to group_by. Use dplyr::filter to select a particular phenotype tissue category.

This example computes the mean PDL1 expression for CD68+ and CK+ cells in Tumor, with the mean computed per Slide ID.

csd %>% 
  filter(`Tissue Category`=='Tumor', Phenotype %in% c('CD68+', 'CK+')) %>% 
  group_by(`Slide ID`, Phenotype) %>% 
  summarize(Mean_PDL1=mean(`Entire Cell PDL1 (Opal 520) Mean`))

Add distance columns to merged data

Nearest-neighbor distances must be computed per-sample because the X/Y coordinates reported in cell seg data files are all relative to the top-left of the sample.

Use dplyr::group_by to aggregate across subsets of a full data set. In this case, we want to group by Sample Name. Within each group, use dplyr::do to call find_nearest_distance to compute the distance columns and dplyr::bind_cols to combine them with the original data.

# Use the same list of phenotypes for each sample
phenos <- unique(csd$Phenotype)
csd <- csd %>%
  group_by(`Sample Name`) %>%
  do(bind_cols(., find_nearest_distance(., phenos)))
dim(csd)
tail(names(csd), 5)
# Use cached values
distances <- read_csv('nearest_distance.csv')
csd <- bind_cols(csd, distances)
dim(csd)
tail(names(csd), 5)

Average distance per sample

The next example uses group_by, filter and summarize again to compute the average distance from a tumor cell (CK+) to the nearest macrophage (CD68+), with the averages computed per Slide ID.

csd %>% group_by(`Slide ID`) %>% 
  filter(Phenotype=='CK+') %>% # Only tumor cells
  summarize(mean_dist_to_CD68=round(mean(`Distance to CD68+`), 2))

Compute count_within for each field in merged data

count_within is another function that must be computed per field.

This example uses dplyr::group_by and dplyr::do to call count_within for each field in a merged data file. The result is a data_frame with one row per radius per field.

Including Slide ID in the group_by arguments doesn't change the grouping, it causes Slide ID to be included in the result. This is helpful for further aggregation.

See the section Aggregate counts and means per slide below for an example which aggregates counts per slide.

csd %>% group_by(`Slide ID`, `Sample Name`) %>% 
  do(count_within(., from='CK+', to='CD68+', radius=15))

Aggregate count_within across samples

Compute counts and averages

Use count_within_batch to count cells within a radius for multiple tissue categories, phenotypes and fields when the fields have not been merged.

This example counts CK+ cells having a CD8+ cell within 10 or 25 microns, and CK+ cells with a CD68+ cell within 10 or 25 microns. dplyr::glimpse gives a compact summary of the data.

base_path <- system.file("extdata", "samples", package = "phenoptrExamples")
pairs <- list(c('CK+', 'CD8+'),
             c('CK+', 'CD68+'))
radius <- c(10, 25)
counts <- count_within_batch(base_path, pairs, radius, verbose=FALSE) %>% 
  select(-source, -category) # Remove unneeded columns

glimpse(counts)
# count_with_batch is slow, we cheat by caching the result
counts <- read_csv('count_within_batch.csv')
glimpse(counts)

Aggregate counts and means per slide

Aggregating from_count, to_count and from_with across fields is straightforward, it only requires simple sums. Aggregating within_mean requires computing the underlying count of cells within the radius, summing, and computing a new mean.

(Note: the value of from_count * within_mean is not reported by count_with because it may count cells multiple times.)

counts_per_sample <- counts %>% group_by(slide_id, from, to, radius) %>% 
    summarize(from_count=sum(from_count),
              to_count=sum(to_count),
              from_with=sum(from_with),
              within=sum(from_count*within_mean),
              within_mean=within/from_count) %>%
  ungroup %>% select(-within)
counts_per_sample


PerkinElmer/phenoptrExamples documentation built on May 14, 2019, 2:36 p.m.