require(knitr)

An overview of the COPASutils package

In this example, the COPASutils package will be used to read in, process, and analyze data resulting from a Genome Wide Association Study (GWAS) using Caenorhabditis elegans nematodes and a COPAS BIOSORT large particle flow cytometer. This example assumes that the COPASutils package is installed, with all necessary dependencies on your local machine. To install the COPASutils package, you can use the command install.packages("COPASutils").

Note: Some of the commands below will produce warnings when executed on a local machine. The warnings describe issues such as missing or strange values fed to the underlying functions and should not alarm the user. Warnings have been suppressed in this vignette for the sake of aesthetics.

The COPASutils package will first be loaded so that we can utilize the associated functions and example data:

options(width = 120)
library(COPASutils)

Reading Data

Note: A detailed walkthrough of the readPlate function is available in the vignette hosted at AndersenLab.org/Research/Software.

Training the SVM of readPlate

The support vector machine utilized in the readPlate function must be trained on each COPAS or BioSorter device because internal parameters vary by device. To train the SVM, run two 96-well microtiter plates through your device. The first plate should contain only objects of interest with the bubble trap engaged, and the second plate should contain no objects of interest, less liquid per well, and have the bubble trap disengaged. These plates will generate object and bubble data, respectively. A variety of R packages (like kernlab) can be used to generate an SVM object for use with the readPlate function. The new function should be edited so as to utilize the new SVM object after the SVM is trained.

Normalized optical values

The readPlate function will normalize the optical density and fluorescence values measured for every object. The COPAS device outputs integrated measures of each of these parameters. Therefore, larger objects always have increased optical density or fluorescence values. To better represent the optical density or fluorescence of individual objects, we divide each parameter by the length (time of flight) for every object recorded. This normalization allows the researcher to determine if objects of different size also have different optical density and fluorescence.

Processing Data

We will begin by loading one of the attached data sets plateData1 into the variable setupRaw. This data set represents the raw data from the setup of plate used in a Genome Wide Association Study with the worm Caenorhabditis elegans. Now that we have our raw data, we would probably to get a nice overview of how our setup went (i.e. summary statistics by well). For instance, in the GWAS experiment from we which are examining the data, we would like three worms be sorted into each well in every other column. We can check these sort data. To summarize the data we can use the summarizePlate function. This function takes, as parameters, an unsummarized plate data frame as well as optional boolean values (TRUE or FALSE) for the quantiles, log, and ends arguments, denoting whether these measures of the distribution should be included:

setupRaw <- plateData1
setupSummarized <- summarizePlate(setupRaw)
colnames(setupSummarized)

We now see that we have many more trait variables, many of which describe the distribution of the originally measured values by well. We can get an even more complete picture by adding in extra quantiles (quantiles argument), log transformed values of EXT and the fluorescence channels (log argument), and the minimum and maximum of each trait (ends argument), as below:

setupSummarized2 <- summarizePlate(setupRaw, quantiles=TRUE, log=TRUE, ends=TRUE)
colnames(setupSummarized2)

We now have a great deal of information describing, in detail, the summary statistics of each of the measured traits. Again, each subset of new trait values can be removed by leaving each of the optional parameters (quantiles, log, and ends) set to FALSE. Each trait follows a specific naming system, wherein any mathematical transformation imparted on the original data is added at the beginning of the trait. For instance, the mean of all of the time of flight data (TOF) for each well can be found in the column mean.TOF. If we wanted the column corresponding to the 25th quantile of the log transformed extinction data (EXT), we could find it in the column named q25.log.EXT. All of the naming conventions are outlined in the table below:

Statistic | Abbreviation | Example :------------------------------:|:--------------:|:-----------------: mean | mean | mean.TOF median | median | median.EXT minimum | min | min.yellow maximum | max | max.green normalized to time of flight | norm | mean.norm.red quantiles | qXX | q25.red log transformed data | log | mean.log.red

Some statistics, such as normalized and log values are calculated before the data are summarized and, as such, have a distribution of their own. For these traits, all mean, median, min, max and quantile data are available by stringing together names as in the above table.

The readPlate function will also normalize the optical density and fluorescence values measured for every object. The device outputs integrated measures of each of these parameters. Therefore, larger objects always have increased optical density or fluorescence values. To better represent the optical density or fluorescence of individual objects, we divide each parameter by the length (time of flight) for every object recorded. This normalization allows the researcher to determine if objects of different size also have different optical density and fluorescence.

In the experiment we are examining here, it was desired that no worms be sorted into the wells in the even columns. Therefore, the data we are seeing in these columns are the result of background debris or accidental placement of worms into the wells. We now want to remove these wells before we continue with our analysis. Included in the package is the removeWells function, which does exactly what its name implies:

setupSummarized3 <- removeWells(setupSummarized, c("A2", "A4", "A6")) # All wells you want to remove need to be included here as a vector
head(setupSummarized3[,1:10])

The removeWells function takes as input a summarized plate data frame, as well as a vector of string corresponding to wells to be removed, and returns the data frame with all phenotype data in the frame set to NA. An optional drop parameter is used to specify whether to "NA out" trait data or drop those rows from the frame entirely:

setupSummarized4 <- removeWells(setupSummarized, c("A2", "A4", "A6"), drop=TRUE)
head(setupSummarized4[,1:10])

The converse of the removeWells function is the fillWells function. This function fills the trait data for any wells missing from the selected data frame with NAs. If you want to fill the wells back in from our example above, where the rows were dropped from the data frame, we could use the following command:

setupSummarized5 <- fillWells(setupSummarized4)
head(setupSummarized5[,1:10])

One issue when working with 96-well plates is that of placement within the plate. Edge wells, being exposed to more direct dry air currents, may experience greater evaporation, which may have an effect on different traits in those wells. To test this hypothesis, we can utilize the edgeEffect function:

edgeEffect(setupSummarized, "n")

This function takes a summarized plate data frame and the name of the trait to test. In this instance, we tested our setup plate for any effect with respect to the number of worms in each well. The function splits the plate population by wells found on the perimeter of the plate and those found on the interior, then performs a Wilcoxon Rank-Sum test between the two populations for the specified trait. The resultant p-value is returned. Since the returned p-value does not exceed our significance threshold of 0.05, we fail to reject the null hypothesis that the two populations are drawn from the same distribution. If we want to simultaneously test all traits, we can simply not specify a trait to be tested. A data frame of all traits and associated p-values will be returned.

Plotting Data

Now that we have access to both the unsummarized and summarized data, we would like to visualize the results from this plate. The plate in this example was set up with worms in every other column. To confirm that large populations only exist in every other columns, we will plot a heat map of the plate representing the population present in each well using the plotTrait function. This function takes a summarized or unsummarized data frame as well as string indicating which trait to plot and optionally, the type of plot to make:

plotTrait(setupSummarized, "n")

We can see that the larger populations of worms are, in fact, only present in every other well, row-wise. We can also see wells that are missing data points, as they are colored gray and are missing the numerical annotation. The returned plot is a ggplot2 object and as such can be manipulated using standard ggplot2 functions and grammar. This extensibility is true for all of the following plotting functions as well. For instance, we can add a title as such:

plotTrait(setupSummarized, "n") + ggtitle("Example Heatmap")

We are not simply limited to heat maps, however. By plotting the raw data, we can get a better feel for the distributions of the traits. We can plot a histogram of the values in each well:

plotTrait(setupRaw, "TOF", type="hist")

Or we can plot a scatter plot between two traits:

plotTrait(setupRaw, "TOF", "EXT", type="scatter")

We may also want to compare how traits differ across plates. Included in the package is a function called plotCompare that accomplishes this task. Here, we'll compare the distributions of the time-of-flight values between the data in setupRaw and a new plate called scoreRaw. These two plates will be entered as a list to the first argument in the function. Then, we'll specify the trait to be compared (TOF in this case). Finally, we'll enter in a vector of the plates names as the optional third argument:

scoreRaw <- plateData2
plotCompare(list(setupRaw, scoreRaw), "TOF", plateNames=c("Setup", "Score"))

We can see that side-by-side box plots are plotted for the values in each well. Likewise, we can compare the summarized values between plates by feeding in summarized plate data:

scoreSummarized <- summarizePlate(scoreRaw)
plotCompare(list(setupSummarized, scoreSummarized), "mean.TOF", plateNames=c("Setup", "Score"))

In addition, we can also check for correlation between traits both within and across plates using a function called plotCorMatrix. This function will plot a correlation heat map between all of the traits either within or between summarized plates. Here, we will examine correlations between traits within the summarized setup data:

plotCorMatrix(setupSummarized)

In the above matrix we can see that some traits are positively correlated, some are negatively correlated, and some are completely uncorrelated. We can also examine these patterns between plates as such:

plotCorMatrix(setupSummarized, scoreSummarized)

We can now see an instance where all values of a trait (n.sorted) were equivalent and therefore the correlation could not be calculated. Tiles for these traits are all set to gray.

If we now transition to some new data, representing an experiment in which drug dose response curves were measured, we can utilize the last plot function in the package, plotDR. We will first load in our example data set, called doseData into the variable dosesRaw. We will then summarize this data, filling in the strains with a 96-element vector corresponding to the names of the strains across a plate, row-wise.

dosesRaw <- doseData
strains <- rep(c("Strain 1", "Strain 2", "Strain 3", "Strain 4"), each=6, times=4)
dosesSummarized <- summarizePlate(dosesRaw, strains)
doses <- rep(c(0,2.5,5,10,20,NA), times=16)
plotDR(dosesSummarized, dosages=doses, trait="mean.TOF")

We can see that we now need to include the strains vector when summarizing the data as well as a dosages vector when plotting the dose response. We also might like to see how the strains vary across every trait. We can generate a list of ggplot2 objects using the plotDR_allTraits function:

plotList <- plotDR_allTraits(dosesSummarized, dosages=doses)

We can even access each plot by name using the scheme below:

plotList$median.red
plotList$mean.EXT

Conclusion

The COPASutils package provides quick, streamlined tools for reading, processing, and plotting data resulting from COPAS platform machines. Here, we have analyzed data from several different experiments. Of course, analysis pipelines for the data will change from project to project, and the pipeline described here may not be the best fit for your data. COPASutils provides a general and flexible work flow for COPAS data and should be easily adaptable to your specific project.



AndersenLab/COPASutils documentation built on May 5, 2019, 4:57 a.m.