README.md
In exaexa/DiffSOM: Comparative analysis using EmbedSOM

DiffSOM

A nice power-user-friendly utility for EmbedSOM (http://bioinfo.uochb.cas.cz/embedsom/).

First, install FlowSOM from Bioconductor, using the code snippet from here: https://www.bioconductor.org/packages/release/bioc/html/FlowSOM.html

Then you can install EmbedSOM and DiffSOM using devtools:

devtools::install_github('exaexa/EmbedSOM')
devtools::install_github('exaexa/DiffSOM')

To start out, please see the documentation of the following functions. Using the standard R help (?Load) should give you sufficient examples to run the analysis. The example below follows the same structure. The functions should be run roughly in this order:

Load for gathering the files from disk
Embed for computing the embedding and clustering
Plot for getting overview of the result
PlotDiff for getting overview of sample differences
PlotClusters for getting cleaner picture of clusters
PlotExprs for a view of actual expression values (NEW!)
Gate for selecting subpopulations
PlotSignificance and GetProbs for getting a bit of statistics about the subpopulations

Most of extra parameters (not yet documented) are named similarly as in FlowSOM or in R standard libraries; they are usually just forwarded to corresponding FlowSOM/R functions.

We demonstrate the functionality on data from study of Michelle Makiya (see https://www.ncbi.nlm.nih.gov/pubmed/29885264), downloadable from FlowRepository http://flowrepository.org/id/FR-FCM-ZYND.

First, download all the files en masse using Unix shell, and start R:

mkdir diffsom-test
cd diffsom-test
wget -r -np --content-disposition http://flowrepository.org/experiments/1773/download_ziped_files
find -iname '*.fcs' -exec mv {} . \;
rm -fr flowrepository.org/
R

1. Load the files

We now want to load DiffSOM (assuming it is installed using devtools, as shown above), gather the interesting files (we will compare Baseline_None vs. 12min_VH, 6 files from each) and load them to DiffSOM:

library(DiffSOM)
baseline_files <- dir(pattern='^Patient_.*_Baseline_None.fcs$')
vh_files <- dir(pattern='^Patient_.*_120min_VH.fcs$')
ds <- Load(c(baseline_files, vh_files),
           transform=T, toTransform=c(7:11),
       cells=300000)

We have reduced the cell count to 300k to speed up the preliminary analysis, final analysis may be run on much more cells (the slowdown caused by increased cell count is deterministic and linear, i.e., using 1 million of cells will make all computations roughly 3.3x slower).

2. Run the embedding and clustering

Let's run the embedding and get some pictures about what the result looks like. We additionally fix the marker names to something tangible:

# optional: fix the column names in the loaded FlowSOM object
ds$fs$prettyColnames[7:11] <- c('CCR1', 'SIGLEC8', 'CXCR4', 'CD45', 'CCR3')
colnames(ds$fs$data)[7:11] <- c('CCR1', 'SIGLEC8', 'CXCR4', 'CD45', 'CCR3')

# embedding (put extra importance on CD45 marker)
ds <- Embed(ds, colsToUse=c(1,4,7:11),
            xdim=16, ydim=16,
        importance=c(1,1,1,1,1,2,1))

# verifying that the data is OK
PlotExprs(ds, files=T)
PlotExprs(ds, clusters=T)

#plotting
png('out1.png', 900, 900)
Plot(ds)
dev.off()

The Embed function does the main part of the computation. In this case it takes around 15 seconds on a modest laptop.

This produces the following nice image.

Plot of all cells

Optional: Export the results

ExportData function that can be easily used to create a data frame with all relevant information. That is pretty useful in many situations; most notably, it integrates well with ggplot and allows easy export to various formats.

ExportData(ds)

Additionally to the marker expressions from original files, the cells in the exported data frame also have File parameter with the number of the file where each cell originated from, dimensions EmbedSOM1 and EmbedSOM2 with the embedding, dimensions Precluster1 and Precluster2 that describe the mapping of the cells to the SOM (also, there's a slightly redundant Precluster with the raw SOM vertex ID), and, finally, the Metacluster column with the assigned metacluster ID.

Note: Previously, there was a function ExportFCS function for saving the contents of your DiffSOM object (ds, or any other, like ds2 produced in the next step) to FCS format. That was removed because saving stuff to FCS is too error-prone to be simplified to a single function (moreover, the FCS standard specification is so unhelpful that I have exactly zero idea about how to write the file correctly). You can still get the original functionality using this code:

flowCore::write.FCS('exported.fcs',
                    x=new('flowFrame',
                  exprs=as.matrix(ExportData(ds))))

3. Gate out subpopulations

Let us pretend that we are interested in the CD45-negative part of the cells, which is formed by clusters 9 and 10 in this case. Following code first shows a better picture of the clusters, so that you can verify whether 9 and 10 are really the desired clusters, then it gates the clusters out and saves the subpopulation to DiffSOM object ds2:

PlotClusters(ds, alpha=.3)
# check out the cluster images

ds2 <- Gate(ds, clusters=c(9,10))
ds2 <- Embed(ds2, xdim=16, ydim=16, colsToUse=c(1,4,7,8,9,11))
png('out2.png', 1200, 600)
Plot(ds2)
dev.off()

Plot of CD45- subset

4. Observe differences

Now let's see how this population differs among the samples. Also, we use larger points for plotting to make the relatively low cell number more visible:

png('out2d.png', 1200, 900)
PlotDiff(ds2, pch=20, cex=0.3)
dev.off()

Plot of CD45- subset in files

This doesn't give a very good view of the groups, so let's instead use p-value plots to detect the relevant changes in whole metaclusters and in FlowSOM pre-clusters:

png('out3.png', 600, 300)
par(mar=c(0,0,0,0), mfrow=c(1,2))
PlotSignificance(ds2, control=c(1:6), experiment=c(7:12), paired=T)
PlotSignificance(ds2, control=c(1:6), experiment=c(7:12), paired=T, meta=F)
dev.off()

Plot of CD45- subset group difference significances

Orange colors in the plot show that the p-values hint at significant increase of the cell count in the population in experiment group, blue shows that p-values hint at significant decrease. You can also use paired=F if your samples do not correspond one-to-one.

Colors are only informative; but rigorous statistical testing can be run on a table of relative cell abundances (a.k.a. probabilities) obtained using GetProbs(ds2)$probs:

              [,1]      [,2]         [,3]         [,4]        [,5] ...
 [1,] 0.0010019036 0.2774271 0.0016030458 0.0032060916 0.033663962 ...
 [2,] 0.0023727351 0.4236411 0.0002157032 0.0002157032 0.008412425 ...
 [3,] 0.0008337502 0.4305486 0.0006670002 0.0010005003 0.020010005 ...
 [4,] 0.0032122560 0.2940450 0.0053125772 0.0013590314 0.040400297 ...
 [5,] 0.0005069158 0.2591788 0.0010138316 0.0017379970 0.014338475 ...
 [6,] 0.0018617500 0.3176986 0.0028827097 0.0016815807 0.017536484 ...
 [7,] 0.0010257287 0.3289170 0.0003419096 0.0022224122 0.007949397 ...
 [8,] 0.0006872852 0.4828866 0.0032989691 0.0004123711 0.012371134 ...
 [9,] 0.0011308728 0.4344608 0.0018505192 0.0008224530 0.016654673 ...
[10,] 0.0059383892 0.3122603 0.0013608809 0.0007422987 0.006680688 ...
[11,] 0.0005499878 0.2864214 0.0023832804 0.0011610853 0.023588365 ...
[12,] 0.0017746772 0.3596475 0.0052628358 0.0013463068 0.020622973 ...

Each line in the matrix corresponds to one file from the input, each row corresponds to one metacluster (or pre-cluster if meta=F). Testing can be then run manually using any favorite testing procedure. Note that t.test is provided only as an example -- wilcox.test will probably be better for this general use-case, and exact test to choose highly depends on the experiment type. We examine the combined contents of clusters 2 and 5 to get the exact test results and p-values:

ps <- GetProbs(ds2)$probs
t.test(ps[7:12,2]+ps[7:12,5],
       ps[1:6,2]+ps[1:6,5],
       alternative='greater', paired=T)

THIS IS A PROOF OF CONCEPT that only exists for research and demonstration purposes. Please do not rely on consistency of DiffSOM, availability of updates or long-term existence of this repository. If you find DiffSOM useful and want to use it for anything serious, please maintain a separate forked version to avoid conflicts with possible breaking changes that I will need to add to this repository.

A more sophisticated, fast and clickable software for running this kind of analyses will be made available as soon as possible.

As usual, you are welcome to report any encountered issues or possible ideas for improvement using the GitHub issue tracker.

exaexa/DiffSOM documentation built on Aug. 22, 2019, 2:46 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com