In stranda/unpakathon: Code Used and Data Needed to work with unpak data

The unpakathon package is primarily a datafreeze of the unpak database. It also provides some functions to help work with these data.

This vignette shows how to interact with these data, once the package is installed and loaded into the current R session

load the library

library(unpakathon)

How do I know how fresh the data are?

After loading the package up using 'library()', the date of the snapshot is available:

datestamp

Tables

There are several tables available, each with some sort of help description that you can access by typing ? (e.g. ?phenolong or ?independent)

Here are the tables at the moment and a list of their current column names

#phenolong: long format for all the phenotypic data
names(phenolong)
#phenowide: wide (typical) format for phenotypic data in phenolong
names(independent)
##tdna: tdna data we know what it is.  indexed by accession
names(tdna)
#geneont The tair10 gene ontology indexed by gene
names(geneont)
#allfeat
names(allfeat)
#allmeth
names(allmeth)
#SalkPos
names(SalkPos)
#SalkInsert
names(SalkInsert)

Experiments and Meta Experiments

The experiment table is sort of "subexperiment-oriented". the database now provides a way to lump all the data associated with experiments 1a1, 1a2, 1a3, etc into a meta.experiment called "1". This provides a quick way to do subsetting of the huge data set:

unique(phenolong[,c("experiment","meta.experiment")])

So, to subset the data to exclude farms one could do a couple of easy lines of code. I'm using dplyr throughout on this document, btw.

library(dplyr)
unique(phenolong$experiment)
nofarm.long <- phenolong %>% filter(meta.experiment!='farm')
unique(nofarm.long$experiment)

Or maybe you only want to look at the distribution of individual fruitnums for experiment1

exp1 <- phenolong %>% filter(meta.experiment=="1") %>% filter(variable=="fruitnum") 
library(ggplot2) #might as well use ggplot to start with
ggplot(data=exp1, aes(value, fill=experiment)) + geom_histogram(binwidth=50)

Here's the same exercise but getting averages for every replicate within a growth chamber within an experiment

exp1mn <- phenolong %>% filter(meta.experiment=="1") %>%
  filter(variable=="fruitnum") %>%
  group_by(experiment,accession) %>%
  summarise(value=mean(value,na.rm=T))
ggplot(data=exp1mn, aes(value, fill=experiment)) + geom_histogram(binwidth=50)

I've done all of the above with the long format, but similar approaches could be used with the wide format in phenowide

For example, it is arguably easier to compare phenotypes when they are in the wide format

exp1 <- phenowide %>% filter(meta.experiment=="1")
ggplot(data=exp1,aes(x=days.to.bolt,y=fruitnum))+geom_point()

might be nice to see whats with that point at 765 fruits

Correcting phenotypes for within Growth Chamber variance, phytometers or col

Everything so far has been performed on uncorrected phenotypes.

We can correct phenotypes based on the Treatment, Experiment, and Facility combinations

Here's an example scaling data by the amount of variation in each experiment/facility combination.

I'm using 'phenos' as the name of a vector to capture the names of a couple of phenotypes. This could be a single phenotype or many.

The first real line gives a good intuition into how one can filter on just a few variables in the long format and then "spread" them so that variables can be compared to one another.

The rest of the lines of code create dataframes that have the trait variables adjusted by means of columbia or scaled by sd or both.

phenos = c("fruitnum","days.to.bolt")

plotraw = phenolong %>% filter(variable %in% phenos) %>% spread()

plotscale <- phenolong %>%
  filter(variable %in% phenos) %>%
  scalePhenos(classifier = c("facility","experiment")) %>% spread()

plotadj <- phenolong %>%
  filter(variable %in% phenos) %>%
  colcorrect(classifier = c("facility","experiment")) %>%
  spread()

plot.all <- phenolong %>%
  filter(variable %in% phenos) %>%
  colcorrect(classifier = c("facility","experiment")) %>%
  scalePhenos(classifier = c("facility","experiment")) %>% 
  spread()


plot(fruitnum~days.to.bolt,data=plotraw,main="unadjusted")
plot(fruitnum~days.to.bolt,data=plotscale,main="scaled by std. dev.")
plot(fruitnum~days.to.bolt,data=plotadj,main="adjusted by columbia mean.")
plot(fruitnum~days.to.bolt,data=plot.all,main="scaled and adjusted by columbia mean")

Here is a version with the adjustment done by the mean of the phytometers in each chamber/time combo rather than the mean of columbia alone.

phenos = c("fruitnum","days.to.bolt")

plotadj <- phenolong %>%
  filter(variable %in% phenos) %>%
  phytcorrect(classifier = c("facility","experiment"), pheno=phenos) %>%
  spread()

plot.all <- phenolong %>%
  filter(variable %in% phenos) %>%
  phytcorrect(classifier = c("facility","experiment"), pheno=phenos) %>%
  scalePhenos(classifier = c("facility","experiment")) %>% 
  spread()


plot(fruitnum~days.to.bolt,data=plotraw,main="unadjusted")
plot(fruitnum~days.to.bolt,data=plotscale,main="scaled by std. dev.")
plot(fruitnum~days.to.bolt,data=plotadj,main="adjusted by phyt mean.")
plot(fruitnum~days.to.bolt,data=plot.all,main="scaled and adjusted by phyt mean")

You could easily make 'phenos' longer and look at multivariable relationships:

phenos = c("fruitnum","days.to.bolt","diameter.at.bolt")

plot.all <- phenolong %>%
  filter(variable %in% phenos) %>%
  phytcorrect(classifier = c("facility","experiment"), pheno=phenos) %>%
  scalePhenos(classifier = c("facility","experiment")) %>% 
  spread() 

library(GGally)
ggpairs(plot.all,columns=phenos)

Of course, these are all based on individual data points. It might make sense to take the means of accessions.

phenos = c("fruitnum","days.to.bolt","diameter.at.bolt")

plot.all <- phenolong %>%
  filter(variable %in% phenos) %>%
  phytcorrect(classifier = c("facility","experiment"), pheno=phenos) %>%
  scalePhenos(classifier = c("facility","experiment")) %>% 
  group_by(facility, experiment,variable,accession) %>%    ##group the data before
  summarise(value=mean(value,na.rm=T)) %>%                 ##taking the mean
  spread() 

library(GGally)
ggpairs(plot.all,columns=phenos)