The unpakathon package is primarily a datafreeze of the unpak database. It also provides some functions to help work with these data.
This vignette shows how to interact with these data, once the package is installed and loaded into the current R session
load the library
library(unpakathon)
After loading the package up using 'library()', the date of the snapshot is available:
datestamp
There are several tables available, each with some sort of help description that you can access by typing ?
Here are the tables at the moment and a list of their current column names
#phenolong: long format for all the phenotypic data names(phenolong) #phenowide: wide (typical) format for phenotypic data in phenolong names(independent) ##tdna: tdna data we know what it is. indexed by accession names(tdna) #geneont The tair10 gene ontology indexed by gene names(geneont) #allfeat names(allfeat) #allmeth names(allmeth) #SalkPos names(SalkPos) #SalkInsert names(SalkInsert)
The experiment table is sort of "subexperiment-oriented". the database now provides a way to lump all the data associated with experiments 1a1, 1a2, 1a3, etc into a meta.experiment called "1". This provides a quick way to do subsetting of the huge data set:
unique(phenolong[,c("experiment","meta.experiment")])
So, to subset the data to exclude farms one could do a couple of easy lines of code. I'm using dplyr throughout on this document, btw.
library(dplyr) unique(phenolong$experiment) nofarm.long <- phenolong %>% filter(meta.experiment!='farm') unique(nofarm.long$experiment)
Or maybe you only want to look at the distribution of individual fruitnums for experiment1
exp1 <- phenolong %>% filter(meta.experiment=="1") %>% filter(variable=="fruitnum") library(ggplot2) #might as well use ggplot to start with ggplot(data=exp1, aes(value, fill=experiment)) + geom_histogram(binwidth=50)
Here's the same exercise but getting averages for every replicate within a growth chamber within an experiment
exp1mn <- phenolong %>% filter(meta.experiment=="1") %>% filter(variable=="fruitnum") %>% group_by(experiment,accession) %>% summarise(value=mean(value,na.rm=T)) ggplot(data=exp1mn, aes(value, fill=experiment)) + geom_histogram(binwidth=50)
I've done all of the above with the long format, but similar approaches could be used with the wide format in phenowide
For example, it is arguably easier to compare phenotypes when they are in the wide format
exp1 <- phenowide %>% filter(meta.experiment=="1") ggplot(data=exp1,aes(x=days.to.bolt,y=fruitnum))+geom_point()
might be nice to see whats with that point at 765 fruits
Everything so far has been performed on uncorrected phenotypes.
We can correct phenotypes based on the Treatment, Experiment, and Facility combinations
Here's an example scaling data by the amount of variation in each experiment/facility combination.
I'm using 'phenos' as the name of a vector to capture the names of a couple of phenotypes. This could be a single phenotype or many.
The first real line gives a good intuition into how one can filter on just a few variables in the long format and then "spread" them so that variables can be compared to one another.
The rest of the lines of code create dataframes that have the trait variables adjusted by means of columbia or scaled by sd or both.
phenos = c("fruitnum","days.to.bolt") plotraw = phenolong %>% filter(variable %in% phenos) %>% spread() plotscale <- phenolong %>% filter(variable %in% phenos) %>% scalePhenos(classifier = c("facility","experiment")) %>% spread() plotadj <- phenolong %>% filter(variable %in% phenos) %>% colcorrect(classifier = c("facility","experiment")) %>% spread() plot.all <- phenolong %>% filter(variable %in% phenos) %>% colcorrect(classifier = c("facility","experiment")) %>% scalePhenos(classifier = c("facility","experiment")) %>% spread() plot(fruitnum~days.to.bolt,data=plotraw,main="unadjusted") plot(fruitnum~days.to.bolt,data=plotscale,main="scaled by std. dev.") plot(fruitnum~days.to.bolt,data=plotadj,main="adjusted by columbia mean.") plot(fruitnum~days.to.bolt,data=plot.all,main="scaled and adjusted by columbia mean")
Here is a version with the adjustment done by the mean of the phytometers in each chamber/time combo rather than the mean of columbia alone.
phenos = c("fruitnum","days.to.bolt") plotadj <- phenolong %>% filter(variable %in% phenos) %>% phytcorrect(classifier = c("facility","experiment"), pheno=phenos) %>% spread() plot.all <- phenolong %>% filter(variable %in% phenos) %>% phytcorrect(classifier = c("facility","experiment"), pheno=phenos) %>% scalePhenos(classifier = c("facility","experiment")) %>% spread() plot(fruitnum~days.to.bolt,data=plotraw,main="unadjusted") plot(fruitnum~days.to.bolt,data=plotscale,main="scaled by std. dev.") plot(fruitnum~days.to.bolt,data=plotadj,main="adjusted by phyt mean.") plot(fruitnum~days.to.bolt,data=plot.all,main="scaled and adjusted by phyt mean")
You could easily make 'phenos' longer and look at multivariable relationships:
phenos = c("fruitnum","days.to.bolt","diameter.at.bolt") plot.all <- phenolong %>% filter(variable %in% phenos) %>% phytcorrect(classifier = c("facility","experiment"), pheno=phenos) %>% scalePhenos(classifier = c("facility","experiment")) %>% spread() library(GGally) ggpairs(plot.all,columns=phenos)
Of course, these are all based on individual data points. It might make sense to take the means of accessions.
phenos = c("fruitnum","days.to.bolt","diameter.at.bolt") plot.all <- phenolong %>% filter(variable %in% phenos) %>% phytcorrect(classifier = c("facility","experiment"), pheno=phenos) %>% scalePhenos(classifier = c("facility","experiment")) %>% group_by(facility, experiment,variable,accession) %>% ##group the data before summarise(value=mean(value,na.rm=T)) %>% ##taking the mean spread() library(GGally) ggpairs(plot.all,columns=phenos)
Now we're talking.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.