This package contains common R scripts I use in my day to day data analysis of biological data. The scripts are primarily for plotting and visualisation, with some data organisation thrown in as well.
dplyr
ggplot2
ggrepel
ComplexHeatmap
RColorBrewer
Rmisc
ggpubr
gtools
grid
pBrackets
biomaRt
First install devtools to allow installation from gitub and any other required packages.
install.packages("devtools")
library("devtools")
library(devtools)
library(knitr)
Now install the BioOutputs package.
install_github("KatrionaGoldmann/BioOutputs")
library("BioOutputs")
Create a correlation plot. Taken from kassambara/ggpubr just changed the default arguments.
So we can use the classic example with the mtcars data frames:
kable(head(mtcars))
mpg
cyl
disp
hp
drat
wt
qsec
vs
am
gear
carb
Mazda RX4
21.0
6
160
110
3.90
2.620
16.46
0
1
4
4
Mazda RX4 Wag
21.0
6
160
110
3.90
2.875
17.02
0
1
4
4
Datsun 710
22.8
4
108
93
3.85
2.320
18.61
1
1
4
1
Hornet 4 Drive
21.4
6
258
110
3.08
3.215
19.44
1
0
3
1
Hornet Sportabout
18.7
8
360
175
3.15
3.440
17.02
0
0
3
2
Valiant
18.1
6
225
105
2.76
3.460
20.22
1
0
3
1
bio_corr(mtcars, "qsec", "wt")
The bio_frequency() function generates a frequency table from factor or character vector columns in a data frame. This has the following arguments:
Argument data A data frame containing columns to be counted columns Column names or indices to be counted in data freq.percent Whether the table should include frequency counts, percentages or both (options = c("freq", "percent", "both")). Default="both" include.na Include NA values (options are TRUE/FALSE, default=TRUE) remove.vars Character vector of variables not to be included in the counts (e.g. remove.vars = c("") remove blanks from the count)Then if we want to see the breakdown of, say, the gear column in mtcars we can apply:
kable(bio_frequency(mtcars, "gear"))
3
4
5
Total
gear
15 (47%)
12 (38%)
5 (16%)
n = 32
And if wanted we can remove one variable from the table. This is useful if we have unknowns or the likes.
kable(bio_frequency(mtcars, "gear", remove.vars=c("5")))
3
4
Total
gear
15 (56%)
12 (44%)
n = 27
This function generates a volcano plot from a top table using ggplot.
The function contains many parameters, use ?bio_volcano
to interogate.
toptable <- read.table("https://gist.githubusercontent.com/stephenturner/806e31fce55a8b7175af/raw/1a507c4c3f9f1baaa3a69187223ff3d3050628d4/results.txt", sep="", header=T)
rownames(toptable) = toptable$Gene
bio_volcano(toptable, fc.col="log2FoldChange", label.row.indices=1:10, main="ALS patients carrying the C9ORF72", add.lines=TRUE)
Lets look at the leukemia data set
BiocManager::install("leukemiasEset", version = "3.8")
library(leukemiasEset)
library(limma)
data(leukemiasEset)
ourData <- leukemiasEset[, leukemiasEset$LeukemiaType %in% c("ALL", "NoL")]
ourData$LeukemiaType <- factor(ourData$LeukemiaType)
design <- model.matrix(~ ourData$LeukemiaType)
fit <- lmFit(ourData, design)
fit <- eBayes(fit)
toptable <- topTable(fit)
toptable$pvalue = toptable$P.Value
bio_volcano(toptable, fc.col="logFC", label.row.indices=1:10, main="leukemia", add.lines=FALSE)
This function colours data according to whether it is below or above a defined plane. The plane is plotted as a line and data can be output as either a line or markers/points.
Argument x x column name in data y y column name in data df.data Data frame containing x and y columns x.plane column name for the x axis in df.plane y.plane column name for the y axis in df.plane df.plane Date frame modelling the plane stepwise logical whether to plot the cutoff plane as stepwise or smoothed colours colour vector for higher, lower and plane values (default=c("green", "red", "grey) respectively) inc.equal logical whethere points on the line should be counted as above (dafault=TRUE) labels label for the markers (default=c("above", "below")) type type of plot for data (options include point (dafault), line, stepwise)Lets look at some beaver temperature data.
data(beavers)
df.plane = beaver1
df.data = beaver2
df.plane$temp <- df.plane$temp + 0.5
bio_bires(x="time", y="temp", df.data, x.plane="time", y.plane="temp", df.plane)
This function splits expression data into customisable modules and averages over catagories in a given variable.
Argument exp data frame containing the expression data mod.list A list of modules. Each element contains the list of genes for a modules. The gene names must match the rownames in the exp dataframe. meta Dataframe where each column contains an annotation/tract for samples in the heatmap. The order of samples in meta must match that of exp. cluster.rows The method to use for clustering of rows cols Chacter vector, or named vector to fix the order, defining the colours of each mean.var group. main Title of heatmap show.names Show row names. Logical. mean.subjects Logical to determine whether to add a row for the mean value for all subjects in a group split.var Character defining the meta column to average (mean) over.In this example we will look at two Li modules and a custom one I made up.
exp = rld.syn
meta <- rld.metadata.syn
bio_mods(exp=exp, mod.list=mod_list, meta=meta, mean.var = "Pathotype", cluster.rows = FALSE)
This function plots a line graph comparing two treatment groups over time.
Argument df Data frame containing the data for both groups group.col Column name which corresponding to the group of values in df y.col Column name corresponding to the y-axis values in df x.col Column name corresponding to the x-axis values in df id.col Column name which corresponds to subject ID main Title of pot cols Character vector for colours of lines p.col Colour of p-value textLets look at kidney disease survival in this example from the survival package.
library(survival)
data(kidney)
df <- kidney
df <- df[df$time < 150, ]
bio_treatmentGroups(df, group.col="status", y.col="frail", x.col="time")
This function switches gene (or snp) ids using biomart
Argument IDs list of the Ids you want to convert IDFrom What format these IDs are in (default Ensembl) IDTo What format you want the IDs converted to (default gene names) mart The biomart to use. Typically, for humans you will want ensembl (default). Alternatives can be found at listEnsembl() dataset you want to use. To see the different datasets available within a biomaRt you can e.g. do: mart = useEnsembl('ENSEMBL_MART_ENSEMBL'), followed by listDatasets(mart). attributes list of variables you want outputFor example genes TNF and A1BG:
IDs = c("TNF", "A1BG")
#kable(bio_geneid(IDs, IDFrom='hgnc_symbol', IDTo = 'ensembl_transcript_id'))
This function exports a complex heatmap looking at the expression of different modules which can be customised.
Argument exp Expression Data var Vector classing samples by variables prefix Prefix to Heatmap titles stars whether pvales should be written as numeric or start (default=FALSE) overlay pvalues on fold change Heatmap or besidde (default = TRUE) logp Whether or not to log the pvalues ... Other parameters to pass to Complex Heatmapexp = rld.syn
meta <- rld.metadata.syn
bio_fc_heatmap(exp=exp, var=meta$Pathotype)
Creates boxplots showing the significance between groups.
Argument data Data frame containing columns x and y x, y x and y variable names for drawing. p.cutoff plot p-value if above p.cutoff threshold. To include all comparisons set as NULL. stars Logical. Whether significance shown as numeric or stars method a character string indicating which method to be used for comparing means.c("t.test", "wilcox.test") star.vals a list of arguments to pass to the function symnum for symbolic number coding of p-values. For example, the dafault is symnum.args <- list(cutpoints = c(0, 0.0001, 0.001, 0.01, 0.05, 1), symbols = c('****', '', '', '', 'ns')). In other words, we use the following convention for symbols indicating statistical significance: ns: p > 0.05; : p <= 0.05; : p <= 0.01; : p <= 0.001; ****: p <= 0.0001Lets look at the iris data:
bio_boxplots(iris, x="Species", y= "Sepal.Width", p.cutoff = 0.0001)
bio_boxplots(iris, x="Species", y= "Sepal.Width", NULL, stars=TRUE)
Creates a sankey plot with heatmaps showing how individuals progress over time.
Argument samp.orders A list of named vectors for sample order at each timepoint. Vector names must correspond to matchable ids.library(lme4)
library(ComplexHeatmap)
data(sleepstudy)
data = sleepstudy
data = data[data$Days %in% c(0, 1, 2), ]
data = data[data$Days !=1 | data$Subject != "308", ]
time.col="Days"
exp.cols = "Reaction"
sub.col = "Subject"
row.order=list()
for(i in unique(data[, time.col])){
hm = Heatmap(data[data[, time.col] == i, exp.cols])
row.order[[paste('timepoint', as.character(i))]] = setNames(unlist(row_order(hm)), data[data[, time.col] == i, sub.col])
}
bio_sankey(samp.order=row.order)
Converts strings to appropriate format for titles according to Chicago style.
Argument titles A character vector of phrases to be converted to titles exeption.words A character vector of words with case to be forced (for example abbreviations and roman numerals) replace.chars A named list of characters to replace in title. This works in order of appearance. E.g. c("\."=" ") replaces fullstops with spaces.shakespeare_plays =c("all's well that ends well", "as you like it","comedy of errors","love's labour's lost", "measure for measure", "merchant of venice","merry wives of windsor","midsummer night's dream","much ado about nothing","taming of the shrew", "tempest", "twelfth night","two gentlemen of verona", "winter's tale", "henry iv, part i","henry iv, part ii", "henry v", "henry vi, part i","henry vi, part ii", "henry vi, part iii", "henry viii","king john", "pericles","richard ii", "richard iii", "antony and cleopatra","coriolanus","cymbeline", "hamlet","julius caesar", "king lear", "macbeth (the scottish play)", "othello", "romeo and juliet","timon of athens", "titus andronicus", "troilus and cressida")
exception.words= c("I", "II", "III", "IV", "V", "VI") # Pass in exceptions
bio_capitalize(as.character(shakespeare_plays[grepl("henry", shakespeare_plays)]), exception.words)
Calculates the module expression from a list of genes.
Argument exp data frame containing the expression data mod.list A list of modules. Each element contains the list of genes for a modules. The gene names must match the rownames in the exp dataframe.bio_mods(exp=exp, mod.list=mod_list)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.