knitr::opts_chunk$set(cache=TRUE,fig.width=6,fig.height=4)

Introduction {#intro}

Package provenance provides functions for the visualisation of typical data utilised in sediment provenance analysis in the geosciences, as well as helper functions aiding data analysis. Plotting takes advantage of the framework provided by the ggplot2 package, and the output plots are ggplot objects, which allows for further modification with ggplot2's functions and easy saving of the graphics. This document is meant for geoscientists who want to get step-by-step guidelines for using provenance's plotting functions, and who might not be experienced users of the R language.
The "workhorse functions" of the package are:

plotKDE() provides plotting of kernel density estimates (KDEs) of geochronological data for single or suites of samples

plotMDS() provides plotting of multidimensional scaling (MDS) maps for quick visual identification of trends and similar groups within a set of sample data

Working with provenance data using the ggprovenance package in R typically involves three main steps:

  1. load data files, possibly filter and reformat input data for use
  2. plot visualisation of data
  3. save output

Of these steps, 2 usually is an iterative "trial-and-error" or gradual improvement process to determine best graphical representation of the data, with a few additional calculations necessary here and there. Step 3 is, due to ggplot2's functionality, a trivial call to ggsave(). The first item, loading the data, is usually the most involved and might require the user to write their own scripts for the purpose, as this can not easily be standardised over the wide range of possible data file formats and the individual requirements of each user. This step can be further subdivided into:

1a. loading data files
1b. reformattiong data
1c. adding information for visualisation

All basic steps will be presented in workable examples in section 2 of this document. Section 3 illustrates the effects of different settings of the parameters given to the main plotting functions. Section 4 gives a basic generic script that should be easily adaptable to the user's needs. Section 5 contains additional tips & tricks on plotting and data import.

\pagebreak

Basic workflow {#workflow}

First, we need to load the package:

library(ggprovenance)

Loading data

For this document, we use example data files installed with ggprovenance, but the principles would be the same for any other data file. Simple example data files are provided in the /extdata subfolder of the package installation folder, these can serve as an example of how to prepare data. Alternatively, the data loading process can be adapted to the existing data files, and the data restructured in R, to obtain objects suitable for the plotting functions. The latter approach is preferrable, as it leaves the original data files untouched, and new and altered data is easily replotted by simply re-running the R script, instead of preparing intermediate data files by hand. See also section Tips & Tricks for notes on how to import lists of individual data files and MS Excel files.

Loading data files

In the /extdata folder, there is a simple example file of detrital zircon U-Pb ages. The individual measured ages are arranged column-wise, one column per sample, with the sample names in the first line.

table<-read.xls(xls=system.file("extdata", "Tarim.xls", package="ggprovenance"),
                stringsAsFactors=FALSE)
knitr::kable(table[1:6,],format="markdown")

To load this data, you could use the included function read.xls.flat():

agedata<-read.xls.flat(system.file("extdata", "Tarim.xls", package="ggprovenance"))

Two small helper functions are provided with ggprovenance, read.xls.flat() and read.xls.tabbed(), which are wrappers around gdata's read.xls() function, adapted for two common cases: Excel files with either data in one worksheet, one column per sample, or data contained in a file with one worksheet per sample, although it is assumed that the column name for the data of interest within each worksheet is the same. Please note that gdata needs a working perl installation, which is usually the case on *nix-based operating systems, but requires extra instllation steps under Windows, see Section 5.
Another common file format are comma-separated values (csv) files, which can be read with read.table() and its derivates, from the utils package. This is typically contained in a base installation of R, and thus does not require installing further packages or dependencies.

Before plotting, we can have a look at some properties of the data:

names(agedata)

Adding information for visualisation

For e.g. colouring purposes, additional information can be added to the data. For example, we might classify our data into general source areas. plotKDE() allows to provide a data.frame in parameter categories for this purpose. We prepare a data.frame with the desired properties in columns (here: area), and the sample names set as row.names (see section section 3 for an easier way to achieve this).

cats<-data.frame(area=rep("n/a",length(agedata)),stringsAsFactors=FALSE)
row.names(cats)<-names(agedata)
cats$area[grep("Tb04",row.names(cats))]<-"Tarim"
cats$area[grep("Tb21",row.names(cats))]<-"Taklamakan"
cats$area[grep("Tb22",row.names(cats))]<-"Taklamakan"
cats$area[grep("Tb35",row.names(cats))]<-"Kunlun"
cats$area[grep("Tb38",row.names(cats))]<-"Kunlun"
cats$area[grep("Tb50",row.names(cats))]<-"Tian Shan"

The area name of each data set in data is now recorded in smplarea (we'll use this later):

knitr::kable(cats,format="markdown")

Plotting data

To get a quick look at the loaded data, plotKDE() is a good point to start.

plotKDE(agedata)

With the cats variable we generated, and some additional adjustments, we can make this a little clearer. See the examples and help files for details.

names(cats)
plotKDE(agedata,categories=cats,markers="dash",stack="close",limits=c(0,3500),mapping=aes(fill=area))

Saving the output

Since plots generated with ggprovenance are also ggplot objects, ggsave() from package ggplot2 can be utilised for saving in a wide range of file formats.

ggsave("~/test.pdf",width=10,height=8)

Adapt the output path, file format (indicated simply by the file extension) and sizes to your needs. You can also save a specific plot, if you stored it in a variable earlier:

p1<-plotKDE(agedata)

#
# ... a lot of clever code here, generating other plots (p2, p3,...)
#

ggsave(plot=p1,filename="~/test.png",dpi=300,width=10,height=8)
# save p2, p3,...

\pagebreak

Examples and detailed description of parameters {#examples}

In the following, the effects of the many different parameters for the plotting functions are illustrated by example plots. The examples assume agedata loaded as described previously. Many of the parameters can be used together (see examples at the end of this section), not all possible combinations can be detailed here.

plotKDE()

The full function call to plotKDE is:

plotKDE<-function(ages,title,limits=c(0,max(unlist(agedata),na.rm=TRUE)),
                  plotonly=names(ages),categories,mapping,breaks=NA,
                  bandwidth=NA,splitat=NA,markers=c("none","dash","circle"),
                  logx=FALSE,histogram=FALSE,binwidth=bandwidth,adaptive=TRUE,
                  stack=c("equal","close","dense"),
                  normalise=c("area","height","none"),lowcount=80,...)

See also the help files for details.

Disclaimer: as of the time of writing (r Sys.Date()), not all parameters are functional yet.

The simplest call plots KDEs for all elements of ages. Each plot is assigned an individual colour, the default age range spans from 0 to the oldest age data, bandwidth is chosen automatically.

plotKDE(agedata)

The plotonly parameter allows to select specific data sets (samples) by their names. The below example has the same effect as subsetting the data provided to the function like in e.g. plotKDE(agedata[["Tb22"]]) or plotKDE(agedata$Tb22). Note automatic removal of the legend, as it is obsolete here.

  plotKDE(agedata,plotonly=c("Tb22"))

Add histograms. The optional binwidth parameter sets histogram bin width independently from bandwidth.

plotKDE(agedata,plotonly=c("Tb22"),hist=TRUE,binwidth=50)

Add data markers. Possible values are "dash" for little tick marks and "circle" for semi-transparent circles.

plotKDE(agedata,plotonly=c("Tb22"),markers="dash")

Custom breaks. The breaks parameter overrides limits, i.e. if any breaks lie outside the limits, the limits will be expanded accordingly.

plotKDE(agedata,plot=c("Tb22"),breaks=seq(200,2800,200))

Change x- (time-) limits. Leaving out limits will automatically use the range of values contained in data.

plotKDE(agedata, limits=c(50,600))

Logarithmic x-scale.

plotKDE(agedata,logx=TRUE)

Split plot at a certain age, the two sub-ranges will occupy equal space (useful to emphasise the younger ages).

#plotKDE(agedata,splitat=600)

Splitting can be combined with custom ranges for the two half plots. A limits parameter of length 4 will override splitat.

#plotKDE(agedata,limits=c(50,600,1600,2800))

Usually, an "optimal bandwidth" is calculated automatically for each data series, and the median of all optimal bandwidths for all data series is set equally for all KDEs. It can also be set manually.

plotKDE(agedata, bandwidth=15)

To plot each KDE with it's own optimal bandwidth, set bandwidth=-1. Note that this sometimes leads to unexpected results, i.e. severe oversmoothing in data sets with very few data or very regular data distribution. However, this gives a good impression on the statistical significance of age peaks. If a peak does not show up unless a small bandwidth is set manually, there is probably not enough data to support this age population in the first place.

plotKDE(agedata,bandwidth=-1)

The categories parameter allows the data to be classified based on the individual values encountered in the supplied data.frame. categories should have as many lines as entries in ages, and the row.names should be the same as names(ages). We can use the cats data.frame created in Basic workflow. In parameter mapping, we can then provide an aesthetic mapping for any variable within categories. plotKDE() understands fill, colour, size and linetype aesthetics, that work much like aesthetics in ggplot2.

plotKDE(agedata,categories=cats, mapping=aes(fill=area))

It is often easier to prepare the categories as a separate table/file, and load them from there. For finding the optimal data visualisation, this table can then easily adapted and reloaded until the desired plot is produced:

# load categories:
cats<-read.table(file=system.file("extdata", "categories.csv", package="ggprovenance"),
                 header=TRUE,row.names=1,sep=",",stringsAsFactors=FALSE)
plotKDE(agedata,categories=cats,mapping=aes(fill=area,linetype=type),
        limits=c(0,1200),stack="close",bandwidth=-1)
# adapt categories.csv to needs, re-run, repeat...

order

title

Examples of combining parameters:

Nice combined colour plot
t.b.a.

Custom breaks on a log scale
t.b.a.

Publication-quality pure black & white plot

plotKDE(agedata,aes(fill="black"))+theme(panel.background=element_blank(),
    panel.grid=element_blank(),axis.ticks=element_line(colour="black"),
    axis.text=element_text(size=rel(0.8), colour="black"))

Add annotation to plots

g<-plotKDE(agedata,bandwidth=-1,logx=TRUE,limits=c(50,3200),
           plotonly=c("Tb38","Tb35","Tb50","Tb22"),normalise="height")
g+annotate("rect",xmin=c(90,260,385),xmax=c(125,320,490),ymin=0,ymax=1.1,
           fill=c("#FF000022","#00FF0022","#0000FF22"),colour="black")

plotMDS()

plotMDS(mds, diss, col="", sym="", nearest=TRUE, labels=TRUE, symbols=TRUE,
    fcolour=NA, stretch=FALSE)

other useful functions

plotShepard()

plotShepard(mds, diss, xlab="dissimilarity", ylab="distance", title="")

plotDendrogram()

plotTernary()



thegeologician/ggprovenance documentation built on Sept. 26, 2021, 8:59 a.m.