library(dplyr)
library(ggplot2)
library(jezioro)
library(magrittr)
library(mapproj)
library(rgdal)
library(rioja)
library(rmarkdown)

Introduction

Despite the countless blog posts, textbooks, and online courses focused on R that litter the internet, this document attempts to provide an overview of how R is typically used in paleolimnological analyses at PEARL.

The purpose of the guide is to aid both new and experienced users refine their many inevitable internet searches prefaced with "r help ..."; as even with some familiarity with a given statistical technique, it may be unclear how to perform it using R.

The guide begins with some basic information for users new to R; however, the majority of content is focused on common analyses used in paleolimnological studies and the specific packages and functions that can facilitate them. The guide was created using R Markdown, specifically to allow easy updates and a gradual evolution over time. Therefore, please share any suggestions you may have to improve future versions.

Fundamental Concepts

What is R?

R is an open-source programming language and software environment for statistical computing and graphics. R is freely available for Linux, macOS, and Windows and the current version can be obtained directly from the project website: https://www.r-project.org/.

R vs. RStudio

As R is a command line program, many people choose to control it through a graphical interface. There are several different graphical interfaces or integrated development environments (IDE) to choose from, but the one most commonly used at PEARL is RStudio. RStudio is an open-source IDE for R, freely available for Linux, macOS, and Windows, and its current version can be obtained directly from its project website: https://www.rstudio.com/.

RStudio is not required for anything described in this guide, it simply provides multiple panes, menus, and buttons to make some aspects of R more accessible. However, for simplicity, the guide assumes RStudio is installed, and that if you choose to either use R on its own or with a different IDE, that you are also capable of translating the described operations to your particular setup.

Objects, Classes, and Functions

Objects are the fundamental elements of the R programming language. Within R, an object is simply a data structure with some attributes. Therefore, essentially everything that you work with during an R session will be an object of some kind.

Most objects have a class attribute that provides some definition regarding the type of data it contains. Classes include things like 'character', 'numeric', 'logical', 'data.frame' and 'function'.

Functions are the principal tools used in R. Functions are objects that can take arguments as input, and return a new object as output. Some are quite simple (e.g. mean returns the mean value of a numeric object), whereas others are much more complex (e.g. those used to perform ordinations). Many different functions are available within a base installation of R; however, as you become familiar with R, it becomes possible to write functions of your own, that can be shared via scripts and packages.

Packages

All R functions and datasets are stored in packages, including everything contained within a base install of R. However, this base functionality can be extended through external packages that contain additional functions (along with relevant documentation). It may be helpful to view this relationship as packages are the 'apps' available for the R 'operating system'. The main source for external packages is the Comprehensive R Archive Network (CRAN) repository (the 'app store' in the OS analogy).

It is also possible to install external packages from a local archive file (e.g. .zip or .tar.gz), and this is how the jezioro package (that contains several functions and datasets created at PEARL) is currently distributed. The latest version of jezioro is always available on the PEARL website.

R is able to access the CRAN repository directly, and external packages are typically installed either through the console, or using an IDE such as RStudio.

Installing a package using the console

To install the package rmarkdown using the console, you would enter the following command into the lower left pane of RStudio (the 'Console' pane):

install.packages("rmarkdown")

However, the functions contained within that package do not become available until it is loaded into memory with the library command.

library("rmarkdown")

Installing a package using RStudio

To install a package using RStudio:

Useful packages in the CRAN repository

The jezioro package contains several functions and data sets written at PEARL and is distributed as an archive file; however many other packages useful to the analyses performed at PEARL are available from the CRAN repository. These include:

Help Documentation

All objects within R should have documentation accessible with the help function. To bring up the documentation for a function, package, or dataset, enter 'help("object of interest")' into the console and the help file should appear in the ‘Viewer’ pane of RStudio (on the bottom right). For example, to bring up the documentation for the help function itself, enter:

help("help")

The ? is a shortcut for the help function, so entering the following will also bring up the documentation for the help function.

?help

The help documentation for a package (which should contain a list of all of its functions) is accessed in the same way. So to bring up the help file for the rmarkdown package installed earlier, use:

?rmarkdown

However, this will generate an error if the package is not loaded into memory. To search for help documentation on objects that are installed, but not currently loaded, use a double question mark ??.

??rmarkdown

Basic Usage

The Console

After installing both R and RStudio, and then starting RStudio the screen will be split into four panes as shown in the official RStudio Cheat Sheet.

By default these panes are:

Typing directly into the console is the simplest way to use R. For example, entering

2+2

will return the output,

2+2

Where the [1] indicates that the first element of the result (in this case, there is only one element) has a value of 4.

As you start performing more complex operations, you will find that it is easier to enter your input in the 'Source' pane, where it can saved to a file for later use, rather than retyping everything each time. These R script files typically have an .R extension.

Individual lines can be sent from the 'Source' pane to the console, by moving the cursor to the desired line and either hitting the 'Run' button, or using the keyboard shortcut 'Ctrl + Enter' (or 'Ctrl + R'). The same can also be done for selections of text.

Assignment

Assigning a numeric value to an object

The next logical step after determining that 2+2 = 4, is storing that information for later use. This is done by 'assigning' the value '4' to an object we will call x by using the assignment operator <- (you could also use =, but that has some restrictions on its use, whereas <- can be used anywhere).

x <- 4

You should now see x and its value displayed in the 'Environment' pane. Which means it is stored in memory and can be called back again using the console.

x

Inspecting the class of x with the class function reveals that it is 'numeric'.

class(x)

Which means that it can be used in arithmetic operations.

x + 2

x currently contains only one element (i.e. it has a length of 1), but this can be changed by reassigning multiple values to x using the c or 'combine' function. Below we reassign x with two 8's, changing it to a numeric vector with two elements.

x <- c(8, 8)
x
class(x)
length(x)

Note that x can still be used in arithmetic operations. Adding '5' to x will add '5' to each of element of x.

x + 5

Numeric vs. Character

As mentioned in the introduction, 'numeric' is one of several available classes. Another class that you will frequently encounter is 'character'. A 'character' element is typically a string of alphanumeric characters, such as words.

y <- "blue"
class(y)

We can combine 'character' elements into a vector with the c function.

y <- c(y, "red", y)
y

Predictably, errors occur when you try to perform addition on characters.

y + 2 

However, what about if you have a vector with both data types?

z <- c(2, 4, 6, "blue")
class(z)
z + 2

So, it only takes one 'character' element, to force the assignment of the z object to the 'character` class.

Classes

Vectors

Within R, all objects have a class. The class defines what kind of information the object contains and how it may be used. The previous example demonstrated some characteristics of 'numeric' vs 'character' objects.

In general, data will most often be classified as either a vector, data frame or matrix.

As seen previously, a vector is a sequence of data elements.

x <- c(1,2,3)
y <- c(2,4,6)

Operations can be performed on the vector as a whole.

x+2

Or on individual elements of a vector using [] to indicate which element.

y[3]*2

Or using multiple vectors.

x+y

Matrices

A matrix is a two dimensional array. So lets make one by combining x and y using the rbind function. Note that in addition to the data elements, table1 also has row names ('x' and 'y') and column names ([,1], [,2], and [,3]).

table1 <- rbind(x, y)
table1
class(table1)

Individual elements of a matrix can be also accessed using [] to indicate the indices of each dimension separated by a comma (row then column).

table1[2,3]
table1[1,3] + table1[2,1]
table1*2

Matrices vs. Data Frames

A matrix can contain only one type of data (i.e. only 'numeric' or only 'character'). In contrast, a data frame is a table of data that can (but doesn't necessarily) contain mixed data types. They can be created with data.frame function.

table2 <- data.frame(Col1=c(1, 2), Col2=c("black", "white"))
table2
class(table2)

There are many more classes within R, but initially you will likely spend most of your time working with matrices and data frames. It is important to stay mindful of the differences between these two classes.

A matrix can be changed into a data frame using the as.data.frame function.

x <- as.data.frame(table1)
class(x)
x

Similarly, a data frame can be changed into a matrix by using the as.matrix function. However, matrices must contain only one data type, and a number can be converted into a character, but not vice versa. Thus, when converting table2 to a matrix, the numeric values are converted to character values (denoted in quotation marks), and are no longer available for arithmetic operations.

x <- as.matrix(table2)
class(x)
x
x[1,1]*2

This becomes important when trying to combine matrices of different data types. For example, lets combine table1 a numeric matrix, with x (the copy of table2 with everything converted to 'character' data) using the cbind function.

x <- cbind(table1, x)
x

By default, output inherits class from the input. So here, x must be a matrix, and matrices can contain only one data type, therefore all the numeric values in both table1 and x are converted to character values (as denoted by the quotation marks).

class(x)
x[1,3] + x[2,1]

In this case, it would probably make more sense for the output to be a data frame, containing both numeric and character columns. One way to do this, is by nesting the as.data.frame function within the cbind function. As we are now combining objects of different classes, R must choose a class for the output object. In this case it defaults to a data frame, as this can be done withoutchange any of the data with either object.

x <- cbind(table1, as.data.frame(table2))
class(x)
x
x[1,3] + x[2,1]

Functions

The past few examples should have provided some insight regarding the potential strength of functions. Functions allow very complex operations and can be combined in many ways.

Functions follow the general form: function(input, optional arguments).

For example, the function round, will round an input value 'x' to 'n' significant digits, when round(x,digits=n) is entered into the console. Replacing 'x' and 'n' with actual numbers gives something like:

round(3.141593, digits=5)
round(3.141593, digits=2) 

The available documentation for a function can be viewed by entering a ? before its name into the console.

?round

A base install of R contains many functions and so far we have already used: class, c, length, rbind, cbind, as.data.frame, as.matrix, and round. It is also possible to define your own (that can incorporate other functions) with the syntax: function(input){operations}.

For example, we can create a function that will calculate what percentage each element of some input data contributes to the total sum of the input: function(x){x/sum(x)}. Assigning (<-) this function to an object called percentage then allows it to be applied repeatedly to different datasets.

percentage <- function(x){x/sum(x)}

Although not particularly useful on objects containing only a single element, we can use the c function to combine several numbers into a vector, then run percentage on the vector as a whole.

percentage(10)
percentage(c(5,5,10))

Importing and Exporting Files

Most of the data you will want to analyze with R, will be generated outside of R. Thus, it will need to be imported before you work with it. The simplest procedure to import data files is:

count.data <- read.csv(file=file.choose(), header=T, stringsAsFactors=FALSE)

Similarly, to export an object from within your R environment to a .csv file, use the write.csv function. If we wanted to export the object output.data to a .csv file, we would enter the following into the console to open up a dialog box to select the location of the new file.

write.csv(output.data, file=file.choose())

Graphics

In addition to statistical analyses, one of the main uses of R is the generation of high quality figures. However, creating figures with R is a complex topic, and currently beyond the scope of this guide.

To begin, basic plots can be generated with the plot function.

x <- c(1,2,3,4)
y <- c(2,4,6,8)

plot(x,y, type ='l')
?plot

Complex plots are also possible with the plot function, but there are also alternate graphic systems available through external packages such as ggplot2.

Using Packages

A few different methods for installing external packages were given in the Introduction of this guide. However, before the functions within a package become available for use, the package needs to be loaded into memory.

For example, if you installed the rmarkdown function earlier, trying to bring up the help file for its render function without the package loaded should generate an error.

?render

Load an installed package (in this case rmarkdown) into memory with the library function, and try again.

library(rmarkdown)
?render

Data Analysis Tools

This section lists common data analysis techniques used at PEARL and the functions and/or packages that contain the tools necessary to perform them.

Over time I would like to add some detailed example code for each of these techniques.

Normality

To check your the normality of your data, you can use the Shapiro-Wilk Normality Test provided in a base installation of R by the shapiro.test function.

shapiro.test(input.data)

If you are dealing with a large number of variables, it may make sense to examine all of them at once using the apply function.

apply(input.data, 2, shapiro.test)

Transformations

Data can be square-root transformed using the sqrt function. Data can be log (base 10) transformed using log10. Note that by default the log function will calculate the natural logarithm (i.e. ln).

sqrt(input.data)
log10(input.data)
log(input.data)

Standardization

The decostand function within the vegan package provides several standardization methods (e.g. 'chi square' and 'hellinger')

?decostand

Z-Scores

Z-scores can be calculated with the scale function available in a base installation of R. Checking the details of the help file reveals that scale has two arguments, center and scale.

?scale

Therefore, z-scores for input.data can be calculated with:

scale(input.data, center=TRUE, scale=TRUE)

Correlation

Correlations (Pearson, Kendall, and Spearman) can be examined using the cor function.

cor(input.data)

A nice correlation matrix can be produced using the corrplot package.

library(corrplot)
newdatacor = cor(my.dat[1:13])
corrplot(newdatacor, method = "number")

data(mtcars)
M <- cor(mtcars)
set.seed(0)

corrplot(M, method = "number", col = "black", cl.pos = "n")

Species Diversity (Hill's N2)

Hill's N2 values (Hill 1973)^[Hill MO (1973) Diversity and evenness: a unifying notation and its consequences. Ecology 54: 427-432] can be calculated for your data using the Hill.N2 function provided by the rioja package.

N2=Hill.N2(input.data, margin=2)

Rarefaction

Rarefaction is a method to correct for bias in species richness between samples of unequal size, by standardizing across samples to the number of species expected in a sample of the same total size as the smallest sample.

This can be done within R using the rarefy function provided by the vegan package.

?rarefy

Age-Depth Modelling

Gamma Dating

At PEARL, we currently use ScienTissiME to produce ^210^Pb chronologies. However, when reanalyzing cores dated prior to 2012-2013, they have been dated with the binford functions provided by the jezioro package.

Radiocarbon Dating

For time-scales too long for ^210^Pb analyses, an implementation of the Bacon approach to age-depth modelling (i.e. using Bayesian statistics to reconstruct accumulation histories for deposits, by combining radiocarbon and other dates with prior information; Blaauw and Christen, 2011^[Blaauw M, Christen JA (2011) Flexible paleoclimate age-depth models using an autoregressive gamma process. Bayesian Analysis 6: 457-474]) is provided by the package rbacon.

Ordination

Tools to perform ordinations are provided by the vegan package.

A detrended correspondance analysis (DCA) can be performed using the function decorana. For example, to run a DCA on a matrix called "speciesData" (arranged with species as columns and samples as rows).

library(vegan)
outputDCA <- decorana(speciesData)
summary(outputDCA)
plot(outputDCA)

A principal component analysis (PCA) can be performed using the function rda (Note that the "scale" argument can be used to scale values to unit variance).

outputPCA <- rda(speciesData, scale=TRUE)
summary(outputPCA, axes=4)

The biplot function can be use to generate a PCA biplot with species scores indicated by biplot arrows (Note that the scaling argument can be used to scale either site(1), species (2), or both by eigenvalues. See ?biplot.rda for more details.

biplot(outputPCA, scaling=2) 

A redundancy analysis (RDA) is performed by the rda function when two matrices are provided as input, one containing the species data and the second containing the environmental data.

outputRDA <- rda(speciesData ~  ., envData)
outputRDA
summary(outputRDA, scaling=0, axes=4)
plot(outputRDA, scaling=2)

To manually perform forward selection on an RDA model use add1.cca

testRDA <- rda(speciesData ~ 1, envData) # defines starting model
add1(testRDA, scope=formula(outputRDA), test="perm", pstep=500) # examine F-ratios
testRDA <- update (testRDA, . ~ . + pH) # update model
add1(testRDA, scope=formula(outputRDA), test="perm", pstep=500) # examine F-ratio
testRDA <- update (testRDA, . ~ . + Ca) # updates model
add1(testRDA, scope=formula(outputRDA), test="perm", pstep=500) # continue until no more variables can improve model

anova (testRDA, by="terms", step=500) # tests terms in final model
testRDA
summary(testRDA, scaling=0, axes=4)
plot(testRDA, scaling=2) 

Finally, examine the variable inflation factors in both the full model and reduced models.

vif.cca(outputRDA)
vif.cca(testRDA)

ANOSIM and SIMPER

An ANOSIM can be performed using the anosim function provided by the vegan package. Similarity percentages can be determined with the simper function also provided by the vegan package.

indVAL?

Transfer Functions

Transfer functions can be built using the WA function provided by the rioja package.

For example, the midge/VWHO calibration set described in detail by Quinlan and Smol^[Quinlan R, Smol JP (2001) Chironomid-based inference models for estimating end-of-summer hypolimnetic oxygen from south-central Ontario shield lakes. Freshwater Biology 46: 1529-1551] ^[Quinlan R, Smol JP (2010) Use of Chaoborus subfossil mandibles in models for inferring past hypolimnetic oxygen. Journal of Paleolimnology 44: 43-50] is provided as a dataset called vwhoQuinlan2010 by the jezioro package.

Therefore the Quinlan and Smol midge/VWHO model can be built using the following code: ``` {r, eval=FALSE, echo=TRUE} library(jezioro) library(rioja) data(vwhoQuinlan2010) vwho.model <- WA(vwhoQuinlan2010[,3:47], vwhoQuinlan2010[,2], tolDW=TRUE)

``` {r, eval=TRUE, echo=FALSE}
library(jezioro)
library(rioja)
#Hack due to vignette build failing, as unable to access the Rda file during build
vwhoQuinlan2010 <- read.csv(file="data/vwhoQuinlan2010.csv", header =T)
vwho.model <- WA(vwhoQuinlan2010[,3:47], vwhoQuinlan2010[,2], tolDW=TRUE)

The Quinlan and Smol papers settled on the WA.inf.tol_XVal model as having the best performance metrics (RMSE=1.98 mg/L, R2=0.5998). To evaluate that the vwho.model object has the same performance as the model described by Quinlan and Smol, use the crossval function also provided by rioja.

library(rioja)
rioja::crossval(vwho.model, cv.method="loo", verbose=FALSE, ngroups=10, nboot=100, h.cutoff=0, h.dist=NULL)

Confident that your model is the same as the one described in the manuscripts, it can then be applied down-core to midge sedimentary assemblage data to reconstruct VWHO concentrations.

Merging Data Sets

Data sets can be merged on common columns using the join function provided by the analogue package.

This can be useful when comparing downcore data sets with a calibration set, by creating a merged data set adding rows/columns (populated by zeroes) as necessary for those taxa absent from either of the original datasets.

library(analogue)
calSet <- data.frame(taxa1=c(0.25, 0.75, 0.25),taxa2=c(0.20, 0.05, 0.35), taxa3=c(0.55, 0.20, 0.50), row.names=c("cal1", "cal2", "cal3"))

testSet <- data.frame(taxa1=c(0.25, 0.25, 0.50),taxa4=c(0.75, 0.10, 0.15), row.names=c("test1", "test2", "test3"))

join(calSet, testSet, type="outer", split=FALSE)

The aggregate funciton provided in a base installation of R can also be useful when harmonizing disparate data sets.

Analog Matching

Identifying whether a fossil data set has any close modern analogs within a calibration set (i.e. evaluating a transfer function) can be done using the analog function provided by the analogue package.

The analogue package also contains a vignette with a detailed analog matching example that follows the approach described in Flower et al. 1997^[Flower RJ, Juggins S and Battarbee RW (1997) Matching diatom assemblages in lake sediment cores and modern surface sediment samples: the implications for lake conservation and restoration with special reference to acidified systems. Hydrobiologia 344: 27–40] and Simpson et al. 2005^[Simpson GL, Shilland EM, Winterbottom JM and Keay J (2005) Defining reference conditions for acidified waters using a modern analogue approach. Environmental Pollution 137: 119–133].

library(analogue)
vignette("analogue_methods")

Multivariate Regression Trees

Regression trees can be constructed using the rpart package.

Breakpoint Analysis

Davies Test

LOESS

Locally Estimated Scatterplot Smoothing (LOESS) can be performed using the loess function provided by the bast stats package.

GAMs

Generalized Additive Models (or GAMs) can be constructed using the mgcv package.

GAMs are increasingly applied in paleolimnological analyses because:

Simpson (2018)^[Simpson GL (2018) Modelling Palaeoecological Time Series Using Generalised Additive Models. Frontiers in Ecology and Evolution doi:10.3389/fevo.2018.00149] provides a detailed introduction to the use of GAMs and their construction using R (the supplementary information for the article contains the annotated R code used to perform the described analyses).

Advanced Usage

Data Manipulation Verbs

The dplyr package provides several functions that act intuitively as verbs for data manipulation.

These functions include: filter, select, mutate, arrange, and summarise.

Forward Pipe: %>%

The %>% command provided by the magrittr package functions similarly to the Unix pipe "|" command. It "pipes" the output from one function directly into the next (as its first argument), allowing intermediate steps to be bypassed, reducing the need for repetition in the code.

Note that due to the special characters in %>%, pulling up it's help file requires:

?magrittr::`%>%`

A simple example:

x <- c(5,4,3,2,1)
x %>% mean

A more complicated example using %>% in combination with some of the data frame manipulation verbs provided by dplyr:

Import the vwhoQuinlan2010 dataset from jezioro, and convert it to a data frame.

data(vwhoQuinlan2010)
vwhoQuinlan2010 <- as.data.frame(vwhoQuinlan2010)
vwhoQuinlan2010 <- read.csv(file="data/vwhoQuinlan2010.csv", header =T)
vwhoQuinlan2010 <- as.data.frame(vwhoQuinlan2010)

Calculate the mean VWHO of those lakes (rows) where TANYSL <10 and MICROP >40.

filter(vwhoQuinlan2010, MICROP>30) %>% filter(TANYSL<10) %>% summarise(mean(VWHO))

Plotting with ggplot2

ggplot2 is a data visualization package that implements concepts outlined in 'The Grammar of Graphics' a book 'about grammatical rules for creating perceivable graphs'.

The syntax used to produce plots with ggplot2 is quite different from that of the base graphics system. For example:

library(ggplot2)

activities <- read.csv(file="data/ggplot2_activities.csv", header =T)

ggplot(activities, aes(x=Lake, y=Pb210, ymin=Pb210-Pb210err, ymax=Pb210+Pb210err)) +
  theme(legend.position=c(0.90, 0.85), axis.text.x=element_text(size=8)) +
  geom_point(aes(reorder(Lake, -Pb210))) +
  geom_errorbar(aes(reorder(Lake, -Pb210), colour=Interval)) +
  ylab("Total 210Pb Activity (Bq/kg)")

The ggplot2 package is able to construct very complex figures (click here for an extensive list of examples), and there are extensive resources available detailing its use. However, at PEARL we occasionally need to break some of its rules (e.g multiple y-axes), thus ggplot2 is not the best tool for producing stratigraphies that show both depth and age.

Composite Plots

Basic Maps

With the appropriate shapefiles, basic maps can be generated using ggplot2 and rgdal (Bindings for the 'Geospatial' Data Abstraction Library).

library(ggplot2)
library(rgdal)

For example, to add a few points to an outline of Ontario:

1) Read in the shapefile, and convert it to a dataframe for use with ggplot2.

ontario<-readOGR("data/ontarioShapefile/Ont2.shp", layer="Ont2")
ontario_df <- fortify(ontario)

2) Import (or in this example create) a dataframe containing the data for the plot.

sites <- data.frame(name=c("queens", "lake"), decLatitude=c(44.22, 45.18), decLongitude=c(-76.50, -78.83), stringsAsFactors = FALSE)

3) Generate the plot with geom_polygon and add the points with geom_point. Note that by default this will use a Mercator projection centered on the Prime Meridian.

map <- ggplot() + 
  geom_polygon(data = ontario_df, aes(x = long, y = lat, group = group), fill="grey40") +
  geom_point(data=sites, aes(x=decLongitude, y=decLatitude))
print(map)

4) The projection can be changed with coord_map. For example, to change the previous map to a Lambert projection o a Lambert projection with a true scale between latitudes of 40 and 70 degrees N:

#The projection can be changed using 'coord_map'
map <- map +
  coord_map("lambert", lat0=40, lat1=60)
print(map)

Stratigraphies

Stratigraphies can be produced using the strat.plot function provided by the rioja package.

Animated Plots

https://github.com/dgrtwo/gganimate

Word Clouds

Word clouds are a neat way to present and visualize text data. The following example demonstrates generating word frequency tables (along with text stemming and stem completetion) necessary to produce word clouds.

There are several ways to produce a word cloud using R; here functions from the packages tm are used to mine text data, SnowballC to stem text, and wordcloud2 to generate the word clouds. If not already installed on your system, all three packages are available from CRAN, and can be installed with:

install.packages(c("wordcloud2", "tm", "SnowballC"))

Once the packages are installed, make them available for use by loading them:

library(wordcloud2)
library(tm)
library(SnowballC)

Data Import

Word clouds require words, so import your text data into the R environment:

rawText <- readLines(file=file.choose())

The following example uses the first three paragraphs from the introduction of Volume 1 of the DPER seres as input data.

rawText <- "Paleolimnology, the interpretation of past conditions and processes in lake basins, is a multidisciplinary science whose roots extend back nearly two centuries. Despite this long history of development, the science has seen a surge of interest and application over the past decade. Today paleolimnology assumes a pivotal role in paleoclimatic and global change investigations, many fields of environmental science, and hydrocarbon and mineral resource exploration and exploitation. Associated with this dramatic increase in researchactivity involving lake sediments, there has been an equally rapid advance in the techniques and methods used by paleolimnologists. The objective of this volume is to provide a state-of-the-art summary of the major field methods, chronological techniques, and concepts used in the study of large-scale lacustrine basin analysis. This and the other techniques volumes in this series build on the foundation provided by previous compilations of paleoenvironmental techniques, such as Kummel & Raup (1965), Bouma (1969), Carver (1971), Berglund (1986), Gray (1988), Tucker (1988), Warner (1990), and Rutter & Catto (1995), many of which continue to serve as essential handbooks. However, the development of new and different methods for studying lake sediments over the past decade, as well as advancements and modifications to old methods, have provided impetus for a new series of monographs for this rapidly expanding topic. Three additional books from this series deal with other components of paleolimnology. Volume 2 (Last & Smol, 2001) focuses on the vast array of physical, mineralogical, and geochemical parameters that can be applied to the interpretation of lake histories. Volumes 3 and 4 (Smol et al., 2001a, b) address the great range of biological techniques that have been and continue to be such an important aspect of many paleolimnological efforts. Although chapters in each of these books discuss the quantitative aspects of lake sediment interpretations, a separate volume on statistical and data handling approaches in paleolimmnology is currently in preparation (Birks et al., in preparation). Our intent with this series of volumes is to provide sufficient methodological and technical detail to allow both practitioners and newcomers to the area of paleolimnology to use them as handbooks/laboratory manuals and as primary research texts. Although we recognize that the study of pre-Quaternary lakes is a very rapidly growing component in the vast arena of paleolimnology, this volume is directed primarily towards those involved in Quaternary lacustrine deposits and Quaternary environmental change. Nonetheless, many of the techniques and approaches are applicable to other time frames. Finally, we anticipate that many of the techniques discussed in these volumes  apply  equally  well to  marine, estuarine, and other depositional environments, although we have not specifically targeted non-lacustrine settings."

Generate Word Frequency Table

To produce a word cloud, a word frequency table is required (i.e. a two column table with the first column listing each word present in the input text, and the second listing the number of times it appears). This type of text mining can be done using tools provided by the tm package.

Some preprocessing of the input text is necessary. The tm package uses a data structure known as a corpus. So, first collapse the rawText data object into a single character string, then convert into a corpus using the Corpus function provided by the tm package.

rawText <- paste(unlist(rawText, use.names=FALSE), collapse=" ")
rawText <- Corpus(VectorSource(rawText))

Next, clean the text data with the corpus using tm_map to convert all characters to lower case, as well as remove any special characters, punctuation, stop words or extra white space.

cleanCorpus <- function(corpus){
  corpus <- tm_map(corpus, content_transformer(tolower))
  ##Replace all instances of “/”, “@” and “|” with a space:
  replaceWithSpace <- content_transformer(function(x , pattern ) gsub(pattern, " ", x))
  corpus <- tm_map(corpus, replaceWithSpace, "/")
  corpus <- tm_map(corpus, replaceWithSpace, "@")
  corpus <- tm_map(corpus, replaceWithSpace, "\\|")
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeWords, stopwords("en"))
  corpus <- tm_map(corpus, stripWhitespace)
  return(corpus)
}
cleanText <- cleanCorpus(rawText)

Note, that you may want to prevent some additional words from appearing in the word cloud, beyond those contained in the tm stop word list. If so, add them to the character vector cutWords.

cutWords <- c("can", "will", "also")
cleanText <- tm_map(cleanText, removeWords, cutWords)

Now the TermDocumentMatrix function within the tm package can be used to produce a frequency table for each word present in cleanText:

createFreqTable <- function(corpus){
  freqTable <- TermDocumentMatrix(corpus)
  freqTable <- as.matrix(freqTable)
  freqTable <- sort(rowSums(freqTable),decreasing=TRUE)
  freqTable <- data.frame(word = names(freqTable), freq=freqTable)
  return(freqTable)
}
cleanFreqTable <- createFreqTable(cleanText)

A quick examination of the frequency table will reveal that individual members of word families such as 'volume' and 'volumes' or 'provide' and 'provided' are tabulated separately. Thus, the frequency table could be simplified by reducing members of each word family to their root and then combining each instance. This process is known as stemming the document.

head(cleanFreqTable, 40)

Text Stemming

To generate a stemmed version of cleanText use the stemDocument function provided by tm and then generate a new frequency table using our createFreqTable function from earlier:

stemText <- tm_map(cleanText, stemDocument)
stemFreqTable <- createFreqTable(stemText)

However, it is unlikely that you would want to use the actual roots in your word cloud (i.e. the root of the 'techique' word family is 'techniqu', etc.):

head(stemFreqTable, 40)

Stem Completion

The next step is to complete the roots using the most frequent occurrence in the input text of the inflected/derived words. This is done using another function provided by tm, stemCompletion; however, it works upon character vectors, so both cleanText and stemText must be converted first:

corpus2CharVect <- function(corpus){
  corpus <- unlist(corpus[[1]][1], use.names=FALSE)
  corpus <- strsplit(corpus, " ")
  corpus <- unlist(corpus, use.names=FALSE)
  return(corpus)
}

cleanWords <- corpus2CharVect(cleanText)
stemWords <- corpus2CharVect(stemText)

Complete the contents of stemWords using cleanWords as the dictionary, and convert the output to a corpus, so that the final frequency table can be created.

completeText <- stemCompletion(stemWords, cleanWords, type="prevalent")
completeText <- Corpus(VectorSource(completeText))
completeFreqTable <- createFreqTable(completeText)
head(completeFreqTable, 40)

Generate Word Cloud

The frequency table of the stemmed and completed text will be the input for the wordcloud2 function. The visual appearance of the word cloud can be modified using specific arguments to change size/shape/colour etc. Also note, that if the most common words do not appear, it may be becasue the default sizes may be too large for the plot window, this can be fixed by reducing the value give to the size argument.

?wordcloud2
wordcloud2(data=completeFreqTable, size=.5, ellipticity = 0.65, shape="circle")

R Markdown

R Markdown is an authoring framework that allows a single file to both save and execute code, and also generate high quality formatted reports with the rmarkdown package.

This document is maintained in a single file written in R Markdown called RGuide.rmd, the html file is then generated with:

library(rmarkdown)
render("labGuide.rmd")

Additional R Resources

Official Documentation

Websites

Cheat Sheets

Books

References



shiggo/jezioro documentation built on Sept. 7, 2020, 7:34 p.m.