```r knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
## Introduction to `muddier` The package \code{muddier} is inspired by the field of fluvial geomorphology. Correcting for the inhertied ages of radiocarbon-dated charcoal in samples of debris-flow deposits in headwater streams concerns a particular application of deriving a PMF from two parent PMFs through convolution. The functions in \code{muddier} are an attempt to abstract the particulars of the application into a more general set of tools for converting emprical observational data into CDFs, PMFs, and for deriving PMFs from multiple parent PMFs. ## Charcoal Data The `muddier` package includes data from Lancaster & Casebeer (2007) on charcoal samples taken from bank stratigraphy adjacent to headwater streams in the OCR. The table `charcoal` contains site data from the field surveys where they collected the charcoal samples. The table `char_pmfs` is the radiocarbon age distribution associated with each sample. To load the data, you can use the `data` function, or simply type `library(muddier)`. ```r library(muddier) # load package data without loading package # data(charcoal, package = 'muddier') # data(char_pmfs, package = 'muddier') # inspect charcoal table head(charcoal)
The charcoal
dataset has 370 observations listed as rows. The rownames of the object and the first column of data are the site ids. At sites where the researchers took multiple samples, the site ids will have letters appended to them (eg. LK-187a, LK-187b, LK-187c). The family
column is the site id stripped of any ending letters, so that all observations from the same site share the same family id. Researchers classified the deposit type of samples as fluvial fines (FF), fluvial gravels (FG) and debris-flow deposits (DF). The facies
column indicates the facies type using the two-letter classification codes (FF, FG, DF). The mn
column is the mean sample age BP as derived from the pmf in char_pmfs
, and rank
is the ordinal rank of each sample from youngest to oldest.
If we examine the dimensions of each dataset, we see char_pmfs
has 370 columns and 4430 rows. The column names match the row names of charcoal
. The row names of char_pmfs
are the year values associated with the probability in the row. If we plot the values of a column against the values of the row names, the result is the age distribution pmf of the sample.
library(magrittr) # compare dimensions of objects dim(charcoal) dim(char_pmfs) years <- char_pmfs %>% rownames %>% as.numeric probs <- char_pmfs[,250] %>% unlist %>% as.numeric plot(years, probs, xlim=c(1000,1400), main='pmf of ob# 250')
To produce these two datasets, I performed several operations on a csv file converted from the Excel file containing the original Lancaster & Casebeer (2007) data. The file used to perform these operations is stored in the data-raw
folder on the github reposity for this project. Compared to the original data, this data has been sorted by mean age, and split into two datasets, one for sample variable data and one for numeric pmf vectors associated with each sample.
Before we begin analyzing the data, I would like to verify that the datasets have been sorted correctly, and that the pmfs all sum to 1. First I will make sure the rownames of charcoal
match the column names of char_pmfs
, then I will test that each column of char_pmfs
sums to 1. Finally, I want to make sure the calculated means in charcoal
match the means of the pmfs in char_pmfs
. To do this, I use the testthat
package which employs the function expect_equal
. If the two arguments are equal to within a tolerance range, the test passes silently. If the two arguments are not equal, the test fails with a warning. Failure causes abortion of vignette construction, so every time you see a test run in code, that means it has passed. I will explicitly mark unexecuted lines of code with the comment # not run
at the top of the code snippet.
library(testthat) years <- char_pmfs %>% rownames %>% as.numeric mns <- lapply(char_pmfs, function(a) weighted.mean(years, a)) # TESTING CODE # # test rownames are equal # test means are equal test_that('charcoal and char_pmfs match', { expect_equal(rownames(charcoal), colnames(char_pmfs)) expect_equal(as.numeric(charcoal$mn), as.numeric(mns)) }) # test the pmfs sum to 1 sums <- lapply(char_pmfs, sum) %>% as.numeric ones <- rep(1, ncol(char_pmfs)) max(sums-ones) # examine tolerance test_that('pmfs in char_pmfs sum to 1', { expect_equal(sums, ones, tolerance = 1e-4) })
Note that the pmfs will not sum exactly to one due to rounding error. First I examine the maximum size of the rounding error with max(sums-ones)
. I explictly set tolerance = 1e-4
to ingore this rounding error. Having passed the initial round of tests, I am confident the two datasets are correctly ordered and match, and we can proceed to calculate inherited age.
charcoal
is organized as a data.table to enable simple subsetting operations by the user. The convenient .N
term for delineating number of observations is convenient for producing summary statistics, when combined with by
indicated the variable for which to count observations. The data.table format allows the user to treat column names as variables when inside brackets, so subsetting a data.table using a logical test is as straightforward as charcoal[facies == 'DF']
.
library(data.table) # subset by facies df_count <- charcoal[facies == 'DF', .N, by = family] ff_count <- charcoal[facies == 'FF', .N, by = family] fg_count <- charcoal[facies == 'FG', .N, by = family] # family ids of sites with > 3 obs df_count[N>3] ff_count[N>3] fg_count[N>3]
Inherited age is the age of the charcoal before becoming entrained in bank deposits. By taking multiple samples of a facies type at a single sampling site, Lancaster & Casebeer (2007) were able to estimate a distribution range around inherited age.
charcoal
is intentionally sorted by rank so that an index of ranks is equivalent to the index of observations by row. Furthermore, the rank number of a sample from charcoal
is equivalent to column number of the pmf for the same sample in char_pmfs
. Rather than searching by site_id, I have elected to use rank number, and so the function rank_list(ftype, dat = charcoal, min_count = 3)
returns the ranks of observations of a given facies type from dat
above a minimum count min_count
, with dat
set to the charcoal data as a default, and minimum count set to 3.
The concluding test in the snippet tells me that the ranks returned by the function rank_list()
have the same family ids as the subsets in the previous example. While function calls in muddier
are relatively succinct, the testing code is generally lengthier, and so I delineate the two portions of the snippet with the comment # TESTING CODE #
.
df_ranks <- rank_list('DF') ff_ranks <- rank_list('FF') fg_ranks <- rank_list('FG') # inspect output df_ranks[[1]] # TESTING CODE # # family ids pulled by rank df_by_ranks <- charcoal[rank %in% unlist(df_ranks), family] %>% factor %>% levels %>% sort ff_by_ranks <- charcoal[rank %in% unlist(ff_ranks), family] %>% factor %>% levels %>% sort fg_by_ranks <- charcoal[rank %in% unlist(fg_ranks), family] %>% factor %>% levels %>% sort # family ids pulled by count df_by_count <- df_count[N>3,family] %>% unlist %>% sort ff_by_count <- ff_count[N>3,family] %>% unlist %>% sort fg_by_count <- fg_count[N>3,family] %>% unlist %>% sort test_that('rank list matches family ids', { expect_equal(df_by_ranks, df_by_count) expect_equal(ff_by_ranks, ff_by_count) expect_equal(fg_by_ranks, fg_by_count) })
I can examine the pmfs associated with a rank list using the function pmf_by_rank(ranks, pmfs = char_pmfs)
. Because the default argument for pmfs
is set to char_pmfs
, this second argument can be ommitted in this instance. When passed a vector of ranks to the first argument ranks
, pmf_by_rank
returns a data.table with pmfs for each rank stored by column, with probabilities stored as row values.
When passed a list of rank vectors, as output by rank_list
, to the first argument ranks
, pmf_by_rank
returns a list of data.tables, each containing the pmfs associated with ranks in the rank list, with probabiliites stored as row values.
To test the functionality of pmf_by_rank
, I am mostly concerned that I have inadvertently scrambled the pmfs. I am already highly confident the rows of both datasets match, and the pmfs sum to 1, so now I would like to ensure they are ordered within each family from youngest to oldest.
# return pmfs by rank df_pmfs <- pmf_by_rank(df_ranks) ff_pmfs <- pmf_by_rank(ff_ranks) fg_pmfs <- pmf_by_rank(fg_ranks) # TESTING CODE # # list wrapper for weighted.mean get_means <- function(pmfs, years) { lapply(pmfs, function(a) lapply(a, function(b) weighted.mean(years, b))) } df_mns <- get_means(df_pmfs, years) ff_mns <- get_means(ff_pmfs, years) fg_mns <- get_means(fg_pmfs, years) test_that('pmfs are sorted in ascending age', { for (i in seq_along(df_mns)) { expect_equal(as.numeric(df_mns[[i]]), sort(as.numeric(df_mns[[i]]))) } for (i in seq_along(ff_mns)) { expect_equal(as.numeric(ff_mns[[i]]), sort(as.numeric(ff_mns[[i]]))) } for (i in seq_along(fg_mns)) { expect_equal(as.numeric(fg_mns[[i]]), sort(as.numeric(fg_mns[[i]]))) } })
If we are trying to characterize the distribution of inherited ages from the range of age distributions at a particular site, we can reason that the charcoal in the samples was no older than the youngest piece of charcoal in the set when the samples became entrained as a bank deposit, simply because the youngest sample had to become charcoal before becoming a bank deposit. By this logic, we can characterize the distribution of inherited ages a given site by subtracting the youngest age distribution from the age distribution of each sample via convolution.
df_cl <- convo_list(df_ranks) ff_cl <- convo_list(ff_ranks) fg_cl <- convo_list(fg_ranks) # TESTING CODE # # slight variation on list wrapper pmf_means <- function(pmfs, years) { lapply(pmfs, function(a) lapply(a, function(b) weighted.mean(sort(years), b))) } dfc_mns <- pmf_means(df_cl, years) ffc_mns <- pmf_means(ff_cl, years) fgc_mns <- pmf_means(fg_cl, years) test_that('convolved pmfs are sorted in ascending age', { for (i in seq_along(dfc_mns)) { expect_equal(as.numeric(dfc_mns[[i]]), sort(as.numeric(dfc_mns[[i]]))) } for (i in seq_along(ffc_mns)) { expect_equal(as.numeric(ffc_mns[[i]]), sort(as.numeric(ffc_mns[[i]]))) } for (i in seq_along(fgc_mns)) { expect_equal(as.numeric(fgc_mns[[i]]), sort(as.numeric(fgc_mns[[i]]))) } }) # because we convolved the pmfs, check that they sum to 1 get_sums <- function(pmfs) { lapply(pmfs, function (a) lapply(a, sum)) %>% unlist %>% as.numeric } dfc_sums <- get_sums(df_cl) dfc_ones <- rep(1, length(dfc_sums)) max(dfc_sums-dfc_ones) # examine tolerance ffc_sums <- get_sums(ff_cl) ffc_ones <- rep(1, length(ffc_sums)) fgc_sums <- get_sums(fg_cl) fgc_ones <- rep(1, length(fgc_sums)) test_that('pmfs in char_pmfs sum to 1', { expect_equal(dfc_sums, dfc_ones, tolerance = 1e-4) expect_equal(ffc_sums, ffc_ones, tolerance = 1e-4) expect_equal(fgc_sums, fgc_ones, tolerance = 1e-4) })
The inherited age convolutions include values below zero. This occurs in the specific case of subtracting the youngest age from itself, which creates a distribution centered on zero, and approximately half of the values in the PMF will be below zero. Because we know that inherited ages must be positive, we can truncate the pmf at zero by setting probabilities at values below zero to zero using the trunc_0
. We then need to renormalize the truncated using the normalize
function, or we can do both at once using the trunc_norm
wrapper.
df_trc <- trunc_list(df_cl) ff_trc <- trunc_list(ff_cl) fg_trc <- trunc_list(fg_cl) # TESTING CODE # # test that truncated pmfs do not contain values below zero # sum probabilities of values below zero in a vector sum_subzero <- function(pmf, index = sort(years)) { pmf <- pmf %>% unlist %>% as.numeric sum(pmf[index<0]) } # list wrapper for sum_subzero subzero_sum <- function(pmfs, index = sort(years)) { lapply(pmfs, function(a) lapply(a, function(b) sum_subzero(b,index))) %>% unlist %>% as.numeric } df_szs <- subzero_sum(df_trc) ff_szs <- subzero_sum(ff_trc) fg_szs <- subzero_sum(fg_trc) # vector of zero to compare to for test df_zs <- rep(0, length(df_szs)) ff_zs <- rep(0, length(ff_szs)) fg_zs <- rep(0, length(fg_szs)) test_that('truncated pmfs do not contain values below zero', { expect_equal(df_szs, df_zs) expect_equal(ff_szs, ff_zs) expect_equal(fg_szs, fg_zs) }) # test that truncated pmfs are renormalized and sum to one list_sums <- function(pmfs) { lapply(pmfs, function(a) lapply(a, sum)) %>% unlist %>% as.numeric } dft_sums <- list_sums(df_trc) dft_ones <- rep(1, length(dft_sums)) max(dft_sums-dft_ones) # examine tolerance fft_sums <- list_sums(ff_trc) fft_ones <- rep(1, length(fft_sums)) fgt_sums <- list_sums(fg_trc) fgt_ones <- rep(1, length(fgt_sums)) test_that('truncated pmfs are renormalized and sum to 1', { expect_equal(dft_sums, dft_ones) expect_equal(fft_sums, fft_ones) expect_equal(fgt_sums, fgt_ones) })
The use of data.tables enables easy application of nested lapply
functions. In this case, the parent list is the set of sites containing multiple samples of a given facies. Each parent element contains a child list of pmfs by observations. Once we have derived the convolved pmfs of inherited age for a given facies, we can sum their distributions and renormalize to produce the inherited age distribution of the facies.
The convenience function rack
converts a list or data.table to a matrix by cbind, provided the elements of the list are of equal length. The pmfs are stored in row values with observations by column, so after racking the list into a large matrix, we can sum the distributions using rowSums
and renormalize using normalize
.
# convert from list of data.tables to matrix to_mat <- function(ls) { lapply(ls, rack) %>% rack } df_dt <- to_mat(df_trc) %>% rowSums %>% normalize ff_dt <- to_mat(ff_trc) %>% rowSums %>% normalize fg_dt <- to_mat(fg_trc) %>% rowSums %>% normalize # plot results plot(sort(years), df_dt, type = 'l', lwd = 3, col = 'brown', xlim = c(0,11000), main = 'DF Inherited Age', xlab = 'inherited age', ylab = 'pmf') plot(sort(years), ff_dt, type = 'l', lwd = 3, col = 'slateblue', xlim = c(0,4000), main = 'FF Inherited Age', xlab = 'inherited age', ylab = 'pmf') plot(sort(years), fg_dt, type = 'l', lwd = 3, col = 'forestgreen', xlim = c(0,10000), main = 'FG Inherited Age', xlab = 'inherited age', ylab = 'pmf')
While muddier
uses pmfs to represent pdfs to perform convolution, the results are often easier to consume as cdfs or log exceedance graphs. The to_cdf
and to_exceed
functions convert pmfs to cdfs and log10 exceedance value respectively.
# convert pmfs to cdfs df_cdf <- df_dt %>% to_cdf ff_cdf <- ff_dt %>% to_cdf fg_cdf <- fg_dt %>% to_cdf # convert pmfs to log 10 exceedance df_exc <- df_dt %>% to_exceed ff_exc <- ff_dt %>% to_exceed fg_exc <- fg_dt %>% to_exceed # plot results plot(sort(years), df_cdf, type = 'l', lwd = 3, col = 'brown', xlim = c(0,11000), ylim = c(0,1), main = 'DF Inherited Age', xlab = 'inherited age', ylab = 'cdf') plot(sort(years), df_exc, type = 'l', lwd = 3, col = 'brown', xlim = c(0,11000), main = 'DF Inherited Age', xlab = 'inherited age', ylab = 'log10 exc') plot(sort(years), ff_cdf, type = 'l', lwd = 3, col = 'slateblue', xlim = c(0,4000), main = 'FF Inherited Age', xlab = 'inherited age', ylab = 'cdf') plot(sort(years), ff_exc, type = 'l', lwd = 3, col = 'slateblue', xlim = c(0,4000), main = 'FF Inherited Age', xlab = 'inherited age', ylab = 'log10 exc') plot(sort(years), fg_cdf, type = 'l', lwd = 3, col = 'forestgreen', xlim = c(0,10000), main = 'FG Inherited Age', xlab = 'inherited age', ylab = 'cdf') plot(sort(years), fg_exc, type = 'l', lwd = 3, col = 'forestgreen', xlim = c(0,10000), main = 'FG Inherited Age', xlab = 'inherited age', ylab = 'log10 exc')
library(parallel) options(mc.cores=detectCores()) set.seed(10101) # boostrap ids with replacement at each site df_bl <- boot_ids(50, df_ranks) # convolve inherited ages at each site df_cbl <- mclapply(df_bl, convo_list) # truncate and normalize at zero df_t <- mclapply(df_cbl, trunc_list) # convert to matrix dfb_pmf <- lapply(df_t, to_mat) %>% rack dfb_pmf_cis <- apply(dfb_pmf, 1, get_cis) # get cis for pmf, cdf, and log 10 exc dfb_cdf <- apply(dfb_pmf, 2, to_cdf) dfb_cdf_cis <- apply(dfb_cdf, 1, get_cis) dfb_exc <- apply(dfb_pmf, 2, to_exceed) dfb_exc_cis <- apply(dfb_exc, 1, get_cis) rownames(dfb_pmf_cis) <- c('lwr','med','upr') rownames(dfb_cdf_cis) <- c('lwr','med','upr') rownames(dfb_exc_cis) <- c('lwr','med','upr') # plot results plot(sort(years), dfb_pmf_cis[3,], type = 'l', lwd = 3, col = 'brown', xlim = c(0,2000), main = 'DF Inherited Age', xlab = 'inherited age', ylab = 'pmf') lines(sort(years), dfb_pmf_cis[2,], lwd = 3, col = 'black') lines(sort(years), dfb_pmf_cis[1,], lwd = 3, col = 'slateblue') lines(sort(years), df_dt, lwd = 3, col = 'goldenrod') legend('topright', legend = c('obs','upr','med','lwr'), fill = c('goldenrod', 'brown', 'black', 'slateblue')) plot(sort(years), dfb_cdf_cis[3,], type = 'l', lwd = 3, col = 'brown', xlim = c(0,7000), main = 'DF Inherited Age', xlab = 'inherited age', ylab = 'cdf') lines(sort(years), dfb_cdf_cis[2,], lwd = 3, col = 'black') lines(sort(years), dfb_cdf_cis[1,], lwd = 3, col = 'slateblue') lines(sort(years), df_cdf, lwd = 3, col = 'goldenrod') legend('bottomright', legend = c('obs','upr','med','lwr'), fill = c('goldenrod', 'brown', 'black', 'slateblue')) plot(sort(years), dfb_exc_cis[3,], type = 'l', lwd = 3, col = 'brown', xlim = c(0,11000), main = 'DF Inherited Age', xlab = 'inherited age', ylab = 'log10 exc') lines(sort(years), dfb_exc_cis[2,], lwd = 3, col = 'black') lines(sort(years), dfb_exc_cis[1,], lwd = 3, col = 'slateblue') lines(sort(years), df_exc, lwd = 3, col = 'goldenrod') legend('topright', legend = c('obs','upr','med','lwr'), fill = c('goldenrod', 'brown', 'black', 'slateblue'))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.