library(knitr)
opts_chunk$set(cache = TRUE,
               message = FALSE,
               warning = FALSE)

This is a tutorial on using the program PyRate for the analysis of cultural evolution. PyRate was originally developed by Daniele Silvestro and colleagues in a series of papers. We include his Python code here in the PyRate/ directory. I highly recommended reading these papers for additional details, Methods in Ecology and Evolution and Systematic Biology.

The first application of this methodology to cultural data can be found in Palgrave Communications.

This tutorial we use a combination of R (for data manipulation and plotting) and Python (run through a terminal window) for analysis. R (required) and R Studio (strongly recommended) can be downloaded from the following websites (R and R Studio).

For more inexperienced users of the R programming environment, many functions require the installation of additional packages. For packages that are not currently installed on the user's local machine they can be installed by the following commands.

Loading the required libraries can be done with the following commands:

library(dplyr)
library(tidyr)
library(stringr)
library(scales)
library(laser) # laser package no longer on CRAN, so we can get it like this: devtools::install_github("cran/laser")
library(RCurl)

If you receive an error message from the above line, you probably need to run install.packages("name_of_library") to download the package and then try again.

Loading the raw data

In this tutorial, we will be working with two data sets. The first contains a list of all the car models produced by American manufacturers between 1896 and 2014. This data was used in the Palgrave Communications paper (Gjesfjeld et al. 2016). By going through this tutorial similar results should be obtained. The second data set contains a list of archaeological pottery ware types from the American Southwest that have had there names and dates randomized.

Let's load the data:

car_models <- read.csv("data/car_models_tutorial.csv", header = TRUE)
pottery_data <- read.csv("data/pottery_data_tutorial.csv", header = TRUE)

The first thing we want to do is look at the structure of our data. We can look at the first few entries by using the following command:

head(car_models)
head(pottery_data)

We can quickly see that the format of the two data sets is different. The car models has a date for the first and last production year of each model, or lineage. In contrast, the pottery data set has multiple dates for each type of ware. This is because these data sets are at different scales. The pottery data contains information for every pottery sherd in our analysis, where as the car model only shows information for each car model (not every single car).

The important difference is that with the pottery data set, we have enough additional information in order to evaluate each pottery ware based on multiple dates. This is particularly useful for archaeological data where we might not be sure that the most recent and oldest ages are completely accurate. Given the detailed historical record of car models, we have high confidence that the first and last years of production for each car model is correct, so we are more interested in knowing the origination and extinction rate rather than the preservation rate. Throughout this tutorial I will refer to the pottery data set as an occurrence data set (as we have dates for each individual occurrence) and the car models as a lineage data set (where we have dates for each lineage). Of course, we can quickly change the pottery data set from occurrence data to lineage data by summarizing the most recent and oldest dates for each ware type into a single entry (i.e. Brown_Ware_A 66 1465). However, we this should only be done if the research is highly confident in the dates being used.

Please also not that the dates in both data sets are in years before present (or BP). This means that 0 is the most recent time unit and the higher the number the further back in time. Please make sure that the your own data follows this date structure (i.e. please do not use Gregorian calendar dates).

Formatting Data For PyRate

In order to use lineage data with PyRate, we must further transform our data into a structure that can be read by the program. We will use the function lineage_pyrate. The main output of this function is a list of three data frames:

library(pyrater)
car_models_species_pyrate_data <- lineage_pyrate(car_models)

#  We can preview each data frame:
lapply(car_models_species_pyrate_data, head)

Lineage data (car model) for PyRate

In the 'species_pyrate_data' data frame we can see that the names of the car models and manufacturers have been replaced by numeric values. The lineage_pyrate function also produces two data frames which provide the keys for linking the numeric values to their names. In all of these files we are loosely borrowing from biology nomenclature by referring to the car models as species and the manufacturers of cars as clades. These terms are used in the most generic way possible with species representing our smallest taxonomic unit (car models or lineages) and clades representing a way to group species, in this case by their manufacturer.

It is possible that you may not have a adequate grouping variable (clade) for lineages. If so, simply adjust the numeric clade value to 0 using the following command.

car_models_species_pyrate_data$species_pyrate_data$clade <- 0

We will also want to write a text file to be used with the PyRate analysis later on. This file will appear in your current working directory

write.table(car_models_species_pyrate_data$species_pyrate_data,
            "lineage_pyrate_data.txt",
            quote=FALSE,
            sep="\t",
            row.names = FALSE)

Formatting Ocurrence Data for PyRate

As discussed above, data sets in occurrence format (pottery_data) provide the opportunity to incoporate uncertainty about the preservation of artifacts when analyzing origination and extinction rates. If a researcher does not have a high level of certainty that the dates of first and last occurrence are correct, such as is often the case with fossils or artifacts, preservation can be informative. Please see the discussion by Silvestro et al. (2014) for further details. This function will produce an R object title occurrence_pyrate_data that structures the pottery data into four categories.

occurrence_pyrate_data <- occurrence_pyrate(pottery_data)

# take a look
head(occurrence_pyrate_data)

The four columns in this object are:

Species: which once again refers the smallest taxonomic unit which is each pottery sherd;

Status: which identifies whether the sherd or ware is still used today, with archaeological data these are all extinct but could be extant with modern technological data

min_age: this refers to the most recent age assigned to each sherd (dates should be in BP)

max_age: this refers to the oldest age assigned to each sherd (dates should be in BP)

trait: this refers to some discrete trait that is known about each entry. In this example, each trait could represent an archaeological site in which the sherd was found at.

The next step is to prepare this occurance data frame for input into Python. This is accomplished by calling the extract.ages function which extracts the ages and creates a readable python file:

extract.ages(occurrence_pyrate_data)

Additional options, such as replicates, can be included in the extract.ages function. See Silvestro et al. 2015 for details.

A python file titled occurrence_pyrate_data_PyRate.py should have been created and available in the working directory.

Summary Plots

Prior to starting the PyRate analysis, it is good practice to get a sense of the data by developing a series of plots. Using the function below, the following plots are created

  1. Log Lineages Through Time
  2. Cumulative Diversity Through Time
  3. Average Lifespan Through Time
  4. Histogram of Lifespans

Car Data (Lineage Data)

# The numeral indicates the x-axis increments with 
# the number 5 indicating 5-year time bins
summary_plots(car_models_species_pyrate_data$species_pyrate_data, 5) 

Pottery Data (Occurrence Data)

Given the long time scales of archaeological or paleontological data, we will want to rescale the temporal bins in our graphs. Therefore, we will change the second value from 5 in the first data set to 100.

summary_plots_occurrence(occurrence_pyrate_data, 100)

Choosing the best underlying model

The first decision to be made in the analysis of cultural data is whether a birth-death model is an appropriate model for the data under analysis. In this analysis, a birth-death model refers to a continuous-time Markov process where in each time step a the unit of analysis (typically a lineage or new product) can either originate (birth) or go extinct (death). Being that the birth-death model is an example of a continuous-time Markov process, the emergence or discontinuation of a lineage in the future depends on the current state of the model in the previous time step. In PyRate, if a birth-death model is favored, this means that the origination or extinction of a cultural lineage or material culture product is influenced by the existing number of lineages in the previous time step. This can be seen as suggestive of incremental innovation where the emergence of a new lineage or product is strongly influenced by the previous lineage or products available, likely through minor modifications of existing traits. A birth-death model is generally suggestive of a strong inheritance pattern between generations of cultural lineages or artifacts. (For example, a new car model that is a combination of traits from other already existing car models).

In contrast to many biological systems, where birth-death models have been shown to be very reliable, it is not always known if a birth-death model is appropriate for cultural data. An alternative model, developed for this research, is an immigration-death model. In this model, the emergence or discontinuation of a lineage or product is independent of the previous states of the model. This models has an underlying Poisson distribution which highlights that the emergence or discontinuation of a cultural lineage or product would occur randomly in each time step. This can be seen as suggestive of more radical innovation or horizontal transmission of traits from individuals or technologies from outside of the direct lineages. (For example, a new car model that has novel traits that were developed in the airline industry).

In order to test between these two models, the data set will be analyzed in two separate analysis using the -mBDI tag. In order to execute the analysis, the following commands will automatically open a terminal window and execute Python code. You need to have Python 2.7 installed for this to work. You can download this from https://www.python.org/. You will also need numpy and scipy for your Python installation. The code below assumes you have Python 2.7 installed at C:/python27/ (more preceisely, the python.exe file, if you are using Windows). This location may be different on your computer. You should check, and then update the code below so it's correct for your computer. On OSX, you probably just need python2.7 instead of C:/python27/python.exe If you would like to see the console output from these functions, add intern = TRUE to the end of the system() function.

Car Models

# set the number of generations. 10,000,000 generations = 10 hrs

n_generations <- 10000

# remember to check the location of your 'python.exe' file for Python 2.7

# birth-death model
# system(paste0("C:/python27/python.exe PyRate/PyRate.py -d lineage_pyrate_data.txt -A 2 -mBDI 0 -n ", n_generations))

mBDI(input = car_models_species_pyrate_data$species_pyrate_data, 
     n_generations, 
     model = 0)


# immigration-death model
system(paste0("C:/python27/python.exe PyRate/PyRate.py -d lineage_pyrate_data.txt -A 2 -mBDI 1 -n ", n_generations)) 

Pottery Data

system(paste0("C:/python27/python.exe PyRate/PyRate.py  occurrence_pyrate_data_PyRate.py -A 2 -fixSE null -mBDI 0 -n ", 
              n_generations)) 

system(paste0("C:/python27/python.exe PyRate/PyRate.py  occurrence_pyrate_data_PyRate.py -A 2 -fixSE null -mBDI 1 -n ", 
              n_generations)) 

Upon executing these commands, a new folder will be created in your working directory called pyrate_mcmc_logs. This analysis uses the default settings of samples every 1,000 generations for 10,000,000 generations. Details about changes to PyRate default settings can be found in Silvestro et al. 2014. Depending on the speed of the computer, with 10,000,000 generations the car model set will take approximately 7 hours to finish and the pottery data will take approximately 3 hours. If access to a computing cluster is available, this is the preferred option. To reduce the run time of the analysis, simply adjust the number of generations from 10 million to 1 million.

Visualizing the Rates of Origination and Extinction

Once the BDMCMC runs are finished, one can easy view the a rate through time plot by using the following commands:

system("C:/python27/python.exe PyRate/PyRate.py -plot pyrate_mcmc_logs/lineage_pyrate_data_0_ID_marginal_rates.log")

# or 

system("C:/python27/python.exe PyRate/PyRate.py -plot pyrate_mcmc_logs/occurrence_pyrate_data_1ID_marginal_rates.log")

These plots will use the default plot setting in PyRate and create both an R file and pdf with the rate-through-time plots. Open the pdf and check out the rates of origination and extinction through time!

If we want to make some more refined plots we will read in the data in R and use the base packages to plot our data.

First, we will plot the lineage car model data.

Note: you must run the -plot commands from the previous section before running these commands

source("pyrate_mcmc_logs/lineage_pyrate_data_0_ID_marginal_rates_RTT.r")
age_rev <-
  2015 + age # adjust for the most recent date in each of the datasets

# Origination
plot(
  age_rev,
  L_hpd_M95,
  type = 'n',
  ylim = c(0, max(L_hpd_M95)),
  xlim = c(min(age_rev), max(age_rev)),
  xlab = "Years",
  ylab = "Rate of Event Per Lineage Per Time Unit",
  main = "Origination"
)
plot_RTT(age_rev, L_hpd_M95, L_hpd_m95, L_mean, "navyblue")

# Extinction
plot(
  age_rev,
  M_hpd_M95,
  type = 'n',
  ylim = c(0, max(M_hpd_M95)),
  xlim = c(min(age_rev), max(age_rev)),
  xlab = "Years",
  ylab = "Rate of Event Per Mineage Per Time Unit",
  main = "Extinction"
)
plot_RTT(age_rev, M_hpd_M95, M_hpd_m95, M_mean, "red")

# Net Diversification
plot(
  age_rev,
  R_hpd_M95,
  type = 'n',
  ylim = c(min(R_hpd_m95), max(R_hpd_M95)),
  xlim = c(min(age_rev), max(age_rev)),
  xlab = "Years",
  ylab = "Rate of Event Per Mineage Per Time Unit",
  main = "Net Diversification"
)
plot_RTT(age_rev, R_hpd_M95, R_hpd_m95, R_mean, "darkgreen")

And the occurrence pottery data

source("pyrate_mcmc_logs/occurrence_pyrate_data_1ID_marginal_rates_RTT.r")
age_rev <-
  1965 + age # adjust for the most recent date in each of the datasets

# Origination (Note that the the ages are slightly reversed in plotting as this data is plotted in BP format unlike the cars which is in a calendar format)
plot(
  age_rev,
  L_hpd_M95,
  type = 'n',
  ylim = c(0, max(na.omit(L_hpd_M95))),
  xlim = c(max(age_rev), min(age_rev)),
  xlab = "Years (BP)",
  ylab = "Rate of Event Per Lineage Per Time Unit",
  main = "Origination"
)
plot_RTT(rev(age_rev), L_hpd_M95, L_hpd_m95, L_mean, "navyblue")

# Extinction
plot(
  age_rev,
  M_hpd_M95,
  type = 'n',
  ylim = c(0, max(na.omit(M_hpd_M95))),
  xlim = c(max(age_rev), min(age_rev)),
  xlab = "Years (BP)",
  ylab = "Rate of Event Per Lineage Per Time Unit",
  main = "Extinction"
)
plot_RTT(rev(age_rev), M_hpd_M95, M_hpd_m95, M_mean, "firebrick")

# Net Diversification
plot(
  age_rev,
  R_hpd_M95,
  type = 'n',
  ylim = c(min(na.omit(R_hpd_m95)), max(na.omit(R_hpd_M95))),
  xlim = c(max(age_rev), min(age_rev)),
  xlab = "Years (BP)",
  ylab = "Rate of Event Per Lineage Per Time Unit",
  main = "Net Diversification"
)
plot_RTT(rev(age_rev), R_hpd_M95, R_hpd_m95, R_mean, "green")

Once you have the plots available in R you can use the wide suite of graphic abilities to make publication quality graphics. Such as the figure from the Palgrave Communications publication.

Figure 2. Origination and Extinction Rates of Car Models Through Time

It is important to note that we are unable to quantitatively test whether a birth-death or immigration-death model best fits our data from the plotting of origination, extinction and net diversification rates. In order to identify the best-fitting model we need to model test between all the most likely models.

Model testing between the BD/ID models

To determine the best fitting model, we want to first know how many different rates fit our data. It is unlikely that the rates of origination and extinction for cultural data through time are constant, but we don't know how many rates best fit the data that we have. In order to determine this, run the following commands

mProb_occurrence_pyrate_data_1BD <- system("C:/python27/python.exe PyRate/PyRate.py  -mProb  pyrate_mcmc_logs/occurrence_pyrate_data_0_BD_mcmc.log", 
       intern = TRUE) 

mProb_occurrence_pyrate_data_1ID <- system("C:/python27/python.exe PyRate/PyRate.py  -mProb  pyrate_mcmc_logs/occurrence_pyrate_data_0_ID_mcmc.log", 
       intern = TRUE) 

The input file paths are:

pyrate_mcmc_logs/occurrence_pyrate_data_1BD_mcmc.log pyrate_mcmc_logs/occurrence_pyrate_data_1ID_mcmc.log pyrate_mcmc_logs/lineage_pyrate_data_1BD_mcmc.log pyrate_mcmc_logs/lineage_pyrate_data_1ID_mcmc.log

PyRate will provide a breakdown of the probability of most likely rate models for both speciation and extinction. In addition the best configuration of rates of also provided. It is probably best to open a spreadsheet (excel or Google sheets) and record these values.

| BD | Origination | Extinction | | ID | Origination | Extinction | |-------------|-------------|------------|---|------------|-------------|------------| | 1-rate | 0.0345 | 0.0494 | | 1-rate | 0.4375 | 0.0423 | | 2-rate | 0.5101 | 0.5104 | | 2-rate | 0.4137 | 0.5136 | | 3-rate | 0.3192 | 0.3088 | | 3-rate | 0.124 | 0.3074 | | 4-rate | 0.1086 | 0.1037 | | 4-rate | 0.0208 | 0.1041 | | 5-rate | 0.022 | 0.0227 | | 5-rate | 0.0038 | 0.0265 | | 6-rate | 0.0048 | 0.0037 | | 6-rate | 0.0002 | 0.0056 | | 7-rate | 0.0006 | 0.0009 | | 7-rate | 0 | 0.0004 | | 8-rate | 0.0001 | 0.0004 | | 8-rate | 0 | 0.0001 | | | | | | | | | | Best BD/ID | | | | Best BD/ID | | | | Birth Rates | Death Rates | Rel. Prob. | | Im. Rates | Death Rates | Rel. Prob. | | 2 | 2 | 0.268 | | 1 | 2 | 0.224 | | 3 | 2 | 0.158 | | 2 | 2 | 0.215 | | 2 | 3 | 0.154 | | 1 | 3 | 0.138 | | 3 | 3 | 0.101 | | 2 | 3 | 0.122 | | 4 | 2 | 0.053 | | 3 | 2 | 0.062 | | 2 | 4 | 0.051 | | | | |

In this example, the best rate configuration for the Birth-Death model for the preservation data set is 2 origination rates (1 rate shift) and 2 extinction rates (1 rate shift), with a 3 origination 2 extinction rate configuration is equally likely. The best rate configuration for the Immigration-Death model is 1 origination rates (no rate shifts) and 1 extinction rates (1 rate shifts).

Once the marginal probabilities are recorded, we need to examine each of the birth-death and immigration-death configuration. We will again use PyRate but with a thermodynamic integration (TI) model rather than the birth-death MCMC to analyze the data. See Silvestro et al. (2014, 2015) for details on the implementation of a TI approach.

In the following code, -A 1 specifies the use of TI, -mL is the number of origination rates (L stands for lambda), -mM is the number of extinction rates (M stands for mu), -mBDI is the birth-death (mBDI 0) or immigration-death (mBDI 1). Each line of code should be run in a separate terminal window (or on a separate screen on a Linux server).

#This example is for lineage data only, see the example below for occurrence data.  

cd ~/Desktop/PyRate_CE_Tutorial
python2.7 PyRate.py -d lineage_pyrate_data.txt -A 1 -mL 2 -mM 2 -mBDI 0 #Most likely
python2.7 PyRate.py -d lineage_pyrate_data.txt -A 1 -mL 3 -mM 2 -mBDI 0 
python2.7 PyRate.py -d lineage_pyrate_data.txt -A 1 -mL 2 -mM 3 -mBDI 0 
python2.7 PyRate.py -d lineage_pyrate_data.txt -A 1 -mL 3 -mM 3 -mBDI 0 
python2.7 PyRate.py -d lineage_pyrate_data.txt -A 1 -mL 4 -mM 2 -mBDI 0 
python2.7 PyRate.py -d lineage_pyrate_data.txt -A 1 -mL 2 -mM 4 -mBDI 0  #Least Likely

python2.7 PyRate.py -d lineage_pyrate_data.txt -A 1 -mL 1 -mM 2 -mBDI 1 #Most likely
python2.7 PyRate.py -d lineage_pyrate_data.txt -A 1 -mL 2 -mM 2 -mBDI 1 
python2.7 PyRate.py -d lineage_pyrate_data.txt -A 1 -mL 1 -mM 3 -mBDI 1 
python2.7 PyRate.py -d lineage_pyrate_data.txt -A 1 -mL 2 -mM 3 -mBDI 1 
python2.7 PyRate.py -d lineage_pyrate_data.txt -A 1 -mL 3 -mM 2 -mBDI 1 #Least Likely


python2.7 PyRate.py preservation_pyrate_data_PyRate.py -A 1 -mL 2 -mM 3 -mBDI 1 -fixSE null 

#Occurrence Data - Only one example listed adjust accordingly for the number of different rate models as highilighted above. Note the inclusion of the -fixSE null command that is necessary to run the A1 model with occurrence data.  

python2.7 PyRate.py occurrence_pyrate_data_PyRate.py -A 1 -mL ? -mM ? -mBDI 0 -fixSE null
python2.7 PyRate.py occurrence_pyrate_data_PyRate.py -A 1 -mL ? -mM ? -mBDI 0 -fixSE null

After the A1 models are finished (which could take a few hours) create a new folder in the PyRate_CE_Tutorial folder titled A1_mcmc_logs. Take each of the _mcmc.log files from the recently finished A1 analyses and place them in this folder. In order to compare these different rate models, we want to calculate the marginal likelihood of each model. To do this we use the following command in a terminal window

cd ~/Desktop/PyRate_CE_Tutorial
python2.7 PyRateContinuous.py -mL ~/Desktop/PyRate_CE_Tutorial/A1_mcmc_logs

PyRate continuous is a complementary program to PyRate the evaluates the correlation between our data and other variables of interest. This above command with produce a number of new files in the A1_mcmc_logs folder. These will include a "cold" version of each mcmc_log and a marginal_likelihoods.txt file. Open the marginal likelihood text file in a spreadsheet and sort the BD_lik column from largest to smallest. The model with the highest value (or closest to zero if the values are negative) is the rate model that best fits the data. In the example data set given here, we can see that the BD_33 rate model is the best fitting model compared to all other models with a marginal likelihood of 14491.35. Furthermore, we can see that all of the birth-death models fit better than the immigration-death models suggesting that a birth-death model is the best fit for our data.

marg_likes<-read.delim("~/Desktop/PyRate_CE_Tutorial/A1_mcmc_log_files_tutorial/marginal_likelihoods.txt",header=T)
arrange(marg_likes,desc(BD_lik))

Finding the Times of Shift Points

Once we have determined the best fitting model, a 3 origination rate and 3 extinction rate model for our example data, we can identify the times that rate shifts were most likely to occur. To do this we need to run the following code:

BD_33<-read.delim("~/Desktop/PyRate_CE_Tutorial/A1_mcmc_log_files_tutorial/ebay_pyrate_american_mBDI_0_BD33_TI_mcmc.log",row.names=1) #This one was chosen as it was the best fitting model
BD_33<-BD_33[1:1000,] #removal of the rest of the TI as we are not interested in it

sp.1<-shift.points.func(BD_33$shift_sp_1, 2014.5) #The year is the most recent year in the dataset plus 0.5
sp.2<-shift.points.func(BD_33$shift_sp_2, 2014.5)
ex.1<-shift.points.func(BD_33$shift_ex_1, 2014.5)
ex.2<-shift.points.func(BD_33$shift_ex_2, 2014.5)
shift_points<-data.frame(sp.1=sp.1[1,1],sp.2=sp.2[1,1],ex.1=ex.1[1,1],ex.2=ex.2[1,1])
shift_points
#NOTE: The number of shifts will also be one less than the number of rates expected. So, if the best fitting model is a model with 4 origination and extinction rates you will need to add a sp.3 and a ex.3 to the code above. 

The resulting R object provides the years for which the rate shifts are most likely to take place. Prior to the final R object (shift points) you are also seeing the probabilities for each particular shift point.

Summary

Well, we have finally finished the this tutorial! In it we have estimated the origination, extinction and net diverisfication rate for two cultural data sets, determined the birth-death model as the best fitting model and identified the temporal shifts points when rate shift were most likely to occur.

The broader contribution of the PyRate method, or more broadly a stratigraphic method, is the ability to evaluate the dynamics of diversification over continuous time with dates. We can see in the example of the car models that there has been a nearly four-fold decrease in the origination and extinction rate of American car models since the 1920s with the most significant decrease occurring during the 1940s and 1950s. Finally, this is the first draft of this tutorial and any mistakes, errors or clarifications would be valuable for future drafts. Please do not hesitate to contact me at egjesfjeld@socgen.ucla.edu with any questions or comments.



benmarwick/pyrater documentation built on May 7, 2019, 10:38 a.m.