knitr::opts_chunk$set(echo = TRUE)

Data Simulation

This tutorial uses the function from package FossilSim to simulate datasets from posterior distribution. We quantify some characteristics of both the simulated and emperical(origin time) datasets using test statistics, here, mean and standard deviation. The goal is to check the model accuracy.

Installation

You might need to install and load these packages:

install.packages("tidyverse")
install.packages("ggplot2")
library(tidyverse)
library(ggplot2)
install.packages("devtools")
library(devtools)

You need to install FossilSim R package for this package to run.

FossilSim is used for simulating trees directly from a fossilized-birth death process using the sim.fbd function.

install.packages("FossilSim")
library(FossilSim)

Now, we install my R package.

devtools::install_github("Anisha-Neupane/R_package_Neupane")
library(NeupaneRpackage)

Now, we load my R package using devtools. load_all() loads a package. It roughly simulates what happens when a package is installed and loaded with library().

devtools::load_all()

Let us download some data for our package:

download.file(url = "https://raw.githubusercontent.com/Anisha-Neupane/R_package_Neupane/master/Data/ants_timevary.log", destfile = "/cloud/project/vignettes/Ants_timevary.log")

Load the data using read_tsv.

all <- read_tsv("/cloud/project/vignettes/Ants_timevary.log")  #read_tsv is used for tab separated ("\") values.
all

Rename speciation_rate, extinction_rate and psi:

lambda <- all$speciation_rate
mu <- all$extinction_rate
psi <- all$psi

Function: Simulation

This function simulates trees while conditioning on the number of sampled extant taxa using sim.fbd.taxa function. It requires the fossil sampling rate parameter psi as an argument.

Now, lets simulate the data using the function simulation:

i = 1
trees <- vector()
while (i <=20) { 
trees[i] <- c(NeupaneRpackage::Simulation(n = 10, numbsim = 1, lambda[i], mu[i], psi[i], frac = .01))
print(trees[i])
i <- i + 1
}

In the above code we have created a loop using 'while' so that we do not need to enter the values manually. Then we have created the variable tree which is a vector to store our data from simulation.

It returns a list of SAtree objects as the simulated trees may contain sampled ancestors as tips with zero-length branches.

We can check the simulated data value using trees[[n]]$root.edge function.

For Example:

trees[[1]]$root.edge

It returns the simulated value from first tree simulated. We can do the same for other 19.

If we want to create an object to store the simulated data, we can create a for loop which collects the data at once for all 20 trees.

i = 1  
simulated_data <- vector()
for (i in 1:20) { 
simulated_data[i] <- c(trees[[i]]$root.edge)
}

We also need to create some variable to put our emperical value which is the origin time from our given dataset, and calculate mean for both.

emperical_data <- all$origin_time
meane <- mean(emperical_data)
means <- mean(simulated_data)

Function treeplot

This function plots a strati graphic range representation of the tree and fossil samples using rangeplot.asymmetric function. It assumes that all speciation events are asymmetric.

Working example:

treeplot(1)

Now, lets create a loop for this function that gives the plot for all the simulated data at once.

i = 1
for (i in 1:20) {
treeplot(i)
} 

Function comparision

This function compares the simulated data with the emperical data using emperical mean. It returns the ggplot of simulated data against emperical mean. If the emperical test statistic value is very different from the simulated ones, our model is not doing a good job of replicating some aspect of the process that generated our data.

We can make a dataframe to keep the necessary data for our convenience:

datas <- data.frame(simulated_data, emperical_data, meane, means)

Working example of the function:

comparision(datasdf = datas)

It returns a curve for the simulated data and a dotted line for emperical mean.

Function effect_size

This function calculate the difference between the emperical test statistic value and the median of simulated test statistic divided by the standard deviation of simulated test statistic. It shows how many standard deviations away from the median is the observed value.

Effect_size(datasdf = datas)


Anisha-Neupane/R_package_Neupane documentation built on Dec. 17, 2021, 8:52 a.m.