knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

The following script carries out data cleaning and reshaping. It also provides some meta data on columns the dan

Citation

Lloyd et al. 2016. Assessing conservation status of resident and migrant birds on Hispaniola with mist-netting. PeerJ. https://peerj.com/articles/1541/

Preliminaries

Load Libraries

library(reshape2)
library(here)
library(lubridate)

Load data

The here() function of the here package is used to construct the correct file path for the raw data

file. <- "inst/extdata/rawdata/jlloyd.7.3.csv"
path. <- here::here(file.)

Getting to know the data

Check to make sure it work

file.exists(path.)

Now load the actualy data

dat <- read.csv(path., stringsAsFactors = F)

Check out data structure

Use basic R functdions to get to know the data

dim(dat)
names(dat)
summary(factor(dat$Date))

Check out data layout

Each row is the capture record for a single bird at a single time point. If a bird was captured multiple times over the course of the study, there would be a row for each capture, regardless of when it occured.

head(dat[,c(1:10)])

Here's an example of data for a bird capture multiple times

i.7caps  <- which(dat$Metal.Band == "135192207")
head(dat[i.7caps,c(1:10)])

Check out data columns

names(dat)

Basic data cleaning

Clean dates

Transform the date column into the standard "Date" class

dat$Date2 <- lubridate::mdy(dat$Date)

Extract year

dat$Year <- lubridate::year(dat$Date2)

Extract month and day

dat$Mo <- lubridate::month(dat$Date2)
dat$Day <- lubridate::day(dat$Date2)

Check out the results

summary(factor(dat$Year))
summary(factor(dat$Mo))

Comments on banding

Look at comments section. Some birds were re-banded, lost bands, etc.

#"not banded"
#"died"

dat$Comments[grep("change",dat$Comments)]
dat$Comments[grep("new",dat$Comments)]
dat$Comments[grep("added",dat$Comments)]
dat$Comments[grep("metal",dat$Comments)]
dat$Comments[grep("band",dat$Comments)]

Comments on effort

Some years they did extra banding outside their main "constant effort" netting. It appears that 8 records are due to this work not associated with their many study goals.

dat$Comments[grep("effort",dat$Comments)]

Create indicator for non-constant effort samples (extras)

i.nonconstant <- grep("constant",dat$Comments)
dat$Comments[i.nonconstant]
dat$effort <- "constant"
dat$effort[i.nonconstant] <- "nonconst"

summary(factor(dat$effort))

How many unique bands?

Summary:

Determine how many unique metal bands in the entire data set. This includes captures from outside the focal years/months and non-constant effort banding, but does not include any birds which were only colored bandeded. This is 1815 bands.

length(unique(dat$Metal.Band))

Focal months/years AND constant-efort only, but again, only metal bands - birds only receiving colored bands are excluded. This is 1359 bands. They report 1597 captures. The difference is likely due color bands.

i.07.all <- which(dat$Year > 1996 & 
                dat$Mo %in% c(1,2,3) &   
                dat$effort == "constant")
length(unique(dat$Metal.Band[i.07.all]))

After I finish all the processing steps (below) I end up with 1518 unique birds. There are therefore 79 unique indivuals I haven't accounted for somehow.

1597-1518

Create unique ID for each bird

The following code creates an unequivocally unique ID for each unique bird by combining its spp, site etc with bands (metal and/or color).

dat$ID <- with(dat, paste(Species, 
                          Location,
                          Metal.Band,
                          Color.bands,
                          TempMarker, sep = ""))

#convert any NAs to an empty string
dat$ID <- gsub("NA","",dat$ID)


length(unique(dat$ID[i.07.all]))

How many birds never banded with either metal nor color bands? When this occurs, the "ID" column I just created will be made up of just the Species and Location

i.noband <- which(dat$ID == paste(dat$Species, dat$Location, sep = ""))
length(i.noband)

Create Code for sampling session

In 2007 and 2008 they revisited one site an additional time in March

Make the default sessions "1"

dat$Session <- 1

ID which rows in the data are from March 2007 or 2008.

i.March.07.08 <- with(dat, which(Year %in% c(2007,2008) &
                Mo == 3))

Code those rows as session "2"

dat$Session[i.March.07.08] <- 2

Check the result

summary(factor(dat$Session))

The "Repeat.New"" column

This column is useful for reproducing Table 2 of Lloyd et al. 2015.

Repeat.New = "A code indicating if the bird had been previously captured and banded."

Examine some details of data

Below are some notes on certain details of the paper that were investigated while working throught the data

Rare species "HILC"

Only occurs once; not in Table 2 in text of paper; just double checking data

dat[which(dat$Species=="HILC"), ]

Rare species RHQD

Only occurs once; not in Table 2 in text of paper; just double checking data

dat[which(dat$Species=="RHQD"), ]

Examine 2nd march sampling data

This is the additionally sampling date that occurred in 2 of the years.

dat$Species <- factor(dat$Species)
summary(dat[i.March.07.08,c("Repeat.New","Species")])

Make Table 2

TODO:display Lloy'd et al's original table here to show what end product is

Note: throughout the paper the paper the authors exclude some additionally sampling dates

"To minimize the potentially confounding effects of seasonal variation in abundance and bird behavior that may affect capture rate, this analysis does not include data collected during the May, November, and December visits." (Lloyd et al. 2016)

Species total per year

First, calc total for each species for each year.

Subset used for data analysis:

dat$Year <- as.numeric(dat$Year)
i.07.ff <- which(dat$Year > 1996 &    #Exclude 1996
                dat$Repeat.New == 1 & #exclude ALL recaps
                dat$Mo %in% c(1,2,3) &  #use only focal months
                dat$effort == "constant") #use only data from constant-effort netting

Reshape the data. Note: I use older data manipulation package reshape2 rather than the Tidyverse's tools like dplyr.

cast1 <- reshape2::dcast(data = dat[i.07.ff,],
               formula = Species + Year  ~ ID,
               value.var = "Species",
               fun.aggregate = length)

dim(cast1)#262 1652? #current = 257 1486

Look at the output: each individual is represented by a column, with typically a 1 in a row for each year they were captured. If they were captured more than once within a year then there should be a value greatter than 1.

dim(cast1)
head(cast1[,1:10])
summary(cast1[,1:10])

Recode 0/1

Recode any values >1 as 1 to indicate just that a bird was captured at least once.

fx01 <- function(x){ifelse(x > 0,1,0)}

cast1[,-c(1:2)] <- apply(cast1[,-c(1:2)],2,fx01)

Calculate total caps

ann.tot <- apply(cast1[,-c(1:2)],1,sum)
ann.tot <- cbind(cast1[,c(1:2)],ann.tot)

Cast to total across years

spp.tot <- reshape2::dcast(data  = ann.tot,
            formula = Species ~ .,
            value.var = "ann.tot",
            fun.aggregate = sum)

      names(spp.tot)[2] <- "spp.total"

Save table 2

#save output
write.csv(spp.tot,
          row.names = F,
                file = f.)

Compare original with reproduced table 2

Not implemented; previous version of spreadsheet in /tables folder.

    # #Load original of table
    # Lloyd.table2 <- read.csv(file = "./data_Lloyd_PeerJ2016/Lloyd_table2_CSV.csv")
    # 
    # #fix name in reproduced version
    # names(spp.tot)[1] <- "spp"
    # 
    # #merge
    # table2.compare <- merge(spp.tot,Lloyd.table2, all = T)
    # 
    # #difference in counts
    # table2.compare$diff <- with(table2.compare,
    #                             Lloyd.N - spp.total)

Look at output

# table2.compare[,c("spp","species",
#                   "spp.total","Lloyd.N",
#                   "diff","B.yr")]
# 
# sum(table2.compare$spp.total)
# sum(table2.compare$Lloyd.N,na.rm = T)

Save merged table

# write.csv(table2.compare,row.names = F,
#           file = "./output_tables/table2_compare.csv")

Reshape Data For Regression Analyses

Look at months when sampling occured

table(as.numeric(dat$Mo),dat$Year)

Set up index for working data

NOTE: There is a random year when they also sampled in May - look into this (see above)

dat$Year <- as.numeric(dat$Year)
i.07.all <- which(dat$Year > 1996 & 
                #dat$Repeat.New == 1 &
                dat$Mo %in% c(1,2,3) &       #originally dat$Mo != 11 &
                dat$effort == "constant")

Check

summary(factor(dat$Year)[i.07.all])

"Cast by Location: PUVI vs. PALO"

This actually casts by ID... not sure why discrepencay in how I describe it and what I did, so I put the title in " "

with(dat, table(Session, Mo,Year))

summary(dat$Session)
summary(factor(dat$Mo))
summary(factor(dat$Year))
summary(factor(dat$Session))

with(dat, table(Year,Session))
cast.by.site <- dcast(data = dat[i.07.all,],
               formula = Species + Year + Location + Session  ~ ID,
               value.var = "Species",
               fun.aggregate = length)


dim(cast.by.site)

#df is very wide b/c there is a row for each individual!
summary(factor(cast.by.site$Year))
summary(factor(cast.by.site$Session))

with(cast.by.site, table(Year,Session,Location))

Recode 0/1

this allows total count of unique individuals to be determined

fx01 <- function(x){ifelse(x > 0,1,0)}
names(cast.by.site)[1:5]
cast.by.site[,-c(1:4)] <- apply(cast.by.site[,-c(1:4)],2,fx01)
summary(factor(cast.by.site$Year))
dim(cast.by.site)

Calcualte total captured for a given year w/in a site

spp.ann.tot <- apply(cast.by.site[,-c(1:4)],1,sum)
length(spp.ann.tot)
spp.ann.tot <- cbind(cast.by.site[,c(1:4)],
                     spp.ann.tot)
dim(spp.ann.tot)

summary(factor(spp.ann.tot$Year))

Check

length(unique(spp.ann.tot$Species))
unique(spp.ann.tot$Year)


with(spp.ann.tot, table(Year,Session,Location))

Add years when no birds were caught

Note that in its current format, there is a row of data ONLY if a species was captured at least once - there are no zeros for the year it wasn't captured

summary(spp.ann.tot$spp.ann.tot ) #min value == 1

Need to add zeros. First, make a dataframe that has a row for every species for every year for both sites. We'll use the handy expand.grid() functions will make all combinations of species-locations-years

all.spp <- unique(spp.ann.tot$Species)
length(all.spp)


Location <- unique(spp.ann.tot$Location)

#remove bad location
Location <- as.character(Location[-3])


Year <- unique(spp.ann.tot$Year)[order(unique(spp.ann.tot$Year))]


#session 1
empty.datframe.sess1 <- expand.grid(Species = all.spp,
                              Year = Year,
                              Location = Location,
                              Session = 1)

length(all.spp)*length(Year)*length(Location)
with(empty.datframe.sess1,
     table(Year, Location))

#session 2
## a 2nd sampling sessions occured at PUVI in March in just these 2 years
empty.datframe.sess2 <- expand.grid(Species = all.spp,
                              Year = c(2007,
                                       2008),
                              Location = "PUVI",
                              Session = 2)


dim(empty.datframe.sess1)  #884 4
dim(empty.datframe.sess2)  #68 4

empty.datframe <- rbind(empty.datframe.sess1,
                              empty.datframe.sess2)

dim(empty.datframe) #952 4

#952 = 884 + 68

There are 884 species-year-site combinations. There is extra stuff for sessions 2

length(all.spp)*length(Location)*length(Year)  #884

Merge the real data with the "empty dataframe. First, I'll show a problematic merge, then I'll correct it

    bad.merge <- merge(spp.ann.tot,empty.datframe)

The orignal and new dataframes are the same size; rows have not been inserted where there are currently zeros

    dim(bad.merge)
    dim(spp.ann.tot)

Add "all = T" to insert NAs into the dataframe where there currently are no entries (ie, years when a species wans't captured currently don't occur)

spp.ann.tot.b <- merge(spp.ann.tot,
                   empty.datframe,
                   all = T)

dim(spp.ann.tot.b)

Check the size; its now as big as the "empty.datafrme"

dim(spp.ann.tot.b)

summary(factor(spp.ann.tot.b$Session))

with(spp.ann.tot.b,table(Year, Session,Location))



summary(spp.ann.tot.b)

Change NAs to zeros

There now should be NAs inserted into the abundance column in years when a species wasn't observed at a particular site

summary(spp.ann.tot.b$spp.ann.tot)

Convert those NAs to zeros

i.NA <- which(is.na(spp.ann.tot.b$spp.ann.tot) == TRUE)
spp.ann.tot.b[i.NA,"spp.ann.tot"] <- 0

Check that zeros have been added

summary(spp.ann.tot.b$spp.ann.tot)
summary(factor(spp.ann.tot.b$Year))

with(spp.ann.tot.b,
     table(Year,Session))

with(spp.ann.tot.b,
     table(Year,Session,Location))

Load effort data

From Table 1 in Lloyd et al 2016. Note that for 2007 and 2009 there were additional sampling sessions in March for one Location

#data directory
d2 <- paste(d.working,"data/data_in", sep = "")

effort <- read.my.csv(file. = "Lloyd_table1_effort",
                      rel.dir. = d2)

Merge count data and effort

#remove stray NA

# i.NA <- which(spp.ann.tot.b$Location == "")
# spp.ann.tot.b <- spp.ann.tot.b[-i.NA,]

spp.ann.tot.b$Location <- factor(as.character(spp.ann.tot.b$Location))


#merge data
spp.ann.tot2<- merge(spp.ann.tot.b,
                     effort,
                     all = T)


dim(spp.ann.tot.b) == dim(spp.ann.tot2)

dim(spp.ann.tot.b) 
dim(spp.ann.tot2)
summary(spp.ann.tot2)

NOTE: things don't line up perfectly b/c of 2 years with zero effort and a few random NAs

Remove date column

i.date.col <- which(names(spp.ann.tot2) == "date")
spp.ann.tot2 <- spp.ann.tot2[,-i.date.col]
with(spp.ann.tot2, 
     table(Year, Location))
i.PALO <- which(spp.ann.tot2$Location == "PALO")

with(spp.ann.tot2[i.PALO,], 
     table(Year, Location))

Remove bad location

i.bad.loc <- which(spp.ann.tot2$Location == "")

spp.ann.tot2 <- spp.ann.tot2[-i.bad.loc,]

Set abundance of NA when net hours = 0

i.net.zero <- which(spp.ann.tot2$net.hours == 0)

spp.ann.tot2$spp.ann.tot[i.net.zero] <- NA

Save final cleaned data

d. <- paste(".",d.working,"data/data_out",sep = "")
f. <- paste("./",d.,"/Lloyd_data_cleaned.csv", sep = "")
write.csv(spp.ann.tot2, 
          row.names = F,
          file = f.)

spp.ann.tot2[which(spp.ann.tot2$Species == "GTGT" &
                     spp.ann.tot2$Location == "PUVI"), ]
spp.ann.tot2[which(spp.ann.tot2$Species == "WEWA" & spp.ann.tot2$spp.ann.tot > 0),]

Recaps per year

dat[which(dat$Species == "WEWA"),]

i.WEWA <- which(dat$Species == "WEWA" & dat$Year > 1996 & 
                dat$Repeat.New == 1 & #exclude ALL recaps
                dat$Mo %in% c(1,2,3) &  #previous was "!= 11"
                dat$effort == "constant")

cast.WEWA <- dcast(data = dat[i.WEWA,],
               formula = Species + Year  ~ ID,
               value.var = "Species",
               fun.aggregate = length)

Recode 0/1

Recode to pres/abs for

fx01 <- function(x){ifelse(x > 0,1,0)}

cast.WEWA[,-c(1:2)] <- apply(cast.WEWA[,-c(1:2)],2,fx01)

Calculate total caps

ann.tot.WEWA <- apply(cast.WEWA[,-c(1:2)],1,sum)
ann.tot.WEWA <- cbind(cast.WEWA[,c(1:2)],ann.tot.WEWA)

Cast to total across years

spp.tot.WEWA <- dcast(data  = ann.tot.WEWA,
      formula = Species ~ .,
      value.var = "ann.tot.WEWA",
      fun.aggregate = sum)

names(spp.tot.WEWA)[2] <- "spp.total"


brouwern/lloyd2016 documentation built on June 8, 2020, 12:36 a.m.