library(knitr)
opts_chunk$set(echo = FALSE, include = TRUE)
load('keepTrack.RData')

Data version

The raw data were received r format(keepTrack$dataDate, '%d %B %Y').

De-duplication

The raw data started with r keepTrack$nrowInitial rows. As a result of de-duplication a total of r keepTrack$nrowInitial - keepTrack$nrowDeDup rows were removed. We used the following columns as criteria to check for duplicates (i.e. if a record had equal values for all these columns it was deemed a duplicate):

keepTrack$keyCol

It should be noted that de-duplication happened after cleaning all species and geographic names as detailed below.

Cleaning species names

The following species names were corrected (i.e. changed from old_name to new_name):

kable(keepTrack$nameFix, row.names = FALSE)

Cleaning geographical data

Some island names were inconsistent. The original island names were

keepTrack$islandNameOld

The updated names are

keepTrack$islandNameNew

Some records had low spatial accuracy (designated with a C in the ACC column. Removing those records further eliminated r keepTrack$nrowDeDup - keepTrack$nrowBadACC rows.

nIslandOut <- nrow(keepTrack$outsideIsland)
islandsGood <- nIslandOut == 0
rec <- ifelse(nIslandOut == 1, 'record', 'records')

Furthermore, we checked that all records fall within the bounds of the islands they were reported from (e.g. a record from Hawai`i Island does indeed fall within the boundary of Hawai`i Island). We found r nIslandOut r rec falling outside the island polygons.

cat('These are the records falling outside the island polygons:')
kable(keepTrack$outsideIsland, row.names = FALSE)
cat('These records falling outside the island polygons will be removed unless they can be corrected.')

Cleaning up collection dates

nNoDate <- nrow(keepTrack$noDate)
anyNoDate <- nNoDate > 0
rec <- ifelse(nNoDate == 1, 'date', 'dates')

Dates were in multiple formats which have been standardized to YYYY-MM-DD format. We checked for missing dates and found r nNoDate missing r rec.

cat('Records with missing dates are:')
kable(keepTrack$noDate, row.names = FALSE)
cat('These records with no collection date will be removed unless they can be corrected.')

Final dataset

nRecFinal <- with(keepTrack, nrowBadACC - nrow(noDate) - nrow(outsideIsland))

codeCode <- function(x) {
    sprintf('`%s`', x)
}

The final dataset is saved as an R object of class r codeCode(keepTrack$class) from the sp package [@sp] and has geographic coordinate reference system r codeCode(keepTrack$proj).

The final dataset contains r nRecFinal records. Below we summarize changes between the raw data and filtered data.

Geographic localities

The following localities were lost after filtering:

kable(keepTrack$geoLost, row.names = FALSE)

Samples per year

The below plot shows the differences between sample sizes per year

par(mar = c(3, 3, 0, 0) + 0.5, mgp = c(2, 0.75, 0), tcl = -0.05)
plot(keepTrack$perYr[, c('year', 'nrec_initial')], 
     type = 'l', lwd = 2, 
     xlab = 'Year', ylab = 'Number of records')
points(keepTrack$perYr[, c('year', 'nrec_final')], 
       type = 'l', lwd = 2, col = 'red')

legend('topleft', legend = c('Raw data', 'Post processing'), 
       lty = 1, col = c('black', 'red'), lwd = 2, bty = 'n')

Samples per species

The below table shows the differences between sample sizes per species. This table should also be checked manually for misspelled names.

names(keepTrack$perSpp) <- c('species', 'raw data', 'post processing')
kable(keepTrack$perSpp, row.names = FALSE)

References



ajrominger/hiDrosophoBioGeo documentation built on Jan. 26, 2022, 6:06 a.m.