Sys.setenv(TZ = 'UTC')
library(plhdbR)
load_plhdb_packages()
The functions read_bio_table
and read_fert_table
read csv files of biography and fertility data, respectively, created by the download buttons for these tables on the PLHDB website. These functions strip away blank lines and header lines, parse any date/time columns, and return a well-ordered dplyr::tbl_df
, an extension of R's data.frame
. To pull all the data from a given table, use search criteria like 'Study.ID != 10'
. Note that the data are not extensively error-checked at this stage. If you try to feed these functions a normal csv file, bad things might happen.
# Assuming your file is called "biography_2015_05_20.csv"
lh <- read_bio_table("../data/biography_2015_05_20.csv")
summary(lh)
#> Study.Id Animal.Id Animal.Name
#> rppn-fma: 562 KOM : 4 Susuruka: 4
#> amboseli:1324 RUS : 4 Ahab : 2
#> kakamega: 599 SUS : 4 APOLLO : 2
#> gombe : 313 AFR : 3 APPLE : 2
#> karisoke: 343 AMA : 3 AQUA : 2
#> beza : 993 BUS : 3 (Other) :3436
#> ssr : 307 (Other):4420 NA's : 993
#> Birth.Date Min.Birth.Date
#> Min. :1919-07-02 00:00:00 Min. :1914-07-03 00:00:00
#> 1st Qu.:1988-08-08 00:00:00 1st Qu.:1988-02-01 00:00:00
#> Median :1999-12-13 00:00:00 Median :1999-08-16 00:00:00
#> Mean :1996-10-17 04:21:01 Mean :1996-03-17 05:50:43
#> 3rd Qu.:2007-07-21 00:00:00 3rd Qu.:2007-07-19 00:00:00
#> Max. :2015-03-20 00:00:00 Max. :2015-03-15 00:00:00
#> NA's :19
#> Max.Birth.Date Birth.Date.Distribution Birth.Group
#> Min. :1924-06-30 00:00:00 N: 778 KK : 209
#> 1st Qu.:1989-08-20 06:00:00 U:3663 Matão : 206
#> Median :2000-01-25 12:00:00 1.1 : 193
#> Mean :1997-03-24 15:10:30 2.2 : 174
#> 3rd Qu.:2007-07-20 00:00:00 1 : 170
#> Max. :2015-03-25 00:00:00 (Other):2941
#> NA's :19 NA's : 548
#> Birth.Group.Certainty First.Born Mom.Id Sex
#> C :4066 N:2749 81 : 16 F:1881
#> U : 187 U:1094 137 : 15 M:2011
#> NA's: 188 Y: 598 9160 : 14 U: 549
#> TZ : 14
#> 86 : 13
#> (Other):3461
#> NA's : 908
#> Entry.Date Entry.Type Depart.Date
#> Min. :1963-01-15 00:00:00 B:3092 Min. :1963-04-29 00:00:00
#> 1st Qu.:1992-01-18 00:00:00 C: 457 1st Qu.:1998-08-28 00:00:00
#> Median :2002-01-23 00:00:00 I: 319 Median :2009-03-14 00:00:00
#> Mean :1999-01-10 20:01:59 O: 573 Mean :2005-02-16 09:29:42
#> 3rd Qu.:2008-02-20 00:00:00 3rd Qu.:2013-07-15 00:00:00
#> Max. :2015-12-23 00:00:00 Max. :2015-04-30 00:00:00
#>
#> Depart.Type Depart.Date.Error
#> D:2166 Min. :0.00000
#> E: 107 1st Qu.:0.00000
#> O:1544 Median :0.00000
#> P: 624 Mean :0.06784
#> 3rd Qu.:0.01000
#> Max. :3.00000
#>
# Assuming your file is called "fertility_2015_05_20.csv"
fert <- read_fert_table("../data/fertility_2015_05_20.csv")
summary(fert)
#> Study.Id Animal.Id Start.Date Start.Type
#> rppn-fma:151 BLAN : 8 Min. :1963-05-15 00:00:00 B:1086
#> amboseli:618 DOS- : 8 1st Qu.:1990-07-15 00:00:00 C: 192
#> kakamega:229 KATH : 8 Median :1999-02-02 00:00:00 I: 114
#> gombe :177 SERI : 8 Mean :1997-07-30 08:29:33 O: 459
#> karisoke:210 LIMP : 7 3rd Qu.:2006-11-01 12:00:00
#> beza :252 NYLA : 7 Max. :2015-03-20 00:00:00
#> ssr :214 (Other):1805
#> Stop.Date Stop.Type
#> Min. :1965-02-14 00:00:00 D:847
#> 1st Qu.:1997-06-17 00:00:00 E: 88
#> Median :2008-08-03 00:00:00 O:838
#> Mean :2004-05-08 13:23:37 P: 78
#> 3rd Qu.:2013-06-28 00:00:00
#> Max. :2015-04-30 00:00:00
#>
The functions find_bio_errors
and find_fert_errors
scan the loaded biography and fertility data, respectively, looking for errors of various kinds. Each function takes as an argument the relevant dplyr::tbl_df
generated by the read_..._table
functions listed above. find_bio_errors
will find dates that are unrealistic as well as duplicate entries for the same (Study.Id, Animal.Id) combination. These are returned as named list elements $error_dates
and $error_duplicates
. find_fert_errors
only scans the date/time fields for errors, since there are multiple fertility entries for some individuals. This is returned in a list with named element $error_dates
.
# Check the biography data for errors
bio_errors <- find_bio_errors(lh)
bio_errors$error_dates %>% data.frame()
#> Study.Id Animal.Id Mom.Id Birth.Date Min.Birth.Date Max.Birth.Date
#> 1 karisoke IZI IZU 2005-09-02 <NA> <NA>
#> 2 karisoke FATINF FAT 2014-09-18 <NA> <NA>
#> 3 karisoke BUK KUB 2014-03-15 <NA> <NA>
#> 4 karisoke ISK POP 2005-08-25 <NA> <NA>
#> 5 karisoke MAS IKZ 2014-04-24 <NA> <NA>
#> 6 karisoke UKUINF UKU 2014-12-25 <NA> <NA>
#> 7 karisoke UBO NYB 2013-09-17 <NA> <NA>
#> 8 karisoke TMSINF TMS 2014-10-22 <NA> <NA>
#> 9 karisoke TEKINF TEK 2014-08-01 2014-07-01 2014-09-01
#> 10 karisoke KRNINF KRN 2014-11-17 <NA> <NA>
#> 11 karisoke NDE NAH 2013-10-21 <NA> <NA>
#> 12 karisoke ISN KUB 2013-11-01 <NA> <NA>
#> 13 karisoke GUTINF GUT 2015-02-24 <NA> <NA>
#> 14 karisoke BSHINF BSH 2014-03-31 <NA> <NA>
#> 15 karisoke BIR TAR 2014-05-23 <NA> <NA>
#> 16 karisoke MAHINF MAH 2015-03-10 <NA> <NA>
#> 17 karisoke KRDINF KRD 2015-01-02 <NA> <NA>
#> 18 karisoke TAMINF TAM 2014-04-14 <NA> <NA>
#> Entry.Date Depart.Date
#> 1 2013-08-10 2015-04-30
#> 2 2014-09-18 2015-04-30
#> 3 2014-03-15 2014-09-04
#> 4 2013-08-10 2015-04-30
#> 5 2014-04-24 2015-04-30
#> 6 2014-12-25 2015-02-23
#> 7 2013-09-17 2015-04-30
#> 8 2014-10-22 2014-10-30
#> 9 2015-12-23 2015-04-30
#> 10 2014-11-17 2015-01-30
#> 11 2013-10-21 2015-04-30
#> 12 2013-11-01 2015-04-30
#> 13 2015-02-24 2015-04-30
#> 14 2014-03-31 2014-04-22
#> 15 2014-05-24 2014-10-01
#> 16 2015-03-10 2015-04-30
#> 17 2015-01-02 2015-03-21
#> 18 2014-04-14 2014-05-26
bio_errors$error_duplicates
#> Source: local data frame [1 x 3]
#> Groups: Study.Id
#>
#> Study.Id Animal.Id n_records
#> 1 karisoke SUS 4
# Check the fertility data for errors
fert_errors <- find_fert_errors(fert)
#> No errors found!
fert_errors$error_dates
#> NULL
The function find_mom_id_errors
checks to see if all the animals listed in Mom.Id in the biography table for a given study have a corresponding record in Animal.Id. It is important to note that not all of these are errors! In some studies, the mother can be known but excluded from the biography table for a variety of reasons. It would be a good idea to double-check the cases listed below.
find_mom_id_errors(lh)
#> $`rppn-fma`
#> [1] "ANA2"
#>
#> $amboseli
#> [1] "EST" "ETA" "FLU" "JAN" "KUP" "LIS" "LOI" "NUB" "PIN" "RIN" "RUK"
#> [12] "SKI" "TWI"
#>
#> $kakamega
#> character(0)
#>
#> $gombe
#> [1] "PATINA"
#>
#> $karisoke
#> [1] "AMR" "GUY" "IGT" "IZU" "KUG" "UMH" "UMY"
#>
#> $beza
#> character(0)
#>
#> $ssr
#> [1] "BB--" "CH--"
#>
#> attr(,"split_type")
#> [1] "data.frame"
#> attr(,"split_labels")
#> Study.Id
#> 1 rppn-fma
#> 2 amboseli
#> 3 kakamega
#> 4 gombe
#> 5 karisoke
#> 6 beza
#> 7 ssr
The function find_first_born_errors
checks for two kinds of errors regarding first-born offspring.
first_born_errors <- find_first_born_errors(lh)
First, there should be a known Mom.Id (not blank or NA) for any animal that has "N" or "Y" in the First.Born column because if the animal's first-born status is known ("N", or "Y"), then the researchers must know the mother. Cases that violate this rule are returned in the named list element $unknown_mother_first_born
.
first_born_errors$unknown_mother_first_born
#> Source: local data frame [3 x 6]
#>
#> Study.Id Mom.Id Animal.Id Animal.Name Birth.Date First.Born
#> 1 beza NA 131 NA 1990-07-15 N
#> 2 beza NA 9012 NA 1985-07-15 Y
#> 3 beza NA 9023 NA 1979-07-15 N
Second, any given female should have no more than one first-born offspring. Multiple offspring that are identified as first-born but attributed to the same female are returned in the named list element $multiple_first_born
.
first_born_errors$multiple_first_born
#> NULL
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.