In KevinSee/PITcleanr: Cleans up PIT tag capture histories

library(dplyr)
library(stringr)
library(purrr)
library(tidyr)
library(PITcleanr)

Not all PIT tag data is in PTAGIS. This section will show the user how to read in PIT tag data from a variety of other sources, including BioLogic csv files as well as raw files (e.g. .xlsx, .log and .txt file extensions) downloaded directly from PIT tag readers. The readCTH() function is able to read in all of these formats, but the user must indicate what format each file is in.

There are many ways a user might store their various files, and many ways to script how to read them into R. In this vignette, we will suggest one way to do this, but alternatives certainly exist.

Storing Data Files

In this example, we have accumulated a number of different PIT tag detection files, in a variety of formats. We have saved them all in the same folder, and used some naming convention to indicate what format they are. Files from PTAGIS have "PTAGIS" in the file name somewhere, Biologic csv files have "BIOLOGIC" in the file name somewhere, and we assume the other files are raw files (with a mixture of .xlsx, .log and .txt file extensions).

For raw detection files, which usually come from a single reader, the file does not contain information about what site code those observations come from. The user can add that themselves within R, or the readCTH() function will assume the site code is the first part of the file name before the first underscore, _. Therefore, if the user adopts a naming convention that includes the site code at the beginning of every raw detection file, PITcleanr will assign the correct site code. Otherwise, the user can overwrite the site codes manually after running readCTH().

We set the name and path of the folder with all the detection data (detection_folder), and used the following script to compile a data.frame of various file names and file types. A user can set the detection_folder path to point to where they have stored all their detection files.

detection_folder = system.file("extdata", 
                                "non_ptagis",
                                package = "PITcleanr",
                                mustWork = TRUE) |> 
  paste0("/")

file_df <- tibble(file_nm = list.files(detection_folder)) |> 
  filter(str_detect(file_nm, "\\.")) |> 
  mutate(file_type = if_else(str_detect(file_nm, "BIOLOGIC"),
                             "Biologic_csv",
                             if_else(str_detect(file_nm, "PTAGIS"),
                                     "PTAGIS",
                                     "raw")))

file_df

Reading in Detections

The following script uses the path of the detection_folder, the various file names inside that folder, the file type associated with each file, and the readCTH() function to read all the detections into R and consolidate them into a single data.frame, all_obs. It contains a quick check (try and "try-error") to ensure that if any particular file has trouble being read, the others are still included.

# read them all in
all_obs <-
  file_df |> 
  mutate(obs_df = map2(file_nm,
                       file_type,
                       .f = function(x, y) {
                         try(readCTH(cth_file = paste0(detection_folder, x), 
                                     file_type = y))
                       })) |> 
  mutate(cls_obs = map_chr(obs_df, .f = function(x) class(x)[1])) |> 
  filter(cls_obs != "try-error") |> 
  select(-cls_obs) |> 
  unnest(obs_df) |> 
  distinct()

Alternatively, the user can run readCTH() on each detection file, saving each R object separately, and then use bind_rows() to merge all the detections together. One reason to do this might be to manipulate or filter certain files before merging them.

The readCTH() also contains an argument to filter out test tags, so their detection is not included in the results. This option involves setting the test_tag_prefix argument to the initial alphanumeric characters in the test tags. By default, this is set to "3E7". If the user wishes to keep all test tag detections, set test_tag_prefix = NA. If the user wishes to pull out only test tag detections, they can use the function readTestTag() instead of readCTH(), and set the test_tag_prefix to the appropriate code (e.g. "3E7").

# PTAGIS configuration
org_config <- buildConfig()

# Biologic configuration for Henry’s Ranch 
bio_config <- read_csv("O:Documents/Git/Other/PITcleanr_lite/config/site_metadata_revised.csv",
                        show_col_types = F) |> 
  mutate(
    across(
      rkm,
      as.character
    ),
    across(
      antenna_id,
      ~ str_pad(., 
                width = 2,
                side = "left",
                pad = "0")
    )
  ) |> 
  mutate(
    across(
      site_code,
      ~ recode(., 
               "SUB" = "SUB2")
    ),
    node = paste(site_code, antenna_id,
                 sep = "_")
  )

my_config <- org_config |> 
  bind_rows(bio_config)

# compress all detections
test_comp <- compress(all_obs,
                      configuration = my_config |> 
                        select(site_code:node))
nrow(all_obs)
nrow(test_comp)
nrow(test_comp) / nrow(all_obs)

# transform into capture histories of 1's and 0's

# put column names of capture history in correct order
ch_col_nms <- my_config |> 
  select(node, rkm, rkm_total) |> 
  distinct() |> 
  arrange(desc(rkm), desc(rkm_total), node) |> 
  filter(node %in% unique(test_comp$node)) |> 
  # as.data.frame()
  pull(node)

test_comp |> 
  filter(node %in% ch_col_nms) |> 
  select(tag_code,
         node) |> 
  distinct() |> 
  mutate(seen = 1) |> 
  mutate(across(node,
                ~ factor(.,
                         levels = ch_col_nms))) |> 
  pivot_wider(names_from = node,
              names_expand = T,
              names_sort = T,
              values_from = seen,
              values_fill = 0) |> 
  # names()
  unite(col = ch,
        -tag_code, 
        sep = "",
        remove = T)

sites_sf <- extractSites(all_obs,
                         as_sf = T,
                         configuration = my_config)

dwn_flw = F
nhd_list = queryFlowlines(sites_sf = sites_sf,
                          root_site_code = "LLR",
                          min_strm_order = 2,
                          dwnstrm_sites = dwn_flw,
                          dwn_min_stream_order_diff = 4)
# compile the upstream and downstream flowlines
flowlines = nhd_list$flowlines
if(dwn_flw) {
  flowlines %<>%
    rbind(nhd_list$dwn_flowlines)
}

ggplot() +
  geom_sf(data = flowlines,
          aes(color = as.factor(StreamOrde),
              size = StreamOrde)) +
  scale_color_viridis_d(direction = -1,
                        option = "D",
                        name = "Stream\nOrder",
                        end = 0.8) +
  scale_size_continuous(range = c(0.2, 1.2),
                        guide = 'none') +
  geom_sf(data = nhd_list$basin,
          fill = NA,
          lwd = 2) +
  geom_sf(data = sites_sf,
          size = 4,
          color = "black") +
  ggrepel::geom_label_repel(
    data = sites_sf,
    aes(label = node_site, 
        geometry = geometry),
    size = 2,
    stat = "sf_coordinates",
    min.segment.length = 0,
    max.overlaps = 50
  ) +
  theme_bw() +
  theme(axis.title = element_blank())