knitr::opts_chunk$set(cache=TRUE) library(dplyr) library(ggplot2) library(ggthemes) library(hcphb) library(plotly) library(readr) library(readxl) library(stringdist) library(stringr) library(viridis) data("hcp_rev_sent") data("hcp_cur_sent") data("hcp_rev_all") data("hcp_cur_all") hcp_cur <- hcp_cur_sent hcp_rev <- hcp_rev_sent
As noted in the title, this document is a draft, and is incomplete at this time.
Habitat Conservation Plans (HCPs) are a conservation tool enabled by section 10(a)(1)(B) of the U.S. Endangered Species Act (ESA). In return for taking actions that will benefit ESA-listed species, non-federal entities with an HCP receive an Incidental Take Permit (ITP) that allows such species to be harmed incidental to otherwise legal activities. The U.S. Fish and Wildlife Service (Dept. of the Interior) and National Marine Fisheries Service (Dept. of Commerce) administer the HCP programs. In 1996, the Services finalized their Habitat Conservation Planning Handbook (Handbook), which is used to guide the development of HCPs. In June, 2016, the Services issued a notice in the Federal Register that they were revising the Handbook and released the draft for public comment. At r length(hcp_rev_all)
pages for the revision and r length(hcp_cur_all)
pages in the original (current), trying to determine all of the proposed changes could be rather time-consuming. To aid in the review process, we undertook some basic analyses of the two documents. In the following sections we exclude the tables of contents and acronym lists from analysis, but otherwise, all text returned from pdftools::pdf_text
was analyzed. The analyses in this vignette are available from the R package in which the vignette was written, available from GitHub.
The Handbook revision (REV) has r length(hcp_rev)
chapters or sections and the current (CUR) has r length(hcp_cur)
chapters or sections. Using the sentence tokenizer in openNLP
, we find that these result in r length(unlist(hcp_rev))
sentences for REV and r length(unlist(hcp_cur))
sentences for CUR. Given the volume of information, and lacking a 'Track Changes' version, we would like to have a way to find the most-similar sentences between the two documents.
We use the stringdist
package to identify the most similar sentences between the two documents. We like the default metric, 'optimal string alignment' or 'restricted Damerau-Levenshtein distance' because it is simple to understand: the distance is the sum of changes needed to align two strings A and B:
A <- "This is a string." B <- "This is another string." C <- "This is a strung." stringdist(A, B) stringdist(A, C)
We tokenized the documents using Maxent_Sent_Token_Annotator
from openNLP
, which tends to perform very well, but not perfectly. Because pdf_text
doesn't preserve empty lines, which are part of the pattern for identifying section headings, those headings are concatenated with the first sentence of the section. That is, the sentence extraction is close but not perfect; however, if the heading and text didn't change at all between versions then the result is still an edit distance of 0. But there are differences...lots and lots of differences. Consider the comparison of sentences in REV chapter 1 to the sentences of all the chapters in CUR. First we gather the pairwise distances:
# # Not run here...run on cluster, loaded here # ch1_dists <- lapply(hcp_cur, FUN = stringdistmatrix, b = hcp_rev$ch1) # ch1_maxes <- lapply(hcp_cur, FUN = get_max_dist, hcp_rev$ch1) # ch1_ratio <- calc_dist_ratio(ch1_dists, ch1_maxes) # mins_ls <- lapply(ch1_ratio, FUN = get_min_dists) # mins_df <- create_mins_df(mins_ls, "ch1") load("../data/mins_df.rda") knitr::kable(head(mins_df, 10))
And then we can plot the similarities of REV chapter 1 along the sentences of CUR:
ggplot(data = mins_df, aes(x = rev_idx, y = val)) + geom_point(alpha = 0.2, size = 2) + labs(x = "Match index", y = "String distance (osa)") + geom_hline(yintercept = 0.25, color = substr(viridis(3), 1, 7)[2]) + geom_hline(yintercept = 0.218, color = substr(viridis(3), 1, 7)[1]) + geom_hline(yintercept = 0.3, color = substr(viridis(3), 1, 7)[3]) + theme_hc()
While osa = 0.25 seems like a reasonable cutoff (slightly less than the 0.01 percentile), we can examine a histogram of osa distances to see if that's reasonable:
qplot(mins_df$val, geom="histogram", bins = 40) + theme_hc()
For the time being, we'll use osa < 0.25 as the cutoff for "match" sentences.
Because the test with chapter 1 appears to have worked, we can run all sentences in all REV chapters against all chapters and sentences in CUR:
# # Not run locally; run on cluster, then loaded here # all_dists <- list() # for(i in names(hcp_rev)) { # cur_dists <- lapply(hcp_cur, FUN = stringdistmatrix, b = hcp_rev[[i]]) # cur_maxes <- lapply(hcp_cur, FUN = get_max_dist, hcp_rev[[i]]) # cur_ratio <- calc_dist_ratio(cur_dists, cur_maxes) # cur_min_ls <- lapply(cur_ratio, FUN = get_min_dists) # cur_min_df <- create_mins_df(cur_min_ls, i) # all_dists[[i]] <- cur_min_df # } # # all_dists_df <- dplyr::bind_rows(all_dists) # # save(all_dists_df, file = "../data-raw/all_dists_df.rda") load("../data/all_dists_df.rda") all_dists_df$from <- str_c("current.", all_dists_df$cur_ch, ".", all_dists_df$cur_sent) all_dists_df$to <- str_c("revised.", all_dists_df$rev_ch, ".", all_dists_df$rev_sent) head(all_dists_df) good_dists <- all_dists_df[all_dists_df$val <= 0.25, ] cross_tab <- table(good_dists$rev_ch, good_dists$cur_ch) # knitr::kable(cross_tab) plot_ly(z = cross_tab, type = "heatmap", x = colnames(cross_tab), y = row.names(cross_tab), colors = viridis(7)) %>% layout(title = "Sentence matches of HCP Handbook and revision", xaxis = list(title = "Current", tickangle = 45), yaxis = list(title = "Revision"), legend = list(title = "# matches"), margin = list(b = 50))
We come up with r sum(cross_tab)
matches (with an optimal string alignment ratio <= 0.25) for sentences between the REV and CUR Handbooks. Many of these are repeats, however; only r comb <- paste(good_dists$cur_ch, good_dists$rev_sent, sep = ":"); length(unique(comb))
unique REV sentences match to CUR sentences.
cur_names <- unique(all_dists_df$from) rev_names <- unique(all_dists_df$to) both_name <- c(cur_names, rev_names) dims <- length(cur_names) + length(rev_names) amat <- matrix(data = NA, nrow = dims, ncol = dims) rownames(amat) <- both_name colnames(amat) <- both_name for(i in 1:length(cur_names)) { cur_sub <- dplyr::filter(all_dists_df, from == cur_names[i]) for(j in 1:length(cur_sub$to)) { amat[cur_names[i], cur_names[j]] <- 1 - cur_sub[i,]$val amat[cur_names[j], cur_names[i]] <- 1 - cur_sub[i,]$val } } save(amat, file = "../data-raw/all_invdists_mat.rda") tmp <- data.frame(val = all_dists_df$val, from = all_dists_df$from, to = all_dists_df$to) edgebundleR::edgebundle(amat, cutoff = 0.75, width = 800) # # Looks like I may have to do this in Circos and do a system call... # # no idea if I can get that to show in Rmd..
There have been a significant number of legal and policy developments since the CUR handbook was written in 1996, and we expect that the REV handbook incorporates this new information. For example, in 2015 the White House released a government-wide mitigation policy that should affect how avoidance, minimization, and mitigation requirements are met for HCPs. Here we evaluate the sentence similarities of CUR and REV to these "outside" documents, as available, to understand where and how new information has been incorportated.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.