pick_urls: Extract URLs and email addresses from text
In mvkorpel/pickURL: Extract URLs and Email Addresses from Text

Description Usage Arguments Value Details References Examples

View source: R/pick_urls.R

pick_urls extracts individual URLs and email addresses from the input text. The function recognizes comma and whitespace as characters separating individual URLs, as is the specification of the URL field in the DESCRIPTION file of R packages (R Core Team, 2017). See ‘Commas and Whitespace’ for details.

pick_urls(x, plain_email = all_email, single_item = FALSE,
  all_email = FALSE, collapse_x = FALSE, mailto_alias = c("email",
  "e-mail"), scheme_sub = list(mailto = mailto_alias),
  url_pattern = "^(https?|ftp)://.", email_pattern = "@[^.[]+\\.[^.]+",
  need_scheme = missing(url_pattern) || !isTRUE(nzchar(url_pattern)),
  deobfuscate = TRUE, rm_endpunct = 20)

`x`	a `character` vector containing the input text.
`plain_email`	a `logical` flag. If `TRUE`, the function also looks for plain email addresses (i.e. not formatted as a mailto URL). The default is to use the same value as for `all_email` (`FALSE`).
`single_item`	a `logical` flag. If `TRUE`, the function looks for a single URL or email address per string instead of splitting each string into multiple potential URLs or email addresses. The default is `FALSE`. This setting interacts with `collapse_x`. See ‘Details’.
`all_email`	a `logical` flag. If `TRUE`, individual email addresses are also picked up from mailto (or alias) URLs, and the remaining empty mailto URLs are discarded. This effectively forces `plain_email` to `TRUE`. If `FALSE` (the default), each mailto URL (possibly with multiple email addresses) is returned if `url_pattern` fits.
`collapse_x`	a `logical` flag. If `TRUE`, selected input strings are concatenated, separated by newline characters. This allows URLs in angle brackets and plain email addresses to extend across multiple strings. The default is `FALSE`: don't look across string boundaries.
`mailto_alias`	a `character` vector or `NULL`. Synonyms for mailto URI scheme. These are case insensitive and must follow the requirements for a URI scheme name: starts with an ASCII letter and is optionally followed by a sequence of letters, digits or characters in the set `"."`, `"+"`, `"-"`.
`scheme_sub`	a `list` of URI scheme substitutions to be made. The default is to substitute `"mailto"` for `"e-mail"` or `"email"` (copied from `mailto_alias`). Each element of the list corresponds to one official form, stored in the name of the element. The element is a `character` vector holding the unofficial forms. The strings are case-insensitive. Use `NULL` or an empty list for no substitutions.
`url_pattern`	a `character` string or `NULL`. Only strings matching this Perl-like regular expression are returned. This is not applied to plain email addresses (see `plain_email` and `email_pattern`). The matching is performed after all other processing. See ‘Details’. If `NULL` or otherwise of zero `length`, no matching is done.
`email_pattern`	a `character` string or `NULL`. Like `url_pattern` but applied to email addresses.
`need_scheme`	a `logical` flag. If `TRUE`, return only strings starting with a technically valid, but not necessarily existing URI scheme followed by a `":"` (see `plain_email`). The default is `TRUE` if `url_pattern` is `missing` or empty, `FALSE` otherwise.
`deobfuscate`	a `logical` flag. If `TRUE` (the default), the function interprets some substrings with an `"at"` word (case insensitive) as email addresses. The actual pattern to match is more complicated, and false positives should be rare.
`rm_endpunct`	a `logical` flag or a `numeric` value with integral or infinite value. If `TRUE`, removes any `"."`, `"?"` or `"!"` that is suspected to end a sentence. Useful when no space has been used to separate the end punctuation from a URL. If `FALSE`, punctuation is not removed. A `numeric` value indicates the memory size, i.e. the maximum number of items (lines) across which a sentence may extend. A smaller number means faster operation. Numbers smaller than `1` are equivalent to `FALSE`, and `Inf` is equivalent to `TRUE`. The default is `20`.

If plain_email is FALSE, returns a character vector containing the URLs in x. URL schemes are converted to lowercase, which is the canonical form.

If plain_email is TRUE, returns a list where the first element "url" is the URL vector described above and the second element "email" is a character vector with the email addresses found. See all_email.

If non-ASCII results are present, their Encoding will be "UTF-8".

Compatibility with the Internationalized Resource Identifier (IRI) specification (Duerst and Suignard, 2005) has not been assessed carefully. However, the function will accept non-ASCII characters (as opposed to splitting the string).

Invalid UTF-8 strings are handled by keeping the valid (ASCII) bytes and discarding the rest. This means that URLs can still be picked up from a "latin1" encoded string falsely marked as having a "UTF-8" Encoding or when such a string has an "unknown" encoding in a UTF-8 locale.

The function can remove delimiting brackets, some other punctuation or simple LaTeX markup around URLs.

Note that the function looks for matching delimiting brackets across input string boundaries when single_item is FALSE. Also rm_endpunct works across strings. Therefore it may be best that strings originating from different input files or otherwise non-consecutive input lines are processed separately, with multiple calls to this function. See collapse_x.

The default url_pattern means that the URI scheme must be http, https or ftp and that the scheme must be followed by "://" and at least one character, indicating the presence of an authority component. See regex. For example, setting url_pattern to "^mailto:." would allow email URLs to be returned.

The default email_pattern requires that the domain portion after "@" has at least two parts separated by a ".". This rules out addresses such as "root@localhost" and avoids some false positives. Also email addresses with a literal IP domain are dropped when the domain is in square brackets.

When single_item is TRUE, no more than one item is extracted from each input string. If also collapse_x is TRUE, then no more than one item is extracted from each group of concatenated input items. The function looks for a URL scheme and a following ":". In case of no match, the first substring looking like an email address is selected if plain_email is TRUE. A URL between double quotes ("\"") or angle brackets ("<" and ">") takes precedence over a URL without such delimiters. A URL may be rejected by the filtering stage (see url_pattern), in which case it does not matter if an email address was also found: no results are returned for the input string in question.

Commas and Whitespace

The comma is a problematic URL separator, because it is a valid character in some parts of a URL (Berners-Lee, Fielding, and Masinter, 2005). The function estimates which commas should remain as part of a URL. Misclassifications are possible.

Text between a pair of double quotes or angle brackets (URL scheme required after the opening delimiter) is mostly interpreted as representing a single URL, but commas are still checked, and the URL is cut when necessary. It is possible to use multiple lines (but not multiple strings unless collapse_x is TRUE) for a long URL when delimited by "<" and ">": whitespace is removed when it occurs after the ":" that follows the URL scheme.

Whitespace, i.e. tabs and spaces, and commas are allowed in plain email addresses: in a double quoted local part, or in a domain literal delimited by square brackets (Resnick, 2008). These are accepted by the function.

Berners-Lee, T., Fielding, R., and Masinter, L. (2005) Uniform Resource Identifier (URI): Generic syntax. RFC 3986, RFC Editor. https://www.rfc-editor.org/rfc/rfc3986.txt.

Braden, R., editor (1989) Requirements for Internet hosts - application and support. RFC 1123, RFC Editor. https://www.rfc-editor.org/rfc/rfc1123.txt.

Duerst, M., Masinter, L., and Zawinski, J. (2010) The 'mailto' URI scheme. RFC 6068, RFC Editor. https://www.rfc-editor.org/rfc/rfc6068.txt.

Duerst, M. and Suignard, M. (2005) Internationalized Resource Identifiers (IRIs). RFC 3987, RFC Editor. https://www.rfc-editor.org/rfc/rfc3987.txt.

Elz, R. and Bush, R. (1997) Clarifications to the DNS specification. RFC 2181, RFC Editor. https://www.rfc-editor.org/rfc/rfc2181.txt.

Harrenstien, K., Stahl, M., and Feinler, E. (1985) DoD Internet host table specification. RFC 952, RFC Editor. https://www.rfc-editor.org/rfc/rfc952.txt.

Mockapetris, P. (1987) Domain names - concepts and facilities. RFC 1034, RFC Editor. https://www.rfc-editor.org/rfc/rfc1034.txt.

R Core Team (2017) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.

Resnick, P., editor (2008) Internet Message Format. RFC 5322, RFC Editor. https://www.rfc-editor.org/rfc/rfc5322.txt.

email1 <- "user1@example.org"
urls <- c("http://www.example.org/",
          "ftp://cran.r-project.org",
          "https://a,b,c@[vf.a1,b2]/foo,bar",
          paste0("mailto:", email1))
phrase <- c(paste0("See ", urls[1], ", ", urls[2], " and"),
            paste0(urls[3], "."))
url_urls <- paste0("With prefix URL:", urls, " and that's all.")
comma_urls <- paste0(urls, collapse=",")
angle_urls <- sub(".", ".\n", paste0("<", urls, ">"), fixed=TRUE)
split_urls <- unlist(strsplit(angle_urls, "\n", fixed=TRUE))

pu1 <- pick_urls(urls)
identical(pu1, urls[1:3])                       # TRUE
pu2 <- pick_urls(urls, url_pattern="")
identical(pu2, urls)                            # TRUE
pu3 <- pick_urls(phrase)
identical(pu3, pu1)                             # TRUE
pu4 <- pick_urls(url_urls, url_pattern="")
identical(pu4, urls)                            # TRUE
pu5 <- pick_urls(urls, url_pattern="", all_email=TRUE)
identical(pu5[["url"]], urls[1:3])              # TRUE
identical(pu5[["email"]], email1)               # TRUE
pu6 <- pick_urls(comma_urls, url_pattern="")
identical(pu6, urls)                            # TRUE
pu7 <- pick_urls(angle_urls, url_pattern="")
identical(pu7, urls)                            # TRUE
pu8 <- pick_urls(split_urls, url_pattern="", collapse_x=TRUE)
identical(pu8, urls)                            # TRUE

emails <- c("user2 at example.org",
            "\"user 3\"(comment) @ localhost",
            "\"user", " 4\"@[::", " 1]")
emails_target <- c("user2@example.org",
                   "\"user 3\"@localhost",
                   "\"user 4\"@[::1]")

pe1 <- pick_urls(emails, plain_email=TRUE)
identical(pe1[["email"]], emails_target[1])     # TRUE
pe2 <- pick_urls(emails, plain_email=TRUE, email_pattern="")
identical(pe2[["email"]], emails_target[1:2])   # TRUE
pe3 <- pick_urls(emails, plain_email=TRUE, email_pattern="",
                 collapse_x=TRUE)
identical(pe3[["email"]], emails_target)        # TRUE
pe4 <- pick_urls(emails, plain_email=TRUE, email_pattern="",
                 deobfuscate=FALSE)
identical(pe4[["email"]], emails_target[2])     # TRUE