Description Usage Arguments Value Details References Examples
pick_urls
extracts individual URLs and email
addresses from the input text. The function recognizes comma and
whitespace as characters separating individual URLs, as
is the specification of the URL field in the DESCRIPTION
file of R packages (R Core Team, 2017). See ‘Commas and
Whitespace’ for details.
1 2 3 4 5 6 | pick_urls(x, plain_email = all_email, single_item = FALSE,
all_email = FALSE, collapse_x = FALSE, mailto_alias = c("email",
"e-mail"), scheme_sub = list(mailto = mailto_alias),
url_pattern = "^(https?|ftp)://.", email_pattern = "@[^.[]+\\.[^.]+",
need_scheme = missing(url_pattern) || !isTRUE(nzchar(url_pattern)),
deobfuscate = TRUE, rm_endpunct = 20)
|
x |
a |
plain_email |
a |
single_item |
a |
all_email |
a |
collapse_x |
a |
mailto_alias |
a |
scheme_sub |
a |
url_pattern |
a |
email_pattern |
a |
need_scheme |
a |
deobfuscate |
a |
rm_endpunct |
a |
If plain_email
is FALSE
, returns a
character
vector containing the URLs in
x
. URL schemes are converted to lowercase, which
is the canonical form.
If plain_email
is TRUE
, returns a list
where
the first element "url"
is the URL vector
described above and the second element "email"
is a
character
vector with the email addresses found. See
all_email
.
If non-ASCII results are present, their Encoding
will
be "UTF-8".
Compatibility with the Internationalized Resource Identifier (IRI) specification (Duerst and Suignard, 2005) has not been assessed carefully. However, the function will accept non-ASCII characters (as opposed to splitting the string).
Invalid UTF-8 strings are handled by keeping the valid (ASCII)
bytes and discarding the rest. This means that URLs can still be
picked up from a "latin1"
encoded string falsely marked as
having a "UTF-8"
Encoding
or when such a
string has an "unknown"
encoding in a UTF-8 locale.
The function can remove delimiting brackets, some other punctuation or simple LaTeX markup around URLs.
Note that the function looks for matching delimiting brackets
across input string boundaries when single_item
is
FALSE
. Also rm_endpunct
works across
strings. Therefore it may be best that strings originating from
different input files or otherwise non-consecutive input lines are
processed separately, with multiple calls to this function. See
collapse_x
.
The default url_pattern
means that the URI scheme
must be http, https or ftp and that the scheme must be followed by
"://"
and at least one character, indicating the presence of
an authority component. See regex
. For example,
setting url_pattern
to "^mailto:."
would allow email
URLs to be returned.
The default email_pattern
requires that the domain portion
after "@"
has at least two parts separated by a
"."
. This rules out addresses such as
"root@localhost"
and avoids some false positives. Also
email addresses with a literal IP domain are dropped when the
domain is in square brackets.
When single_item
is TRUE
, no more than one item is
extracted from each input string. If also collapse_x
is
TRUE
, then no more than one item is extracted from each
group of concatenated input items. The function looks for a
URL scheme and a following ":"
. In case of no
match, the first substring looking like an email address is
selected if plain_email
is TRUE
. A URL
between double quotes ("\""
) or angle brackets ("<"
and ">"
) takes precedence over a URL without such
delimiters. A URL may be rejected by the filtering stage
(see url_pattern
), in which case it does not matter if an
email address was also found: no results are returned for the input
string in question.
The comma is a problematic URL separator, because it is a valid character in some parts of a URL (Berners-Lee, Fielding, and Masinter, 2005). The function estimates which commas should remain as part of a URL. Misclassifications are possible.
Text between a pair of double quotes or angle brackets
(URL scheme required after the opening delimiter) is
mostly interpreted as representing a single URL, but
commas are still checked, and the URL is cut when
necessary. It is possible to use multiple lines (but not multiple
strings unless collapse_x
is TRUE
) for a long
URL when delimited by "<"
and ">"
:
whitespace is removed when it occurs after the ":"
that
follows the URL scheme.
Whitespace, i.e. tabs and spaces, and commas are allowed in plain email addresses: in a double quoted local part, or in a domain literal delimited by square brackets (Resnick, 2008). These are accepted by the function.
Berners-Lee, T., Fielding, R., and Masinter, L. (2005) Uniform Resource Identifier (URI): Generic syntax. RFC 3986, RFC Editor. https://www.rfc-editor.org/rfc/rfc3986.txt.
Braden, R., editor (1989) Requirements for Internet hosts - application and support. RFC 1123, RFC Editor. https://www.rfc-editor.org/rfc/rfc1123.txt.
Duerst, M., Masinter, L., and Zawinski, J. (2010) The 'mailto' URI scheme. RFC 6068, RFC Editor. https://www.rfc-editor.org/rfc/rfc6068.txt.
Duerst, M. and Suignard, M. (2005) Internationalized Resource Identifiers (IRIs). RFC 3987, RFC Editor. https://www.rfc-editor.org/rfc/rfc3987.txt.
Elz, R. and Bush, R. (1997) Clarifications to the DNS specification. RFC 2181, RFC Editor. https://www.rfc-editor.org/rfc/rfc2181.txt.
Harrenstien, K., Stahl, M., and Feinler, E. (1985) DoD Internet host table specification. RFC 952, RFC Editor. https://www.rfc-editor.org/rfc/rfc952.txt.
Mockapetris, P. (1987) Domain names - concepts and facilities. RFC 1034, RFC Editor. https://www.rfc-editor.org/rfc/rfc1034.txt.
R Core Team (2017) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
Resnick, P., editor (2008) Internet Message Format. RFC 5322, RFC Editor. https://www.rfc-editor.org/rfc/rfc5322.txt.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | email1 <- "user1@example.org"
urls <- c("http://www.example.org/",
"ftp://cran.r-project.org",
"https://a,b,c@[vf.a1,b2]/foo,bar",
paste0("mailto:", email1))
phrase <- c(paste0("See ", urls[1], ", ", urls[2], " and"),
paste0(urls[3], "."))
url_urls <- paste0("With prefix URL:", urls, " and that's all.")
comma_urls <- paste0(urls, collapse=",")
angle_urls <- sub(".", ".\n", paste0("<", urls, ">"), fixed=TRUE)
split_urls <- unlist(strsplit(angle_urls, "\n", fixed=TRUE))
pu1 <- pick_urls(urls)
identical(pu1, urls[1:3]) # TRUE
pu2 <- pick_urls(urls, url_pattern="")
identical(pu2, urls) # TRUE
pu3 <- pick_urls(phrase)
identical(pu3, pu1) # TRUE
pu4 <- pick_urls(url_urls, url_pattern="")
identical(pu4, urls) # TRUE
pu5 <- pick_urls(urls, url_pattern="", all_email=TRUE)
identical(pu5[["url"]], urls[1:3]) # TRUE
identical(pu5[["email"]], email1) # TRUE
pu6 <- pick_urls(comma_urls, url_pattern="")
identical(pu6, urls) # TRUE
pu7 <- pick_urls(angle_urls, url_pattern="")
identical(pu7, urls) # TRUE
pu8 <- pick_urls(split_urls, url_pattern="", collapse_x=TRUE)
identical(pu8, urls) # TRUE
emails <- c("user2 at example.org",
"\"user 3\"(comment) @ localhost",
"\"user", " 4\"@[::", " 1]")
emails_target <- c("user2@example.org",
"\"user 3\"@localhost",
"\"user 4\"@[::1]")
pe1 <- pick_urls(emails, plain_email=TRUE)
identical(pe1[["email"]], emails_target[1]) # TRUE
pe2 <- pick_urls(emails, plain_email=TRUE, email_pattern="")
identical(pe2[["email"]], emails_target[1:2]) # TRUE
pe3 <- pick_urls(emails, plain_email=TRUE, email_pattern="",
collapse_x=TRUE)
identical(pe3[["email"]], emails_target) # TRUE
pe4 <- pick_urls(emails, plain_email=TRUE, email_pattern="",
deobfuscate=FALSE)
identical(pe4[["email"]], emails_target[2]) # TRUE
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.