README.md

pickURL

Travis-CI Build Status AppVeyor Build Status Coverage Status

Extract URLs and email addresses from text using R. Actually, all kinds of URIs are supported, not just URLs. The set of accepted URI schemes can easily be adjusted.

Leading and trailing punctuation is examined. If it seems that punctuation is used as delimiters around a URI or that a URI is the last part of sentence, some trailing punctuation may be removed. Comma-separated URI lists are split but the heuristics used for this may fail, as the comma is a valid character in some parts of a URI. Any technically valid URI is protected from being cut if it is surrounded by angle brackets (<http://www.example.org/>) or double quotes ("http://www.example.org/"). Whitespace is allowed (and removed) within angle brackets, as long as the URI scheme and the following : are not interrupted by whitespace.

Some (approximate) validation against the URI specification is performed, for example in the host part of the URI. The program also catches illegal ASCII characters and use of the % character for purposes other than percent-encoding; anything after that, including the illegal character itself, is not considered a part of the URL. The program is generally not aware of possible additional rules applying to URIs following a particular URI scheme. As an exception, the program knows about the structure of mailto URIs.

Installation

With devtools already installed, run the following command in the R console:

devtools::install_github("mvkorpel/pickURL")

Usage

After installing the package, see the help page of function pick_urls.



mvkorpel/pickURL documentation built on May 23, 2019, 10:55 a.m.