get_hrefs: Get hrefs

Description Usage Arguments Details Value See Also Examples

View source: R/get_hrefs.R

Description

Given a url, or html session, find the absolute urls of relative and external links posted on the web page.

Usage

1
2
get_hrefs(x, keep_regex = NULL, omit_regex = NULL,
  omit_bookmarks = TRUE, ...)

Arguments

x

Either a character string of website of interest, or a session has defined from the rvest function html_session

keep_regex

a regular expression to be matched in the found hrefs, see details.

omit_regex

a regular expression to be matched in the found hrefs, see details.

omit_bookmarks

urls containing the "#" symbol will be omited from the returned urls (Logical, defaults to TRUE)

...

not currently used

Details

There are a few options for filtering the set of returned links: keep_regex, omit_regex, and omit_bookmarks. The first two are regular expressions and will be applied to the set of links in order of keep, then omit, that is: given a character vector of links, the use of the keep_regex and omit_regex is equivalent to the following two lines of code:

> links <- links[grepl(keep_regex, links)]

> links <- links[!grepl(omit_regex, links)]

Both keep_regex and omit_regex are optional. You may consider runing get_hrefs without filting results and inspect the returned urls. Post hoc filter would be viable, as would re-evaluating the get_hrefs call with the wanted filters.

By default urls with the '#' symbol are omitted. Set omit_bookmarks = FALSE to include url with bookmarks in the return.

Value

A sna_hrefs object, which is a data.frame with the following columns:

url

<chr> the found urls, modified to be absolute urls

relative

<logical> indicates whether or not the url relative to the domain of x

The return object as additional attributes

session

<session> the html session

If the url or session does not resolve, the retruned data.frame will have the aforementioned columns, but will have no rows.

See Also

vignette(topic = "snaWeb", package = "snaWeb")

Examples

1
2
3
4
5
6
7
## Not run: 
get_hrefs('neptuneinc.org')

## See the vignette for more details:
vignette("snaWeb", package = "snaWeb")

## End(Not run)

jhollist/snaWeb documentation built on April 7, 2020, 12:49 a.m.