uri_diversity: Quantify URL diversity

Description Usage Arguments Value Note Author(s) References Examples

View source: R/uri-diversity.R

Description

Compute WSDL Diversity Index, Shannon's evenness index, and Simpson's diversity index for a corpus (collection) of URLs.

Usage

1
2
3
4
5
uri_diversity(corpus, corpus_id = uuid::UUIDgenerate(),
  exception_domains = NULL)

url_diversity(corpus, corpus_id = uuid::UUIDgenerate(),
  exception_domains = NULL)

Arguments

corpus

a collection (character vector) of URLs

corpus_id

an identifier (ideally unique) for the collection; will be generated if not provided.

exception_domains

a character vector of domains; use this to specify domains where the query string is important. Normally, the query string is excluded from the canonicalized URI but in some cases (e.g. youtube.com) it is desirable to have the query string influence the diversity computations.

Value

a data frame (tibble) with WSDL, Shannon and Simpson diversity indices for canonical URIs and hostnames.

Note

Algorithm creator: Alexander C. Nwala

Author(s)

Alexander Nwala (anwala@cs.odu.edu); Bob Rudis (bob@rud.is)

References

http://ws-dl.blogspot.com/2018/05/2018-05-04-exploration-of-url-diversity.html

Examples

1
2
collection <- readLines(system.file("extdat", "corpus.txt", package = "urldiversity"))
uri_diversity(collection)

hrbrmstr/urldiversity documentation built on May 14, 2019, 4 a.m.