get_wiki_content: Get Wikipedia Text Content from Multiple Pages

View source: R/get_wiki_content.R

get_wiki_contentR Documentation

Get Wikipedia Text Content from Multiple Pages

Description

A wrapper around WikipediR::get_page_content with some soft cleaning of the content and automatic handling of pages with a redirect (see details section). Furthmore, the function does not stop if input includes eroneous page names and simply skips these.

Usage

get_wiki_content(page_names, language = "en", project = "wikipedia",
  rm_bracket_length = 50)

Arguments

page_names

The names of the Wiki pages to retreive content of (e.g., "Main_Page").

language

By default "en".

project

By default "Wikpedia".

rm_bracket_length

Maximum length (number of characters) of bracket content to be removed. Edged brackets and enclosed content with equal or lower length are removed. By default 50.

Details

The content cleaning includes: - removal of "non-text" sections (See_also|Notes_and_References|Notes|References|Further_reading|External_links) - removal of html tags, line breaks, reference and other bracket content (e.g., [edit]) - harmonization of blanks

If a redirect page is hit, the function simply discards this page (and associated name) and turns to the redirected page.

Value

A character vector with Wiki content.

Examples

 
content = get_wiki_content(c("S_(programming_language)", "Eco-sufficiency", "Energy star ratings"))
# [1] "S_(programming_language)"
# [1] "Eco-sufficiency"
# [1] "Energy star ratings"
# [1] "Energy Star"

# notice that "energy star ratings" is actually a page with a redirect
# the function replaces it by the respective redirect page

str(content)
# Named chr [1:3] "SParadigm multi-paradigm: imperative, object orientedDeveloper Rick Becker, Allan Wilks, John ChambersFirst&#16"| __truncated__ ...
# - attr(*, "names")= chr [1:3] "S_(programming_language)" "Eco-sufficiency" "Energy star ratings"

manuelbickel/textility documentation built on Nov. 25, 2022, 9:07 p.m.