In jaytimm/quicknews: Some Tools For Working With Digital Media

quicknews

A simple, lightweight news article extractor, with functions for:

(1) retrieving URLs for search-based Google News RSS feeds; (2) parsing RSS feeds; (3) extracting online news article content; and (4) resolving shortened URLs.

quicknews::qnews_build_rss(x = 'jan 6 committee hearings')

Note (to self): Sometimes custom Google News searches return blank XML feeds. Not sure why. Do not change code. Instead, try again later. Check example RSS feed here for a jan 6 committee hearings custom search.

Installation

You can download the development version from GitHub with:

library(dplyr)

devtools::install_github("jaytimm/quicknews")
# remotes::install_github("jaytimm/quicknews")

Usage

§ Google News RSS feed URL

Specify custom search keyword, or leave parameter empty to get RSS URL for Google News' top stories. For the former, the 100 most recent articles are included in RSS; for the latter, the 40 most recent.

mm <- quicknews::qnews_build_rss(x = 'medical marijuana')
mm

§ Parsing RSS feed

mm_feed <- quicknews::qnews_parse_rss(mm) 

mm_feed |>
  select(-link) |> head() |>
  knitr::kable()

The user can also specify RSS feeds from other sources; below, the health-related feed from BBC News.

quicknews::qnews_parse_rss('http://feeds.bbci.co.uk/news/health/rss.xml') %>%
  select(-link) |> head() |>
  knitr::kable()

§ Convert Google link to url proper

This is new, as Google News no longer includes the actual URL in its RSS feed.

mm_feed$url <- quicknews::qnews_get_rurls(mm_feed$link)

Google link:

mm_feed$link[1]

Proper url:

mm_feed$url[1]

§ Article content

The qnews_extract_article function is designed for multi-threaded text extraction from HTML. Via rvest and xml2. A simple approach, with no Java dependencies. HTML markups, comments, extraneous text, etc. are removed mostly via node type, node-final punctuation, character length, and a small dictionary of "junk" phrases.

articles <- quicknews::qnews_extract_article(
  x = mm_feed$url[1:5], cores = 1) |>
  left_join(mm_feed)

list(title = strwrap(articles$title[1], width = 60), 
     text = strwrap(articles$text[1], width = 60)[1:5])

The generic get_site function can be used to scrape text from any url. Quick, convenient.

wik <- 'https://en.wikipedia.org/wiki/Generation_Jones' |>
  quicknews::get_site() |>
  subset(type == 'p' & nchar(text) > 3)

strwrap(wik$text[1], width = 60)