A simple, lightweight news article extractor, with functions for:
(1) retrieving URLs for search-based Google News RSS feeds; (2) parsing RSS feeds; (3) extracting online news article content; and (4) resolving shortened URLs.
quicknews::qnews_build_rss(x = 'jan 6 committee hearings')
Note (to self): Sometimes custom Google News searches return blank XML feeds. Not sure why. Do not change code. Instead, try again later. Check example RSS feed here for a
jan 6 committee hearings
custom search.
You can download the development version from GitHub with:
library(dplyr)
devtools::install_github("jaytimm/quicknews") # remotes::install_github("jaytimm/quicknews")
Specify custom search keyword, or leave parameter empty to get RSS URL for Google News' top stories. For the former, the 100 most recent articles are included in RSS; for the latter, the 40 most recent.
mm <- quicknews::qnews_build_rss(x = 'medical marijuana') mm
mm_feed <- quicknews::qnews_parse_rss(mm) mm_feed |> select(-link) |> head() |> knitr::kable()
The user can also specify RSS feeds from other sources; below, the health-related feed from BBC News.
quicknews::qnews_parse_rss('http://feeds.bbci.co.uk/news/health/rss.xml') %>% select(-link) |> head() |> knitr::kable()
This is new, as Google News no longer includes the actual URL in its RSS feed.
mm_feed$url <- quicknews::qnews_get_rurls(mm_feed$link)
Google link:
mm_feed$link[1]
Proper url:
mm_feed$url[1]
The qnews_extract_article
function is designed for multi-threaded text extraction from HTML. Via rvest
and xml2
. A simple approach, with no Java dependencies. HTML markups, comments, extraneous text, etc. are removed mostly via node type, node-final punctuation, character length, and a small dictionary of "junk" phrases.
articles <- quicknews::qnews_extract_article( x = mm_feed$url[1:5], cores = 1) |> left_join(mm_feed) list(title = strwrap(articles$title[1], width = 60), text = strwrap(articles$text[1], width = 60)[1:5])
The generic get_site
function can be used to scrape text from any url. Quick, convenient.
wik <- 'https://en.wikipedia.org/wiki/Generation_Jones' |> quicknews::get_site() |> subset(type == 'p' & nchar(text) > 3) strwrap(wik$text[1], width = 60)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.