README.md

Travis build
status R-CMD-check

quicknews

A simple, lightweight news article extractor, with functions for:

  1. retrieving URLs for search-based Google News RSS feeds;
  2. parsing RSS feeds;
  3. extracting online news article content; and
  4. resolving shortened URLs.

Note (to self): Sometimes custom Google News searches return blank XML feeds. Not sure why. Do not change code. Instead, try again later. Check example RSS feed here for a jan 6 committee hearings custom search.

Installation

You can download the development version from GitHub with:

devtools::install_github("jaytimm/quicknews")
# remotes::install_github("jaytimm/quicknews")

Usage

§ Google News RSS feed URL

Specify custom search keyword, or leave parameter empty to get RSS URL for Google News’ top stories. For the former, the 100 most recent articles are included in RSS; for the latter, the 40 most recent.

mm <- quicknews::qnews_build_rss(x = 'medical marijuana')
mm
## [1] "https://news.google.com/news/rss/search?hl=en-US&gl=US&ceid=US:en&q=%22medical%20marijuana%22"

§ Parsing RSS feed

mm_feed <- quicknews::qnews_parse_rss(mm) 

mm_feed |>
  select(-link) |> head() |>
  knitr::kable()

| date | source | title | |:-----|:------------|:-----------------------------------------------------| | 2023-08-17 | Marijuana Moment | Medical Marijuana Use Linked To Improved Quality Of Life And Better Job Performance For People With Neurological … | | 2023-08-17 | Crain’s Detroit Business | Does medical marijuana still matter in Michigan? | | 2023-08-17 | Marijuana Moment | Minnesota Marijuana Regulators Lay Out Roadmap For Implementing Legalization | | 2023-08-17 | Marijuana Moment | Bipartisan Lawmakers Push VA Secretary To End ‘Detrimental’ Policy Blocking Doctors From Recommending Medical … | | 2023-08-16 | WPEC | Seniors on marijuana: Local health system sees success in medical marijuana pilot program | | 2023-08-17 | 4029tv | Medical marijuana tax revenues set to help food insecurity in Arkansas |

The user can also specify RSS feeds from other sources; below, the health-related feed from BBC News.

quicknews::qnews_parse_rss('http://feeds.bbci.co.uk/news/health/rss.xml') %>%
  select(-link) |> head() |>
  knitr::kable()

| date | source | title | |:---------|:--------------|:-----------------------------------------------| | 2023-08-16 | BBC News - Health | Junior doctors in Scotland accept new pay offer | | 2023-08-16 | BBC News - Health | Aerosol fire trend leads to ‘two or three’ burns a week | | 2023-08-17 | BBC News - Health | Should we be worried about Covid this winter? | | 2023-08-15 | BBC News - Health | £250m funding for more hospital beds in England this winter | | 2023-08-17 | BBC News - Health | Killiow House holiday lodges could house NHS staff | | 2023-08-14 | BBC News - Health | Many cancer waiting time targets set to be dropped in England |

§ Convert Google link to url proper

This is new, as Google News no longer includes the actual URL in its RSS feed.

mm_feed$url <- quicknews::qnews_get_rurls(mm_feed$link)

Google link:

mm_feed$link[1]
## [1] "https://news.google.com/rss/articles/CBMiqwFodHRwczovL3d3dy5tYXJpanVhbmFtb21lbnQubmV0L21lZGljYWwtbWFyaWp1YW5hLXVzZS1saW5rZWQtdG8taW1wcm92ZWQtcXVhbGl0eS1vZi1saWZlLWFuZC1iZXR0ZXItam9iLXBlcmZvcm1hbmNlLWZvci1wZW9wbGUtd2l0aC1uZXVyb2xvZ2ljYWwtZGlzb3JkZXJzLW5ldy1zdHVkeS1maW5kcy_SAQA?oc=5"

Proper url:

mm_feed$url[1]
## [1] "https://www.marijuanamoment.net/medical-marijuana-use-linked-to-improved-quality-of-life-and-better-job-performance-for-people-with-neurological-disorders-new-study-finds/"

§ Article content

The qnews_extract_article function is designed for multi-threaded text extraction from HTML. Via rvest and xml2. A simple approach, with no Java dependencies. HTML markups, comments, extraneous text, etc. are removed mostly via node type, node-final punctuation, character length, and a small dictionary of “junk” phrases.

articles <- quicknews::qnews_extract_article(
  x = mm_feed$url[1:5], cores = 1) |>
  left_join(mm_feed)

list(title = strwrap(articles$title[1], width = 60), 
     text = strwrap(articles$text[1], width = 60)[1:5])
## $title
## [1] "Medical Marijuana Use Linked To Improved Quality Of Life"   
## [2] "And Better Job Performance For People With Neurological ..."
## 
## $text
## [1] "Medical marijuana use is associated with improved quality" 
## [2] "of life—including better job performance, sleep, appetite" 
## [3] "and energy—according to a new study. Researchers at the"   
## [4] "University of West Attica in Greece published the study in"
## [5] "the journal GeNeDis Neuroscientific Advances on Wednesday"

The generic get_site function can be used to scrape text from any url. Quick, convenient.

wik <- 'https://en.wikipedia.org/wiki/Generation_Jones' |>
  quicknews::get_site() |>
  subset(type == 'p' & nchar(text) > 3)

strwrap(wik$text[1], width = 60)
## [1] "Generation Jones is the social cohort[1][2] of the latter"  
## [2] "half of the baby boomer generation to the first year of"    
## [3] "Generation X.[3][4][5][6] The term Generation Jones was"    
## [4] "first coined by the American cultural commentator Jonathan" 
## [5] "Pontell, who identified the cohort as those born from 1954" 
## [6] "to 1965 in the U.S.,[7] who were children during Watergate,"
## [7] "the oil crisis, and stagflation rather than during the"     
## [8] "1950s, but slightly before Gen X.[8][9]"


jaytimm/quicknews documentation built on Aug. 23, 2023, 12:09 a.m.