README.md
In jaytimm/quicknews: Some Tools For Working With Digital Media

quicknews

A simple, lightweight news article extractor, with functions for:

retrieving URLs for search-based Google News RSS feeds;
parsing RSS feeds;
extracting online news article content; and
resolving shortened URLs.

Note (to self): Sometimes custom Google News searches return blank XML feeds. Not sure why. Do not change code. Instead, try again later. Check example RSS feed here for a jan 6 committee hearings custom search.

You can download the development version from GitHub with:

devtools::install_github("jaytimm/quicknews")
# remotes::install_github("jaytimm/quicknews")

Specify custom search keyword, or leave parameter empty to get RSS URL for Google News’ top stories. For the former, the 100 most recent articles are included in RSS; for the latter, the 40 most recent.

mm <- quicknews::qnews_build_rss(x = 'medical marijuana')
mm

## [1] "https://news.google.com/news/rss/search?hl=en-US&gl=US&ceid=US:en&q=%22medical%20marijuana%22"

mm_feed <- quicknews::qnews_parse_rss(mm) 

mm_feed |>
  select(-link) |> head() |>
  knitr::kable()

| date | source | title | |:-----|:------------|:-----------------------------------------------------| | 2023-08-17 | Marijuana Moment | Medical Marijuana Use Linked To Improved Quality Of Life And Better Job Performance For People With Neurological … | | 2023-08-17 | Crain’s Detroit Business | Does medical marijuana still matter in Michigan? | | 2023-08-17 | Marijuana Moment | Minnesota Marijuana Regulators Lay Out Roadmap For Implementing Legalization | | 2023-08-17 | Marijuana Moment | Bipartisan Lawmakers Push VA Secretary To End ‘Detrimental’ Policy Blocking Doctors From Recommending Medical … | | 2023-08-16 | WPEC | Seniors on marijuana: Local health system sees success in medical marijuana pilot program | | 2023-08-17 | 4029tv | Medical marijuana tax revenues set to help food insecurity in Arkansas |

The user can also specify RSS feeds from other sources; below, the health-related feed from BBC News.

quicknews::qnews_parse_rss('http://feeds.bbci.co.uk/news/health/rss.xml') %>%
  select(-link) |> head() |>
  knitr::kable()

| date | source | title | |:---------|:--------------|:-----------------------------------------------| | 2023-08-16 | BBC News - Health | Junior doctors in Scotland accept new pay offer | | 2023-08-16 | BBC News - Health | Aerosol fire trend leads to ‘two or three’ burns a week | | 2023-08-17 | BBC News - Health | Should we be worried about Covid this winter? | | 2023-08-15 | BBC News - Health | £250m funding for more hospital beds in England this winter | | 2023-08-17 | BBC News - Health | Killiow House holiday lodges could house NHS staff | | 2023-08-14 | BBC News - Health | Many cancer waiting time targets set to be dropped in England |

This is new, as Google News no longer includes the actual URL in its RSS feed.

mm_feed$url <- quicknews::qnews_get_rurls(mm_feed$link)

Google link:

mm_feed$link[1]

## [1] "https://news.google.com/rss/articles/CBMiqwFodHRwczovL3d3dy5tYXJpanVhbmFtb21lbnQubmV0L21lZGljYWwtbWFyaWp1YW5hLXVzZS1saW5rZWQtdG8taW1wcm92ZWQtcXVhbGl0eS1vZi1saWZlLWFuZC1iZXR0ZXItam9iLXBlcmZvcm1hbmNlLWZvci1wZW9wbGUtd2l0aC1uZXVyb2xvZ2ljYWwtZGlzb3JkZXJzLW5ldy1zdHVkeS1maW5kcy_SAQA?oc=5"

Proper url:

mm_feed$url[1]

## [1] "https://www.marijuanamoment.net/medical-marijuana-use-linked-to-improved-quality-of-life-and-better-job-performance-for-people-with-neurological-disorders-new-study-finds/"

The qnews_extract_article function is designed for multi-threaded text extraction from HTML. Via rvest and xml2. A simple approach, with no Java dependencies. HTML markups, comments, extraneous text, etc. are removed mostly via node type, node-final punctuation, character length, and a small dictionary of “junk” phrases.

articles <- quicknews::qnews_extract_article(
  x = mm_feed$url[1:5], cores = 1) |>
  left_join(mm_feed)

list(title = strwrap(articles$title[1], width = 60), 
     text = strwrap(articles$text[1], width = 60)[1:5])

## $title
## [1] "Medical Marijuana Use Linked To Improved Quality Of Life"   
## [2] "And Better Job Performance For People With Neurological ..."
## 
## $text
## [1] "Medical marijuana use is associated with improved quality" 
## [2] "of life—including better job performance, sleep, appetite" 
## [3] "and energy—according to a new study. Researchers at the"   
## [4] "University of West Attica in Greece published the study in"
## [5] "the journal GeNeDis Neuroscientific Advances on Wednesday"

The generic get_site function can be used to scrape text from any url. Quick, convenient.

wik <- 'https://en.wikipedia.org/wiki/Generation_Jones' |>
  quicknews::get_site() |>
  subset(type == 'p' & nchar(text) > 3)

strwrap(wik$text[1], width = 60)

## [1] "Generation Jones is the social cohort[1][2] of the latter"  
## [2] "half of the baby boomer generation to the first year of"    
## [3] "Generation X.[3][4][5][6] The term Generation Jones was"    
## [4] "first coined by the American cultural commentator Jonathan" 
## [5] "Pontell, who identified the cohort as those born from 1954" 
## [6] "to 1965 in the U.S.,[7] who were children during Watergate,"
## [7] "the oil crisis, and stagflation rather than during the"     
## [8] "1950s, but slightly before Gen X.[8][9]"

jaytimm/quicknews documentation built on Aug. 23, 2023, 12:09 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com