A simple, lightweight news article extractor, with functions for:
Note (to self): Sometimes custom Google News searches return blank XML feeds. Not sure why. Do not change code. Instead, try again later. Check example RSS feed here for a
jan 6 committee hearings
custom search.
You can download the development version from GitHub with:
devtools::install_github("jaytimm/quicknews")
# remotes::install_github("jaytimm/quicknews")
Specify custom search keyword, or leave parameter empty to get RSS URL for Google News’ top stories. For the former, the 100 most recent articles are included in RSS; for the latter, the 40 most recent.
mm <- quicknews::qnews_build_rss(x = 'medical marijuana')
mm
## [1] "https://news.google.com/news/rss/search?hl=en-US&gl=US&ceid=US:en&q=%22medical%20marijuana%22"
mm_feed <- quicknews::qnews_parse_rss(mm)
mm_feed |>
select(-link) |> head() |>
knitr::kable()
| date | source | title | |:-----|:------------|:-----------------------------------------------------| | 2023-08-17 | Marijuana Moment | Medical Marijuana Use Linked To Improved Quality Of Life And Better Job Performance For People With Neurological … | | 2023-08-17 | Crain’s Detroit Business | Does medical marijuana still matter in Michigan? | | 2023-08-17 | Marijuana Moment | Minnesota Marijuana Regulators Lay Out Roadmap For Implementing Legalization | | 2023-08-17 | Marijuana Moment | Bipartisan Lawmakers Push VA Secretary To End ‘Detrimental’ Policy Blocking Doctors From Recommending Medical … | | 2023-08-16 | WPEC | Seniors on marijuana: Local health system sees success in medical marijuana pilot program | | 2023-08-17 | 4029tv | Medical marijuana tax revenues set to help food insecurity in Arkansas |
The user can also specify RSS feeds from other sources; below, the health-related feed from BBC News.
quicknews::qnews_parse_rss('http://feeds.bbci.co.uk/news/health/rss.xml') %>%
select(-link) |> head() |>
knitr::kable()
| date | source | title | |:---------|:--------------|:-----------------------------------------------| | 2023-08-16 | BBC News - Health | Junior doctors in Scotland accept new pay offer | | 2023-08-16 | BBC News - Health | Aerosol fire trend leads to ‘two or three’ burns a week | | 2023-08-17 | BBC News - Health | Should we be worried about Covid this winter? | | 2023-08-15 | BBC News - Health | £250m funding for more hospital beds in England this winter | | 2023-08-17 | BBC News - Health | Killiow House holiday lodges could house NHS staff | | 2023-08-14 | BBC News - Health | Many cancer waiting time targets set to be dropped in England |
This is new, as Google News no longer includes the actual URL in its RSS feed.
mm_feed$url <- quicknews::qnews_get_rurls(mm_feed$link)
Google link:
mm_feed$link[1]
## [1] "https://news.google.com/rss/articles/CBMiqwFodHRwczovL3d3dy5tYXJpanVhbmFtb21lbnQubmV0L21lZGljYWwtbWFyaWp1YW5hLXVzZS1saW5rZWQtdG8taW1wcm92ZWQtcXVhbGl0eS1vZi1saWZlLWFuZC1iZXR0ZXItam9iLXBlcmZvcm1hbmNlLWZvci1wZW9wbGUtd2l0aC1uZXVyb2xvZ2ljYWwtZGlzb3JkZXJzLW5ldy1zdHVkeS1maW5kcy_SAQA?oc=5"
Proper url:
mm_feed$url[1]
## [1] "https://www.marijuanamoment.net/medical-marijuana-use-linked-to-improved-quality-of-life-and-better-job-performance-for-people-with-neurological-disorders-new-study-finds/"
The qnews_extract_article
function is designed for multi-threaded text
extraction from HTML. Via rvest
and xml2
. A simple approach, with no
Java dependencies. HTML markups, comments, extraneous text, etc. are
removed mostly via node type, node-final punctuation, character length,
and a small dictionary of “junk” phrases.
articles <- quicknews::qnews_extract_article(
x = mm_feed$url[1:5], cores = 1) |>
left_join(mm_feed)
list(title = strwrap(articles$title[1], width = 60),
text = strwrap(articles$text[1], width = 60)[1:5])
## $title
## [1] "Medical Marijuana Use Linked To Improved Quality Of Life"
## [2] "And Better Job Performance For People With Neurological ..."
##
## $text
## [1] "Medical marijuana use is associated with improved quality"
## [2] "of life—including better job performance, sleep, appetite"
## [3] "and energy—according to a new study. Researchers at the"
## [4] "University of West Attica in Greece published the study in"
## [5] "the journal GeNeDis Neuroscientific Advances on Wednesday"
The generic get_site
function can be used to scrape text from any url.
Quick, convenient.
wik <- 'https://en.wikipedia.org/wiki/Generation_Jones' |>
quicknews::get_site() |>
subset(type == 'p' & nchar(text) > 3)
strwrap(wik$text[1], width = 60)
## [1] "Generation Jones is the social cohort[1][2] of the latter"
## [2] "half of the baby boomer generation to the first year of"
## [3] "Generation X.[3][4][5][6] The term Generation Jones was"
## [4] "first coined by the American cultural commentator Jonathan"
## [5] "Pontell, who identified the cohort as those born from 1954"
## [6] "to 1965 in the U.S.,[7] who were children during Watergate,"
## [7] "the oil crisis, and stagflation rather than during the"
## [8] "1950s, but slightly before Gen X.[8][9]"
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.