clean_news: Clean retrieved news articles

Description Usage Arguments Examples

View source: R/clean_news.R

Description

clean_news wrangles the messy data fetched by get_news, returning a tidy tibble with sensible defaults.

Usage

1
2
3
clean_news(data, min_nchar = 300, as_date = TRUE, drop_vars = TRUE,
  to_lower = TRUE, distinct = TRUE, drop_na = FALSE,
  tif_corpus = FALSE)

Arguments

data

Tbl, returned from get_news.

min_nchar

Integer, specifying the minimum number of characters of articles to be kept in the corpus.

as_date

Logical, indicating whether dates should be transformed to class "Date".

drop_vars

Logical, indicating whether all variables (other than title, text, discoverDate, & website.domainName) should be dropped. The Newsriver API (typically) returns 26 variables, many of which contain sparse metadata.

to_lower

Logical, indicating whether the title and text variables should be transformed to lowercase.

distinct

Logical, indicating whether only articles with either distinct title or text values should be kept.

drop_na

Logical, indicating whether to drop rows containing missing values.

tif_corpus

Logical, indicating whether the tibble should be a TIF valid corpus.

Examples

1
2
3
4
5
6
## Not run: 
clean_news(data = my_tbl)

clean_news(my_tbl, min_nchar = 500, tif_corpus = TRUE)

## End(Not run)

MikeJohnPage/newsrivr documentation built on Jan. 4, 2021, 7:48 p.m.