knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "##",
  fig.path = "man/images/"
)

Newsmap: geographical document classifier

CRAN
Version Downloads Total
Downloads R build
status codecov

Semi-supervised Bayesian model for geographical document classification. Newsmap automatically constructs a large geographical dictionary from a corpus to accurate classify documents. Currently, the newsmap package contains seed dictionaries in multiple languages that include English, German, French, Spanish, Portuguese, Russian, Italian, Hebrew, Arabic, Japanese, Chinese.

The detail of the algorithm is explained in Newsmap: semi-supervised approach to geographical news classification. newsmap has also been used in scientific research in various fields (Google Scholar).

How to install

newsmap is available on CRAN since the version 0.6. You can install the package using R Studio GUI or the command.

install.packages("newsmap")

If you want to the latest version, please install by running this command in R. You need to have devtools installed beforehand.

install.packages("devtools")
devtools::install_github("koheiw/newsmap")

Example

In this example, using a text analysis package quanteda for preprocessing of textual data, we train a geographical classification model on a corpus of news summaries collected from Yahoo News via RSS in 2014.

Download example data

download.file('https://www.dropbox.com/s/e19kslwhuu9yc2z/yahoo-news.RDS?dl=1', '~/yahoo-news.RDS')

Train Newsmap classifier

require(newsmap)
require(quanteda)

# Load data
dat <- readRDS('~/yahoo-news.RDS')
dat$text <- paste0(dat$head, ". ", dat$body)
dat$body <- NULL
corp <- corpus(dat, text_field = 'text')

# Custom stopwords
month <- c('January', 'February', 'March', 'April', 'May', 'June',
           'July', 'August', 'September', 'October', 'November', 'December')
day <- c('Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday')
agency <- c('AP', 'AFP', 'Reuters')

# Select training period
sub_corp <- corpus_subset(corp, '2014-01-01' <= date & date <= '2014-12-31')

# Tokenize
toks <- tokens(sub_corp)
toks <- tokens_remove(toks, stopwords('english'), valuetype = 'fixed', padding = TRUE)
toks <- tokens_remove(toks, c(month, day, agency), valuetype = 'fixed', padding = TRUE)

# quanteda v1.5 introduced 'nested_scope' to reduce ambiguity in dictionary lookup
toks_label <- tokens_lookup(toks, data_dictionary_newsmap_en, 
                            levels = 3, nested_scope = "dictionary")
dfmt_label <- dfm(toks_label)

dfmt_feat <- dfm(toks, tolower = FALSE)
dfmt_feat <- dfm_select(dfmt_feat, selection = "keep", '^[A-Z][A-Za-z1-2]+', 
                        valuetype = 'regex', case_insensitive = FALSE) # include only proper nouns to model
dfmt_feat <- dfm_trim(dfmt_feat, min_termfreq = 10)

model <- textmodel_newsmap(dfmt_feat, dfmt_label)

# Features with largest weights
coef(model, n = 7)[c("us", "gb", "fr", "br", "jp")]

Predict geographical focus of texts

pred_data <- data.frame(text = as.character(sub_corp), country = predict(model))
knitr::kable(head(pred_data))


koheiw/Newsmap documentation built on Feb. 22, 2024, 4:56 p.m.