knitr::opts_chunk$set( collapse = TRUE, comment = "##", fig.path = "man/images/" )
Semi-supervised Bayesian model for geographical document classification. Newsmap automatically constructs a large geographical dictionary from a corpus to accurate classify documents. Currently, the newsmap package contains seed dictionaries in multiple languages that include English, German, French, Spanish, Portuguese, Russian, Italian, Hebrew, Arabic, Japanese, Chinese.
The detail of the algorithm is explained in Newsmap: semi-supervised approach to geographical news classification. newsmap has also been used in scientific research in various fields (Google Scholar).
newsmap is available on CRAN since the version 0.6. You can install the package using R Studio GUI or the command.
install.packages("newsmap")
If you want to the latest version, please install by running this command in R. You need to have devtools installed beforehand.
install.packages("devtools") devtools::install_github("koheiw/newsmap")
In this example, using a text analysis package quanteda for preprocessing of textual data, we train a geographical classification model on a corpus of news summaries collected from Yahoo News via RSS in 2014.
download.file('https://www.dropbox.com/s/e19kslwhuu9yc2z/yahoo-news.RDS?dl=1', '~/yahoo-news.RDS')
require(newsmap) require(quanteda) # Load data dat <- readRDS('~/yahoo-news.RDS') dat$text <- paste0(dat$head, ". ", dat$body) dat$body <- NULL corp <- corpus(dat, text_field = 'text') # Custom stopwords month <- c('January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December') day <- c('Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday') agency <- c('AP', 'AFP', 'Reuters') # Select training period sub_corp <- corpus_subset(corp, '2014-01-01' <= date & date <= '2014-12-31') # Tokenize toks <- tokens(sub_corp) toks <- tokens_remove(toks, stopwords('english'), valuetype = 'fixed', padding = TRUE) toks <- tokens_remove(toks, c(month, day, agency), valuetype = 'fixed', padding = TRUE) # quanteda v1.5 introduced 'nested_scope' to reduce ambiguity in dictionary lookup toks_label <- tokens_lookup(toks, data_dictionary_newsmap_en, levels = 3, nested_scope = "dictionary") dfmt_label <- dfm(toks_label) dfmt_feat <- dfm(toks, tolower = FALSE) dfmt_feat <- dfm_select(dfmt_feat, selection = "keep", '^[A-Z][A-Za-z1-2]+', valuetype = 'regex', case_insensitive = FALSE) # include only proper nouns to model dfmt_feat <- dfm_trim(dfmt_feat, min_termfreq = 10) model <- textmodel_newsmap(dfmt_feat, dfmt_label) # Features with largest weights coef(model, n = 7)[c("us", "gb", "fr", "br", "jp")]
pred_data <- data.frame(text = as.character(sub_corp), country = predict(model))
knitr::kable(head(pred_data))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.