rdatasets: Datasets for R

#' arcene
#'
#' Binary response data set.
#'
#' @section UCI abstract: ARCENE's task is to distinguish cancer
#' versus normal patterns from
#' mass-spectrometric data. This is a two-class classification problem
#' with continuous input variables. This dataset is one of 5 datasets
#' of the NIPS 2003 feature selection challenge.
#'
#' @source https://statweb.stanford.edu/~tibs/strong/
#' @source https://archive.ics.uci.edu/ml/datasets/Arcene
"arcene"

#' golub
#'
#' Gene expression data set. Binary response.
#'
#' @source https://statweb.stanford.edu/~tibs/strong/
"golub"

#' dorothea
#'
#' Binary response data set.
#'
#' @section UCI abstract: DOROTHEA is a drug discovery dataset. Chemical
#' compounds represented
#' by structural molecular features must be classified as active (binding
#' to thrombin) or inactive. This is one of 5 datasets of the NIPS 2003
#' feature selection challenge.
#'
#' @source https://archive.ics.uci.edu/ml/datasets/Dorothea
#' @source https://statweb.stanford.edu/~tibs/strong/
"dorothea"

#' gisette
#'
#' Binary response data set.
#'
#' @section UCI abstract: GISETTE is a handwritten digit recognition problem.
#' The problem is to separate the highly confusable digits '4' and '9'. This
#' dataset is one of five datasets of the NIPS 2003 feature selection challenge.
#'
#' @source https://statweb.stanford.edu/~tibs/strong/
#' @source https://archive.ics.uci.edu/ml/datasets/Gisette
"gisette"

#' news20
#'
#' Multi-Class training data set from the LIBSVM database.
#'
#' @source Ken Lang. Newsweeder: Learning to filter netnews.
#'   In Proceedings of the Twelfth International Conference on Machine
#'   Learning, pages 331-339, 1995.
#' @source https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html
"news20"

#' cpusmall
#'
#' Regression data set from the LIBSVM database
#'
#' @source http://www.cs.toronto.edu/~delve/data/datasets.html
#' @source https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html
"cpusmall"

#' zipcode
#'
#' Multi-class training data set.
#'
#' @section From source: Normalized handwritten digits, automatically
#' scanned from envelopes by the U.S. Postal Service. The original
#' scanned digits are binary and of different sizes and orientations; the
#' images  here have been deslanted and size normalized, resulting
#' in 16 x 16 grayscale images (Le Cun et al., 1990).
#'
#' @source https://web.stanford.edu/~hastie/ElemStatLearn/
#' @source https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html
"zipcode"

#' physician
#'
#' Count outcome data set related to a paper by Deb et al. The variable
#' `ofp` (physician office visits) have been used as the outcome here.
#'
#' @source Deb, Partha, and Pravin K. Trivedi. "Demand for Medical Care by
#'   the Elderly: A Finite Mixture Approach." Journal of Applied Econometrics,
#'   vol. 12, no. 3, 1997, pp. 313–336. JSTOR, www.jstor.org/stable/2285252.
#' @source https://www.jstatsoft.org/article/view/v027i08
"physician"

#' E2006-tfidf
#'
#' From the source: "10-K reports from thousands of publicly traded U.S. companies,
#' published in 1996–2006 and stock return volatility measurements in the twelve-month
#' period before and the twelve-month period after each report, where available."
#'
#' @source Shimon Kogan, Dimitry Levin, Bryan R. Routledge, Jacob S. Sagi, and Noah A.
#' Smith. "Predicting risk from financial reports with regression". In Proceedings of the
#' North American Association for Computational Linguistics Human Language Technologies
#' Conference, pages 272-280, 2009.
#' @source http://www.cs.cmu.edu/~ark/10K/
#' @source https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html#E2006-tfidf
#' @source https://homes.cs.washington.edu/~nasmith/
"e2006"

#' newsgroup
#'
#' From the source: "10-K reports from thousands of publicly traded U.S. companies,
#' published in 1996–2006 and stock return volatility measurements in the twelve-month
#' period before and the twelve-month period after each report, where available."
#'
#' @source http://statweb.stanford.edu/~tibs/strong/
#' @source Lang, K. (1995). NewsWeeder: Learning to Filter Netnews. In A. Prieditis & S.
#'   Russell (Eds.), Machine Learning Proceedings 1995 (pp. 331–339). Morgan Kaufmann.
#'   <https://doi.org/10.1016/B978-1-55860-377-6.50048-7>
"newsgroup"

#' sensorless
#'
#' Multiclass classification set for sensorless drive diagnosis.
#'
#' From the source: "Features are extracted from electric
#' current drive signals.
#' The drive has intact and defective components. This results in 11 different classes
#' with different conditions. Each condition has been measured several times by 12
#' different operating conditions, this means by different speeds, load moments and
#' load forces. The current signals are measured with a current probe and an oscilloscope
#' on two phases."
#'
#' @source https://archive.ics.uci.edu/ml/datasets/Dataset+for+Sensorless+Drive+Diagnosis
#' @source Chien-Chih Wang, Kent-Loong Tan, Chun-Ting Chen, Yu-Hsiang Lin, S. Sathiya
#'   Keerthi, Dhruv Mahajan, Sellamanickam Sundararajan, and Chih-Jen Lin. Distributed Newton
#'   methods for deep learning. Neural Computation, 30(6):1673-1724, 2018.
"newsgroup"