R/datasets.R

#' Wine quality data set
#'
#' @description Two datasets with chemical properties of red and white vinho verde wine samples, from the north of Portugal.
#'
#' @format A list with two data frames for red and white wine.
#'
#' @source \url{https://archive.ics.uci.edu/ml/datasets/wine+quality}
#'
#' @examples
#'
#' wine_quality$red
#'
"wine_quality"

#' Data for insurance premium prediction
#'
#' @description A dataset to predict the insurance premium (charges).
#'
#' @format A data frame with 1338 observations (rows) and 7 features (columns). The dataset contains four numerical features (age, bmi, children and expenses) and three nominal features (sex, smoker and region) converted into factors with numerical value designated for each level. The variable to predict is charges.
#'
#' @source \url{https://www.kaggle.com/noordeen/insurance-premium-prediction}. Original source was  from the Machine Learning course website (Spring 2017) from Professor Eric Suess.
#'
#' @examples
#'
#' insurance_charges
#'
"insurance_charges"

#' default of credit card clients
#'
#' @description A dataset of default payments in Taiwan, used in Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480.
#'
#' @format A data frame with 30000 observations. I have relabelled variables according to its meaning, and the definition provided in the paper. Reader should examine them beforehand, as some inconsistencies may occur.
#'
#' \describe{
#' \item{limit_bal}{Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.}
#' \item{sex}{Gender, enconded as a factor with levels: male, female.}
#' \item{education}{Education, a factor with levels: unkonwn, graduate, university, high_school, others.}
#' \item{marriage}{Marital status, a factor with levels: unknown, married, single.}
#' \item{age}{Age, in years.}
#' \item{pay_sep-pay_abr}{History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: pay_sep = the repayment status in September, 2005; pay_aug = the repayment status in August, 2005; ...; pay_apr = the repayment status in April, 2005. The measurement scale for the repayment status is: -2: No consumption; -1: Paid in full; 0: The use of revolving credit; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.}
#' \item{bill_amt_sep-bill_amt_apr}{Amount of bill statement (NT dollar). bill_amt_sep = amount of bill statement in September, 2005; bill_amt_aug = amount of bill statement in August, 2005; ...; bill_amt_apr = amount of bill statement in April, 2005.}
#' \item{pay_amt_sep-pay_amt_apr}{Amount of previous payment (NT dollar). pay_amt_sep = amount paid in September, 2005; pay_amt_aug = amount paid in August, 2005; ...;pay_amt_apr = amount paid in April, 2005.}
#' \item{default}{The target variable, indicating if the customer has defaulted or not, a factor with levels yes, no.}
#' }
#'
#' @source \url{https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients#}
#'
#' @examples
#'
#' cc_defaults
#'
"cc_defaults"

#' A sample of the Titanic dataset for rule mining
#'
#' @format A data frame with 2201 observations and four variables.
#'
#' \describe{
#' \item{Class}{Class where the passenger was travelling (1st, 2nd, 3rd, Crew).}
#' \item{Sex}{Passenger's gender (Female, Male).}
#' \item{Age}{Is the passenger an adult or a child? (Adult, Child).}
#' \item{Survied}{Has the passenger survived? (Yes, No)}
#' }
#'
#' @source \url{http://www.rdatamining.com/data}
#'
#' @examples
#'
#' titanic_raw
#'
"titanic_raw"

#' A sample of epub transactions for rule mining
#'
#' @description The epub dataset is a transformation of the Epub transaction matrix of the arules package into a data frame. The dataset contains the download history of documents from the electronic publication platform of the Vienna University of Economics and Business Administration. The data was recorded between Jan 2003 and Dec 2008.
#'
#' @format A data frame with 25893 observations and three variables.
#'
#' \describe{
#' \item{transaction_id}{The id of the transaction. A transaction appears in as many rows as epub acquired in it.}
#' \item{time_stamp}{Time stamp of the transaction in POSIXct format.}
#' \item{book_code}{The code of the epub}
#' }
#'
#' @source Original data provided by Michael Hahsler from ePub-WU at \url{https://epub.wu-wien.ac.at}.
#'
#' @examples
#'
#' epub
#'
"epub"

#' A dataset of online retail transactions
#'
#' @description A transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.
#'
#' @format A data frame with 541909 observations and eight variables.
#'
#' \describe{
#' \item{InvoiceNo}{Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.}
#' \item{StockCode}{Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.}
#' \item{Description}{Product (item) name. Nominal.}
#' \item{Quantity}{The quantities of each product (item) per transaction. Numeric.}
#' \item{InvoiceDate}{Invice Date and time. Numeric, the day and time when each transaction was generated. In POSIXct format.}
#' \item{UnitPrice}{Unit price. Numeric, Product price per unit in sterling.}
#' \item{CustomerID}{Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.}
#' \item{Country}{Country name. Nominal, the name of the country where each customer resides.}
#' }
#'
#' @source UCI Machine Learning Repository \url{https://archive.ics.uci.edu/ml/datasets/online+retail}
#'
#' @examples
#'
#' online_retail
#'
"online_retail"

#' Another dataset of online retail transactions
#'
#' @description A transnational data set which contains all the transactions occurring between 01/12/2009 and 09/12/2010 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.
#'
#' @format A data frame with 525461 observations and eight variables.
#'
#' \describe{
#' \item{InvoiceNo}{Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.}
#' \item{StockCode}{Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.}
#' \item{Description}{Product (item) name. Nominal.}
#' \item{Quantity}{The quantities of each product (item) per transaction. Numeric.}
#' \item{InvoiceDate}{Invice Date and time. Numeric, the day and time when each transaction was generated. In POSIXct format.}
#' \item{UnitPrice}{Unit price. Numeric, Product price per unit in sterling.}
#' \item{CustomerID}{Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.}
#' \item{Country}{Country name. Nominal, the name of the country where each customer resides.}
#' }
#'
#' @source UCI Machine Learning Repository \url{https://archive.ics.uci.edu/ml/datasets/Online+Retail+II}
#'
#' @examples
#'
#' online_retail2
#'
"online_retail2"

#' A sample dataset for hierarchical linear regression
#'
#' @description A (probably artificial) dataset to introduce hierarchical linear gression model.
#'
#' @format A data frame with 100 observations and five variables.
#'
#' \describe{
#'
#' \item{happiness}{The criterion or dependent variable.}
#' \item{age}{A control variable.}
#' \item{gender}{Another control variable, encoded as character with "Male" and "Female" values.}
#' \item{friends}{A predictor variable.}
#' \item{pets}{A predictor variable.}
#' }
#'
#' @source Hierarchical linear regression (University of Virginia Library) \url{https://data.library.virginia.edu/hierarchical-linear-regression/}
#'
#' @examples
#'
#' hrl
#'
'hlr'

#' OpenFlights airport data
#'
#' @description Data from aiports presented in the OpenFlights data set. Last update is on January 2017.
#'
#' \describe{
#' \item{name}{Name of airport.}
#' \item{city}{Main city served by airport.}
#' \item{country}{Country or territory where airport is located, in ISO 3166-1 format.}
#' \item{IATA}{3-letter IATA code.}
#' \item{ICAO}{4-letter ICAO code.}
#' \item{lat}{Decimal degrees, usually to six significant digits. Negative is South, positive is North.}
#' \item{lon}{Decimal degrees, usually to six significant digits. Negative is West, positive is East.}
#' \item{alt}{Altitude in feet.}
#' \item{timezone}{Hours offset from UTC. Fractional hours are expressed as decimals, eg. India is 5.5.}
#' \item{DST}{Daylight savings time. One of E (Europe), A (US/Canada), S (South America), O (Australia), Z (New Zealand), N (None) or U (Unknown).}
#' \item{tz}{Timezone in "tz" (Olson) format, eg. "America/Los_Angeles".}
#' }
#'
#' @source \url{https://openflights.org/data.html}
#'
#' @examples
#'
#' of_airports
#'
'of_airports'

#' OpenFlights route database
#'
#' @description The set of OpenFlights routes. The third-party that OpenFlights uses for route data ceased providing updates in June 2014. The current data is of historical value only.
#'
#' \describe{
#' \item{airline}{2-letter (IATA) or 3-letter (ICAO) code of the airline.}
#' \item{org}{3-letter (IATA) or 4-letter (ICAO) code of the source (origin) airport.}
#' \item{dst}{3-letter (IATA) or 4-letter (ICAO) code of the destination airport.}
#' \item{codeshare}{"Y" if this flight is a codeshare (that is, not operated by Airline, but another carrier), "N otherwise.}
#' \item{stops}{Number of stops on this flight ("0" for direct).}
#' \item{equipment}{3-letter codes for plane type(s) generally used on this flight, separated by spaces.}
#' }
#'
#' @source \url{https://openflights.org/data.html}
#'
#' @examples
#'
#' of_routes
#'
'of_routes'

#' adult dataset
#'
#' @description Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset. The `adult_test` dataset is a test set for those data. There are missing values encoded as `?` in the `workclass` and `occupation` variables.
#'
#' @source <https://archive.ics.uci.edu/dataset/2/adult>
#'
#' @examples
#'
#' adult
#'
'adult'

#' adult test dataset
#'
#' @description Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset. The `adult` test is the training set for those data. There are missing values encoded as `?` in the `workclass` and `occupation` variables.
#'
#' @source <https://archive.ics.uci.edu/dataset/2/adult>
#'
#' @examples
#'
#' adult_test
#'
'adult_test'

#' breast cancer dataset
#'
#' @description This breast cancer domain was obtained from the University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia. This is one of three domains provided by the Oncology Institute that has repeatedly appeared in the machine learning literature. There are missing values encoded as `?` in the `node_caps` and `breast_quad` variables.
#'
#' @source <https://archive.ics.uci.edu/dataset/14/breast+cancer>
#'
#' @examples
#'
#' breast_cancer
#'
'breast_cancer'
jmsallan/BAdatasets documentation built on May 7, 2024, 11:45 a.m.