R/pendigits.R
In hhcartr: HHCART(G) - A Reflected Feature Space for CART

#' Pen-Based Recognition of Handwritten Digits Dataset.
#'
#' We create a digit database by collecting 250 samples from 44 writers. The samples written
#' by 30 writers are used for training, cross-validation and writer dependent testing, and
#' the digits written by the other 14 are used for writer independent testing. This database
#' is also available in the UNIPEN format.
#'
#' We use a WACOM PL-100V pressure sensitive tablet with an integrated LCD display and a
#' cordless stylus. The input and display areas are located in the same place. Attached to
#' the serial port of an Intel 486 based PC, it allows us to collect handwriting samples.
#' The tablet sends $x$ and $y$ tablet coordinates and pressure level values of the pen at
#' fixed time intervals (sampling rate) of 100 miliseconds. These writers are asked to write
#' 250 digits in random order inside boxes of 500 by 500 tablet pixel resolution. Subject are
#' monitored only during the first entry screens. Each screen contains five boxes with the
#' digits to be written displayed above. Subjects are told to write only inside these boxes.
#' If they make a mistake or are unhappy with their writing, they are instructed to clear the
#' content of a box by using an on-screen button. The first ten digits are ignored because
#' most writers are not familiar with this type of input devices, but subjects are not aware
#' of this. In our study, we use only ($x, y$) coordinate information. The stylus pressure
#' level values are ignored. First we apply normalization to make our representation
#' invariant to translations and scale distortions. The raw data that we capture from the
#' tablet consist of integer values between 0 and 500 (tablet input box resolution). The new
#' coordinates are such that the coordinate which has the maximum range varies between 0 and 100.
#' Usually $x$ stays in this range, since most characters are taller than they are wide. In order
#' to train and test our classifiers, we need to represent digits as constant length feature
#' vectors. A commonly used technique leading to good results is resampling the (x_t, y_t) points.
#' Temporal resampling (points regularly spaced in time) or spatial resampling (points regularly
#' spaced in arc length) can be used here. Raw point data are already regularly spaced in time but
#' the distance between them is variable. Previous research showed that spatial resampling to
#' obtain a constant number of regularly spaced points on the trajectory yields much better
#' performance, because it provides a better alignment between points. Our resampling algorithm
#' uses simple linear interpolation between pairs of points. The resampled digits are represented
#' as a sequence of T points (x_t, y_t)_{t = 1}^T, regularly spaced in arc length, as opposed to
#' the input sequence, which is regularly spaced in time. So, the input vector size is 2*T, two
#' times the number of points resampled. We considered spatial resampling to T=8,12,16 points in
#' our experiments and found that T=8 gave the best trade-off between accuracy and complexity.
#'
#' @docType data
#'
#' @usage data(pendigits)
#'
#' @format A data frame with 10,992 rows and 16 variables:
#' \describe{
#'     \item{cols1-16}{All input attributes are integers in the range 0..100.}
#'     \item{class}{The last column is the class which has been coded as follows :
#'     The last attribute is the class code 0..9}
#' }
#'
#' @keywords datasets pendigits
#'
#' @source \url{https://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits}
"pendigits"