download_kuzushiji_mnist: Download Kuzushiji-MNIST

View source: R/kuzushiji-mnist.R

download_kuzushiji_mnistR Documentation

Download Kuzushiji-MNIST

Description

Download Kuzushiji-MNIST database of images of cursive Japanese writing.

Usage

download_kuzushiji_mnist(base_url = kuzushiji_mnist_url, verbose = FALSE)

Arguments

base_url

Base URL that the files are located at.

verbose

If TRUE, then download progress will be logged as a message.

Format

A data frame with 786 variables:

px1, px2, px3 ... px784

Integer pixel value, from 0 (white) to 255 (black).

Label

The character, represented by an integer in the range 0-9.

Pixels are organized row-wise. The Label variable is stored as a factor.

There are 70,000 items in the data set. The first 60,000 are the training set, as found in the train-images-idx3-ubyte.gz file. The remaining 10,000 are the test set, from the t10k-images-idx3-ubyte.gz file.

Items in the dataset can be visualized with the show_mnist_digit function.

For more information see https://github.com/rois-codh/kmnist.

Details

Downloads the image and label files for the training and test datasets and converts them to a data frame. The dataset is intended to be a drop-in replacement for the MNIST digits dataset.

Value

Data frame containing Kuzushiji-MNIST.

Note

Originally based on a function by Brendan O'Connor.

References

"KMNIST Dataset" (created by CODH), adapted from "Kuzushiji Dataset" (created by NIJL and others), doi:10.20676/00000341 https://github.com/rois-codh/kmnist

Clanuwat, T., Bober-Irizar, M., Kitamoto, A., Lamb, A., Yamamoto, K., & Ha, D. (2018). Deep Learning for Classical Japanese Literature. arXiv preprint arXiv:1812.01718.

Examples

## Not run: 
# download the data set
kuzushiji <- download_kuzushiji_mnist()

# first 60,000 instances are the training set
kuzushiji_train <- head(kuzushiji, 60000)
# the remaining 10,000 are the test set
kuzushiji_test <- tail(kuzushiji, 10000)

# PCA on 1000 examples
kuzushiji_r1000 <- kuzushiji[sample(nrow(kuzushiji), 1000), ]
pca <- prcomp(kuzushiji_r1000[, 1:784], retx = TRUE, rank. = 2)
# plot the scores of the first two components
plot(pca$x[, 1:2], type = "n")
text(pca$x[, 1:2],
  labels = kuzushiji_r1000$Label,
  col = rainbow(length(levels(kuzushiji_r1000$Label)))[kuzushiji$Label]
)

## End(Not run)

jlmelville/snedata documentation built on Jan. 13, 2024, 2:06 a.m.