In emmal73/readUCI: Read Data from the UCI Database

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(readUCI)

Introduction to the readUCI package

The University of California Irvine Machine Learning Repository is a resource that provides access to hundreds of datasets. The challenge with using the repository is that understanding how to get the data into R isn't always intuitive and is inaccessible for beginners. readUCI helps make this process easier.

`read_UCI()`

The workhorse function of the package is read_UCI(). It builds off of the functionality of readr::read_delim() and readxl::read_excel() by pasting together a URL from the webpage and the data_file.

To know what to enter as arguments, look at the "Data Folder" on the page of the dataset you are interested in.

The webpage is the text (in quotes) that comes directly after "Index of /ml/machine-learning-databases/". It is also part of the URL for that page.
The data_file is the name of the dataset (also in quotes) you would like to import, also on that page of the "Data Folder".

The other arguments will change based on the structure and format of the data you are importing.

iris <- read_UCI("iris", "iris.data")

Troubleshooting

The tricky thing is that it can be hard to know what information you need to give read_UCI() to make sure your data is in the format you want. Here are some common things to look out for.

Delimiters

las_vegas <- read_UCI(webpage = "00397",
                      data = "LasVegasTripAdvisorReviews-Dataset.csv")
head(las_vegas)

If no delimiter is specified, read_UCI() defaults to a comma. However, in some cases this is not the delimiter used in the file. Here, las_vegas only has 1 column because read_UCI() was looking for commas, while the Las Vegas Strip dataset actually uses semicolons as a delimiter.

las_vegas2 <- read_UCI(webpage = "00397",
                       data = "LasVegasTripAdvisorReviews-Dataset.csv",
                       data_delim = ";")

head(las_vegas2)

You can see that now we have 20 variables because read_UCI() is using the proper delimiter. We can also see that the first row is the actual column names, so we will change the data_col_names argument to be TRUE, meaning the first row will be used as column names instead of as data.

las_vegas3 <- read_UCI(webpage = "00397",
                      data = "LasVegasTripAdvisorReviews-Dataset.csv",
                      data_delim = ";",
                      data_col_names = TRUE
                      )

head(las_vegas3)

Working with Excel files

Because of the way that readxl::read_excel() works, if ".xls" is detected as part of the data_file the data gets written to the disk. This means that if you know you are working with an Excel file, you will need to think about the overwrite argument. The default is FALSE, which protects against overwriting a file that you already have in your directory of the same name. If you get an error with an Excel file, it may be because you are trying to import a dataset of the same name as something in your directory. You can circumvent this error by setting overwrite to TRUE.

Sheets

breast_tissue <- read_UCI(webpage = "00192",
                          data = "BreastTissue.xls",
                          data_overwrite = TRUE) 
head(breast_tissue)

Clearly from what gets read in here is not actually the data. If you were to download the file and open it in Excel, you would see that there are actually two sheets in this file - "Description" and "Data".

If you are working with an Excel file and you get something that doesn't look like data, try a different sheet.

breast_tissue <- read_UCI(webpage = "00192",
                          data = "BreastTissue.xls",
                          data_overwrite = TRUE,
                          sheet = 2) 
head(breast_tissue)

Note that in the code above, data_overwrite = TRUE is required so we can overwrite the version of breast_tissue that is the description.

`preview_names()`

preview_names is a function that builds off of the capabilities of janitor::clean_names(). By default, it puts the column names into snake case, displays the names, and invisibly returns the data. To save this cleaned version of the names, assign the output of the function back into the dataset.

breast_tissue <- preview_names(breast_tissue)

The above displays the column names in snake case, and breast_tissue now uses those column names. Different cases supported by janitor::clean_names() can be specified as additional arguments to the function.

If you would like to not replace the column names but still assign the output, you can add overwrite = FALSE as an additional argument.

`UCI_datasets`

UCI_datasets is a table with all of the currently available datasets and a selection of their features. It is a resource to help you find data you are interested in.

emmal73/readUCI documentation built on Dec. 24, 2019, 1:29 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com