README.md
In emmal73/readUCI: Read Data from the UCI Database

readUCI

readUCI facilitates the process of importing data from the University of California Irvine Machine Learning Repository. These datasets are especially good for machine learning practice, and can be used to create reproducible code examples. As of December, 2019, there are 488 available datasets.

The dataset UCI_datasets has all of the available datasets from the repository and some characteristics, including data types and common tasks to perform with that data.

You can install the package through GitHub:

#devtools::install_github("emmal73/readUCI")

library(readUCI)

Common Use: `read_UCI`

Abalone

abalone <- read_UCI("abalone", "abalone.data")
head(abalone)
#> # A tibble: 6 x 9
#>   X1       X2    X3    X4    X5     X6     X7    X8    X9
#>   <chr> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl>
#> 1 M     0.455 0.365 0.095 0.514 0.224  0.101  0.15     15
#> 2 M     0.35  0.265 0.09  0.226 0.0995 0.0485 0.07      7
#> 3 F     0.53  0.42  0.135 0.677 0.256  0.142  0.21      9
#> 4 M     0.44  0.365 0.125 0.516 0.216  0.114  0.155    10
#> 5 I     0.33  0.255 0.08  0.205 0.0895 0.0395 0.055     7
#> 6 I     0.425 0.3   0.095 0.352 0.141  0.0775 0.12      8

The data that gets imported does not have variable names, as we can see by just calling preview_names.

We can add those manually, based on the information provided in the abalone.names file. This can also be found at the homepage for the Abalone.

abalone_names <- read_UCI("abalone", "abalone.names")
abalone_names[58:79,]
#> # A tibble: 22 x 1
#>    X1                                                                      
#>    <chr>                                                                   
#>  1 7. Attribute information:                                               
#>  2 "   Given is the attribute name"                                        
#>  3 "   brief description.  The number of rings is the value to predict: ei~
#>  4 "   as a continuous value or as a classification problem."              
#>  5 "\tName\t\tData Type\tMeas.\tDescription"                               
#>  6 "\t----\t\t---------\t-----\t-----------"                               
#>  7 "\tSex\t\tnominal\t\t\tM"                                               
#>  8 "\tLength\t\tcontinuous\tmm\tLongest shell measurement"                 
#>  9 "\tDiameter\tcontinuous\tmm\tperpendicular to length"                   
#> 10 "\tHeight\t\tcontinuous\tmm\twith meat in shell"                        
#> # ... with 12 more rows

Here we will manually add the names.

names(abalone) <- c("Sex", "Length", "Diameter", "Height", "Whole Weight", "Shucked Weight", "Viscera Weight", "Shell Weight", "Rings")

Next we can run preview_names to clean up the column names and display the names.

names(abalone) <- preview_names(abalone)
#> [1] "sex"            "length"         "diameter"       "height"        
#> [5] "whole_weight"   "shucked_weight" "viscera_weight" "shell_weight"  
#> [9] "rings"

For your convenience, 5 datasets from the UCI database are imported and cleaned with the package. These are:

adult: used to predict whether income is greater than $50K from census data
flags: contains details of countries and their flags
las_vegas: contains features of online reviews of 21 hotels in Las Vegas
tictactoe: used for binary classification, based on possible tic-tac-toe configurations
wine: used to determine origin of wines based on chemical analysis