This package is primarily to provide data that is more similar to what many people would typically come across in the wild, or is simply more interesting or accessible (in my opinion), and more useful for instruction and workshops. Far too often examples use iris
, mtcars
, etc. for convenience, but these are actually inconvenient for demonstrating common data and modeling problems, or are too small to even be realistic.
This package will provide larger and messier data. The bias is towards data that could be understood regardless of discipline/background. In addition, it should have minimally several hundred observations, and often much larger, but not so large that analysis or data processing demonstration would take an inordinate amount of time. However, it should have relatively few columns (unless for demonstration of a 'large p' type of problem/analysis, e.g. penalized regression.).
In general the goals are:
In most cases the data has been cleaned up to make it easier to use and understand.
Right now it has:
gapminder_2019
: a 2019 pull from gapminder.org/data.
star_wars
: several data sets based on the Star Wars API.
instructor_evaluations
: a nice-sized data set for mixed/multi-level modeling taken from the lme4
package.
fish
: Number of fish caught on camping trips.
pisa
: OECD's Programme for International Student Assessment with international scores for math, science, and reading, covering years 2000-2015.
world_happiness
: Multiyear data set with country level scores of 'happiness'. From 2019 World Happiness Report, and includes data from 2005-2018.
sp500
: Daily S & P 500 data for a 10 year period covering +- 5 years before and after the Great Recession low.
wine_reviews
, wine_quality
: Two data sets regarding wine reviews that can be used for a wide range of standard statistical and machine learning.
google_apps
: Ratings and other information for Google Play Store apps.
fashion_train
, fasion_test
: The 'Fashion MNIST'. Image data for clothing items.
gender_gap
, gender_gap_2018
: Country level data regarding the World Bank Gender Gap Index.
kiva
: Lending information from kiva.org online crowdfunding platform.
water_risk
, water_risk_province
: Country and province level data regarding water risk.
big_five
: Big Five personality traits.
heart_disease
: The UCI heart disease data.
retirement
: Data on retirement plan participation rate of employees.
movielens
: 1 million samples from MovieLens data.
This package is not on CRAN. To install:
devtools::install_github('m-clark/noiris')
To do:
Note to self, see flexmix, poLCA, and other packages. Maybe add classic biochemists for another count data set. Article pub for link models and related.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.