wine_reviews: Wine Reviews data

Description Usage Format References Examples

Description

Two data sets regarding wine reviews.

Usage

1

Format

The first is wine_reviews, a CC0 release of data obtained from Kaggle. I combined the two data sets available there and removed duplicates. Not every data point will have twitter info, but all will have most of the columns. The result is a data frame of more than 160 thousand rows and 13 columns. The reviews can serve as an example for text analysis, specifically sentiment analysis.

country

The country that the wine is from

description

The reviewers description of the data

designation

The vineyard within the winery where the grapes that made the wine are from

points

The number of points Wine Enthusiast rated the wine on a scale of 80-100

price

The cost for a bottle of the wine

province

The province or state that the wine is from

region_1

The wine growing area in a province or state (i.e. Napa)

region_2

Sometimes there are more specific regions specified within a wine growing area (i.e. Rutherford inside the Napa Valley)

taster_name

The reviewer's name

taster_twitter_handle

The reviewer's Twitter handle

title

The title of the wine review, which often contains the vintage if you're interested in extracting that feature.

variety

The type of grapes used to make the wine (i.e. Pinot Noir)

winery

The winery that made the wine

The second data set is wine_quality, obtained from the UCI repository, and the one that I use in my Introduction to Machine Learning document. It has nearly 6500 rows and 15 columns, mostly with physicochemical qualities of the wine. It can be used for standard regression using the quality score, or classification for color or 'good' quality. However, more than 90% of the scores are 5-7, so it can so it can also serve as an ordinal regression example with appropriate collapsing.

color

Labels are 'red' and 'white'

white

A binary based on color. White == 1.

fixed_acidity

tartaric acid - g / dm^3

volatile_acidity

acetic acid - g / dm^3

citric_acid

g / dm^3

residual_sugar

g / dm^3

chlorides

sodium chloride - g / dm^3

free_sulfur_dioxide

mg / dm^3

total_sulfur_dioxide

mg / dm^3

density

g / cm^3

pH

pH level

sulphates

potassium sulphate - g / dm^3

alcohol

% by volume

quality

Technically 0 (very bad) - 10 (excellent), but actual scores are from 3 to 9

good

Quality scores of 6 or greater. Labels are 'Good' and 'Bad'

References

wine_reviews: Kaggle

wine_quality: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009. link; UCI Link

Examples

1
2
3

m-clark/noiris documentation built on Sept. 9, 2019, 9:08 a.m.