Description Usage Format References Examples
Two data sets regarding wine reviews.
1 |
The first is wine_reviews
, a CC0 release of data obtained from
Kaggle. I combined the two data sets available there and removed
duplicates. Not every data point will have twitter info, but all will have
most of the columns. The result is a data frame of more than 160 thousand
rows and 13 columns. The reviews can serve as an example for text
analysis, specifically sentiment analysis.
The country that the wine is from
The reviewers description of the data
The vineyard within the winery where the grapes that made the wine are from
The number of points Wine Enthusiast rated the wine on a scale of 80-100
The cost for a bottle of the wine
The province or state that the wine is from
The wine growing area in a province or state (i.e. Napa)
Sometimes there are more specific regions specified within a wine growing area (i.e. Rutherford inside the Napa Valley)
The reviewer's name
The reviewer's Twitter handle
The title of the wine review, which often contains the vintage if you're interested in extracting that feature.
The type of grapes used to make the wine (i.e. Pinot Noir)
The winery that made the wine
The second data set is wine_quality
, obtained from the UCI repository,
and the one that I use in my Introduction
to Machine Learning document. It has nearly 6500 rows and 15 columns,
mostly with physicochemical qualities of the wine. It can be used for
standard regression using the quality score, or classification for color or
'good' quality. However, more than 90% of the scores are 5-7, so it can so
it can also serve as an ordinal regression example with appropriate
collapsing.
Labels are 'red' and 'white'
A binary based on color. White ==
1.
tartaric acid - g / dm^3
acetic acid - g / dm^3
g / dm^3
g / dm^3
sodium chloride - g / dm^3
mg / dm^3
mg / dm^3
g / cm^3
pH level
potassium sulphate - g / dm^3
% by volume
Technically 0 (very bad) - 10 (excellent), but actual scores are from 3 to 9
Quality scores of 6 or greater. Labels are 'Good' and 'Bad'
wine_reviews
:
Kaggle
wine_quality
: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties. In
Decision Support Systems, Elsevier, 47(4):547-553, 2009.
link;
UCI Link
1 2 3 |
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.