In emeyers/SDS230: Tools for the class Data Exploration and Analysis

$\$

Welcome to the first homework assignment!

$\$

The purpose of this homework is to gain experience using R and R Markdown, to review concepts from Introductory Statistics, to practice analyzing and plotting data in R. Please fill in the appropriate R code and write answers to all questions in the answer section. Once you have completed the assignment, submit a compiled pdf with your answers to Gradescope by 11pm on Sunday September 8th.

If you need help with the homework, please attend the TA office hours which are listed on Canvas and/or ask questions on Ed Discussions. Also, if you have completed the homework, please help others out by answering questions on Ed Discussions, which will count toward your class participation grade.

With this and all homework assignments, please be sure to knit your document often. This will help catch any mistakes right away so you will know where the mistake was made. If you don't knit often, it will be much harder to find any knitting errors and fix them!

library(knitr)
opts_chunk$set(tidy.opts=list(width.cutoff=60)) 
set.seed(123)  # set the random number generator to always give the same random numbers

download.file("https://yale.box.com/shared/static/1syezdzked6v2l3aqqpffmifww0468ez.rda", "misc_data.Rda")

download.file("https://raw.githubusercontent.com/emeyers/SDS230/master/ClassMaterial/images/metal.jpg", "metal.jpg", mode = "wb")

download.file("https://raw.githubusercontent.com/emeyers/SDS230/master/ClassMaterial/data/animals.rda", "animals.rda", mode = "wb")

download.file("https://raw.githubusercontent.com/emeyers/SDS230/master/ClassMaterial/data/metal_bands_single_country.rda", "metal_bands_single_country.rda", mode = "wb")

download.file("https://raw.githubusercontent.com/emeyers/SDS230/master/ClassMaterial/data/hiphop_vocab.rda", "hiphop_vocab.rda", mode = "wb")

$\$

Part 1: R Markdown practice

As we have discussed in class, R Markdown is a great way to create reproducible data analyses. To gain practice using R Markdown, all homework problem sets and your final project will be done in R Markdown.

R Markdown has a number of features that allow the text in your written reports to have better formatting. A cheatsheet for R Markdown formatting can be found here. When answering the questions be sure to knit your R Markdown document very often to catch errors as soon as they are made.

$\$

Part 1.1 (5 points): R Markdown fomatting

Please modify the lines of text below to change their formatting as described:

Make this line bold

Make this line italics

Make this line a second level header

Make this line a bullet point

Make this text a hyperlink to yale.edu

$\$

Part 1.2 (5 points): LaTeX symbols

Please use LaTeX to write Plato's name in Greek below. An app to find LaTeX characters is here available at http://detexify.kirelabs.org/classify.html, or you can use Google. The first character of his name is given below so please start from there.

Note: make sure the ending dollar sign touches the last letter otherwise you will get an error when knitting.

$\Pi$ ...

$\$

Part 2: Using R within R Markdown documents

Let's now practice doing some basic computations in R. As described in class, all code in the R Markdown chunks is executed and the results are shown in the compiled output document (i.e., the code and output will be shown in the compiled pdf document).

$\$

Part 2.1 (5 points): Number journey

Please complete the following steps in the R Markdown chunk below:

Create an object called a and assign the value of 3 to it.
Create an object called x and assign the value 5 to it.
Create an object called k and assign the value 50 to it.
Create an object called y which has the value: $y = a \cdot 10^{x} + k$.
Print the value of y in the R chunk below.

In the answer section below, please write a couple of sentences on whether the object names a, x, k and y are good names to use and explain your reasoning.

Answer: [Describe whether the names a, x, k and y are good object names to use].

$\$

Part 2.2 (5 points): Summing elements in a vector

In the R chunk below, please create a vector called my_vec that has consecutive integers from 1 to 50; i.e., a vector of length 50 that has the numbers 1, 2, 3, ..., 50 (hint: use a colon to create this vector rather than the c() function). Then add all the numbers in this vector together and print the result. Finally, check that you have the right answer using Gauss' formula for the sum of consecutive integers which is $S = \frac{n (n + 1)}{2}$, where n = 50 here.

# Sum integers from 1 to 50 directly






# Use Gauss' formula

$\$

Part 3: Descriptive statistics and plots for quantitative data

In the following exercises, you will create and compare a few plots of quantitative data. Please answer each question, and if you notice any outliers in your data please address them appropriately. Also, be sure to put meaningful labels on your axes and add titles on all plots.

$\$

Part 3.1 (10 points)

The code chunk below loads four vector objects named x1, x2, x3, and x4. Create a side-by-side boxplot that compares these four vectors. Also create a histogram for each of these vectors (4 histograms total). Describe below whether the box plots or histograms are more informative for plotting this data and why.

Note: since this is synthetic data, for this problem you do not need to label your axes but rather you can just use the default labeling.

load("misc_data.Rda")

Answer: [Describe whether boxplots or histograms are more informative here]

$\$

Part 3.2: (12 points)

The R chunk below loads a data frame called animals which has information on 96 animals. The variables in this data frame are:

Species: name of the species
Brain: brain weight in grams
Body: body weight in kilograms
Gestation: gestation period in days
Litter: litter size

Please create a vector object that is called brain_weight, that has just the weights of the animals' brains. Then create a histogram and a boxplot of these brain weights using this vector object. In the answer section below, describe the shape of the distribution and investigate any outliers in the data; i.e., what is causing these outliers and are they surprising? Also describe the advantages that each type of plot has for showing features in the data.

Hint: for this problem you can inspect the data using RStudio's visual displays of data frames. In the future, it will be good to answer all questions by writing code that can show how you came to your conclusions.

load("animals.rda")

# If you would like examine the data set you can type View(animals) on the
# console. However, do not include the View() function in your RMarkdown
# document since it will make the document fail to knit.


# Continue with your code below...

Answer: [Describe advantages of boxplots and histograms for this data and investigate any unusual features of the data. 1-2 paragraphs with ~4-8 sentences total should suffice].

$\$

Part 3.3: (10 points)

Now create a scatter plot of the animals' brain weight as a function of their body weight. Describe what the results show, if the results are what you expected, and any limitations of the plot.

Answer:

$\$

Part 3.4: (10 points)

Now let's create a data frame called lighter_animals that only has animals with body weights of less than 1,000 killograms. Hint: to do this first create a vector of Booleans called lighter_inds that has values of TRUE for animals that weigh less than 1,000 killograms and values of FALSE for animals that weigh 1,000 or more killograms. Then use the lighter_inds vector to create the lighter_animals data frame.

Once you have the lighter_animals data frame recreate the plot of brain weight as a function of body weight. Then describe whether this new plot more clearly shows the relationship between brain and body weight.

Answer:

$\$

Part 4: Descriptive statistics and plots for categorical data

Heavy metal (or simply metal) is a genre of rock music that developed in the late 1960s and early 1970s that is characterized by distortion, extended guitar solos, emphatic beats, and loudness, with lyrics and performances that are sometimes associated with aggression and machismo (see Wikipedia). To practice exploring and visualizing categorical data, let's examine which countries most heavy metal bands come from.

A data file with a list of heavy metal bands was posted on Kaggle, and the code below loads a modified version of this data into an R data frame called metal (in particular, to make our analyses easier, the modified data excludes any metal bands that originate from multiple countries, so based your analyses/answers on this data). According to the Kaggle website, the variables in this data frame are:

band_name: band name
fans: how many fans the band has on the website
formed: when the band formed
origin: the country of origin on the band
split: when the band split
style: the styles of the band

Please use the data loaded below for the following exercises.

load("metal_bands_single_country.rda")

# If you would like examine the data set you can type View(metal) on the
# console. However, do not include the View() function in your RMarkdown
# document since it will make the document fail to knit.

$\$

Part 4.1: (10 points)

Use the table() function to create an object called metal_counts that has the counts of how many metal bands come from each country. What country did the most metal bands come from and what proportion of bands came from that country?

Also create a bar plot and pie chart showing the counts of how many bands come from each country. How do these plots look? How could you make them better?

Answers:

$\$

Part 4.2: (10 points)

Recreate your bar plot and pie chart to only show countries that have at least 150 bands originating from them (reviewing your answer to part 3.4 could be helpful). Also, see if you can create a better color scheme for the pie chart. Hint: To figure out how to change the color of the pie chart, using ? pie could be helpful. You might find it useful to look at this list of color names.

Bonus (0 points): see if you can create a version of the bar chart plotted horizontally where you can see all the country names.

$\$

Part 4.3: (5 points)

The plots in 4.2 are a bit misleading because they ignore all bands that come from many countries where only a few metal bands originated. The pie chart is particularly misleading because it gives the sense that particular countries have a high proportion of bands originating from them when this proportion is only out of the subset of all the countries in the original data set. A way to address this is to create a category called "other countries" that has the total number of bands from all countries that were left out in the original plot (i.e., the total number of bands from countries with less than 150 bands originating from them).

In the R chunk below, create a pie chart that has the category "other countries" which lists the total number of bands that originate from countries that were not included in figures you created in Part 4.2.

Hints:

If you have a vector (or table) called my_vec, you can concatenate the value 27 on to the vector my_vec using the syntax my_vec <- c(my_vec, 27).
You can get the names of a named vector my_vec using the function names(my_vec) which returns a vector of strings. You can also assign names to a vector my_vec using the function names(my_vec) <- vec_of_names, where vec_of_names is a vector of character strings.

$\$

Bonus problem (0 points)

If you prefer hip hop music over metal, another data set is loaded below that contains information on the number of unique words that different hip hop artists have used. The data comes from data world. The variables in this data frame are:

words: number of unique words in lyrical corpus
era: decade when first 35,000 lyrical words were released
rapper: rapper name
id dash separated name, unique for each artist

If you feel inspired, please play around with this data set and you can share any interesting findings or visualizations below. This question, however, is optional and worth no points toward your homework score.

For more interesting visualizations of this and related data see this pudding article

load("hiphop_vocab.rda")

$\$

Reflection (3 points)

Please reflect on how the homework went by going to Canvas, going to the Quizzes link, and clicking on Reflection on homework 1

$\$

metal

emeyers/SDS230 documentation built on Feb. 6, 2025, 4:55 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

emeyers/SDS230
Tools for the class Data Exploration and Analysis

In emeyers/SDS230: Tools for the class Data Exploration and Analysis

Welcome to the first homework assignment!

Part 1: R Markdown practice

Part 1.1 (5 points): R Markdown fomatting

Part 1.2 (5 points): LaTeX symbols

Part 2: Using R within R Markdown documents

Part 2.1 (5 points): Number journey

Part 2.2 (5 points): Summing elements in a vector

Part 3: Descriptive statistics and plots for quantitative data

Part 3.1 (10 points)

Part 3.2: (12 points)

Part 3.3: (10 points)

Part 3.4: (10 points)

Part 4: Descriptive statistics and plots for categorical data

Part 4.1: (10 points)

Part 4.2: (10 points)

Part 4.3: (5 points)

Bonus problem (0 points)

Reflection (3 points)

R Package Documentation

Browse R Packages

We want your feedback!

emeyers/SDS230 Tools for the class Data Exploration and Analysis

In emeyers/SDS230: Tools for the class Data Exploration and Analysis

Welcome to the first homework assignment!

Part 1: R Markdown practice

Part 1.1 (5 points): R Markdown fomatting

Part 1.2 (5 points): LaTeX symbols

Part 2: Using R within R Markdown documents

Part 2.1 (5 points): Number journey

Part 2.2 (5 points): Summing elements in a vector

Part 3: Descriptive statistics and plots for quantitative data

Part 3.1 (10 points)

Part 3.2: (12 points)

Part 3.3: (10 points)

Part 3.4: (10 points)

Part 4: Descriptive statistics and plots for categorical data

Part 4.1: (10 points)

Part 4.2: (10 points)

Part 4.3: (5 points)

Bonus problem (0 points)

Reflection (3 points)

R Package Documentation

Browse R Packages

We want your feedback!

emeyers/SDS230
Tools for the class Data Exploration and Analysis