library(ggplot2)
library(umdpsyc)
library(learnr)
library(gradethis)
data(mtcars)
data("exam_scores")
knitr::opts_chunk$set(exercise.checker = gradethis::grade_learnr)
knitr::opts_chunk$set(echo = FALSE)
# learnr::tutorial_options(exercise.timelimit = 60)

Descriptive Statistics

One of the first steps that you might take once you have some data in R is to look at descriptive statistics. Remember, the types of descriptive statistics that you use will be different depending on whether it is categorical or numerical data. We will look at numerical descriptive statistics and graphs here; check out the categorical-descriptives tutorial in order to learn about how to do descriptive statistics and graphs for categorical data.

Example Dataset: Exam Scores

In the examples for this tutorial, we will use the exam scores dataset.

head(exam_scores)

This is a fictional dataset, created for the purposes of the tutorials provided in the umdpsyc package, but are based on real studies and real results. It contains data about exam scores as well as variables that might be associated with exam scores, such as anxiety levels, method of note-taking, and amount of sleep that the student got before taking the exam.

Measures of Center and Spread

Many of the functions for getting numerical summaries of center and spread are intuitive in R.

|Function|Type of Plot| |---|---| |mean(x)|Finds the mean of a vector x| | median(x) | Finds the median of a vector x. | sd(x) | Finds the standard deviation of a vector x | | variance(x) | Finds the variance of a vector x | | IQR(x) | Finds the IQR of a vector x| | summary(x) | Finds the min, 1st quartile, median, mean, 3rd quartile and max of a vector x, or of each column in a data frame x|

In general, these are used on a vector, rather than a data frame, so you'll need to be careful to specify the correct column within the data frame. Remember that you can use $ notation to extract a specific column from the data frame.

mean(exam_scores$exam1)
sd(exam_scores$exam1)

Practice Problem

How would you find the median of the exam 2 scores?


grade_result(
  pass_if(~ identical(.result, median(exam_scores$exam2)), "Good Job!"),
  fail_if(~ TRUE)
)

Plotting with base R

Visualizing Numerical Variables

The basic functions for plotting numerical variables in R are shown in the table below.

|Function|Type of Plot| |---|---| |hist(x)|histogram| | boxplot(x) | boxplot | | plot(x,y) | scatterplot |

For example, to get a simple histogram, we can use the following command.

hist(exam_scores$exam1) 

We can make adjustments to this graph by adding additional arguments to the hist function. Arguments are the inputs that we provide a function. Above, we only provided one argument: the data to be plotted. We can also specify that we want a different number of bins, for example.

hist(exam_scores$exam1, breaks = 20) # The "breaks = 20" argument sets the number of bins

Boxplots

We can make boxplots using the boxplots function.

boxplot(exam_scores$exam1)

Customizing Plots

Notice that these are very bare-bones plots, without titles or axes labels in some cases. Let's add a simple title to the boxplot above. We can do this using the main argument.

boxplot(exam_scores$exam1, main = "Scores on Exam 1")

Note that the quotations around the text you want to add as the title are very crucial! The main argument takes in a string as the input, so you need to provide it a string variable.

Plotting with qplot

As we mentioned before in the intro-to-r tutorial, there are external packages that contain various functions, data, and other tools to help us do more with R. One of these is the ggplot2 package, which contains many functions for generating visually pleasing graphs.

# Run install.packages if you do not have it installed already.
# install.packages('ggplot2')
library(ggplot2)

The ggplot2 package is incredibly rich and highly customizable for an experienced user. There is an entire book you can read through on using ggplot2 to create graphics, and the code can get quite complicated. We won't go over all of that in this short introduction to producing graphs. Rather, in this tutorial, we'll focus on one function, qplot, which is a quick way of getting nice looking graphs without doing much work.

Using qplot

The idea behind qplot is that it can be used to generate graphs very quickly using just one function. To do this, it tries to detect the type of data being plotted, and generates an appropriate graph based on this. Let's try using qplot with exam scores.

qplot(exam_scores$exam1)

Instead of using breaks as an argument to specify the number of bins we want for the histogram, we use bins with qplot. In addition, we can use the data argument to make it easier to read and parse what is happening in the function. We show two ways of doing the same thing below, with the first commented out.

# This is the same thing
# qplot(exam_scores$exam1, bins = 10)

qplot(exam1, data = exam_scores, bins = 10)

This made a histogram, exactly like we might have wanted. If we wanted a boxplot instead, we can add an additional argument to specify this. Note that we specify y= because we want to plot the graph on the y-axis. The qplot function for boxplots only works for putting the continuous variable on the y-axis -- in the categorical-descriptives, we'll go over side-by-side boxplots which can show the distributions by group in a categorical variable, and we'll have to specify x= there. For now, just remember to specify y= when using qplot for boxplots.

qplot(y = exam1, data = exam_scores, geom = 'boxplot')

If we want to make a scatterplot, we can use it very similar to how we would use plot.

# This does the same thing
# qplot(exam_scores$exam1, exam_scores$exam2)

qplot(exam1, exam2, data = exam_scores)

Adding a title

We can add a title like we did before using the main argument.

qplot(exam1, exam2, data = exam_scores, main = "Scatterplot of scores on two exams")

Exercises

Consider the exam_scores dataset.

data(exam_scores)
head(exam_scores)

Exercise 1

What is the overall mean of all exam scores for exam 2?


grade_result(
  pass_if(~ identical(.result, mean(exam_scores$exam2)), "Good Job!"),
  fail_if(~ TRUE)
)

Exercise 2

What is the mean of exam 1 scores for students who took notes with paper?


grade_result(
  pass_if(~ identical(.result, mean(exam_scores$exam1[exam_scores$note_mode == 1])), "Good Job!"),
  fail_if(~ TRUE)
)

Exercise 3

Create a histogram of the exam 1 scores for all students who used a computer as their method of note-taking. Use 15 bins in the histogram, and use the base R method (not using qplot).


**Hint:** This is a two step process. First, you need to subset, then you need to create the graph.
grade_result(
  pass_if(~ identical(.result, hist(exam_scores[exam_scores$note_mode == 2,'exam1'], breaks = 15)), "Good Job!"),
  pass_if(~ identical(.result, hist(exam_scores$exam1[exam_scores$note_mode == 2], breaks = 15)), "Good Job!"),
  # pass_if(~ isTRUE(all.equal(.result, qplot(exam_scores[exam_scores$note_mode == 2,'exam1'], bins = 15))), "Good Job!"),
  fail_if(~ TRUE)
)

Submitting work

submission_code <- function(UID){
  httr::sha1_hash('numdesc',as.character(UID))
}

Generate your submission code by putting in your UID in the function below. For example, if your UID is 2, then your code should look like submission_code(UID = 2)

# Replace the number below with your UID
submission_code(2)


kimbrianj/umdpsyc documentation built on Jan. 27, 2021, 7:36 a.m.