# R Data Frame Basics Tutorial - Supplement to Statistics Using R: An Integrative Approach In sur: Companion to "Statistics Using R: An Integrative Approach"

```library(learnr)
knitr::opts_chunk\$set(echo = FALSE)
tutorial_options(exercise.completion = FALSE)

X <- rep(0:1,200)
A <- c("1", "4", "7", "5", "0")
B <- c(1, 4, 7, 5, 0)
```

## Overview

This tutorial, which accompanies the textbook Statistics Using R: An Integrative Approach by Weinberg, Harel, & Abramowitz (Cambridge University Press, 2021), covers the basic use and manipulation of datasets, which are also referred to as data frames, in R. The activities covered in this tutorial are designed to help you understand the examples in each chapter and to complete end-of-chapter exercises in the textbook. It also is designed to help you learn some basic coding skills that will be helpful when working with data frames in R so as to aid your ability to complete more complex analyses and, ultimately, to learn the larger statistical concepts covered throughout the textbook.

To close this tutorial, you will need to exit this tab in your browser window and press Escape within the Console window of RStudio. Note that when you close the tutorial, your progress will be saved until you re-open it next time. To clear your progress before closing the tutorial, click the Start Over button at the bottom of the browser screen.

We will use the Framingham dataset, used throughout the textbook and contained within the sur package, which accompanies the textbook, as an example data frame in the tutorial. The Framingham dataset is based on a longitudinal study investigating factors relating to coronary heart disease. A more complete description of the Framingham dataset may be found in the textbook in Appendix A. Alternatively, you can type `?Framingham` into the Console window and press Enter. This will cause the description of the dataset to open in the help tab of RStudio.

To find out more information about any command or operator used throughout the tutorial, type a `?` before the name of the command or operator in the Console window (either in R, RStudio) and press Enter, or click on the Help tab in the lower right window pane of RStudio.

The answers you provide to the coding exercises are not checked for correctness, but the solution to each exercise is available by clicking on the Solution button along the top of the codebook.

## Setting Up with Data Frames

In this section, we describe a couple of basic data structures in R. Then, we explain how to access datasets included in the sur package. Finally, we briefly review common commands for reading in datasets from outside sources other than the sur package.

### Data Frames and Data Structures in R

A data frame is a particular type of data structure in R that is organized by rows and columns. Typically, in a data frame, each row represents a set of values related to an observation or subject, and each column represents a set of values represented by a variable. The columns in a data frame may be named (e.g., by the name of the variable represented by that column) and the data contained in each column may be one of a different type or class of values (e.g., they may be numbers or non-numerical string characters). Each column in a data frame is called a vector, defined by the fact that all the values (or elements) in that vector are of the same type or class (e.g., they are all numerical or non-numerical string characters). Data structures other than data frames are possible in R, including matrices, lists, and still others, but are beyond the scope of this tutorial.

### Datasets Within the sur Package

All datasets used in Statistics Using R: An Integrative Approach are contained within the sur package and are readily available as data frames after installing and loading the sur package using first `install.packages("sur")` once per computer that you are using, and then `library(sur)` each time RStudio is opened. For instance, we simply have to type `Framingham` to see the Framingham dataset printed by R. Type `Framingham` below. Then click the Run Code button or place the cursor on the line of code and use a keyboard shortcut: Command+Enter for Mac or Ctrl+Enter for Windows and Linux.

```Framingham <- sur::Framingham
```
```
```
```Framingham
```

Now that we have accessed and printed the Framingham dataset to the console, we can see some of the information included in the dataset: each row appears to represent data for an individual with a specific identification number (given by the `ID` column). Data for each individual seems to cover both numeric measurements as well as categorical information. We will inspect this data frame in more detail in the coming sections.

### Objects, Functions, and Arguments in R

There are many types of objects in R. As noted above, data frames are a type of data structure, holding a collection of variables. When the name of a data frame is typed into the console, R prints its contents. A package is also an object in R, but it contains an assortment of related data, functions, and other code. If we want R to do something with these objects (other than simply printing data frame contents), we have to give R a command, also known as a function. For example, we used the function `install.packages` to install the sur package to our library. R knew which package we wanted to install because we listed sur as an argument of `install.packages`: we put `"sur"` in the parentheses following the command. Arguments to functions may tell R on what object the command should act, or even how to act on it. Likewise, when we wanted to access the contents of the sur package within an R session, we used the `library` function followed by `sur` in parentheses, telling R to open the sur package from our library. Note that `install.packages` required the argument `sur` to be in quotes while `library` did not.

### Reading in Datasets from Other Sources

If you would like to read into R as a data frame a dataset that is not part of the sur package, but that is, instead, from an outside source, you may do so with one of several R functions. Two popular such R functions are the following:

• `read.csv` -- allows you to read in files with comma-separated values (CSV) only.

• `read.delim` -- allows you to read in files that not only are separated by commas (in CSV format), but also that are separated by tabs, spaces, and so on. This function is more general than the first as it allows for a greater variety of file types to be read in and converted into data frames.

These functions can even read data in from web addresses, so that the user does not have to download and save the file before reading it into R. We will not be practicing these commands within this tutorial, since they are not needed to access our datasets, but users should know them for when they need to conduct analyses on other datasets. To practice these commands, see the end-of-chapter exercises for Chapter 1 in the Statistics Using R textbook. For further information on these functions, type `?read.csv` or `?read.delim` into the Console window and press Enter, or search for the commands in the Help tab of the lower righthand windowpane of RStudio.

## Inspecting Data Frames

In this section, we show how to view the top and bottom rows of a data frame, how to quickly obtain the dimensions of a data frame, and how to initially examine the structure and variables of a data frame.

Now that you know how to access a dataset, we will show you how to obtain some initial information about it using the Framingham dataset as an example. As you will see below, when you simply type `Framingham`, only some observations (rows) and variables (columns) will be printed in the output window at one time. While this tutorial allows scrolling, outside of this tutorial R has a maximum number of rows and columns it will print to the console at one time. To overcome this limitation and be able to access information more readily in the dataset, we introduce a number of different R commands. In particular, to get an overview of what the data look like we may view the first n rows of the data frame by using the `head` command and typing, not simply `Framingham`, but `head(Framingham)`. By default, R sets n to be 6. The appropriate code is given below: we use the function `head` with the argument `Framingham` to tell R to print the first 6 rows of `Framingham` in the output window. To see the output, hit the Run Code button, or use the shortcut Command+Enter for Mac or Ctrl+Enter for Windows and Linux.

```head(Framingham)
```

Analogously, we can use the `tail` command to print the last n rows of a data frame in the output window. The default for n for this command also is 6. To print the last 3 rows instead of 6, we specify that n is to be equal to 3 by adding the argument `n = 3` in the `tail` command after a comma as shown below. Click the Run Code button or use a keyboard shortcut to run the code below.

```tail(Framingham, n = 3)
```

We can see the row numbers of the Framingham dataset printed alongside the data frame in an unnamed column on the left side. From these row numbers we can tell that R printed rows numbered 1-6 when we used the `head` command under default settings, and R printed rows 398-400 when we used the `tail` command with the argument `n = 3`.

### Dimensions

As noted earlier, a data frame in R has data arranged in rows and columns, where, typically, the rows represent observations and the columns represent variables. Accordingly, to determine how many observations a data frame has, we simply need to find out the row dimensionality of the data frame. Likewise, to determine how many variables a data frame has, we simply need to find out the column dimensionality of the data frame. To do so, we use the command dim, which stands for dimension, and type `dim(Framingham).` Try this command in the code box below.

```
```
```dim(Framingham)
```

The `dim` command returns a vector with the number of rows (observations) as the first element and the number of columns (variables) as the second element.

```quiz(
question("How many variables does the Framingham dataset have?",
),
question("How many rows are there in the Framingham data frame?",
)
)
```

### Structure

A useful command for learning more about the variables in a dataset is the `str` command, which stands for structure. From this command we may learn about (1) the way in which a dataset is structured (for the Framingham dataset, the data are structured as a `data.frame` as defined earlier), (2) how many row and column dimensions the dataset has, and (3) the name of each variable along with whether the variable is numeric or non-numeric. Variables that are listed as being `numeric` (noted as `num`) are either ratio- or interval-levelled; and variables that are listed as `factor` (noted as `Factor`) are either nominal- and ordinal-levelled. Details on working with these two types/classes of variables will be covered in the Data and Variable Types section of the tutorial. A single data frame may contain both numeric and factor variables.

Inspect the output of running `str` on `Framingham` below and then answer the following question.

```str(Framingham)
```
```quiz(
question("What classes of data does the Framingham dataset have? Check all that apply.",
answer("Categorical", message = "There may be categorical variables, but this is not a class of data in R")
)
)
```

## Subsetting Data Frames

It is often the case that a particular analysis will involve not all of the variables or not all of the observations in a data frame, but only a subset of each of them. To carry out an analysis on only a subset of variables or observations, it is first necessary to select that subset of variables or observations. R refers to this as subsetting the data. In this section we review methods for subsetting the data and, in particular, for selecting a particular subset of variables (i.e., columns of the data frame) or particular subset of observations (i.e., rows of the data frame). As we will describe in detail in the following sections, a subset of columns may be selected by identifying the names of the variables stored by those columns. Columns also may be selected by identifying their placement within the dataframe (e.g., 1st, 2nd, 20th, etc.). We refer to this number as a column's index (plural: indices).

### Selecting Columns by the Name of a Variable

In R, data frames can be subsetted by selecting rows and/or columns. These rows and columns can be selected by name or by index. First, we will look at how to select a single column by name. Recall the output of `str(Framingham)`, which is shown below.

```str(Framingham)
```

Each variable name is displayed with a `\$` in front of it. This operator is how we reference a specific column within data frames in R. Say we want to call just the `SEX` variable from the Framingham dataset. We would simply enter `Framingham\$SEX` and R would print the values of the `SEX` variable as the output. Try selecting just the variable `AGE1` from `Framingham` using the `\$`, and run the code below. Remember, you can always view the solution by clicking the Solution button at the top of the code box.

```
```
```Framingham\$AGE1
```

Once a variable is selected, there are other operations that can be performed on that variable beyond simply printing the values of that variable in the output window. If we would like to compute the mean of the variable `AGE1`, for example, we would use the `mean` command as shown below. Other commands for summarizing the values of a variable follow the same format and are given throughout the textbook.

```mean(Framingham\$AGE1)
```

### Numeric and Name Slicing

In this section, we use index values to describe how to access a single value or multiple values within a data frame. We note that a standard way to access values in any row by column array is to specify the value's row, column indices in that format. In R we use brackets `[ ]`, instead of the `\$`, following the name of the data frame and place the row and column indices within the brackets separated by a comma. Thus, if we wanted to access and print the value in the second row, fourth column of the `Framingham` dataset, we would type `Framingham[2,4]`.

To verify that we will indeed obtain the value in the second row and fourth column of the Framingham dataset, use the `head` command to print the first three rows of `Framingham` in the output window. Then type `Framingham[2,4]` and verify that it matches the value in the second row and fourth column.

```
```
```head(Framingham, n = 3)
Framingham[2,4]
```

From the `head` output, we note that the variable `AGE1` occupies the fourth column. To select all the values in the entire fourth column we, once again, use the bracket operator, but rather than specifying a single value for the row, as we did earlier, we now leave the row index blank: `Framingham[,4]`. Alternatively, we can use the column name `AGE1` within the bracket instead of the number 4: `Framingham[,"AGE1"]`. Note that when we use the column name within brackets, we must place the name within quotes; we do not use quotes when using the `\$` operator.

In the space below, access and print all the values in the `TOTCHOL1` variable from `Framingham` in three ways: (1) use the `\$` operator, (2) use brackets with the column index number, and (3) use brackets with the variable name in quotes. We can find the column's index number by examining the output from having executed the `head` command in the previous exercise. Verify that the outputs from the execution of these three commands are identical by checking that the first three values are the same.

```
```
```Framingham\$TOTCHOL1
Framingham[,3]
Framingham[,"TOTCHOL1"]
```

If instead of wanting to access and print all the values of a single variable, we wanted to access and print all the values of more than one variable, where the variables are in sequence in the dataset, we can do so using a colon, `:`, within the brackets. The variables may be referred to either by the column indices or by the variable names. For example, if we wanted to access and print all the values of the four variables `TOTCHOL1`, `AGE1`, `SYSBP1`, and `DIABP1`, we note from our earlier work using the `head` command, that these four variables are in columns 3, 4, 5, and 6. Accordingly, we may use the colon operator to access them by referencing their indices as follows: `Framingham[, 3:6]`. It is worth noting that 3 and 6 represent, respectively, the first and last indices of the four variables of interest. Another way to access these four variables is by their names using the following command: `Framingham[, c("TOTCHOL1", "AGE1", "SYSBP1" ,"DIABP1")]`. When more than one variable is named, they all need to be joined together in a string using the `c` function, which stands for concatenate. It also is worth noting that the names of the variables within the brackets must be in quotes.

If instead of selecting all rows for a specific column, we wanted to select all columns for specific rows, we again use brackets. By analogy, we now leave the column entry blank within the brackets. Thus, if we wanted to access and print the values of all the variables (columns) for just the first subject, we would type `Framingham[1,]`. If, instead, we wanted a subset of sequential rows, we would, as before, use the colon operator, `:`, separating the first and last indices in the sequential set. For example, if we wanted to access the values for all the variables for only the first three rows, we would type `Framingham[1:3,]`. Note that this is identical to calling `head(Framingham, n = 3)`.

If we want to select indices that are not sequential, we can use the `c` function within the brackets to group the indices of interest together. For instance, if we want all the rows for the second, fifth, and ninth columns, we would call `Framingham[,c(2,5,9)]`. We can also use `c` within the brackets to refer to multiple columns by name. Try calling all rows for the `SYSBP1` and `BMI1` variables below. Note again that variable names need to be in quotes when using brackets to subset.

```
```
```Framingham[,c("SYSBP1","BMI1")]
```

### Adding, Removing, and Renaming Columns

There are other commands than `str` and `head` for obtaining information about a data frame. One such other command is `names`. This command will print the names of the variables in a dataset in the Console window. The format or syntax of this command simply is similar to the `str` and `head` commands. Execute the `names` command on `Framingham` to see the names of the variables appear in the output window in the same order as they are in the dataset.

```
```
```names(Framingham)
```
```X <- rep(0:1,200)
```

For this next exercise, we have already created a new vector of 400 values called `X` that is available for use in this tutorial. `X` is not part of the Framingham dataset, but since `X` has the same number of values as the number of rows in `Framingham`, we can add it to the data frame as a new column. The quickest way to do this is to use an equals sign, `=`, to assign `X` as a new variable in the dataset. So that it is clear that we want this new variable to be part of the `Framingham` dataset, we assign the variable `X` to a name that includes the name of the dataset as well, `Framingham\$new_var`. The code for this is displayed below. To be clear, the left side of the equation tells R that we are adding a new column to `Framingham` and we are naming this column `new_var`; the right side of the equation tells R that `new_var` will be getting the data contained in our outside variable `X`.

The line of code below will not produce any output in the console when run. To verify that the data of `X` has been added as a new variable in `Framingham` called `new_var`, add code to check the variable names of `Framingham` using the `names` command.

```Framingham\$new_var = X
```
```Framingham\$new_var = X
names(Framingham)
```

`new_var` is not a particularly good name for a variable, as it tells us nothing about what that variable measures, nor does it stick to the all-capitals naming convention of the Framingham dataset. Since `new_var` is just a fake variable that we made up for practice, let's rename it `FAKE`. We can use the `names` command and brackets to assign a new name to `new_var`. On reviewing the output produced by `names(Framingham)`, we note that the output consists of a single row of names. Output consisting of either a single row or a single column may be described as an array of only one dimension. Arrays of one dimension are called vectors. As we have learned, by contrast, data frames consist of two dimensions, both rows and columns. Because `X` was added as a new variable at the end of the list of variables in the Framingham data set, and the Framingham dataset originally had 33 variables, `new_var` became the 34th variable in that dataset. To refer to `new_var` given that it is the 34th element of the vector of names, we use the `names(Framingham)` command followed by the number 34 in brackets as follows: `names(Framingham)[34]`. To change the name of `new_var` to `FAKE`, we assign the name `FAKE` to the 34th element in the `names(Framingham)` vector using the equals sign `=`. In writing this code, we must remember to place the name `FAKE` in quotes because quotes need to be used when we refer to the name of a variable. As a distinction, when we refer to the set of values of a variable, quotes are not used. Write the code to execute this name change and then print all the variable names again to verify that the name `new_var` has been changed to `FAKE`.

```Framingham\$new_var <- X
```
```
```
```names(Framingham)[34] = "FAKE"
names(Framingham)
```

Sometimes we want to remove columns from a data frame---perhaps they were created in error or we decide they are unnecessary. We can remove a column by assigning it the object `NULL`. Access the `FAKE` column from `Framingham` and assign it `NULL` using the `=` operator. Then check that the column has been removed by using the `names` command.

```Framingham\$X <- X
names(Framingham)[34] <- "FAKE"
```
```
```
```Framingham\$FAKE = NULL
names(Framingham)
```

### Subsetting with Brackets and Logicals

Sometimes we want to look at values of a variable just for a certain group or just for a certain condition, or we want to compare statistics on a variable by group. We can do this using brackets and a logical statement. In R, a logical statement is a statement that is evaluated as either `TRUE` or `FALSE`. For instance, `3 < 5` states that 3 is less than 5. When we enter this in R, the returned value is `TRUE`. If we try `3 > 5`, we would get back `FALSE`. We can also use relational operators for characters as well: `"dog" == "cat"` comes back `FALSE`, but `"dog" == "dog"` comes back `TRUE`. Note that R is case sensitive, so `"dog" == "DOG"` also comes back `FALSE`. The following relational operators may be used to create logical statements:

• `<` means less than

• `<=` means less than or equal to

• `>` means greater than

• `>=` means greater than or equal to

• `==` means equal to

• `!=` means not equal to

If we specify a variable, which is an entire column of values, on the left side of the logical statement, each value in that variable will be checked against the right side. First, print the `AGE1` variable in `Framingham`. Then, check if the values in this variable are less than 50. You will notice that the value returned is `TRUE` whenever the logical expression is true (i.e., when the value of `AGE1` is less than 50) and otherwise the value returned will be `FALSE`.

```
```
```Framingham\$AGE1
Framingham\$AGE1 < 50
```

In the previous example, the code `Framingham\$AGE1 < 50` produces a returned value of either `TRUE` or `FALSE` for each of the 400 values of `AGE1` depending upon whether the value of `AGE1` was less than 50 or not. We also can use a logical expression to subset a variable and select cases from the dataset for which the logical expression is true. To do so for this example, we would use the command `Framingham\$AGE1[Framingham\$AGE1 < 50]` to obtain the age values of only those cases with ages less than 50, as shown below. Said another way, we are asking R to return values from `Framingham\$AGE1`, but only those for which the statement `Framingham\$AGE1 < 50` is true.

```Framingham\$AGE1[Framingham\$AGE1 < 50]
```

Let's suppose, instead, we had wanted the systolic blood pressure for only women. In this case, we would use the command: `Framingham\$SYSBP1[Framingham\$SEX == "Women"]`. Said differently, we would be subsetting the `SYSBP1` variable by whether or not the case is a woman and obtain as output the systolic blood pressure values of just the cases for which the `SEX` variable has the value `"Women"`. Below, try subsetting the `AGE1` variable to see just the ages of women. In this case, we would be subsetting the `AGE1` by whether or not the case is a woman and obtain as output the ages of just the cases for which the `SEX` variable has the value `"Women"`.

```
```
```Framingham\$AGE1[Framingham\$SEX == "Women"]
```

Let's suppose, we now wanted the age values of women whose systolic blood pressure is 130 or more. To obtain these results we would need to include two logical expressions within the brackets, in this case connected by an "and". One logical expression would specify that `SEX == "Women"` and the other that `SYSBP1 >= 130`. In R, "and" is represented by the `&` operator and "or" is represented by the `|` operator. Accordingly, the code for subsetting age to those cases who are women and who have systolic blood pressure greater than or equal to 130 is: `Framingham\$AGE1[Framingham\$SEX == "Women" & Framingham\$SYSBP1 >= 130]`. Below, try using the `&` operator to subset systolic blood pressure (`SYSBP1`) to just the observations for subjects who are women and are 60 years old or older.

```
```
```Framingham\$SYSBP1[Framingham\$SEX == "Women" & Framingham\$AGE1 >= 60]
```

Now let's get the values of `SYSBP1` for the youngest and oldest subjects: those younger than 35 or older than 65.

```
```
```Framingham\$SYSBP1[Framingham\$AGE1 < 35 | Framingham\$AGE1 > 65]
```

## Data and Variable Types

As mentioned earlier in the tutorial, each column of a data frame is a vector of values that are all of the same type. There are several classes of vectors in R, but in this section, we will limit the discussion to just a few important ones: numeric, character, logical, and factor. We can check the class of a vector with the `class` function.

### Numeric and Character Data

Numeric data is exactly what it sounds like: numbers! Numeric vectors typically store values as double precision, which allow for decimals and can be mathematically operated upon. We might use numeric vectors to store values for interval- or ratio-level measurements. See Chapter 1 of Statistics Using R: An Integrative Approach for a review of measurement levels of variables.

Character data consist of string letters and/or numbers contained in quotes. Character vectors might hold nominal- or ordinal-level measurements, and may require conversion to factor vectors in later stages, but more on this shortly. Numbers in quotes are characters and, as such, cannot be mathematically operated on. Let's look at a quick example of this using two vectors that we will create using the `c` function we first saw in the previous section. In the space below, vector `A` has been assigned the numbers 1, 4, 7, 5, and 0, all in quotes. Create a vector `B` that is assigned those same numbers, but without quotes.

```A = c("1", "4", "7", "5", "0")
```
```A = c("1", "4", "7", "5", "0")
B = c(1, 4, 7, 5, 0)
```

In R, we can double every value in a numeric vector by multiplying that vector by 2 using the `*` as the multiplication operator. Check the class of each vector using the `class` command. Then try multiplying the numeric vector by two.

```A <- c("1", "4", "7", "5", "0")
B <- c(1, 4, 7, 5, 0)
```
```
```
```class(A)
class(B)
B*2
```

Now check what happens when we try to multiply the character vector by 2.

```A <- c("1", "4", "7", "5", "0")
```
```
```
```A*2
```

As we can see from the output, `A*2` returns an error because the vector `A` is not a numeric vector. Since all of the elements in `A` contain only numbers, we can easily convert `A` from character data to numeric by applying the code `as.numeric` to `A` and assigning this the name `A`. This means we will be replacing `A` with a numeric version of itself and overwriting the previous character version. Further, instead of using `class`, we can use `is.numeric` to check if `A` is now numeric. The code for the conversion of `A` to numeric is shown below. Type the appropriate code to check if the conversion worked.

```A <- c("1", "4", "7", "5", "0")
```
```A = as.numeric(A)
```
```A = as.numeric(A)
is.numeric(A)
```

### Logical Data

In an earlier section, we described how logical statements in R evaluate to either `TRUE` or `FALSE`. It follows that logical vectors contain only the elements `TRUE` or `FALSE`. Internally, R stores the values of `TRUE` and `FALSE` as 1 and 0, respectively. To demonstrate this, let's print the logical statement that identifies whether a subject is less than 40 years old (based on the `AGE1` variable), and then, let's put this entire statement within the `as.numeric` command.

```
```
```Framingham\$AGE1 < 40
as.numeric(Framingham\$AGE1 < 40)
```

Notice that each `TRUE` is represented by a 1 and each `FALSE` by a 0. This internal coding using the numbers 1 and 0 makes it possible to perform many operations on logical variables. For example, suppose we wanted to know the number of subjects under age 40 in the `Framingham` dataset. We know from the subsetting section of the tutorial that we can select those values using a logical statement in brackets. We could select these individuals and then get the length of the new vector using the `length` command. However, we could do this more efficiently by summing the logical statement using the `sum` command. R will add all the 1's in the vector and return the total number of cases where a subject's age is less than 40. Try this in the space below: take the sum of the logical statement that returns `TRUE` if an individual in `Framingham` is younger than 40. Note that you do not need to use the `as.numeric` command here because the `sum` command accesses the internal numeric codes, 1 and 0, directly.

```
```
```sum(Framingham\$AGE1 < 40)
```

Thus, we find that 57 of the 400 individuals in `Framingham` are under 40 years old.

### Factor Variables: Levels

Factor vectors contain the elements of categorical variables, such as nominal- and ordinal-level measurements. R encodes (internally represents) the levels (categories) of the variable as numbers, but allows the labels of these levels to be strings of numbers or characters. When we classify a vector as a factor variable, R will enter the variable correctly into models as a categorical variable rather than as a numeric variable.

Recall that when we use the `str` command, R prints the class of each variable after its name. We also can check if a specific variable is a factor with `is.factor`. Again from `Framingham`, check if `CURSMOKE1`, the variable that indicates if a subject is a current smoker, is a factor variable. Then check what the categories of `CURSMOKE1` are by running the `levels` command on this variable.

```
```
```is.factor(Framingham\$CURSMOKE1)
levels(Framingham\$CURSMOKE1)
```

We can see that the levels are "No" and "Yes", but if we wanted to see the underlying numeric coding, we can use the `as.numeric` command as we did earlier with respect to logical variables. When applied to a factor variable, the numeric vector that is produced contains the numeric values that are used to internally represent the categories. Try this for `CURSMOKE1` below.

```
```
```as.numeric(Framingham\$CURSMOKE1)
```

This is helpful, but inefficient. Now let's look at what the `table` command does when run on `CURSMOKE1`.

```table(Framingham\$CURSMOKE1)
```

The `table` command provides counts of each level of the variable: half the subjects in `Framingham` are currently smokers and half are not. If we run table on two variables, R provides a tabulation across the combinations of levels of each variable. For example, if we run `table(Framingham\$SEX, Framingham\$CURSMOKE1)` we get the following output, which shows smokers and non-smokers by sex.

```table(Framingham\$SEX,Framingham\$CURSMOKE1)
```

If we run `table` on a factor variable and its numeric conversion we may obtain how the levels of that factor variable are numerically represented internally. Try this for `CURSMOKE1` below.

```
```
```table(Framingham\$CURSMOKE1,as.numeric(Framingham\$CURSMOKE1))
```

From the output we see that "No" is encoded as 1 and "Yes" is encoded as 2. Although one may refer to the different levels by their names, as opposed to by their numerical values used to represent them internally, knowing which numerical value represents each level is important for the interpretation of results from statistical analyses. For more about this, see Statistics Using R: An Integrative Approach.

```quiz(
question("Which level of `CURSMOKE1` is given a value of 1?",
)
```

If, for some reason, you would like to alter the way in which the levels of a factor variable are numerically internally represented, you may do so by using the command `relevel` on that variable and setting the argument `ref` to the name of the level we want to have the value 1. The level assigned the number 1 is often called the reference level, category, or group. Try changing the "Yes" of `CURSMOKE1` to have the value 1. Assign this to a new variable in `Framingham` called `CURSMOKE_RL`. Verify that "Yes" is now encoded as 1 and "No" is now encoded as 2 using the table command on `CURSMOKE_RL` and its numeric conversion.

```
```
```Framingham\$CURSMOKE_RL = relevel(Framingham\$CURSMOKE1, ref = "Yes")
table(Framingham\$CURSMOKE_RL,as.numeric(Framingham\$CURSMOKE_RL))
```

### Factor Variables: Recoding

Sometimes we need to recode a numeric variable into a factor variable. We will try this with `SYSBP1` from `Framingham`. `SYSBP1` contains numeric measurements of systolic blood pressure. Let's assume we would like to recode this variable so that values less than 130 are grouped together under the category named, "normal", and values greater than or equal to 130 are grouped together under the category named, "high". To accomplish this, we will use the `ifelse` command. The `ifelse` command takes three arguments: a logical statement to be evaluated, values to return if the logical statement is true, and values to return if the logical statement is false.

First, let's try an example of how to use `ifelse`. The vector `x` is available in our working environment and contains the numbers 1 through 10. We want to create a new vector `y`, such that any value less than 5 is recoded as "low," and all other values are recoded as "high." Below we have provided the code to print the `x` vector and the logical statement that evaluates whether a value in the `x` vector is less than 5. Notice that `TRUE` is returned for the first five entries, and `FALSE` thereafter.

We now use the `ifelse` command with its three arguments. The first argument, `x < 5`, checks whether each of the values of `x` is less than 5. The second argument specifies the value to be assigned (in this case, "low") to each entry that satisfies the logical statement, `x < 5`, and for which the returned logical value is therefore `TRUE`. The third argument specifies the value to be assigned (in this case, "high") to each entry that does NOT satisfy the logical statement, `x < 5`, and for which the returned logical value is therefore `FALSE`. Add two lines of code below: one to assign to the variable `y` the values produced from the `ifelse` command applied to `x` and another to print `y` to confirm our code worked.

```x <- c(1:10)
```
```x
x < 5
ifelse(x < 5, "low", "high")
```
```x
x < 5
ifelse(x < 5, "low", "high")
y = ifelse(x < 5, "low", "high")
y
```

Now we will try this with the Framingham dataset. Use the `ifelse` command to recode an individual's systolic blood pressure (`SYSBP1`) into a factor variable in such a way that if the blood pressure is greater than or equal to 130 (the returned value is `TRUE`), it is assigned the value "high", and if it is not (the returned value is `FALSE`), it is assigned the value "normal". Assign the result to a new variable in `Framingham` called `SYSBP_CAT`.

```
```
```Framingham\$SYSBP_CAT = ifelse(Framingham\$SYSBP1 >= 130, "high", "normal")
```

Use the space below to run whatever code is needed to answer the following questions about `SYSBP_CAT`.

```Framingham\$SYSBP_CAT <- ifelse(Framingham\$SYSBP1 >= 130, "high", "normal")
```
```# use this space to run code
# any line that starts with '#' is a comment and will not be evaluated by R
```
```quiz(
question("How many subjects have normal systolic blood pressure?",
question("How many subjects have high systolic blood pressure?",
question("What class of vector is `SYSBP_CAT`?",
)
```

As revealed in that last question, `SYSBP_CAT` is a character vector, not a factor vector. To complete the conversion of our new variable into a factor variable, we would like to set the lowest group (the reference group to be internally coded by the number 1) to be "normal." Note that if we do not set this explicitly, R will set "high" to 1 and "normal" to 2 because the default is to encode the groups alphabetically.

Setting our reference group explicitly is easily done by calling the `factor` command on our `SYSBP_CAT` variable and adding a second argument called `levels` after a comma. We use the `c` function to list the levels in the order in which we would like them to be. Because we would like "normal" to be the first level (internally represented by the number 1), we would place "normal" as the first element in the `c` function. We would then set our `levels` argument of the `factor` command equal to our `c` function. In the space below, convert `SYSBP1_CAT` to a factor with "normal" as the first level. Be sure to assign the result to the same variable so that the changes are saved in `Framingham` and the character version of the variable is overwritten by the factor version. Then, check that you were successful using the `table` command.

```Framingham\$SYSBP_CAT <- ifelse(Framingham\$SYSBP1 >= 130, "high", "normal")
```
```
```
```Framingham\$SYSBP_CAT = factor(Framingham\$SYSBP_CAT, levels = c("normal", "high"))
table(Framingham\$SYSBP_CAT,as.numeric(Framingham\$SYSBP_CAT))
```

### Factor Variables: Ordering

By default, the numerical values assigned to factor variables are unordered in the sense that no level is considered greater or lesser than any other. Said differently, factor variables typically are considered to be nominal-leveled variables wherein the numbers assigned to levels are used merely to distinguish one level from another. Sometimes, however, a factor variable is ordinal-leveled, implying that an ordering of the values assigned to the levels of that variable is meaningful. In such instances, we would like our analytic results and plots to reflect that ordering. For example, a factor variable with levels "small", "medium", and "large" would be an ordinal-leveled factor variable, and it would therefore be important for an interpretation of results to reflect the fact that "large" is greater than "medium", which is greater than "small."

Let's suppose we wanted to add two additional levels to `SYSBP_CAT`: "low" for systolic blood pressure below 90 and "elevated" for systolic blood pressure between 120 and 129.9, inclusive. Before recoding `SYSBP_CAT`, we would need to add "low" and "elevated" as levels of the factor variable. We do this using the `factor` function once more, but this time we add the two new categories to the `levels` argument.

```quiz(
question("Which of the following can be supplied to the `levels` argument of `factor` such that the new factor variable will include all four levels (low, normal, elevated, and high)? Check all that apply. Any reference group is acceptable.",
answer('c("low", "normal", "elevated", "high")', correct = TRUE),
answer('c(levels(Framingham\$SYSBP_CAT), "low", "elevated")', correct = TRUE)
)
)
```

Even though we have multiple options for the `levels` argument, we are going to use `c("low", "normal", "elevated", "high")` because we would like these levels to be ordered from least to greatest. We add ordering to our factor simply by setting the argument `ordered` to `TRUE`. The space below shows the code for assigning a new factoring of `SYSBP_CAT` to a variable named `SYSBP_CAT2`. Add the missing arguments to the `factor` command so that the new variable has all four levels and R knows that they are to be regarded as an ordered factor variable with the order as specified.

```Framingham\$SYSBP_CAT <- ifelse(Framingham\$SYSBP1 >= 130, "high", "normal")
Framingham\$SYSBP_CAT <- factor(Framingham\$SYSBP_CAT, levels = c("normal", "high"))
```
```Framingham\$SYSBP_CAT2 = factor(Framingham\$SYSBP_CAT)
```
```Framingham\$SYSBP_CAT2 = factor(Framingham\$SYSBP_CAT,
levels = c("low", "normal", "elevated", "high"),
ordered = TRUE)
```

If we print our new ordered variable, we see the ordering of the levels at the very bottom, as shown below.

```Framingham\$SYSBP_CAT <- ifelse(Framingham\$SYSBP1 >= 130, "high", "normal")
Framingham\$SYSBP_CAT <- factor(Framingham\$SYSBP_CAT, levels = c("normal", "high"))
Framingham\$SYSBP_CAT2 <- factor(Framingham\$SYSBP_CAT,
levels = c("low", "normal", "elevated", "high"),
ordered = TRUE)
```
```Framingham\$SYSBP_CAT2
```

So far we have allowed for the possibility of systolic blood pressure falling into one of four categories, but we have not yet told R how to distinguish when an individual has low or elevated blood pressure. This is why the output above shows four possible categories, but only "normal" or "high" actually being used. Now we need to recode our variable to include the two new categories: "low" for systolic blood pressure below 90 and "elevated" for systolic blood pressure between 120 and 129.9, inclusive. Below we have provided the code to recode `SYSBP_CAT2` to "low" for any rows where systolic blood pressure (`SYSBP1`) is less than 90. Try recoding for the "elevated" category in a similar manner. Hint: We will need to evaluate two logical statements to cover the range for "elevated."

```Framingham\$SYSBP_CAT <- ifelse(Framingham\$SYSBP1 >= 130, "high", "normal")
Framingham\$SYSBP_CAT <- factor(Framingham\$SYSBP_CAT, levels = c("normal", "high"))
Framingham\$SYSBP_CAT2 <- factor(Framingham\$SYSBP_CAT,
levels = c("low", "normal", "elevated", "high"),
ordered = TRUE)
```
```Framingham\$SYSBP_CAT2[Framingham\$SYSBP1 < 90] = "low"
```
```Framingham\$SYSBP_CAT2[Framingham\$SYSBP1 < 90] = "low"
Framingham\$SYSBP_CAT2[Framingham\$SYSBP1 >= 120 & Framingham\$SYSBP1 <= 129.9] = "elevated"
```

Use the space below to run code that shows counts of each level of `SYSBP_CAT2`, and then answer the following questions.

```Framingham\$SYSBP_CAT <- ifelse(Framingham\$SYSBP1 >= 130, "high", "normal")
Framingham\$SYSBP_CAT <- factor(Framingham\$SYSBP_CAT, levels = c("normal", "high"))
Framingham\$SYSBP_CAT2 <- factor(Framingham\$SYSBP_CAT,
levels = c("low", "normal", "elevated", "high"),
ordered = TRUE)
Framingham\$SYSBP_CAT2[Framingham\$SYSBP1 < 90] <- "low"
Framingham\$SYSBP_CAT2[Framingham\$SYSBP1 >= 120 & Framingham\$SYSBP1 <= 129.9] <- "elevated"
```
```# use this space to run code
```
```quiz(
question("How many subjects in the study have elevated systolic blood pressure?",
),
question("How many subjects in the study have low systolic blood pressure?",
)
)
```

The output of using the `table` function on `SYSBP_CAT2` reveals that the "low" level is entirely unused. We can drop this level using the `droplevels` command on our factor variable and assigning the results to the same variable, effectively overwriting it with the version that does not include "low." Try this in the space below. Verify that the level has been dropped by running the `levels` command.

```Framingham\$SYSBP_CAT <- ifelse(Framingham\$SYSBP1 >= 130, "high", "normal")
Framingham\$SYSBP_CAT <- factor(Framingham\$SYSBP_CAT, levels = c("normal", "high"))
Framingham\$SYSBP_CAT2 <- factor(Framingham\$SYSBP_CAT,
levels = c("low", "normal", "elevated", "high"),
ordered = TRUE)
Framingham\$SYSBP_CAT2[Framingham\$SYSBP1 < 90] <- "low"
Framingham\$SYSBP_CAT2[Framingham\$SYSBP1 >= 120 & Framingham\$SYSBP1 <= 129.9] <- "elevated"
```
```
```
```Framingham\$SYSBP_CAT2 = droplevels(Framingham\$SYSBP_CAT2)
levels(Framingham\$SYSBP_CAT2)
```

## Descriptive Statistics

*Obtaining descriptive statistics about variables in a dataset is the first step of most analyses (and even the main objective in some cases!). In this section, we review how to obtain these statistics, what to do when data are missing, and what to do when analyses call for complete cases across more than one variable. See chapters 2 through 5 of Statistics Using R: An Integrative Approach for a more complete and in-depth discussion of these statistics and how to access them with R. *

### Descriptive Statistics Overview Using the `summary` Command

To get a rough idea of what the distribution of each of our variables looks like, and whether they contain missing values, we can use the `summary` command. Use the `summary` command on `Framingham` in the code box below and inspect the results.

```
```
```summary(Framingham)
```

As we can see from the output, `summary` gives us the name of each variable in `Framingham` and some basic descriptive statistics about each of them. For numeric variables, we get the minimum and maximum values, the mean and median, and the first and third quartiles (denoted "1st Qu." and "3rd Qu.", respectively). For categorical variables, we get the names of the groups and their counts. Thus, `summary` is a wonderful command for obtaining an overview of our data, but it is not recommended for when you need to obtain specific statistics for only certain variables.

If any variable has missing values, there will be an additional piece of information at the bottom of the list: a count of `NA` values. `NA` stands for "not available" and is the element/entry that R uses to note a missing value in the column. We will cover more on dealing with missing values later in this section.

### Descriptive Statistics with Complete Data

When there are no missing values in a dataset, it is very simple to obtain descriptive statistics about variables, such as those listed below.

• `length` returns the length of a vector/variable, giving a count of observations for that variable

• `mean` returns the mean value of a vector/variable

• `sd` returns the standard deviation of a vector/variable

Use the space below to run code in order to answer the following questions about variables from the `Framingham` dataset.

```# use this space to run code
```
```quiz(
question("How many observations are there for the variable `ID`?",
question("What is the mean age of subjects according to variable `AGE1`?",
question("What class of vector is `DIABP1`?",
question("What is the standard deviation of `SYSBP1`?",
)
```

### Descriptive Statistics with Missing Data

In R, when values of a variable are missing from a vector or data frame, they are represented as `NA`, meaning "Not Available." The `Framingham` dataset includes variables whose measurements were taken at a number of different time points. Because not all subjects participated in the study at all time points, we do not have values for some of the variables for some of the subjects. Rather than these spaces being left blank, the entries of variables for unavailable subjects are listed as `NA`. We can find the number of missing values in a vector/variable by running a logical statement to check if each value is `NA` and then taking the sum of the result. The code below shows how to find the number of missing values for the `AGE3` variable, the age of the subject measured at time point 3. Add code that finds the count of `AGE3`.

```sum(is.na(Framingham\$AGE3))
```
```sum(is.na(Framingham\$AGE3))
length(Framingham\$AGE3)
```

Even though `AGE3` contains 92 missing values, R returns the length of the vector/variable to be 400, the total number of observations in the dataset. The reason for this is that each `NA` is occupying an element's space in the vector, and as such, is still counted by the `length` function. To circumvent this issue, we use a function called `na.omit` on the vector to filter out the `NA` values from it. Then we feed this into the `length` function, in the same way that we fed the `is.na` result into the `sum` function above. Try this for `AGE3` in the space below.

```
```
```length(na.omit(Framingham\$AGE3))
```

Now we see that `AGE3` actually has only 308 non-missing values, not 400.

Fortunately, many functions in R, including `mean` and `sd`, come with an optional argument `na.rm` that, when set to `TRUE`, removes all the `NA` values before running the function. In the space below, try running `mean` and `sd` for `AGE3` without the `na.rm` argument, and then with it set to `TRUE`.

```
```
```mean(Framingham\$AGE3)
sd(Framingham\$AGE3)
mean(Framingham\$AGE3, na.rm = TRUE)
sd(Framingham\$AGE3, na.rm = TRUE)
```

From the output, we can see that when there are missing values in a vector and we do not include the `na.rm` argument, R returns `NA` as the calculation's result. In order to obtain the result we seek, based on the non-missing values only, we must include the `na.rm` argument to remove the `NA` values. Alternatively, we may use the command `na.omit` to achieve the same result, as shown below.

```mean(na.omit(Framingham\$AGE3))
sd(na.omit(Framingham\$AGE3))
```

### Descriptive Statistics for Paired Data

In order to find the correlation, for example, between two variables, such as height and weight, in a sample of individuals, we would need to have the height and weight measures for each individual in that sample. Because each pair of height and weight values comes from a single individual, height and weight are said to be paired. In this situation, when variables are paired, we must have non-missing values on both of the paired variables in order to run the analysis and obtain the results we seek. Accordingly, we need to use code that allows us to limit the analysis to only those rows that have non-missing values on both variables of interest (i.e., where a result of `TRUE` is returned in response to a query about whether the entry for the paired height and weight variables are non-missing or complete). In another context, suppose we wish to compute the correlation between diastolic blood pressure measured at time 1 (`DIABP1`) and at time 3 (`DIABP3`). Because both the measurements at the two time periods belong to the same person, they are considered to be paired. To limit the analysis to those individuals that have non-missing/complete data on both paired measures, we ask whether `DIABP1`and `DIABP3` have non-missing values by using the command `complete.cases`. This command does the opposite of `is.na`: `complete.cases` checks to see if each element of a vector is not an `NA` value and returns `TRUE` if the value is non-missing and `FALSE` if it is missing.

```quiz(
question("Which logical statement returns `TRUE` if a row has non-missing values for *both* `DIABP1` and `DIABP3`?",
answer("`complete.cases(Framingham\$DIABP1) & complete.cases(Framingham\$DIABP3)`", correct = TRUE),
message = "This will return `TRUE` if *either* `DIABP1` *or* `DIABP3` are non-missing for a row."))
)
```

Use brackets and the solution from the previous question to subset the values `DIABP1` to only those values where both `DIABP1` and `DIABP3` are non-missing. Then do the same for `DIABP3`.

```
```
```Framingham\$DIABP1[complete.cases(Framingham\$DIABP1) & complete.cases(Framingham\$DIABP3)]
Framingham\$DIABP3[complete.cases(Framingham\$DIABP1) & complete.cases(Framingham\$DIABP3)]
```

## Test Your Skills on a New Dataset

In this final section, we present a new dataset: the NELS dataset, available in your environment as `NELS`. Code boxes will be available to help answer quiz questions about the dataset using the skills learned in the previous sections. We encourage you to try to use commands from memory as much as possible, but solution code is available using the Solution button at the top of the code box in case you need assistance. Keep in mind that in R there are often multiple ways to obtain the information sought, so sometimes your approach to finding the solution will not match that of the solution code provided, even though you were still successful in finding the correct information.

### First Impressions of the Dataset

Let's start with some basic information about the dataset. Use the empty code box below to run any commands necessary to answer the quiz questions for this section. Suggested solutions are available by clicking on the Solution button on the code box.

```NELS <- sur::NELS
```
```# use this box to run code
```
```# get observation and variable counts
dim(NELS)

# check variable data classes
str(NELS)
```
```quiz(
question("How many observations are there in the NELS dataset?",
question("How many variables are there in the NELS dataset?",
question("Which **R** classes of data does the NELS dataset have? Check all that apply.",
answer("Categorical", message = "There may be categorical variables, but this is not a class of data in R"),
)
)
```

### Variable Inspection

Let's look at some of the variables more closely now.

```# use this box to run code
```
```# overview of variables (including NAs)
summary(NELS)

# or check individual variables for missing values
sum(is.na(NELS\$hwkin12))
sum(is.na(NELS\$famsize))

# maximum of slfcnc08: find within summary(NELS) or use the following
max(NELS\$slfcnc08)

# mean of ses
mean(NELS\$ses)

# mean of achsls08
mean(NELS\$achsls08, na.rm=TRUE)
mean(na.omit(NELS\$achsls08))

# non-missing achsls08
length(na.omit(NELS\$achsls08))
length(NELS\$achsls08[complete.cases(NELS\$achsls08)])

# non-missing approg
length(na.omit(NELS\$approg))
```
```quiz(
question("Which of the following variables have missing values?",
question("What is the maximum 8th grade self-concept score of students in the NELS dataset? *Hint*: You can find the names and descriptions of variables in `NELS` by entering `?NELS` into the Console window of either R or RStudio.",
question("What is the mean of the `ses` variable?",
question("Which of the following code lines will produce the mean of `achsls08`? Check all that apply.",
question("Which of the following code lines will produce the count (not including `NA` values) of `achsls08`? Check all that apply.",
question("How many non-missing values of `approg` are there?",
)
```

### Information about Subsets of the Data

Now, let's dig a little deeper and investigate more specific details about our data.

```# use this box to run code
```
```# check levels of region variable
levels(NELS\$region)

# females from Northeast, males from South
table(NELS\$region,NELS\$gender)

# mean family size of students from the West
mean(NELS\$famsize[NELS\$region=="West"])

# standard deviation of first 20 slfcnc10 (3 ways)
sd(NELS[1:20,"slfcnc10"])
sd(NELS[1:20, 10])
sd(NELS\$slfcnc10[1:20])

# 151st student cigarette use (2 ways)
NELS[151,"cigarett"]
NELS\$cigarett[151]

# complete pairs of parmarl8 and nursery
sum(complete.cases(NELS\$parmarl8) & complete.cases(NELS\$nursery))
```
```quiz(
question("How many regions does `NELS` cover?",
question("How many female students are from the Northeast?",
question("How many male students are from the South?",
question("What is the mean family size of students from the West?",
question("What is the standard deviation of `slfcnc10` for the first 20 students of the dataset?",
question("Did the 151st student in the dataset ever smoke cigarettes?",
question("How many complete pairs of observations are there for `parmarl8` and `nursery`?",
)
```

### Creating Variables

Finally, let's create some variables and answer questions related to them.

In the code box below, add a variable to `NELS` called `achmatdiff`, which is the difference in math achievement scores from 8th to 12th grade for each student. Remember that you can check variable names and descriptions by running `?NELS` in the Console window of R or RStudio. You can use the `-` operator to subtract one column from another by row.

```
```
```# create achmatdiff
NELS\$achmatdiff = NELS\$achmat12 - NELS\$achmat08
```

Now, use the code box below to run any code necessary to answer the following questions about our new variable.

```NELS\$achmatdiff <- NELS\$achmat12 - NELS\$achmat08
```
```# use this box to run code
```
```# minimum, maximum, and mean
summary(NELS\$achmatdiff)

# missing values
sum(is.na(NELS\$achmatdiff))

# standard deviation
sd(NELS\$achmatdiff)
```
```quiz(
question("What is the minimum change in math achievement score?",
question("What is the maximum change in math achievement score?",
question("What is the average change in math achievement score?",
question("How many missing values are there for `achmatdiff`?",
question("What is the standard deviation of `achmatdiff`?",
)
```

Next, let's recode `achmatdiff` to a categorical variable called `achmatcat`, which has the value "negative" when `achmatdiff` has a value less than zero, and "positive" everywhere else. Check the class of `achmatcat`; if it is not a factor variable, change it so that it is. Then check that the levels are "negative" and "positive."

```NELS\$achmatdiff <- NELS\$achmat12 - NELS\$achmat08
```
```
```
```NELS\$achmatcat = ifelse(NELS\$achmatdiff < 0, "negative", "positive")
class(NELS\$achmatcat)
NELS\$achmatcat = factor(NELS\$achmatcat)
levels(NELS\$achmatcat)
```

Finally, let's inspect `achmatcat` and check that we seem to have created it correctly. Use the code box below to answer the following questions.

```NELS\$achmatdiff <- NELS\$achmat12 - NELS\$achmat08
NELS\$achmatcat <- ifelse(NELS\$achmatdiff < 0, "negative", "positive")
NELS\$achmatcat <- factor(NELS\$achmatcat)
```
```# use this box to run code
```
```# first 10 rows of achmatdiff and achmatcat
NELS[1:10, c("achmatdiff","achmatcat")]

# factor encoding for achmatcat
table(NELS\$achmatcat,as.numeric(NELS\$achmatcat))

# proportion positive
table(NELS\$achmatcat)
258/500

# achmatcat by region
table(NELS\$achmatcat,NELS\$region)

# average ses for "negative"
mean(NELS\$ses[NELS\$achmatcat == "negative"])
```
```quiz(
question("Print the first 10 observations of `achmatdiff` and `achmatcat`. For these 10 rows, are all negative values of `achmatdiff` paired with \"negative\" for `achmatcat`?",
question("As what number is the category \"positive\" encoded for `achmatcat`?",
question("What proportion of students from the NELS dataset showed a positive change in math achievement from 8th to 12th grade? In other words, what proportion of observations of `achmatcat` are \"positive\"?",
question("Which regions show more students with positive change than negative change in math achievement score? Check all that apply.",
question("What is the average `ses` score for those students with a negative change in math achievement score?",