(PART*) Using R and describing data {-}

Getting started with R

Install R on your computer

R is highly customizable, so that your life is a lot easier if you use your own copy rather than a shared copy on a university computer. To install it on your computer, you can follow these detailed instructions for Windows{target="_blank"} and Mac OS{target="_blank"}. Make sure to install both R and RStudio.

Then start exploring - using R is a skill that comes with practice, not with reading about it. The examples from these notes can be copy-pasted or typed into R (when you move the mouse pointer over a code box you should see a copy-button in the top-right corner).

Some key concepts

If you understand the following concepts, using R will make a lot more sense.

R, RStudio and R packages

You can think about the different parts of R in terms of a computer or smartphone. In that analogy:

For a more comprehensive introduction to R, you can watch this video:

r video_code("kvaMhkQF0H4")

Data and variables in R

Lines starting with a # are comments that are ignored by R. I use them below to explain parts of the code further.

# <- saves a value (on the right) into an object (on the left)
score <- 10

# "" need to be used around text that is not an object name
month <- "January" 

# c() combines several values into one vector that 
# can then be saved as a single object
weekend <- c("Saturday", "Sunday") 
sleep_hours <- c(4, 9)

#Variables can then be combined into dataframes (R's version of tables)
sleep_data <- data.frame(day = weekend, hours = sleep_hours)

#Variables within dataframes are accessed using the $ operator
sum(sleep_data$hours)

Variable classes indicate what kind of data a variable contains. There are four main types

The class() function shows the class of a single variable, the str() and glimpse() functions include the classes of all variables in a dataframe.

Usually, R gets the classes right by itself and converts as required. If you need to change variable classes, you can use the as.numeric(), as.character(), as.logical(), as.factor() etc. functions.

class(sleep_data$hours)
str(sleep_data)

Functions in R

All work in R is based on calling functions that do something. A function is called by its name with () behind. If you just type the name, the function is printed out - you usually don't want this.

#Some functions work without any additional arguments
timestamp() #This shows the time this code segment was run

#Functions that work with given data are more helpful - 
# that data is given as an 'argument' in brackets
print("Hello")
mean(c(1,2,3))

#Additionally, arguments can contain instructions to the function
mean(c(1,2,3, NA), na.rm = TRUE)
# NA is a missing value, na.rm tells the mean function to remove it 
# before calculating the mean

#To save the results of a function into a variable, use the <- operator again
variableName <- mean(c(1,2,3, NA), na.rm = TRUE)

To learn more about the arguments accepted by a specific function, use the ? command to open the help page, e.g., by typing ?mean. Arguments are matched by their name, as with na.rm = above; if no name is given, they are matched by their position in the function call. Usually, only very common arguments, typically the data, should be left unnamed.

Key R packages

The most essential packages for this course are included in the tidyverse - a whole set of packages to make data manipulation and visualisation in R efficient and consistent. They include:

All these packages are loaded with the library(tidyverse) command. There are some additional packages in the tidyverse that need to be loaded separately, using library(packagename). The most relevant for us are:

Using the pacman package

As an alternative to install.packages() and library(), you can also use the p_load function in the pacman package. That function checks first whether the package is installed, installs it if needed, and then loads it. I am using it in the code here so that you easily copy-paste parts of it; in your own code, the choice is up to you. If you want to use pacman, just run install.packages("pacman") and then use pacman::p_load(package1, package2) to load your packages.

Using a single function in a package

If you just need to use one function from a package, you do not need to load the whole package. You can also give R the whole address and use package::function(). I will only do that for pacman::p_load() since that is the only function from that package we will ever use in this course.

RMarkdown

A key strength of R are reproducible analyses. For that, we don't want to type commands and just write down the results, but rather create scripts that can produce results again and again. That also makes changes much less painful, for instance if you've made a mistake at the start.

RMarkdown (.Rmd) files are such scripts that additionally allow to add text and split the code into separate chunks. Within them, grey chunks are code, anything else is not executed. It is just formatted (e.g., with headings and highlights) when you create the report. Markup symbols allow for easy formatting of the text, as per the instructions here. (Most of that is not required in this course, a couple of key symbols such as # to indicate headings will be introduced later.)

Within .Rmd files, you need to Insert new code blocks for the major parts of the analysis. For that, use the insert button at the top of the File Editor window. When you want to test your code, you can also Run it there or by pressing Ctrl+Enter in the line you want to use (or after you have selected multiple lines to run). Finally, to create the report you need to Knit it, again using the button in the File Editor. Without installing anything else, you can only knit to HTML files, but those can be opened in any internet browser, so are good to share with anyone.

Where is my output?

When using RMarkdown files, you might sometimes run code and not see the results you expect in the console. That is because anything visual is shown in the Rmd file below the codeblock (e.g., charts, dataframes). By default, only text output is shown in the console. If you want to see all the output there, click on the gear next to Knit and select Chunk Output in Console.

HELP! What to do when problems appear?

When red text appears in the console, don't panick. Some messages that are often entirely irrelevant are displayed in red, for example when packages are loaded. Warnings are more important, but they just alert you that something might be going wrong - read them, but then decide whether you need to worry. For instance, ggplot prints a warning when data is missing, which might or might not be a problem for you. Only Errors always need to be dealt with as they stop your commands from being executed and the .Rmd file from being knitted. Unfortunately, many R functions give rather cryptic error messages - if you are stuck, first check your code for typos, then Google the error message.

Many errors occur due to common mistakes, that you need to pay attention to:

When you are stuck, it also often helps to Restart R (which unloads all packages) and Clear your workspace (which deletes all objects) - both options are found in the Session menu in RStudio. I would also recommend turning off the option that RStudio saves your Environment when you close it. While that means that you have to explicitly save data or rerun your code when you continue, it can save you a lot of hassle that comes from objects left over from earlier sessions. To turn that off:

  1. Choose Tools > Global Options in RStudio
  2. In the Options, look for Save workspace to .RData on exit and change it to Never.
  3. Click OK to apply the change and close the Options window.

As you progress in R, do not try to remember everything - instead, look things up as needed, in particular things like argument names:

There is also a very active R user community that is willing to help - stackoverflow.com{target="_blank"} is a good place to ask questions about R Code, while stats.stackexchange.com{target="_blank"} is the place to go with questions that are more generally about statistics. In both cases, it pays off to first google and then post questions that are as specific as possible.



LukasWallrich/GoldCoreQuants documentation built on Nov. 27, 2021, 1:58 a.m.