In ideas42/tools42: Tools for Analysis at ideas42

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

Executive Summary

You can use this space to explain what this report is about. What were your objectives, what you found, and how you did it.

Project metadata:

Remember that you can change the project code and type, psychologies used, and testing design in the header. They will automatically show up here:

Project code: r rmarkdown::metadata$project_code
Evaluation: This project uses a r stringr::str_to_upper(rmarkdown::metadata$evaluation) strategy for testing.
Project type: This is a r stringr::str_to_upper(rmarkdown::metadata$project_type) project.
Psychologies used: r rmarkdown::metadata$psychologies

Metadata is useful to the project team but also great for book-keeping. It collects valuable information on our practices that can be used to better inform our future work. Fill it out carefully, and remember to update it whenever it changes!

Data importing and cleaning

Typically, you will receive a dataset that is most likely in an Excel (.xlsx or .xls) or .csv (comma-separated values) format, which you will have to load and clean before analyzing it. In this template we use the penguins sample dataset available in the palmerpenguins package, so we don't actually read in any data. Just remember that you can easily do that with the read_csv (for csv files), read_excel, or read_dta (for Stata files) functions. If you want to learn more about reading in data, a tutorial is available on our knowledge-sharing platform.

First, we load all the packages we'll need:

library(palmerpenguins)
library(visdat)
library(skimr)
library(tidyverse)
library(tools42)

# Using the penguins dataset
data(penguins)

A good way to start is by looking at what variables are available in our dataset, as well as the number of observations. The skim function in the skimr package is a quick and easy way to summarize this. Consider it a beefed-up replacement of the Stata summarize command. You'll see you also get tiny histograms for your continuous data. Visualizing your data is critical and you should always, always spend some time plotting it. You won't understand your data unless you plot it.

skim(penguins)

Great. Looks like most of our data is complete and falls under the expected ranges (that is, there are no crazy values for any of the variables). Let's get a sense of which variables are continuous and which are categorical (character or factor).

We can do that with the vis_dat function, which plots every observation and color-codes it by variable type. We could even make it look nicer by adding the i42 theme and color scheme, since this is a ggplot object.

vis_dat(penguins_raw)

Isn't this cool? There's also a function to see the degree of missingness for each of our variables. It's called vis_miss:

vis_miss(penguins_raw)

Exploratory Analysis

Ok, let's look at the relationships between variables.

This is interesting. I wonder if there's any differences by species!

penguins %>%
  viz_scatter(flipper_length_mm, bill_length_mm) +
  aes(color = species) +
  theme_42() +
  scale_color_manual(values = palette_42("i42_bright"))

Now, a histogram of flipper length:

penguins %>%
  viz_hist(flipper_length_mm) +
  aes(fill = species) +
  facet_wrap(~species) +
  theme_42() +
  scale_fill_manual(values = palette_42("i42_bright"))