knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

This tutorial will show you how to remove unwanted columns and subset particular rows from a dataset.

The shroom package contains a dataset with thousands of data points collected by over 200 students in 14 class sections researching 18 different alleles of 6 different genes For many analyses and explorations of this data we'll want to consider a subset of the data, such as data from:

  1. A single student in a single section of the course
  2. All 3 alleles studied within a single section
  3. All 18 alleles studied within a section
  4. A single gene across the 14 section

Preliminaries

Loading the shroom package & wingscores data

If you haven't already, download the shroom package. See the vignette "Loading the shroom package" for details.

Load shroom into your current R session using library()

library(shroom)

Load the wingscores data into your current R sessions

data("wingscores")

Load packages

We'll use tools from the "tidyverse" to subset the data, particularly the dplyr functions

Loading the dplyr package

We'll also need the dplyr package.

Download it if you haven't already.

## don't run this if you have already downloaded dplyr before
# (remove "#" to run code)

# install.packages("dplyr")

Load dplyr into your current R sessions

library(dplyr)

Getting to know wingscores data

The wingscores data is big: 7754 rows by 16 columns

dim(wingscores)
nrow(wingscores)
ncol(wingscores)

Each row is data on a single fly scored for "wing crinklieness" by a student single studnet.

Many of the columns are not necessary for carrying out an analysis. For example the column "file" is just the original Excel file that the student entered their data in to.

summary(wingscores$file, 5)

Subsetting the shroom wingscores data

Many of the columns in the dataset are not informative for any kind of analysis. For simple, preliminary data exploration and analyses we also don't want all ~7500 rows of data.

What we'll do next is

Remove un-needed columns with dplyr::select

We can remove a single column by using the syntax "select(-column.name)", with a "-" in front of the column name to indicate we want it removed. Note that the word file does not have quotes around it.

wingscores.no.file.name <- wingscores %>% 
  select(-file)

The original dataframe had 16 columns; without the "file" column it just has 15.

ncol(wingscores)
ncol(wingscores.no.file.name)

We can just write out all the columns we don't want, putting a "-" before each column name. Again, no quotation marks involved.

wingscores2 <- wingscores %>% 
                   select(-gene,
                          -seat,
                          -section,
                          -loaded,
                          -stock.num,
                          -file,
                          -temp.C,
                          -group,
                          -gene.allele)

The dataframe now has only 7 columns

dim(wingscores2)

Subset particular rows with filter

There are about 250 students and each is assigned a sequential ID number.

summary(wingscores2$student.ID)

If we just want to look at the #1's (the 1st student) we can use the filter() function. Note the double equals signs ("==") means "equals to" and is different than a single equals sign "=".

wingscores_s1 <- wingscores2 %>% 
   filter(student.ID ==  1)

This results in a dataframe with just the 60 flies which student 1 worked on.

dim(wingscores_s1)

Subsetting by multiple conditions use "&"

We can subset by multiple conditions using "&" between conditions. Let's subset just the female (karyotype of XX, coded "F" in the data) flies from student 1.

wingscores_s1F <- wingscores_s1 %>% 
                     filter(student.ID ==  1 &
                                   sex == "F")

There's one row of data for each fly. Student 1 looked at just 24 flies.

dim(wingscores_s1F)

Subsetting by multiple conditions use "%in%"

We can also subset by multiple conditions using %in% and a vector of possibilities.

Students 1, 2 and 3 are all in section 1 of the lab and all worked on different alleles of the same gene, able. I can select them by filtering by all student IDs "in" a vector (a type of R object) containing the numbers 1, 2 and 3.

The function "in" is a bit odd: %in%

A vector possibilities would look like this: c(1,2,3)

wngscrs_sxn1_able <- wingscores.no.file.name %>% 
                     filter(student.ID %in% c(1,2,3))

These 3 students together looked 159 flies

dim(wngscrs_sxn1_able)

This code below is equivalent to what we just did, using c(1:3) instead of c(1,2,3)

wngscrs_sxn1_able <- wingscores.no.file.name %>% 
                     filter(student.ID %in% c(1:3))

Similarly, I can subset the first 18 students, who all were in the same lab section. "c(1:18)" Produces a vector containing all numbers 1 to 18.

wngscrs_sxn1 <- wingscores.no.file.name %>% 
                     filter(student.ID %in% c(1:18))



brouwern/shroom documentation built on May 24, 2019, 7:54 a.m.