knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This tutorial will show you how to remove unwanted columns and subset particular rows from a dataset.
The shroom package contains a dataset with thousands of data points collected by over 200 students in 14 class sections researching 18 different alleles of 6 different genes For many analyses and explorations of this data we'll want to consider a subset of the data, such as data from:
If you haven't already, download the shroom package. See the vignette "Loading the shroom package" for details.
Load shroom into your current R session using library()
library(shroom)
Load the wingscores data into your current R sessions
data("wingscores")
We'll use tools from the "tidyverse" to subset the data, particularly the dplyr functions
We'll also need the dplyr package.
Download it if you haven't already.
## don't run this if you have already downloaded dplyr before # (remove "#" to run code) # install.packages("dplyr")
Load dplyr into your current R sessions
library(dplyr)
The wingscores data is big: 7754 rows by 16 columns
dim(wingscores) nrow(wingscores) ncol(wingscores)
Each row is data on a single fly scored for "wing crinklieness" by a student single studnet.
Many of the columns are not necessary for carrying out an analysis. For example the column "file" is just the original Excel file that the student entered their data in to.
summary(wingscores$file, 5)
Many of the columns in the dataset are not informative for any kind of analysis. For simple, preliminary data exploration and analyses we also don't want all ~7500 rows of data.
What we'll do next is
We can remove a single column by using the syntax "select(-column.name)", with a "-" in front of the column name to indicate we want it removed. Note that the word file does not have quotes around it.
wingscores.no.file.name <- wingscores %>% select(-file)
The original dataframe had 16 columns; without the "file" column it just has 15.
ncol(wingscores) ncol(wingscores.no.file.name)
We can just write out all the columns we don't want, putting a "-" before each column name. Again, no quotation marks involved.
wingscores2 <- wingscores %>% select(-gene, -seat, -section, -loaded, -stock.num, -file, -temp.C, -group, -gene.allele)
The dataframe now has only 7 columns
dim(wingscores2)
There are about 250 students and each is assigned a sequential ID number.
summary(wingscores2$student.ID)
If we just want to look at the #1's (the 1st student) we can use the filter() function. Note the double equals signs ("==") means "equals to" and is different than a single equals sign "=".
wingscores_s1 <- wingscores2 %>% filter(student.ID == 1)
This results in a dataframe with just the 60 flies which student 1 worked on.
dim(wingscores_s1)
We can subset by multiple conditions using "&" between conditions. Let's subset just the female (karyotype of XX, coded "F" in the data) flies from student 1.
wingscores_s1F <- wingscores_s1 %>% filter(student.ID == 1 & sex == "F")
There's one row of data for each fly. Student 1 looked at just 24 flies.
dim(wingscores_s1F)
We can also subset by multiple conditions using %in% and a vector of possibilities.
Students 1, 2 and 3 are all in section 1 of the lab and all worked on different alleles of the same gene, able. I can select them by filtering by all student IDs "in" a vector (a type of R object) containing the numbers 1, 2 and 3.
The function "in" is a bit odd: %in%
A vector possibilities would look like this: c(1,2,3)
wngscrs_sxn1_able <- wingscores.no.file.name %>% filter(student.ID %in% c(1,2,3))
These 3 students together looked 159 flies
dim(wngscrs_sxn1_able)
This code below is equivalent to what we just did, using c(1:3) instead of c(1,2,3)
wngscrs_sxn1_able <- wingscores.no.file.name %>% filter(student.ID %in% c(1:3))
Similarly, I can subset the first 18 students, who all were in the same lab section. "c(1:18)" Produces a vector containing all numbers 1 to 18.
wngscrs_sxn1 <- wingscores.no.file.name %>% filter(student.ID %in% c(1:18))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.