knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE) library(learnr) library(tidyverse) library(useful) tutorial_options(exercise.timelimit = 60, exercise.blanks = "___+")
Working in Rstudio with notebooks
Loading and inspecting data
Working with vectors
Working with matrices
Working with dataframes (the old fashioned way)
Enter commands directly into the console
Write 'scripts': set of R code to execute as a unit (.R files)
Write 'packages': sets of functions that others can use
Write analysis 'notebooks' (R markdown: .Rmd files)
Easily share results and methods in different formats
Encourages good code and analysis practices
Different components of Rstudio interface
Running code lines
Running code chunks
Creating new chunks
Creating, inspecting and clearing variables
There are many good Rstudio video tutorials, e.g.: (https://youtu.be/kfcX5DEMAp4)
-->
Why?
Others can understand what you've done (including future you)
Reproducibility and open science
Clarify your logical thought process while writing code
How?
Write descriptive text before each code chunk in R notebooks:
Describe data used, analysis approach
What are you trying to test/show with a given analysis?
Also good to include comments in your code for specifics/lightweight explanation
#
#add a fixed offset to avoid negative values new_data <- old_data + offset #now normalize so max value is 1 new_data <- new_data/max(new_data)
new_data <- my_pipeline_function(new_data, outliers = 'ignore') #drop 6-sigma outliers
Organize your work as 'Projects' in Rstudio
Each project has a separate folder, with data, code, results.
You will practice setting up a new project for today's assignment.
Small-scale data are typically stored as text files
Different 'rows' of data appear on different lines
Values for a given 'row' are typically separated by either a comma, or a tab
name,age,member_since bob,20,2015 frank,40,2015 tammy,15,2016
name age member_since bob 20 2015 frank 40 2015 tammy 15 2016
| Data type | Extension | Function | Package |
|:------------------------|:----------|:------------------|:-------------------|
| Comma separated values | csv | read.csv()
| utils (default) |
| | | read_csv()
| readr (tidyverse) |
| Tab separated values | tsv | read_tsv()
| readr |
| Other delimited formats | txt | read.table()
| utils |
| | | read_table()
| readr |
| | | read_delim()
| readr |
| Excel | xlsx, xls | read_excel()
| readxl (tidyverse) |
For loading data tables, we suggest:
Use read_delim
(part of the tidyverse
megapackage) for text files (.csv, .tsv, .txt)
Use read_excel
for loading .xls, .xlsx files
We'll discuss more about loading data matrices later.
For example:
my_table <- read_delim('my_data_file.csv')
File locations ('paths') can be specified in absolute or relative terms
An 'absolute' path specifies where exactly in your computer's 'directory structure' a given file lives
For example:
imported_data <- read_csv("/Users/jmmcfarl/BootCamp/cp_r_bootcamp/ data/data_file.csv")
data_file.csv
lives inside data
folder, inside cp_r_boocamp
, etc.
But these will vary from computer to computer, so makes code sharing harder..
This will look for the file inside your 'working directory'
imported_data <- read_csv('data_file.csv')
What's the working directory and how does it get set?
Organize project-specific data in a 'data' subfolder and use Rstudio projects
Use the here
function from the here
package to find the right relative path within the project.
library(here) here('data', 'my_data_file.csv')
You can use it like this:
imported_csv <- read_delim(here('data', 'data_file.csv')) imported_tsv <- read_delim(here('data', 'tab_separated_table.tsv'))
Load the second sheet from an xlsx file
my_table <- read_excel(here('data', 'my_metadata_file.xlsx'), sheet = 2)
Uses the readxl
package part of tidyverse
For matrices:
The size is often too large for read_delim
to work well.
The first column is special: 'row names'
Let's use fread
from the data.table
package
fread
creates a slightly different type of table than we want. as_tibble
converts to the right kind.
library(data.table) counts_matrix <- fread(here('data', 'counts_rpkm.csv')) counts_matrix <- as_tibble(counts_matrix) #convert to a 'tibble' corner(counts_matrix)
Then we want to make one of the column rownames and mint our new data matrix
counts_matrix <- column_to_rownames(counts_matrix, var = 'Gene') #set Gene column to rownames counts_matrix <- as.matrix(counts_matrix) #formally make it a matrix, not really necessary corner(counts_matrix)
Always start with some quick inspection of data after reading it into R
Check for:
Column (always) and row (matrix only) names are correct
Make sure data types of columns are correct
Useful inspection functions:
View(df)
: Open spreadsheet viewer
head(df)
: Show top K (10) rowsglimpse(df)
: quick overview of a tableuseful::corner(mat)
: Show upper left corner of matrixdim(mat)
: Number of rows and columnsrownames(mat)
/colnames(mat)
: extract row/column namesIf you're having trouble with the above data loading strategies
Or you want to load a data file that's not part of your local project directory (and don't want to bother with file paths)
You can import files using the RStudio Import Dataset
tool (see here for more info).
By position
By logical conditions
By name
age <- c(15, 22, 45, 52, 73, 81)
age[5]
idx <- c(3,5,6) # create vector of the elements of interest age[idx]
| Operator | Description | | :-----------:|:-------------| | > | greater than | | >= | greater than or equal to| | < | less than | | <= | less than or equal to | | == | equal to | | != | not equal to | | & | and | | \| |or |
age > 50
log_idx <- age > 50 age[log_idx] #same as age[age > 50]
age == 52
age[age == 52]
age[age != 52]
Can assign names to each element in a vector
age <- c(Allice = 15, Bob = 22, Charlie = 45, Dan = 52) age
names(age)
Can also set names on a given vector
age <- c(15, 22, 45, 52) names(age) <- c('Allice', 'Bob', 'Charlie', 'Dan')
age
Selecting elements by name is generally unambiguous
age[c('Bob', 'Charlie')]
Vectors often are unnamed.
Most useful when working with lists (more to come)
R has another special value NULL
which represents 'the absence of a value'.
For example
my_vec <- c(1,2,3) names(my_vec)
Note: Subtly different from NA
, but don't worry about that now.
x <- c(4, 2, 3, 5, 1) sort(x)
sort(x, decreasing = TRUE)
Indexing can also be used to reorder data
teaching_team <- c("Mary", "Meeta", "Radhika")
reorder_teach <- teaching_team[c(3, 1, 2)] # Saving the results to a variable reorder_teach
Same as subsetting vectors, but can be applied to both rows and columns!
General formula:
matrix[row_set, column_set]
row_set
and column_set
can be individual elements or vectors
Leave either blank if you want all rows/columns, e.g.:
matrix[row_set, ] #keeps all columns matrix[, column_set] #keeps all rows
counts_mat <- fread(here('data', 'counts_rpkm.csv')) counts_mat <- as_tibble(counts_mat) counts_mat <- column_to_rownames(counts_mat, var = 'Gene') counts_mat <- as.matrix(counts_mat)
useful::corner(counts_mat)
counts_mat[3, ]
counts_mat[, 3]
counts_mat[1:4, 2:3]
Use colnames(mat)
and rownames(mat)
to get/set the column and row names of a matrix.
Data for a specified set of rows (genes)
counts_mat[c('ENSMUSG00000000028', 'ENSMUSG00000000037'), ]
counts_mat['ENSMUSG00000000001', c('sample7', 'sample5')]
A <- c(1, 2, 3, 4) B <- c(3, 4, 5, 6) A %in% B
Useful for restricting to the intersection of elements in two lists
A <- c(1, 2, 3, 4) B <- c(3, 4, 5, 6) A[A %in% B]
Or checking whether any/all elements of A are contained in B
any(A %in% B) all(A %in% B)
intersect(A, B)
: return the elements in both A
and B
setdiff(A, B)
: return the elements in A
that are not in B
union(A, B)
: return the elements in either A
or B
x <- counts_mat['ENSMUSG00000081010',] hist(x)
x <- counts_mat['ENSMUSG00000081010',] y <- counts_mat['ENSMUSG00000000037',] plot(x,y)
A list of lists
people <- list( Allice = list(age = 20, height = 50, school = 'MIT'), Bob = list(age = 10, height = 30, school = 'Harvard'), Charlie = list(age = 40, height = 60, school = 'BU'), Frank = c(age = 10, height = 2) )
people[[2]]
With lists it's especially useful to access elements by name
people[['Bob']]
Another (equivalent) way is to use the $
symbol. This is nice because it works with 'tab-complete'
people$Bob
people[['Allice']][['school']]
$
makes it easier to readpeople$Allice$height
species <- c("ecoli", "human", "corn") glengths <- c(4.6, 3000, 50000)
Create a tibble (dataframe, table) using tibble()
function.
Input as many columns as you want (must have same length)
df <- tibble(species, glengths) df
You can name the columns
df <- tibble(animal_species = species, genome_lengths = glengths) df
data.frame()
is the old way, gives a similar tabledf <- data.frame(species = species, glengths = glengths) df
But it makes strings into 'factors', and has some other subtle differences.
Use tibble()
to avoid some rare but confusing issues
Sometimes we'll use the terms 'dataframe', 'tibble', and 'table' interchangeably though.
df <- tibble(species, glengths)
You can access a column from a tibble as if it were a list of vectors (it is):
df$species df[['species']]
You can access rows and columns of dataframes like matrix indexing
df[2,]
df[2:3, 'glengths']
Just like the inverse of reading data
write_csv
to save tables
write.csv
to save matrices (keeps row names)
Use here
function to specify the path to save the file
write_csv(df, here('results', 'my_dataframe.csv'))
Reading data in R is kinda a mess, but stick with read_delim
for tables, read_excel
for Excel files, and data.table::fread
for matrices
You can extract data from a vector by numeric index, name, or using logical conditions
Matrices and data frames are the same, you just specify which rows and columns
Lists are a bit weird, but use $
to pull out elements by name
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.