options(htmltools.dir.version = FALSE) knitr::opts_chunk$set(fig.align="center", fig.width=5, fig.height=5, warning = FALSE, message = FALSE)
library(xaringanthemer) duo_accent( primary_color = "ivory", secondary_color = "#310A31", header_font_google = google_font("Roboto", "400"), text_font_google = google_font("Lato", "300"), code_font_family = "Fira Code", code_font_url = "https://cdn.rawgit.com/tonsky/FiraCode/1.204/distr/fira_code.css", header_color = "#f54278", title_slide_text_color = "#354a66" )
These slides assume your working directory contains the data folder that contains the following files: - Loans.csv - ChicagoEmployees.csv
One way to ensure that you are in that directory is to open the week2.Rproj file. (After downloading the project folder week2 from elearning.)
You can do this in RStudio by File --> Open Project and then navigating to it at the location you downloaded it to.
Here are the first few rows of the data frame (using the head function):
head(iris)
Btw, there is also a tail function, that lets you look at the last few rows of the data.
summary(iris)
-- It provides summary statistics for the columns (variables) in the data frame.
How would you calculate some of those summary statistics on your own? e.g. for the mean and median of the sepal widths:
--
First, mean:
mean(iris$Sepal.Width)
-- Then, median
median(iris$Sepal.Width)
The mean and the median look very close to each other.
hist(iris$Sepal.Width)
--
hist(iris$Sepal.Width, xlab = "Sepal Width", main = "Iris Data Set", col = "royalblue")
plot(iris$Sepal.Width, iris$Sepal.Length)
cor(iris$Sepal.Width, iris$Sepal.Length)
-- This is a weakly negative correlation i.e. length seems to generally decrease slightly with increases in width (and vice versa).
--
Here's the scatterplot:
plot(iris$Petal.Width, iris$Petal.Length)
--
cor(iris$Petal.Width, iris$Petal.Length)
var(iris$Petal.Length)
--
sd(iris$Petal.Length)
boxplot(Sepal.Width ~ Species, data = iris)
The list.files function can be used to see the files in a directory.
list.files("data")
--
This shows the 2 data sets in the data directory.
#loans <- read.csv("data/Loans.csv", stringsAsFactors = FALSE)
--
The loans data set has 10,000 rows and 17 columns
dim(loans)
If you are using RStudio, you should see it in the Environment window. (And you can browse it in the source pane by clicking on the name loans).
An extremely useful function is table which gets counts of categorical variables.
--
table(loans$homeownership)
table(loans$app_type, loans$homeownership)
--
addmargins(table(loans$app_type, loans$homeownership))
ownapps <- table(loans$app_type, loans$homeownership) addmargins(ownapps)
First get the counts of the homeownership levels:
ownership_counts <- table(loans$homeownership)
--
barplot(ownership_counts)
--
We can also use the 2d table, ownapps created previously:
--
barplot(ownapps)
--
Real answer: Yes, you can customise plots as much as you want in R
--
barplot(ownapps, beside = TRUE)
-- First, we will make a vector of colours. (Most common colour names are available): --
colours <- c("blue", "yellow")
-- Now to apply it to the plot, using the col argument: --
barplot(ownapps, beside = TRUE, col = colours)
barplot(ownapps, beside = TRUE, col = colours) legend("top", legend = rownames(ownapps), fill = colours)
ownapps
--
rownames(ownapps)
--
colnames(ownapps)
new_colours <- c("#569BBD", "#F4DD00")
barplot(ownapps, beside = TRUE, col = new_colours) legend("top", legend = rownames(ownapps), fill = new_colours)
There were quite a few commands used in these slides. Some are more important than others and you will find them recurring more frequently in your use of R.
for many mathematical functions the corresponding R function is usually easy to figure out e.g. mean, median, sd, min, max
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.