knitr::opts_chunk$set(echo = TRUE)

Tasks

Task 1

getwd()

Task 2

# Holds the entire data set
mpg.df <- read.csv("EPAGAS.csv")
# Holds the first 6 lines of the data set
mpg_6 <- head(mpg.df, 6)
mpg_6

Task 3

number of miles per gallon vector

mpg <- mpg.df$MPG
head(mpg)

Transform mpg to z

z <- (mpg - mean(mpg))/sd(mpg)
head(z)

Verifying z-bar (mean of z) is zero

print(paste0("z-bar = ", round(mean(z), digits=4)), quote = FALSE)

Verify sz^2 (variance) is 1

print(paste0("variance = ", var(z)), quote = FALSE)

Identify possible outliers in mpg (between [2,3] standard deviations from mean)

mpg[abs(z) >= 2 & abs(z) <= 3]

Identify outliers in mpg (strictly greater than 3)

mpg[abs(z) > 3]

Use lattice to construct a dot plot

library(lattice)
cols = ifelse(abs(z) > 3,"Red",
                        ifelse(abs(z) >= 2 & abs(z),"Blue","Black"))
lattice::dotplot(mpg, col= cols)

Task 4

Make a boxplot

boxplot(mpg, notch = T, horizontal = T, col= "Black", main="MPG Boxplot")

Use Chebyshev's to predict proportion of data within 2 std deviations of mean

According to Chebyshev's, if k = 2 then $\frac{N(S_k)}{n} = 1 - \frac{1}{k^2} = 1 - \frac{1}{4} = \frac{3}{4}$ or $75\%$ of data will lie within 2 Std deviations

Use R to calculate the exact proportion within 2 standard deviation of the mean

x <- mpg[abs(z) <= 2]
print(paste0(length(x),"% of data is within 2 std deviations"))

Does Chebyshev agree with the data?

According to Chebyshev's inequality

k <- 2
mean(abs(z - mean(z)) >= 2*sd(z)) <= 1/k^2

Yes.

Now use the empirical rule, what proportion (according to the rule) of the data should be within 2 standard deviations of the mean?

According to the empirical rule, at least 95% of data should be withing 2 std deviations

How well does it correspond?

We saw earlier that 96% of data in EPAGAS.csv is within 2 std deviations, so it corresponds well.

Is the Empirical rule valid? Why?

The rule makes two assumptions: 1) The distribution is unimodal 2) Symmetrical distrubtion about the mode

Yes. Although there is a slight skew value 0.0499, which is generally acceptable for a normal distribution, I think both of the assumptions hold.



agracy2246/MATH4753grac0009 documentation built on April 26, 2020, 9:39 a.m.