knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
To analyze the data and generate interpretable results the following statistical models can be used:
Understanding the distribution of the data can be done by looking at its shape, for instance:
The 'Normal' Distribution
Normal Distribution Visualized
knitr::include_graphics("pictures/introDA2/ndis.jpg")
quakes
dataset:library(datasets) data("quakes") str(quakes)
table(quakes$mag)
hist(quakes$mag, breaks = 24)
Frequency distribution properties include the following statistics:
Skew - The symmetry of the distribution.
Kurtosis - The 'heaviness' of the tails.
knitr::include_graphics("pictures/introDA2/Skew_PosvsNeg.png")
knitr::include_graphics("pictures/introDA2/Kurtosis.png")
hist(quakes$mag)
library(moments) skewness(quakes$mag) library(psych) describe(quakes$mag)
Bulmer (1979) - a classic - suggests this rule of thumb:
#moments kurtosis(quakes$mag) #psych describe(quakes$mag)
How can you interpret the kurtosis number?
Excess Kurtosis can be interpreted using the following rule of thumb:
We will cover more on skew and kurtosis in the data screening section.
summary(quakes$mag)
pastecs
package. library(pastecs) stat.desc(quakes)
Hmisc
package. library(Hmisc) Hmisc::describe(quakes)
psych
package. library(psych) psych::describe(quakes)
$$\bar{x} = \frac {\sum_{i=1}^{n}x_{i}} {n}$$
mean(quakes$mag)
median(quakes$mag)
getmode <- function(v) { uniqv <- unique(v) uniqv[which.max(tabulate(match(v, uniqv)))] } getmode(quakes$mag)
Bimodal
Multimodal
Example of Bimodal Distribution:
knitr::include_graphics("pictures/introDA2/bimodal_distribution.png")
knitr::include_graphics("pictures/introDA2/Capture_01.png")
A measure of spread gives us an idea of how well the central tendency represents the data.
We will be looking at the key measures of:
Definition - The smallest score subtracted from the largest.
Calculate the range using the following:
range(quakes$mag) psych::describe(quakes$mag)
Definition - is the difference between the first and third quartiles, Q3 - Q1.
Quartiles Definition - The three values that split the sorted data into four equal parts.
knitr::include_graphics("pictures/introDA2/interquartile_range.png")
quantile(quakes$mag) summary(quakes$mag)
quantile(quakes$mag, c(0.05,0.50,0.75,0.95))
$$SD^2 = \frac {\sum_{i=1}^{n}(x_{i} - \bar{x})^2} {n}$$
var(quakes$mag)
sd(quakes$mag)
$$Z = \frac{(x_{i} - \bar{x})} {SD}$$
quakes$zscore <- scale(quakes$mag) head(quakes$zscore) str(quakes$zscore) mean(quakes$mag) sd(quakes$mag)
By dividing by the standard deviation, $SD$, we scale the distance from the mean to express it in units of standard deviations.
Properties:
$$cov_{x,y} = \frac {\sum_{i=1}^{n}(x_{i} - \bar{x})(y_{i} - \bar{y})} {n}$$
cov(quakes$mag, quakes$depth)
cor(quakes$mag, quakes$depth)
library(corrplot) corrplot(cor(quakes), order = "hclust")
In this section, you learned about:
Frequency Distributions: helping us understand shape: skew, kurtosis
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.