Home

/

GitHub

/

MickyDowns/mine_algorithms

/

outliers.md

outliers.md
In MickyDowns/mine_algorithms: What the package does (short line)

Outliers

Alalyze exam data. Find outliers.

setwd("./data")
exams<-read.csv("exams_and_names-2.csv",header=T)
setwd("../")

Key is using measure which is natural to the data set e.g., quartiles on distribution. Recall: - Box goes to 25% (Q1) and 75% (Q3) - Whiskers go to 1.5x Q1-Q3 range - Beyond that are outliers

with(exams,plot(Exam.1~Exam.2))

plot of chunk unnamed-chunk-2

par(mfrow=c(1,2))
with(exams,boxplot(Exam.1)) # no outliers
with(exams,boxplot(Exam.2)) # three outliers

plot of chunk unnamed-chunk-2

par(mfrow=c(1,1))
boxplot(exams$Exam.1,exams$Exam.2,col="blue",main="Exam Scores",
        names=c("exam1", "exam2"),ylab="Exam Score")

plot of chunk unnamed-chunk-2

sd1<-sd(exams$Exam.1,na.rm=T)
sd2<-sd(exams$Exam.2,na.rm=T)

exams$Exam.1.z<-(exams$Exam.1-mean(exams$Exam.1,na.rm=T))/sd1
exams$Exam.2.z<-(exams$Exam.2-mean(exams$Exam.2,na.rm=T))/sd2

# Note: default for scale is to set na.rm=T
exams$Exam.1.scale<-scale(exams$Exam.1)
exams$Exam.2.scale<-scale(exams$Exam.2)

head(sort(exams$Exam.2.z))

## [1] -2.6481 -2.5586 -2.2009 -1.2171 -0.8593 -0.8593

head(sort(exams$Exam.2.z,decreasing=T))

## [1] 1.242 1.153 1.153 1.108 1.108 1.019

# No outliers

Alalyze exam data. Find outliers using IQR. Considered better than mean/sd as they are sensitive to outliers.

# 1. get 1st and 3rd quantiles
q1<-quantile(exams$Exam.2,0.25,na.rm=T)
q3<-quantile(exams$Exam.2,0.75,na.rm=T)

# 2. calc iqr
iqr<-q3-q1

# 3. find outliers
outliers<-exams[exams$Exam.2>=q3+(1.5*iqr) | exams$Exam.2<=q1-(1.5*iqr),]
outliers

##          Student Exam.1 Exam.2 Exam.1.z Exam.2.z Exam.1.scale Exam.2.scale
## 4     Student #4    136    100   -1.925   -2.648       -1.925       -2.648
## NA          <NA>     NA     NA       NA       NA           NA           NA
## 23   Student #23    125    102   -2.555   -2.559       -2.555       -2.559
## NA.1        <NA>     NA     NA       NA       NA           NA           NA
## NA.2        <NA>     NA     NA       NA       NA           NA           NA
## 30   Student #30    141    110   -1.639   -2.201       -1.639       -2.201

# 3 outliers

data<-c(1,2,3,4,100)
boxplot(data)

plot of chunk unnamed-chunk-5

# shows outliers

sd3<-sd(data)
mean1<-mean(data)
zscores<-(data-mean1)/sd3
zscores

## [1] -0.4815 -0.4585 -0.4356 -0.4127  1.7883

# no outliers

# example: T distribution w/ low degrees of freedom. It's just the distribution.
boxplot(rt(50,df=1))

plot of chunk unnamed-chunk-6

First use basic lm to plot regression line for the data.

# load/clean data
setwd("./data")
exams<-read.csv("exams_and_names-2.csv",header=T)
setwd("../")
exams1<-exams[!is.na(exams[,3]),]

# fit model
fit<-lm(exams$Exam.2~exams$Exam.1)
summary(fit)

## 
## Call:
## lm(formula = exams$Exam.2 ~ exams$Exam.1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -20.09  -4.95   1.26   4.76  30.08 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -35.9039    16.8035   -2.14     0.04 *  
## exams$Exam.1   1.1470     0.0983   11.67  1.3e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.3 on 35 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.796,  Adjusted R-squared:  0.79 
## F-statistic:  136 on 1 and 35 DF,  p-value: 1.28e-13

# plot results
plot(exams$Exam.1,exams$Exam.2,pch=19,xlab="Exam 1", ylab="Exam 2", xlim=c(100,200),
     ylim=c(100,200))
abline(fit)

plot of chunk unnamed-chunk-7

# inspect residuals
sort(fit$residuals)

##        4       17       30       39       11       13        7       19 
## -20.0930 -16.7389 -15.8282 -14.5041 -11.3270 -10.1507  -9.7096  -7.0329 
##       23       33        2       40       20       15       35       12 
##  -5.4756  -4.9452  -4.6219  -4.1515  -4.1507  -4.0037  -2.7981  -1.2100 
##       24       32       26       18        1       21       22       38 
##  -0.3270   0.4074   1.2611   1.8200   2.6730   2.7608   2.8785   2.9671 
##       14       16       37       31        6       36        3       29 
##   3.1141   4.2904   4.3781   4.7608   5.6430   5.8193   5.9963   7.5252 
##       28       10        9       34        5 
##   7.7015   7.9370   9.8485  25.2019  30.0841

# recall that residuals are the **vertical** distance to the regression (fitted values) line.

Now let's look at it scaling the dots by the size of their residual

# load/clean data
setwd("./data")
exams<-read.csv("exams_and_names-2.csv",header=T)
setwd("../")
exams1<-exams[!is.na(exams[,3]),]

# fit model
fit2<-lm(exams1$Exam.2~exams1$Exam.1)

# plot results
plot(exams$Exam.1,exams$Exam.2,pch=19,xlab="Exam 1", ylab="Exam 2", xlim=c(100,200),
     ylim=c(100,200),cex=abs(fit2$residuals)/10)

abline(fit2)

plot of chunk unnamed-chunk-8

plot(fit2) # if you tell R to plot a model, it takes you through a series of graphs.

plot of chunk unnamed-chunk-8

First cluster the data

# load data
setwd("./data")
data<-read.csv("exams_and_names-2.csv",header=T)
setwd("../")

# clean data taking only needed rows
x<-data[!is.na(data[,3]),2:3]

# plot the data
plot(x,pch=19,xlab="Exam 1", ylab="Exam 2")

# fit model
fit<-kmeans(x,5)

# plot results
points(fit$centers,pch=19,col="blue",cex=2)
points(x,col=fit$cluster,pch=19)

plot of chunk unnamed-chunk-9

Different methods end up w/ different outliers. In this case, the lower three are considered outliers b/c they have their own cluster (whenk K is 5). The upper two are not as they fall w/in the range of the others when k=5.

MickyDowns/mine_algorithms documentation built on May 8, 2019, 10:49 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

MickyDowns/mine_algorithms
What the package does (short line)

outliers.md
In MickyDowns/mine_algorithms: What the package does (short line)

Outliers

In-Class Exercise #50

Read data

Using boxplot w/ interquartile range

Using z score

In-Class Exercise #52

Calculating / using IQR

Example of why iqr better: outliers skew mean/sd

iqr not perfect. some distributions are broad

Detecting outliers for multiple attributes

In-class #52: Using regression

In-class #53: Using Clustering w/ Kmeans

R Package Documentation

Browse R Packages

We want your feedback!

MickyDowns/mine_algorithms What the package does (short line)

outliers.md In MickyDowns/mine_algorithms: What the package does (short line)

Outliers

In-Class Exercise #50

Read data

Using boxplot w/ interquartile range

Using z score

In-Class Exercise #52

Calculating / using IQR

Example of why iqr better: outliers skew mean/sd

iqr not perfect. some distributions are broad

Detecting outliers for multiple attributes

In-class #52: Using regression

In-class #53: Using Clustering w/ Kmeans

R Package Documentation

Browse R Packages

We want your feedback!

MickyDowns/mine_algorithms
What the package does (short line)

outliers.md
In MickyDowns/mine_algorithms: What the package does (short line)