genoutlier: Identification and exclusion of outliers

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/genoutlier.R

Description

Function genoutlier finds and excludes outlied (concentration) values according to selected method and draws plot of outliers.

Usage

1
2
3
4
genoutlier(x, y=NA, input="openair", output=NA, method="lm3s",
           sides=2, pollutant=NA, plot=TRUE, columns=2, 
           col.points="black", pch=1, xlab="Date", 
           ylab="Concentration", main=NA)

Arguments

x

a vector of concentration values or data frame of genasis/openair type. See 'Details' for more detailed description of both data types.

y

a vector of measurement dates in the case of vector input only.

input

a type of data.frame in the case of data.frame input. The allowed values are "openair" (default) and "genasis". In case of vector input, this argument is meaningless.

output

a type of output data.frame. As in the input argument, both data.frames "openair" and "genasis" are available, with the default value equal to input.

method

method of threshold(s) determination. Allowed values are "m2s" and "m3s" for mean +(-) 3 standard deviation, "lm2s" and "lm3s" for log-transformed variant and "iqr2", "iqr4" and "iqr7" for interquatile distances. See 'Details' for more detailed description of methods.

sides

if sides=2 (default), both lower and upper threshold are used. If sides=1, only the upper one is in charge.

pollutant

a name(s) of the pollutant(s), for which the outliers are find. Not necessary if only data for one pollutant is available in x. If not specified, plots for all pollutants are drawn in a multi-plot arrangement.

plot

logical. Indicates, whether plot should be plotted.

columns

number of columns in the multi-plot arrangement.

col.points

color of non-outlied points inside the plot.

pch

plotting 'character', i.e., symbol to use. For more details see points.

xlab

the x label of the plot.

ylab

the y label of the plot.

main

overall title for the plot.

Details

The genoutlier function finds outlied (concentration) values according to a criterion given by arguments method and sides and substitutes them by NAs. The function recognises three different input formats: Option input="openair" uses "openair" format of data frame with first column of name "date" and class "Date", optional columns of names "date_end", "temp", "wind" and "note" and other columns of class "numeric" containing concentration values and named by names of the compounds. input="genasis" is used for the data frame with six columns "valu", "comp", "date_start", "date_end", "temp" and "wind" where the first, fifth and sixth are of class "numeric", second of class "character" and third and fourth columns could be both "character" or "Date" class. The names of columns in input="genasis" are not rigid, only their order is assumed. There is also a possibility to specify x and y as two vectors of equal lenght, first of class "numeric" containing concentration values, second of class "character" or "Date" containing measurement dates.

The output argument specifies of which type the resul will be. Both types of "data.frame" class output="openair" and output="genasis" are available, the default value is equal to the input argument, therefore the vector class of output is possible only if x is of class "numeric" and output is not specified.

There are seven available methods of outlier threshold set up: method="m3s" set the lower threshold equal to sample mean - 3 standard deviations and the uuper threshold to the sample mean + 3 standard deviations. Variant method="m2s" works similarly with only doubled standard deviations. In case of log-normally distributed data, the variant method="lm3s" could work better, setting up the lower threshold as geometric mean / 3 geometric standard deviation and the upper threshold as geometric mean * 3 geometric standard deviation. Analogously method="lm2s" works with the doubled geometric standard deviation. Non-parametric variants "iqr2", "iqr4" and "iqr7" set lower threshold to 25th quantile - a * interquartile range and upper threshold to 75th quantile + a * interquartile range with parameter a sequentially 0.5, 1.5 and 3 (thus the whole range is 2, 4 and 7 times the interquartile range).

The argument sides serves to specification, whether the one-sided or two-sided exclusion of outliers will be done. In the case sides=2 (default), both outliers under the lower and over the upper threshold are excluded, conversely if sides=1, only the outliers over the upper threshold are excluded.

Value

a list containing:

res

the data frame (or vector) according to the output argument settings with outlied values substituted by NAs.

lower

numeric value of lower threshold

upper

numeric value of upper threshold

Author(s)

Jiri Kalina
kalina@mail.muni.cz

See Also

genloq, genhistogram, genpastoact, genanaggr, genplot, genstatistic, gentransform, genwhisker

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
## Definition of simple data sources:
c1<-rnorm(100)+12
c2<-"random compound"
c3<-as.Date(as.Date("2013-01-01"):as.Date("2013-04-10"),
            origin="1970-01-01")
c4<-c3+1

sample_genasis<-data.frame(c1,c2,c3,c4)
sample_openair<-data.frame(c4,c1)
colnames(sample_openair)=c("date",c2)

## Examples of different usages:
genoutlier(sample_openair,input="openair",pollutant="random compound",
           method="m2s")
genoutlier(sample_genasis,input="genasis",method="m3s")

## Use of example data from the package:
data(kosetice.pas.openair)
genoutlier(genpastoact(kosetice.pas.openair[,1:8]),method="lm3s",
           main="Outliers",ylab="Concentration ngm-3")
genoutlier(kosetice.pas.openair[,c(1:4,23:26)],col.points="orange",
           method="lm3s")
data(kosetice.pas.genasis)
genoutlier(kosetice.pas.genasis[625:832,],input="genasis",
           method="lm2s",sides=1)

genasis documentation built on May 1, 2019, 10:16 p.m.

Related to genoutlier in genasis...