almostComplete: Subset a 'data.frame' by Completeness of Rows or Columns

Description Usage Arguments Details Value Author(s) References Examples

View source: R/almostComplete.R

Description

An alternative to stats::complete.cases() that lets you specify the percentage of completeness desired.

Usage

1
almostComplete(dataset, rowPct, colPct = rowPct, n = 1)

Arguments

dataset

The input data.frame

rowPct

The maximum percent of NA values in rows, as a decimal.

colPct

The maximum percent of NA values in columns, as a decimal.

n

When rowPct and colPct are NULL, the function will drop at least the number of rows and columns specified here, by "rank", if any contain NA. See "Details".

Details

When n is specified and rowPct and colPct are NULL, the function calculates the number of NA values by row and column. By default, it then drops the rows and columns with the highest number of missing values. With the dataset in the Examples section, if you use n = 2, the function will remove rows 1, 3, and 6 and columns A, B, C, and F. Compare this behavior with the results of rowSums(is.na(mydf)) and colSums(is.na(mydf)).

Value

A data.frame

Author(s)

Ananda Mahto

References

http://stackoverflow.com/a/20475029/1270695

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
mydf <- read.csv(text="
SampleID,A,B,C,D,E,F
x1,NA,x,NA,x,NA,x
x2,x,x,NA,x,x,NA
x3,NA,NA,x,x,x,NA
x4,x,x,x,NA,x,x
x5,x,x,x,x,x,x
x6,NA,NA,NA,x,NA,NA
x7,x,x,x,NA,x,x
x8,NA,NA,x,x,x,x
x9,x,x,x,x,x,NA
x10,x,x,x,x,x,x
x11,NA,x,x,x,x,NA")

## What do the data look like?
## How many NAs are there per column and row?
mydf
colSums(is.na(mydf))
rowSums(is.na(mydf))

## What does complete.cases do?
mydf[complete.cases(mydf), ]

## Drop whichever row and column have
## the highest percentage of NA values
almostComplete(mydf, NULL, NULL)

## Drop the rows and columns which have
## more than the second highest percentage of NA values
almostComplete(mydf, NULL, NULL, n = 2)

## Set one threshold value for both rows and columns.
almostComplete(mydf, .7)

## Specify row and column threshold values separately.
almostComplete(mydf, rowPct = .2, colPct = .5)

mrdwab/SOfun documentation built on June 20, 2020, 6:15 p.m.