con_filter | R Documentation |
Traditional filtering (subsetting) of data is typically performed via some criteria based on the columns of the data.
In contrast, this function performs filtering of data based on the joint rows and columns of a matrix-view of two factors.
Conceptually, the idea is to re-shape two or three columns of a dataframe into a matrix, and then delete entire rows (or columns) of the matrix if there are too many missing cells in a row (or column).
The two most useful applications of two-way filtering are to:
Remove a factor level that has few interactions with another factor. This is especially useful in linear models to remove rare factor combinations.
Remove a factor level that has any missing interactions with another factor. This is especially useful with biplots of a matrix to remove rows or columns that have missing values.
A formula syntax is used to specify the two-way filtering criteria.
Some examples may provide the easiest understanding.
dat <- data.frame(state=c("NE","NE", "IA", "NE", "IA"), year=c(1,2,2,3,3), value=11:15)
When the 'value' column is re-shaped into a matrix it looks like:
state/year | 1 | 2 | 3 | NE | 11 | 12 | 14 | IA | | 13 | 15 |
Drop states with too much missing combinations. Keep only states with "at least 3 years per state" con_filter(dat, ~ 3 * year / state) NE 1 11 NE 2 12 NE 3 14
Keep only years with "at least 2 states per year" con_filter(dat, ~ 2 * state / year) NE 2 12 IA 2 13 NE 3 14 IA 3 15
If the constant number in the formula is less than 1.0, this is interpreted as a fraction. Keep only states with "at least 75% of years per state" con_filter(dat, ~ .75 * year / state)
It is possible to include another factor on either side of the slash "/". Suppose the data had another factor for political party called "party". Keep only states with "at least 2 combinations of party:year per state" con_filter(dat, ~ 2 * party:year / state)
If the formula contains a response variable, missing values are dropped first, then the two-way filtering is based on the factor combinations. con_filter(dat, value ~ 2 * state / year)
con_filter(data, formula, verbose = TRUE, returndropped = FALSE)
data |
A dataframe |
formula |
A formula with two factor names in the dataframe
that specifies the criteria for filtering,
like |
verbose |
If TRUE, print some diagnostic information about what data is being deleted. (Similar to the 'tidylog' package). |
returndropped |
If TRUE, return the dropped rows instead of the kept rows. Default is FALSE. |
The original dataframe is returned, minus rows that are filtered out.
Kevin Wright
None.
dat <- data.frame(
gen = c("G3", "G4", "G1", "G2", "G3", "G4", "G5",
"G1", "G2", "G3", "G4", "G5",
"G1", "G2", "G3", "G4", "G5",
"G1", "G2", "G3", "G4", "G5"),
env = c("E1", "E1", "E1", "E1", "E1", "E1", "E1",
"E2", "E2", "E2", "E2", "E2",
"E3", "E3", "E3", "E3", "E3",
"E4", "E4", "E4", "E4", "E4"),
yield = c(65, 50, NA, NA, 65, 50, 60,
NA, 71, 76, 80, 82,
90, 93, 95, 102, 97,
98, 102, 105, 130, 135))
# How many observations are there for each combination of gen*env?
with( subset(dat, !is.na(yield)) , table(gen,env) )
# Note, if there is no response variable, the two-way filtering is based
# only on the presence of the factor combinations.
dat1 <- con_filter(dat, ~ 4*env / gen)
# If there is a response variable, missing values are dropped first,
# then the two-way filtering is based on the factor combinations.
dat1 <- con_filter(dat, yield ~ 4*env/gen)
dat1 <- con_filter(dat, yield ~ 5*env/ gen)
dat1 <- con_filter(dat, yield ~ 6*gen/ env)
dat1 <- con_filter(dat, yield ~ .8 *env / gen)
dat1 <- con_filter(dat, yield ~ .8* gen / env)
dat1 <- con_filter(dat, yield ~ 7 * env / gen)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.