count.above: Number or percent of rows (for each col) where value exceeds...

View source: R/count.above.R

count.aboveR Documentation

Number or percent of rows (for each col) where value exceeds cutoff(s)

Description

Count the number or percent of rows (for each col of a data.frame) where the value exceeds some specified cutoff(s)

Usage

count.above(
  df,
  benchmarks = "mean",
  benchnames = "cutoff",
  or.tied = FALSE,
  below = FALSE,
  wts = 1,
  na.rm = TRUE
)

Arguments

df

Data.frame or matrix, required.

benchmarks

Default is 'mean' but otherwise this must be a number or numeric vector of thresholds to compare values to.

benchnames

Default is 'cutoff' and this string is used to create colnames for the results, such as above.cutoff.for.field1

or.tied

Logical, FALSE by default, reporting on those > cutoff. But, if or.tied=TRUE, this reports on those >= cutoff.

below

Logical, FALSE by default, which counts how many are above cutoff (or tied if or.tied). If TRUE, counts how many are below (or tied with) cutoff.

wts

Number or vector, default is 1. Length must be a factor of number of rows in df, so length(df,1) is an integer multiple of length(wts) Applies weights to when counting how many.

na.rm

Logical value, optional, TRUE by default. Defines whether NA values should be removed first. Otherwise result will be NA when any NA is in a col.

Details

  • If wts is population counts, for example, this gives the COUNT of people (not rows) for whom value in df,x exceeds benchmark for each column x

  • If below=FALSE by default, reports on those above (or tied with, if or.tied) cutoff. But if below=TRUE, this reports on those below (or tied with, if or.tied) cutoff.

  • If df (passed to the function) is a data.frame or matrix, the function returns a vector of length= length(df) or number of cols in matrix.

  • If df is just a vector, it is treated like a 1-column data.frame, so the function returns a single value.

  • If benchmarks (passed to the function) is a data.frame matching df in dimensions, each value is used as the cutoff for the corresponding cell in df.

  • If benchmarks is a vector of length= length(df), each value in benchmarks is the cutoff for the corresponding column in df.

  • If benchmarks is a shorter vector, it is recycled. (e.g., a vector of length 2 would use the first benchmark as the cutoff for all odd columns of df, the second for all even columns of df).

  • If benchmarks is a single numeric value, it is used as the cutoff value in every comparison for all of df.

  • If benchmarks is omitted, the default behavior is to use the arithmetic mean value a column of df as the cutoff for that column of df.

  • If benchnames is omitted, the word "cutoff" is used by default (unless benchmarks is also omitted).

  • If benchnames is specified but benchmarks is not, the benchmarks default to the column means, so benchnames is ignored and "mean" is used instead.

  • If wts is omitted the default is 1 which means no weighting. Just row counts.

  • If wts is a vector of length= length(df,1) then each row of df uses the corresponding weight and count is sum of wts not count of rows.

  • If wts is shorter than that, it is recycled but # of rows in df must be an integer multiple of length(wts).

  • NA values in df are not counted and are not in the numerator of pct.above() but the denominator of pct.above() is a count of all rows of df, not just the non-NA ones.

These could be renamed rows.above.count(), rows.above.pct(), rows.above.which() to follow convention of cols.above.count(), cols.above.pct(), cols.above.which() and same using below too, like rows.below.pct() etc. and *** should make param names consistent, like x not df, cutoff(s) not benchmarks?, or.tied not gte but *** cols versions and all should have wts, na.rm, benchmarks as vector not just 1 number, benchnames, params and ** should have a "below" version for each variant

Value

Returns a vector of numbers of length equal to number of columns in df.

Note

Future work: these functions could have wts, na.rm, & allow cutpoints or benchmarks as a vector (not just 1 number), & have benchnames.

See Also

count.above pct.above pct.below to see, for each column, the count or percent of rows that have values above or below a cutoff.

cols.above.count cols.above.which cols.above.pct to see, for each row, the count or which or fraction of columns with numbers at/above/below cutoff.

colcounter_summary() colcounter_summary_cum() colcounter_summary_pct() colcounter_summary_cum_pct() tablefixed()

Other functions for above and below: cols.above.count(), cols.above.pct(), cols.above.which(), count.below(), pct.above(), pct.below(), rows.above.count(), rows.above.pct(), rows.above.which(), rows.below.count(), rows.below.pct()

Examples

x <- data.frame(a=1:20, b=10, c=c(1:9,100:110))
mywts <- c(rep(1,10), rep(2,10))
mybench <- c(3,100,10)
mynames <- c("HI","USavg","HealthStandard")

count.above(x, 0, wts=mywts)
count.above(x, 100, wts=mywts)
count.above(x, 10, wts=mywts)
count.above(x, mybench, wts=mywts)
cbind(count= count.above(x, mybench, mynames, wts=mywts))
cbind(pct= pct.above(x, benchmarks=mybench, benchnames=mynames, wts=mywts) )
cbind(
  count= count.above(x, mybench, mynames, wts=mywts),
  pct= pct.above(x, benchmarks=mybench, benchnames=mynames, wts=mywts) )
cbind(stat= pct.above(as.matrix(x), mybench, mynames, wts=mywts) )
cbind(stat= pct.above(1:100, 98 , wts=mywts))
 # If only a single vector is passed,
 # not a data.frame "Warning: df is a vector... converting to data.frame"

# to find how many PLACES are at/above the 95th population-weighted percentile
#  (won't be exactly 5% of places, just 5% of people):
mybench2 <- sapply(x, function(z) Hmisc::wtd.quantile(z, mywts, probs=0.95, na.rm=TRUE))
count.above(x, benchmarks=mybench2, benchnames=paste('pop.95th.', names(x), sep=''), wts=1 )

# to find how many PLACES are at/above the MEDIAN pop-wtd place
#  (won't be exactly half of places, just half of people):
mybench2 <- sapply(x, function(z) Hmisc::wtd.quantile(z, mywts, probs=0.50, na.rm=TRUE))
count.above(x, benchmarks=mybench2, benchnames=paste('pop.median.', names(x), sep=''), wts=1 )

# to find how many PEOPLE are at/above the 95th percentile place
#  (won't be exactly 5% of people, just 5% of places):
mybench2 <- sapply(x, function(z) stats::quantile(z, probs=0.95, na.rm=TRUE))
count.above(x, benchmarks=mybench2, benchnames=paste('95th.', names(x), sep=''), wts=mywts )
#
## Not run: 
# to find how many PEOPLE are at/above the MEDIAN place
#  (won't be exactly 50% of people, just 50% of places):
mybench2 <- sapply(x, function(z) stats::quantile(z, probs=0.50, na.rm=TRUE))
count.above(x, benchmarks=mybench2, benchnames=paste('median.', names(x), sep=''), wts=mywts )
##not run## cbind( pct.above(1:100, wts=mywts) )
# That does not recycle weights in this situation of a single vector argument
count.above(data.frame(a=c(1:10, NA)), 2, wts=mywts)   # does not work if NA values
cbind( pct.above(data.frame(a=c(1:10, NA)), 0 , wts=mywts))
  # Gives "Error: wts must be a vector whose length is a factor of # rows in df,
  # so length(df[,1]) is an integer multiple of length(wts) "
pct.above(data.frame(a=c(NA, NA, NA)), 3, wts=mywts)
  # Gives "Error - df is a single NA value or single column with only NA values"
count.above(x, c(3,1), wts=mywts) # 3,1 is recycled as 3,1,3 since x has 3 cols
pct.above(x, benchnames=mynames, wts=mywts)
  # ignores names since default benchmarks are column means

## End(Not run)

ejanalysis/analyze.stuff documentation built on Feb. 2, 2024, 11:21 p.m.