wtd.colMeans: Weighted Mean of each Column - WORK IN PROGRESS (NA HANDLING...

View source: R/wtd.colMeans.R

wtd.colMeansR Documentation

Weighted Mean of each Column - WORK IN PROGRESS (NA HANDLING NOT YET TESTED)

Description

Returns weighted mean of each column of a data.frame or matrix, based on specified weights, one weight per row. Relies on weighted.mean() and unlike wtd.colMeans2() it also uses data.table::data.table()

Usage

wtd.colMeans(x, wts, by = NULL, na.rm = TRUE, dims = 1)

Arguments

x

Data.frame or matrix, required.

wts

Weights, optional, defaults to 1 which is unweighted, numeric vector of length equal to number of rows

by

Optional vector, default is none, that can provide a single column name (as character) or character vector of column names, specifying what to group by, producing the weighted mean within each group. See help for data.table::data.table()

na.rm

Logical value, optional, TRUE by default. Defines whether NA values should be removed before result is found. Otherwise result will be NA when any NA is in a vector.

dims

dims=1 is default. Not used. integer: Which dimensions are regarded as 'rows' or 'columns' to sum over. For row, the sum or mean is over dimensions dims+1, ...; for col it is over dimensions 1:dims.

Details

** Not yet handling factor or character fields well.

For a given column of data values,
If just some values are NA (but no wts are NA), and na.rm = TRUE as in default,
returns a weighted mean of all non-NA values.
If just some values are NA (but no wts are NA), and na.rm = FALSE,
returns NA.
If all values are NA (but no wts are NA),
returns NaN.
If any weights are NA, it behaves like stats::weighted.mean, so it
returns NA,
unless each value corresponding to a NA weight is also NA and thus removed.

Note Hmisc::wtd.mean is not exactly same as stats::weighted.mean since na.rm defaults differ
Hmisc::wtd.mean(x, weights=NULL, normwt="ignored", na.rm = TRUE )
Note na.rm defaults differ.
weighted.mean(x, w, ..., na.rm = FALSE)

Value

If by is not specified, returns a vector of numbers of length equal to number of columns in df. If by is specified, returns weighted mean for each column in each subset defined via by.

Examples

  # library(analyze.stuff)
  wtd.colMeans(data.frame(a = 1:4, b = c(NA, 2, 3, 4)))
  wtd.colMeans(data.frame(a = 1:4, b = c(NA, 2, 3, 4)),  wts = c(1,1,1,1))
  wtd.colMeans(data.frame(a = 1:4, b = c(NA, 2, 3, 4)),  wts = c(NA,1,1,1))
  wtd.colMeans(data.frame(a = 1:4, b = c(NA, 2, 3, 4)),  wts = c(1,NA,1,1))
  wtd.colMeans(data.frame(a = 1:4, b = c(NA, 2, NA, 4)), wts = c(1,1,1,1))
  wtd.colMeans(data.frame(a = 1:4, b = c(NA, NA, NA, NA)), wts = c(1,1,1,1))

  # tests of wtd.colMeans

suppressWarnings({

wtd.colMeans(data.frame(a = 1:4, someNA = c(NA, 2, 3, 4)))

wtd.colMeans(data.frame(a = 1:4,
  someNA = c(NA, 2, 3, 4)),    wts = c(1,1,1,1))
wtd.colMeans(data.frame(a = 1:4,
  someNA = c(NA, 2, NA, 4)),   wts = c(1,1,1,1))
wtd.colMeans(data.frame(a = 1:4,
  someNA = c(NA, NA, NA, NA)), wts = c(1,1,1,1))

wtd.colMeans(data.frame(a = 1:4,
  someNA = c(NA, 2, 3, 4)),    wts = c(NA,1,1,1))
wtd.colMeans(data.frame(a = 1:4,
  someNA = c(NA, 2, 3, 4)),    wts = c(1,NA,1,1))
wtd.colMeans(data.frame(a = 1:4,
someNA = c(NA, 2, 3, 4)),    wts = c(1,NA,NA,NA))
wtd.colMeans(data.frame(a = 1:4,
  someNA = c(NA, 2, 3, 4)),    wts = c(NA,NA,NA,NA))
wtd.colMeans(data.frame(a = 1:4,
  someNA = c(NA, NA, NA, NA)), wts = c(NA,NA,NA,NA))

wtd.colMeans(data.frame(a = 1:4,
  someNA = c(NA, 2, 3, 4)), na.rm = FALSE)
wtd.colMeans(data.frame(a = 1:4,
  someNA = c(NA, 2, 3, 4)),    wts = c(1,1,1,1), na.rm = FALSE)
wtd.colMeans(data.frame(a = 1:4,
  someNA = c(NA, 2, NA, 4)),   wts = c(1,1,1,1), na.rm = FALSE)
wtd.colMeans(data.frame(a = 1:4,
  someNA = c(NA, NA, NA, NA)), wts = c(1,1,1,1), na.rm = FALSE)

 wtd.colMeans(data.frame(a = 1:4,
   someNA = c(NA, 2, 3, 4)),    wts = c(NA,1,1,1), na.rm = FALSE)
 wtd.colMeans(data.frame(a = 1:4,
   someNA = c(NA, 2, 3, 4)),    wts = c(1,NA,1,1), na.rm = FALSE)
 wtd.colMeans(data.frame(a = 1:4,
   someNA = c(NA, 2, 3, 4)),    wts = c(1,NA,NA,NA), na.rm = FALSE)
 wtd.colMeans(data.frame(a = 1:4,
   someNA = c(NA, 2, 3, 4)),    wts = c(NA,NA,NA,NA), na.rm = FALSE)
 wtd.colMeans(data.frame(a = 1:4,
   someNA = c(NA, NA, NA, NA)), wts = c(NA,NA,NA,NA), na.rm = FALSE)
})
  n <- 1e6
  mydf <- data.frame(pop = 1000 + abs(rnorm(n, 1000, 200)),
   v1 = runif(n, 0, 1),
   v2 = rnorm(n, 100, 15),
   REGION = c('R1', 'R2', sample(c('R1', 'R2', 'R3'), n-2,
   replace = TRUE)),
   stringsAsFactors = FALSE)
   mydf$pop[mydf$REGION == 'R2'] <- 4 * mydf$pop[mydf$REGION == 'R2']
  mydf$v1[mydf$REGION == 'R2'] <- 4 * mydf$v1[mydf$REGION == 'R2']
  wtd.colMeans(mydf[ , 1:3])
  wtd.colMeans(mydf[ , 1:3], wts = mydf$pop)
  wtd.colMeans(mydf, by = 'REGION')
  # R HANGS/STUCK: # wtd.colMeans(mydf[1:100, 1:3], by = mydf$REGION,
   # wts = mydf$pop)
  mydf2 <- data.frame(a = 1:3, b = c(1, 2, NA))
  wtd.colMeans(mydf2)
  wtd.colMeans(mydf2, na.rm = TRUE)

ejanalysis/analyze.stuff documentation built on April 2, 2024, 10:10 a.m.