observedVersusExpected: calculate mutual information between a categorical value (X)...

View source: R/missingValues.R

observedVersusExpectedR Documentation

calculate mutual information between a categorical value (X) and its absence in a data set.

Description

This calculates the mutual information of a feature not being present in all samples

Usage

observedVersusExpected(
  df,
  discreteVars,
  sampleVars,
  sampleCount = NULL,
  sampleCountDf = NULL,
  ...
)

Arguments

df

- may be grouped, in which case the value is interpreted as different types of variable (features)

discreteVars

- the column(s) of the categorical value (X) quoted by vars(...) (e.g. outcome)

sampleVars

- the column(s) of the sample identifier

sampleCount

- an integer containing the count of all samples per outcome (discreteVars)

sampleCountDf

- a dataframe containing columns for df grouping (features), and discreteVars (outcomes), N and N_x columns with expected counts of outcomes see expectSamplesByOutcome(...)

Details

This is relevant for sparse data sets with many features such as NLP terms, where a term as a feture may not be present in a given document, and this absense may be assymetrically distributed between different classes.

Value

a dataframe containing the distinct values of the groups of df, and for each group a mutual information column (I). If df was not grouped this will be a single entry


terminological/tidy-info-stats documentation built on Nov. 19, 2022, 11:23 p.m.