dsldEDFFair: dsldEDFFair Wrappers

dsldEDFFair WrappersR Documentation

dsldEDFFair Wrappers

Description

Explicitly Deweighted Features: control the effect of proxies related to sensitive variables for prediction.

Usage

dsldQeFairKNN(data, yName, sNames, deweightPars=NULL, yesYVal=NULL,k=25,
  scaleX=TRUE, holdout=floor(min(1000,0.1*nrow(data))))
dsldQeFairRF(data,yName,sNames,deweightPars=NULL, nTree=500, minNodeSize=10,
  mtry = floor(sqrt(ncol(data))),yesYVal=NULL,
  holdout=floor(min(1000,0.1*nrow(data))))
dsldQeFairRidgeLin(data, yName, sNames, deweightPars = NULL, 
  holdout=floor(min(1000,0.1*nrow(data))))
dsldQeFairRidgeLog(data, yName, sNames, deweightPars = NULL, holdout =
  floor(min(1000, 0.1 * nrow(data))), yesYVal = levels(data[, yName])[2])
## S3 method for class 'dsldQeFair'
predict(object,newx,...)

Arguments

data

Dataframe, training set.

yName

Name of the response variable column.

sNames

Name(s) of the sensitive attribute column(s).

deweightPars

Values for de-emphasizing variables in a split, e.g. 'list(age=0.2,gender=0.5)'. In the linear case, larger values means more deweighting, i.e. less influence of the given variable on predictions. For KNN and random forests, smaller values mean more deweighting.

scaleX

Scale the features. Defaults to TRUE.

yesYVal

Y value to be considered "yes," to be coded 1 rather than 0.

k

Number of nearest neighbors. In functions other than dsldQeFairKNN for which this is an argument, it is the number of neighbors to use in finding conditional probabilities via knnCalib.

holdout

How many rows to use as the holdout/testing set. Can be NULL. The testing set is used to calculate s correlation and test accuracy.

nTree

Number of trees.

minNodeSize

Minimum number of data points in a tree node.

mtry

Number of variables randomly tried at each split.

object

An object returned by the dsld-EDFFAIR wrapper.

newx

New data to be predicted. Must be in the same format as original data.

...

Further arguments.

Details

The sensitive variables S are removed entirely, but there is concern that they still affect prediction indirectly, via a set C of proxy variables.

Linear EDF reduces the impact of the proxies through a shinkage process similar to that of ridge regression. Specifically, instead of minimizing the sum of squared errors SSE with respect to a coefficient vector b, we minimize SSE + the squared norm of Db, where D is a diagonal matrix with nonzero elements corresponding to C. Large values penalizing variables in C, thus shrinking them.

KNN EDF reduces the weights in Euclidean distance for variables in C. The random forests version reduces the probabilities that a proxy will be used in splitting a node.

By using various values of the deweighting parameters, the user can choose a desired position in the Fairness-Utility Tradeoff.

More details can be found in the references.

Value

The EDF functions return objects of class 'dsldQeFair', which include components for test and base accuracy, summaries of inputs and so on.

Author(s)

N. Matloff, A. Mittal, J. Tran

References

https://github.com/matloff/EDFfair

See Also

Matloff, Norman, and Wenxi Zhang. "A novel regularization approach to fair ML."
arXiv preprint arXiv:2208.06557 (2022).

Examples

  

data(compas1) 
data(svcensus) 

# dsldQeFairKNN: deweight "decile score" column with "race" as 
# the sensitive variable
knnOut <- dsldQeFairKNN(compas1, "two_year_recid", "race", 
   list(decile_score=0.1), yesYVal = "Yes")
knnOut$testAcc 
knnOut$corrs 
predict(knnOut, compas1[1,-8]) 

# dsldFairRF: deweight "decile score" column with "race" as sensitive variable
rfOut <- dsldQeFairRF(compas1, "two_year_recid", "race", 
   list(decile_score=0.3), yesYVal = "Yes")
rfOut$testAcc
rfOut$corrs 
predict(rfOut, compas1[1,-8]) 

# dsldQeFairRidgeLin: deweight "occupation" and "age" columns
lin <- dsldQeFairRidgeLin(svcensus, "wageinc", "gender", deweightPars = 
  list(occ=.4, age=.2))
lin$testAcc 
lin$corrs 
predict(lin, svcensus[1,-4])

# dsldQeFairRidgeLin: deweight "decile score" column
log <- dsldQeFairRidgeLog(compas1, "two_year_recid", "race", 
  list(decile_score=0.1), yesYVal = "Yes")
log$testAcc 
log$corrs 
predict(log, compas1[1,-8])

dsld documentation built on Sept. 14, 2024, 1:08 a.m.