trainLogisticRegressionWithPenalization: Train a logistic regression model with penalization
In malsch/occupationCoding: Supervised Learning for Occupation Coding

View source: R/trainLogisticRegressionWithPenalization.R

trainLogisticRegressionWithPenalization

R Documentation

Train a logistic regression model with penalization

Description

Function does some preprocessing and calls glmnet for a logistic regression model

Usage

trainLogisticRegressionWithPenalization(
  data,
  preprocessing = list(stopwords = character(0), stemming = NULL, countWords = FALSE),
  tuning = list(alpha = 0.05, maxit = 10^5, nlambda = 100, thresh = 1e-07)
)

Arguments

data

a data.table created with removeFaultyAndUncodableAnswers_And_PrepareForAnalysis

preprocessing

a list with elements

stopwords: a character vector, use tm::stopwords("de") for German stopwords.
stemming: NULL for no stemming and "de" for stemming using the German porter stemmer.
countWords: Set to TRUE if the predictor matrix should contain a column for answer length.

tuning

a list with elements that will be passed to glmnet

Details

Setting tuning$alpha = 0 (Ridge Penalty) seems to be most stable.

In our experience, glmnet often returns a warning like 3: from glmnet Fortran code (error code -72); Convergence for 72th lambda value not reached after maxit=100000 iterations; solutions for larger lambdas returned. To solve this issue, we can increase maxit to try more iterations or we can decrease the threshold thresh.

Value

a logistic regression model. Commands from glmnet should work.

Examples

# set up data
data(occupations)
allowed.codes <- c("71402", "71403", "63302", "83112", "83124", "83131", "83132", "83193", "83194", "-0004", "-0030")
allowed.codes.titles <- c("Office clerks and secretaries (without specialisation)-skilled tasks", "Office clerks and secretaries (without specialisation)-complex tasks", "Gastronomy occupations (without specialisation)-skilled tasks",
 "Occupations in child care and child-rearing-skilled tasks", "Occupations in social work and social pedagogics-highly complex tasks", "Pedagogic specialists in social care work and special needs education-unskilled/semiskilled tasks", "Pedagogic specialists in social care work and special needs education-skilled tasks", "Supervisors in education and social work, and of pedagogic specialists in social care work", "Managers in education and social work, and of pedagogic specialists in social care work",
 "Not precise enough for coding", "Student assistants")
proc.occupations <- removeFaultyAndUncodableAnswers_And_PrepareForAnalysis(occupations, colNames = c("orig_answer", "orig_code"), allowed.codes, allowed.codes.titles)

# Recommended configuration
trainLogisticRegressionWithPenalization(proc.occupations,
                 preprocessing = list(stopwords = character(0), stemming = "de", countWords = FALSE),
                 tuning = list(alpha = 0.05, maxit = 10^6, nlambda = 100, thresh = 1e-7))

# Other possibility
trainLogisticRegressionWithPenalization(proc.occupations,
                 preprocessing = list(stopwords = tm::stopwords("de"), stemming = NULL, countWords = TRUE),
                 tuning = list(alpha = 0.05, maxit = 10^6, nlambda = 100, thresh = 1e-7))

malsch/occupationCoding documentation built on March 14, 2024, 8:09 a.m.