textcleaner: Text Cleaner

View source: R/textcleaner.R

textcleanerR Documentation

Text Cleaner

Description

An automated cleaning function for spell-checking, de-pluralizing, removing duplicates, and binarizing text data

Usage

textcleaner(
  data = NULL,
  type = c("fluency", "free"),
  miss = 99,
  partBY = c("row", "col"),
  dictionary = NULL,
  spelling = c("UK", "US"),
  add.path = NULL,
  keepStrings = FALSE,
  allowPunctuations,
  allowNumbers = FALSE,
  lowercase = TRUE,
  keepLength = NULL,
  keepCue = FALSE,
  continue = NULL
)

Arguments

data

Matrix or data frame.

For task = "fluency", data are expected to follow wide formatting (IDs are the row names and are not a column in the matrix or data frame):

row.names Response 1 Response 2 Response n
ID_1 1 2 n
ID_2 1 2 n
ID_n 1 2 n

For task = "free", data are expected to follow long formatting:

ID Cue Response
1 1 1
1 1 2
1 1 n
1 2 1
1 2 2
1 2 n
1 n 1
1 n 2
1 n n
2 1 1
2 1 2
2 1 n
2 2 1
2 2 2
2 2 n
2 n 1
2 n 2
2 n n
n 1 1
n 1 2
n 1 n
n 2 1
n 2 2
n 2 n
n n 1
n n 2
n n n
type

Character vector. Type of task to be preprocessed.

  • "fluency" Verbal fluency data (e.g., categories, phonological, synonyms)

  • "free" Free association data (e.g., cue terms or words)

miss

Numeric or character. Value for missing data. Defaults to 99

partBY

Character. Are participants by row or column? Set to "row" for by row. Set to "col" for by column

dictionary

Character vector. Can be a vector of a corpus or any text for comparison. Dictionary to be used for more efficient text cleaning. Defaults to NULL, which will use general.dictionary

Use dictionaries() or find.dictionaries() for more options (See SemNetDictionaries for more details)

spelling

Character vector. English spelling to be used.

  • "UK" For British spelling (e.g., colour, grey, programme, theatre)

  • "US" For American spelling (e.g., color, gray, program, theater)

add.path

Character. Path to additional dictionaries to be found. DOES NOT search recursively (through all folders in path) to avoid time intensive search. Set to "choose" to open an interactive directory explorer

keepStrings

Boolean. Should strings be retained or separated? Defaults to FALSE. Set to TRUE to retain strings as strings

allowPunctuations

Character vector. Allows punctuation characters to be included in responses. Defaults to "-". Set to "all" to keep all punctuation characters

allowNumbers

Boolean. Defaults to FALSE. Set to TRUE to keep numbers in text

lowercase

Boolean. Should words be converted to lowercase? Defaults to TRUE. Set to FALSE to keep words as they are

keepLength

Numeric. Maximum number of words allowed in a response. Defaults to NULL. Set a number to keep responses with words less than or equal to the number (e.g., 3 will keep responses with three or less words)

keepCue

Boolean. Should cue words be retained in the responses? Defaults to FALSE. Set to TRUE to allow cue words to be retained

continue

List. A result previously unfinished that still needs to be completed. Allows you to continue to manually spell-check their data after you've closed or errored out. Defaults to NULL

Value

This function returns a list containing the following objects:

binary

A matrix of responses where each row represents a participant and each column represents a unique response. A response that a participant has provided is a '1' and a response that a participant has not provided is a '0'

responses

A list containing two objects:

  • clean A response matrix that has been spell-checked and de-pluralized with duplicates removed. This can be used as a final dataset for analyses (e.g., fluency of responses)

  • original The original response matrix that has had white spaces before and after words response. Also converts all upper-case letters to lower case

spellcheck

A list containing three objects:

  • full All responses regardless of spell-checking changes

  • auto Only the incorrect responses that were changed during spell-check

removed

A list containing two objects:

  • rows Identifies removed participants by their row (or column) location in the original data file

  • ids Identifies removed participants by their ID (see argument data)

partChanges

A list where each participant is a list index with each response that was been changed. Participants are identified by their ID (see argument data). This can be used to replicate the cleaning process and to keep track of changes more generally. Participants with NA did not have any changes from their original data and participants with missing data are removed (see removed$ids)

Author(s)

Alexander Christensen <alexpaulchristensen@gmail.com>

References

Christensen, A. P., & Kenett, Y. N. (in press). Semantic network analysis (SemNA): A tutorial on preprocessing, estimating, and analyzing semantic networks. Psychological Methods.

Hornik, K., & Murdoch, D. (2010). Watch Your Spelling!. The R Journal, 3, 22-28.

Examples

# Toy example
raw <- open.animals[c(1:10),-c(1:3)]

if(interactive())
{
    #Full test
    clean <- textcleaner(open.animals[,-c(1,2)], partBY = "row", dictionary = "animals")
}


AlexChristensen/SemNetCleaner documentation built on June 29, 2022, 6:44 a.m.