fixName: Preparing People's Names

View source: R/fixName.R

fixNameR Documentation

Preparing People's Names

Description

Standardize name notation

Usage

fixName(
  nomes,
  sep.in = c(";", "&", "|", " e ", " y ", " and ", " und ", " et "),
  sep.out = "|",
  bad.comma = TRUE,
  special.char = FALSE
)

Arguments

nomes

a character string or a vector with names. to FALSE.

sep.in

a vector of the symbols separating multiple names. Default to: ";", "&", "|", " e ", " y ", " and ", " und ", and " et ".

sep.out

a character string with the symbol separating multiple names in the output string. Defaults to "|". If a character vector of length 2 or more is supplied, the first element is used with a warning.

bad.comma

logical. Should the cases when source data use commas to separate last names and first names/initials, as well as multiple people's names, be isolated and (tried to be) fixed? Default to TRUE.

special.char

logical. Should special characters be maintained? Default to FALSE.

Details

The function fixes small problems in name notation (e.g. orphan spaces), standardize the separation between multiple authors and between initials and/or prepositions within the same name. It also standardize the notation of some compound names (i.e. Faria Jr. to Faria Junior). In addition, the function removes numbers, some unwanted expressions (e.g. 'et al.') and symbols (e.g. ? or !).

The function was created to deal with people's names, so input separators for multiple names composed only by letters should be surrounded by spaces. If separators are non-alphabetic characters (e.g. semi-colons, ampersand), they are taken independently of the presence of spaces nearby.

By default, commas are not within the symbols separating multiple people's names are, because commas are often used to separate people's last names from their first names or initials. There are cases when the name notation uses commas to separate last names and first names/initials, as well as multiple people's names (which is not at all encouraged). For some cases (e.g. "M. Costa, J. Ribeiro"), but not for all of those cases (e.g. 'Costa, M., Ribeiro, J.'), the function tries to isolate and solve the separation between multiple people's names. But this procedure currently is very preliminary and it may include noise in the name notation. If this is the case, it can be skipped by setting the argument bad.comma to FALSE.

Due to common encoding problems related to Latin characters, names are returned without accents by default. But users can choose between outputs with and without accents and species characters, by setting the argument special.char to TRUE.

Value

The character string x in the standard notation to facilitate further data processing.

Author(s)

Renato A. F. de Lima & Hans ter Steege

Examples

  names <- c("J.E.Q. Faria Jr.",
  "Leitão F°, H.F.", "Gert G. Hatschbach, et al.",
  "Karl Emrich & Balduino Rambo",
  '( Karl) Emrich ;(Balduino ) Rambo', "F.daS.N.Thomé",
  'F. da S.N. Thomé', 'Pedro L.R.de Moraes (30/4/1998)')
  Encoding(names) <- "latin1"
  names

  fixName(names)
  fixName(names, special.char = TRUE)
  fixName(names, sep.out = " | ")


LimaRAF/plantR documentation built on Jan. 1, 2023, 10:18 a.m.