cleanNames: Clean up string names.

Description Usage Arguments Value Examples

Description

Quick cleanup of characters in a string, typically assignee (company names) and the inventors.

If you have issues with this, you may need to convert to UTF-8 or ASCII. Use the iconv(thisVector, to="UTF-8") or to="ASCII" and it should fix the problem. See the examples for the code.

This function:

  1. Removes values between spaces, such as (US)

  2. Changes all names to lower case

Usage

1
2
cleanNames(rawNames, firstAssigneeOnly = TRUE, sep = ";",
  removeStopWords = TRUE, stopWords = patentr::assigneeStopWords)

Arguments

rawNames

The character vector you want to clean up

firstAssigneeOnly

A logical value, default set to TRUE, keeping only the first assignee if multiple exist.

sep

The separating character for multiple assignees, default set to semi-colon.

removeStopWords

Logical default TRUE, if want to remove common company stopwords found in the stopWords parameter.

stopWords

An optional character vector of words you want to remove. Default to assigneeStopWords.

Value

A character vector of cleaned up character names.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
assigneeNames <- cleanNames(acars$assignee)
# get a feel for the less-messy data
head(sort(table(assigneeNames), decreasing = TRUE))

# for a messier example, note you need to convert to ASCII/UTF-8 to get rid of errors
# associated with tolower
rawGoogleData <- system.file("extdata", "google_autonomous_search.csv", package = "patentr")
rawGoogleData <- read.csv(rawGoogleData, stringsAsFactors = FALSE, skip = patentr::skipGoogle)
rawGoogleData <- data.frame(lapply(rawGoogleData, 
function(x){iconv(x, to = "ASCII")}), stringsAsFactors = FALSE)
assigneeClean <- cleanNames(rawGoogleData$assignee)
head(sort(table(assigneeClean), decreasing = TRUE))

kamilien1/patentR documentation built on May 20, 2019, 7:19 a.m.