removeDups: Remove duplicate entries in a patent data set

Description Usage Arguments Value See Also Examples

View source: R/cleanPatentData.R

Description

Remove duplicate values in the patent data. Typically you will want to check if you have repeat document numbers. A document number should be a unique number in your dataset, thus, having a duplicate document number in your data set should be avoided. You can optionally specify which document type to keep.

Often times, your data sets contain duplicate patent entries. This function is a wrapper function of the duplicated function, applied to a dataframe or vector.

For example, if you have the vector [US123, US123, US456], you will get the value TRUE FALSE TRUE and the duplicate value is removed.

You can go deeper with the optional variables. For many analyses, we want to exclude the second document, typically the application. This function allows you to choose which document type to keep and the rest get thrown out.

Usage

1
removeDups(input, hasDup = NA, docType = NA, keepType = "grant")

Arguments

input

A vector or a data frame which you wish to remove duplicate values. When choosing a data frame, you are more selective. For example, you may want to remove a patent document only if it has the same docNum and country code.

hasDup

A logical vector noting if a duplicate exists. If NA, ignore. The showDups funciton helps with this input.

docType

A character vector of the type of patent document (app, grant, etc.). If NA, ignore.

keepType

A character variable denoting which document type to keep. Default is "grant". If NA, ignore.

Value

A logical vector used to remove duplicate documents not fitting the one chosen. TRUE is for the document to keep.

See Also

duplicated, showDups

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# simple removal: see how many rows were removed
dim(acars) - dim(acars[removeDups(acars$appNum),])

# specific removal: keep the grant docs
hasDup <- showDups(acars$appNum)
pubNum <- extractPubNumber(acars$docNum)
countryCode <- extractCountryCode(acars$docNum)
officeDocLength <- extractDocLength(countryCode = countryCode, pubNum = pubNum)
kindCode <- extractKindCode(acars$docNum)
countryAndKindCode <- paste0(countryCode, kindCode)
docType <- generateDocType(officeDocLength = officeDocLength, 
countryAndKindCode = countryAndKindCode, 
cakcDict = patentr::cakcDict, 
docLengthTypesDict = patentr::docLengthTypesDict)
keepType <- "grant"
toKeep <- removeDups(acars$appNum, hasDup = hasDup, docType = docType, keepType = keepType)
table(toKeep)
acarsDedup <- acars[toKeep, ]

kamilien1/patentr documentation built on May 20, 2019, 7:19 a.m.