cleanPatentData: Generate a clean data set from the imported raw data.

Description Usage Arguments Value See Also Examples

View source: R/cleanPatentData.R

Description

Generate a clean data set from the imported raw data set. The data available dictates the number of columns of attributes that can be generated.

Sumobrain, Lens.org, and Google Patents have varying levels of data available.

If you import your own data, be sure to adhere to the template format, or read carefully to create your own.

Usage

1
2
3
4
5
6
cleanPatentData(patentData = NULL, columnsExpected, cleanNames,
  dateFields = NA, dateOrders, deduplicate = TRUE,
  cakcDict = patentr::cakcDict,
  docLengthTypesDict = patentr::docLengthTypesDict, keepType = "grant",
  firstAssigneeOnly = TRUE, assigneeSep = ";",
  stopWords = patentr::assigneeStopWords)

Arguments

patentData

The data frame of initial raw patent data.

columnsExpected

The expected width of the data frame, numeric.

cleanNames

A character vector of length columnsExpected to rename the data frame with.

dateFields

A character vector of the date column names which will be converted to 'Date' format.

dateOrders

A character string of the format required to convert string data into 'Date' data. Sumobrain is "ymd" and lens and Google data are "mdy". Hardcoded values include googleDateOrder,lensDateOrder, and sumobrainDateOrder.

deduplicate

A logical, default set to TRUE, if you want to deduplicated any patent documents that have both an app and a grant.

cakcDict

A county and kind code dictionary. Default is cakcDict.

docLengthTypesDict

A document length and type dictionary. Default is docLengthTypesDict.

keepType

A character variable denoting which document type to keep. Default is "grant". If NA, ignore.

firstAssigneeOnly

For cleaning names, use the first assignee only, default TRUE.

assigneeSep

The separation character if there is more than one assignee. Default is ";" semicolon.

stopWords

The stopword list to remove from assignee names. Default is assigneeStopWords.

Value

A data frame of tidy patent data.

See Also

For data formats: acars for Sumobrain, acarsGoogle for Google Patents data, and acarsLens for Lens.org data.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
sumo <- cleanPatentData(patentData = patentr::acars, columnsExpected = sumobrainColumns,
cleanNames = sumobrainNames,
dateFields = sumobrainDateFields,
dateOrders = sumobrainDateOrder,
deduplicate = TRUE,
cakcDict = patentr::cakcDict,
docLengthTypesDict = patentr::docLengthTypesDict,
keepType = "grant",
firstAssigneeOnly = TRUE, 
assigneeSep = ";",
stopWords = patentr::assigneeStopWords)

# use a fresh Google export csv
# in a new csv download, however, it would not be the case


rawGoogleData <- system.file("extdata", "google_autonomous_search.csv", 
package = "patentr")
rawGoogleData <- read.csv(rawGoogleData, 
skip = skipGoogle, stringsAsFactors = FALSE)
rawGoogleData <- data.frame(lapply(rawGoogleData, 
function(x){iconv(x, to = "ASCII")}), stringsAsFactors = FALSE)
google <- cleanPatentData(patentData = rawGoogleData, columnsExpected = googleColumns,
cleanNames = googleNames,
dateFields = googleDateFields,
dateOrders = googleDateOrder,
deduplicate = TRUE,
cakcDict = patentr::cakcDict,
docLengthTypesDict = patentr::docLengthTypesDict,
keepType = "grant",
firstAssigneeOnly = TRUE, 
assigneeSep = ",",
stopWords = patentr::assigneeStopWords)


lensRawData <- system.file("extdata", "lens_autonomous_search.csv", 
package = "patentr")
lensRawData <- read.csv(lensRawData, stringsAsFactors = FALSE, skip = skipLens)
lensRawData <- data.frame(lapply(lensRawData, 
function(x){iconv(x, to = "ASCII")}), stringsAsFactors = FALSE)
lens <- cleanPatentData(patentData = lensRawData, columnsExpected = lensColumns,
cleanNames = lensNames,
dateFields = lensDateFields,
dateOrders = lensDateOrder,
deduplicate = TRUE,
cakcDict = patentr::cakcDict,
docLengthTypesDict = patentr::docLengthTypesDict,
keepType = "grant",
firstAssigneeOnly = TRUE, 
assigneeSep = ";;",
stopWords = patentr::assigneeStopWords)

kamilien1/patentr documentation built on May 20, 2019, 7:19 a.m.