In crew102/patentsview: An R Client to the PatentsView API

knitr::opts_chunk$set(
  collapse = TRUE, 
  comment = "#>", 
  warning = FALSE,
  message = FALSE
)

The following is a quick analysis of the top organizations patenting in the field of databases.

The first step is to download the relevant data fields from the PatentsView API:

library(patentsview)
library(dplyr)
library(highcharter)
library(DT)
library(knitr)

# We first need to write a query. Our query will look for "database" in either 
# the patent title or abstract...Note, this isn't a terribly good way to ID our 
# patents, but it will work for the purpose of demonstration. Users who are 
# interested in writing higher-quality queries could consult the large body of 
# research that has been done in patent document retrieval.
query <- with_qfuns(
  or(
    text_phrase(patent_abstract = "database"),
    text_phrase(patent_title = "database")
  )
)

query

# Create a list of the fields we'll need for the analysis:
fields <- c(
  "patent_number", "assignee_organization",
  "patent_num_cited_by_us_patents", "app_date", "patent_date",
  "assignee_total_num_patents"
)

# Send an HTTP request to the PatentsView API to get data:
pv_out <- search_pv(query = query, fields = fields, all_pages = TRUE)

Now let's identify who the top assignees are based on how many patents they have in our data set (i.e., calculate db_pats). We'll also calculate how many total patents these assignees have (ttl_pats) and what fraction of their total patents relate to databases (frac_db_pats).

# Unnest the data frames that are stored in the assignee list column:
dl <- unnest_pv_data(data = pv_out$data, pk = "patent_number")
dl

# Create a data frame with the top 75 assignees:
top_asgns <-
  dl$assignees %>%
    filter(!is.na(assignee_organization)) %>% # some patents are not assigned to an org (only to an inventor)
    mutate(ttl_pats = as.numeric(assignee_total_num_patents)) %>%
    group_by(assignee_organization, ttl_pats) %>% # group by ttl_pats so we can retain ttl_pats
    summarise(db_pats = n()) %>% 
    mutate(frac_db_pats = round(db_pats / ttl_pats, 3)) %>%
    ungroup() %>%
    select(c(1, 3, 2, 4)) %>%
    arrange(desc(db_pats)) %>%
    slice(1:75)

# Create datatable:
datatable(
  data = top_asgns,
  rownames = FALSE,
  colnames = c("Assignee", "DB patents","Total patents", 
               "DB patents / total patents"),
  caption = htmltools::tags$caption(
    style = 'caption-side: top; text-align: left; font-style: italic;',
    "Table 1: Top assignees in 'databases'"
  ),
  options = list(pageLength = 10)
)

IBM is far and away the biggest player in the field. However, we can see that Oracle and Salesforce.com are relatively more interested in this area, as indicated by the fraction of their patents that relate to databases.

Let's see how these assignees' level of investment in databases has changed have changed over time.

# Create a data frame with patent counts by application year for each assignee:
data <- 
  top_asgns %>%
    select(-contains("pats")) %>%
    slice(1:5) %>%
    inner_join(dl$assignees) %>%
    inner_join(dl$applications) %>%
    mutate(app_yr = as.numeric(substr(app_date, 1, 4))) %>%
    group_by(assignee_organization, app_yr) %>%
    count() 

# Plot the data using highchartr:
hc_add_series_df(
  highchart(), data = data, "line", x = app_yr, y = n, group = assignee_organization
) %>%
  hc_plotOptions(series = list(marker = list(enabled = FALSE))) %>%
  hc_xAxis(title = list(text = "Application year")) %>%
  hc_yAxis(title = list(text = "DB patents")) %>%
  hc_title(text = "Top five assignees in 'databases'") %>%
  hc_subtitle(text = "Yearly patent applications over time")

It's hard to see any clear trends in this graph. What is clear is that the top assignees have all been patenting in the field for many years.

Finally, let's see how the organizations compare in terms of their citation rates. First, we'll need to normalize the raw citation counts by publication year, so that older patents don't have an unfair advantage over younger patents (i.e., because they have had a longer time to accrue citations).

# Write a ranking function that will be used to rank patents by their citation counts:
percent_rank2 <- function(x)
  (rank(x, ties.method = "average", na.last = "keep") - 1) / (sum(!is.na(x)) - 1)

# Create a data frame with normalized citation rates and stats from Step 2: 
asng_p_dat <-
  dl$patents %>%
    mutate(patent_yr = substr(patent_date, 1, 4)) %>%
    group_by(patent_yr) %>%
    mutate(perc_cite = percent_rank2(patent_num_cited_by_us_patents)) %>%
    inner_join(dl$assignees) %>%
    group_by(assignee_organization) %>%
    summarise(mean_perc = mean(perc_cite)) %>%
    inner_join(top_asgns) %>%
    arrange(desc(ttl_pats)) %>%
    filter(!is.na(assignee_organization)) %>%
    slice(1:20) %>%
    mutate(color = "#f1c40f") %>%
    as.data.frame()

kable(head(asng_p_dat), row.names = FALSE)

Now let's visualize the data. Each assignee will be represented by a point/bubble. The x-value of the assignee will represent the total number of patents it has published in the field of databases (on a log scale), while the y-value will represent its average normalized citation rate. The size of the bubble we be proportional to the percent of the assignee's patents that relate to databases.

# Adapted from http://jkunst.com/highcharter/showcase.html
hchart(
  asng_p_dat, "scatter", 
  hcaes(x = db_pats, y = mean_perc, size = frac_db_pats, 
        group = assignee_organization, color = color)
) %>%
  hc_xAxis(title = list(text = "DB patents"), type = "logarithmic",
           allowDecimals = FALSE, endOnTick = TRUE) %>%
  hc_yAxis(title = list(text = "Mean cite perc.")) %>%
  hc_title(text = "Top assignees in 'databases'") %>%
  hc_add_theme(hc_theme_flatdark()) %>%
  hc_tooltip(useHTML = TRUE, pointFormat = tooltip_table(
    x = c("DB patents", "Mean cite percentile", "Fraction DB patents"),
    y = c("{point.db_pats:.0f}","{point.mean_perc:.2f}", "{point.frac_db_pats:.3f}")
  )) %>%
  hc_legend(enabled = FALSE)

It looks like Microsoft has relatively high values across the three metrics (average citation percentile, number of database patents, and percent of total patents that are related to databases). IBM has more patents than Microsoft, but a slightly lower average citation percentile.

crew102/patentsview documentation built on May 14, 2019, 11:33 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com