In crew102/patentsview: An R Client to the PatentsView API

knitr::opts_chunk$set(
  collapse = TRUE, 
  comment = "#>", 
  warning = FALSE,
  message = FALSE
)

The following is a brief foray into patent citation networks. The analysis is done on 3 patents that describe patent citation analysis (PCA) themselves.

The first step is to download the relevant data from the PatentsView API. We can use the CPC code of Y10S707/933 to pull PCA patents.

library(patentsview)
library(dplyr)
library(visNetwork)
library(magrittr)
library(stringr)
library(knitr)

# Write a query to get the patents that are assigned a CPC code of "Y10S707/933": 
query <- qry_funs$begins(cpc_subgroup_id = "Y10S707/933")

# Create a list of fields to pull from the API:
fields <- c(
  "patent_number", 
  "patent_title",
  "cited_patent_number", # Which patents do they cite?
  "citedby_patent_number" # Which patents cite them?
)

# Send a request to the API:
res <- search_pv(query = query, fields = fields, all_pages = TRUE)

# Unnest the data found in the list columns:
res_lst <- unnest_pv_data(res$data, pk = "patent_number")
res_lst

There are only r nrow(res_lst$patents) PCA patents. These patents cite r nrow(res_lst$cited_patents) patents and are cited by r nrow(res_lst$citedby_patents) patents. Let's visualize the citations among the PCA patents. We'll create our visualization using the visNetwork package, which requires us to create a data frame of nodes and a data frame of edges.

pat_title <- function(title, number) {
  temp_title <- str_wrap(title)
  i <- gsub("\\n", "<br>", temp_title)
  paste0('<a href="https://patents.google.com/patent/US', number, '">', i, '</a>')
}

edges <-
  res_lst$cited_patents %>%
    semi_join(x = ., y = ., by = c("cited_patent_number" = "patent_number")) %>%
    set_colnames(c("from", "to"))

nodes <-
  res_lst$patents %>%
    mutate(
      id = patent_number,
      label = patent_number,
      title = pat_title(patent_title, patent_number)
    )

visNetwork(
  nodes = nodes, edges = edges, height = "400px", width = "100%",
  main = "Citations among patent citation analysis (PCA) patents"
) %>%
  visEdges(arrows = list(to = list(enabled = TRUE))) %>%
  visIgraphLayout()

It looks like several of the patents cite patent number 6,499,029, perhaps indicating that patent 6,499,029 describes technology that is foundational to this small field. However, when we hover over the nodes we see that several of the patents have the same title. Clicking on the titles brings us to their full text on Google Patents, which confirms that several of the PCA patents belong to the same patent family.[^1] Let's choose one of the patents in each family to act as the family's representative. This will reduce the size of the subsequent network while hopefully retaining its overall structure.

p3 <- c("7797336", "9075849", "6499026")
res_lst2 <- lapply(res_lst, function(x) x[x$patent_number %in% p3, ])

With only 3 patents, it will probably be possible to visualize how these patents' cited and citing patents are all related to one another. Let's create a list of these "relevant patents" (i.e., the 3 patents plus all of their cited and citing patents)[^2], and then get a list of all of their cited patents. The list of their cited patents will allow us to measure how similar the relevant patents are to one another.

rel_pats <-
  res_lst2$cited_patents %>%
    rbind(setNames(res_lst2$citedby_patents, names(.))) %>% 
    select(-patent_number) %>%
    rename(patent_number = cited_patent_number) %>%
    bind_rows(data.frame(patent_number = p3)) %>% 
    distinct() %>%
    filter(!is.na(patent_number))

# Look up which patents the relevant patents cite:
rel_pats_res <- search_pv(
  query = list(patent_number = rel_pats$patent_number),
  fields =  c("cited_patent_number", "patent_number", "patent_title"), 
  all_pages = TRUE,
  method = "POST"
)

rel_pats_lst <- unnest_pv_data(rel_pats_res$data, pk = "patent_number")

Now we know which patents the r nrow(rel_pats_lst$patents) relevant patents cite. This allows us to measure the similarity between any two patents by seeing how many cited references they share in common (a method known as bibliographic coupling).

cited_pats <-
  rel_pats_lst$cited_patents %>%
    filter(!is.na(cited_patent_number))

full_network <- 
  cited_pats %>%
    do({
      .$ind <- group_by(., patent_number) %>%  
               group_indices()
      group_by(., patent_number) %>%  
              mutate(sqrt_num_cited = sqrt(n()))
    }) %>%
    inner_join(x = ., y = ., by = "cited_patent_number") %>%
    filter(ind.x > ind.y) %>%
    group_by(patent_number.x, patent_number.y) %>% 
    mutate(cosine_sim = n() / (sqrt_num_cited.x * sqrt_num_cited.y)) %>% 
    ungroup() %>%
    select(matches("patent_number\\.|cosine_sim")) %>%
    distinct()

kable(head(full_network))

full_network contains the similarity score (cosine_sim) for all patent pairs that share at least one cited reference in common. This means that it probably contains a lot of patent pairs that have only one or two cited references in common, and thus aren't all that similar. Let's try to identify a natural level of cosine_sim to filter on, so our subsequent network graph is not enormous.

hist(
  full_network$cosine_sim, 
  main = "Similarity scores between patents relevant to PCA",
  xlab = "Cosine similarity", ylab = "Number of patent pairs"
)

There appears to be a smallish group of patent pairs that are very similar to one another (cosine_sim > 0.8), which makes it tempting to choose 0.8 as a cutoff point. However, patent pairs that have reference lists that are this similar to each other are probably just patents in the same patent family. Let's choose 0.1 as a cutoff point instead, as there doesn't appear to be too many pairs above this point.[^3]

edges <- 
  full_network %>%
    filter(cosine_sim >= .1) %>% 
    rename(from = patent_number.x, to = patent_number.y, value = cosine_sim) %>%
    mutate(title = paste("Cosine similarity =", as.character(round(value, 3))))

nodes <-
  rel_pats_lst$patents %>%
    rename(id = patent_number) %>%
    mutate(
      color = ifelse(id %in% p3, "#97C2FC", "#DDCC77"),
      label = id,
      title = pat_title(patent_title, id)
    )

visNetwork(
  nodes = nodes, edges = edges, height = "700px", width = "100%",
  main = "Network of patents relevant to PCA"
) %>%
  visEdges(color = list(color = "#343434")) %>%
  visOptions(highlightNearest = list(enabled = TRUE, degree = 1)) %>%
  visIgraphLayout()

The 3 patents of interest are represented by blue nodes, while their citing and cited patents are represented by yellow nodes. One can see the small clusters of patents in this network, as well as how they relate to another.

[^1]: A patent family is a group of related patents, usually all authored by the same inventor and relating to the same technology. [^2]: Defining the network of patents relevant to PCA as those that cite or are cited by the 3 patents of interest is fairly restrictive (i.e., it doesn't adequately capture all of the patents related to PCA). There are likely patents out there that aren't cited by nor cite any of the 3, but are still relevant to PCA.One would need to measure the similarity between all the patents that are in the general area of PCA (versus just those that cite or are cited by the 3 we chose) to get a more complete picture of the patents in this area. This is a much harder problem, though, and would require more analysis than can fit in a single vignette. [^3]: This is still a pretty arbitrary choice. Take a look at algorithms like the disparity filter for a more systematic way to filter edges.

crew102/patentsview documentation built on May 14, 2019, 11:33 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com