ignore/old_README.md

title: "Access the Lens Patent Database using R" author: "Paul Oldham" date: "26 October 2016" output: html_document

```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, cache = TRUE) library(tidyverse)

<!---[![Travis-CI Build Status](https://travis-ci.org/poldham/lensr.svg?branch=master)](https://travis-ci.org/poldham/lensr)
[![codecov.io](https://codecov.io/github/poldham/lensr/coverage.svg?branch=master)](https://codecov.io/github/poldham/lensr?branch=master)--->

IMPORTANT NOTE: THIS PACKAGE IS DEPRECATED FOLLOWING MAJOR CHANGES TO THE LENS INTERFACE AND THE UPCOMING RELEASE OF THE LENS API. 

The Lens now includes a major Scholar component and is much more powerful. Please use [the main Lens interface](https://www.lens.org/) and I will aim to release a new rewritten version of the package when the dust settles on the changes and the API is available. Meanwhile thanks for the interest.


<!---## Introduction

This R package provides access to the basic functions of the [Lens Patent Database](https://www.lens.org/lens/). 

The Lens provides open access to millions of patent records from around the world and allows for searches of the full text (title, abstract, description and claims) of patent documents. We can also search using applicants, inventors and author names and combine searches across different fields. 

The Lens allows those who register for a free account to save, share and download Collections of upto 10,000 records. If you are seeking access to large amounts of patent data we suggest that you follow this route by registering for the database and creating Collections online. If you are completely new to the Lens we recommend the walkthrough in the [WIPO Manual on Open Source Patent Analytics](https://wipo-analytics.github.io/the-lens-1.html).

`lensr` is intended for light weight exploratory use of patent data using the Lens. The default number of records that will be returned by a search is 50 and we aim to increase that in future. If you would like more records please use the Lens database and the Collections feature directly. 

If your research involves complex queries you may find it convenient to use the `lens_urls` function in `lensr` to create complex urls that you can paste into the Lens as the basis for creating a collection to download. 

The package was developed as part of a wider initiative to make patent data more accessible for analytics purposes and to support monitoring under the [Nagoya Protocol on Access to Genetic Resources and Benefit-Sharing](https://www.cbd.int/abs/about/) of the [United Nations Convention on Biological Diversity](https://www.cbd.int/). 

For those unfamiliar with patent analytics we suggest reading the [WIPO Manual on Open Source Patent Analytics](https://wipo-analytics.github.io/) and the [repository of related training materials](https://github.com/wipo-analytics). 

## Package Details

Package development follows the [ropensci guide](https://github.com/ropensci/onboarding/blob/master/packaging_guide.md) and may one day make it into the `ropensci` list of packages. At present the package is in early development.

## Getting started

`lensr` is not on CRAN but can be installed using `devtools`. If you need to install `devtools` in RStudio use:

```{r devtools, eval=FALSE}
install.packages("devtools")

Then use:

```{r install, eval=FALSE} devtools::install_github("poldham/lensr")


## Using lensr

`lensr` involves two general functions and a set of specific functions.

**General functions**

1. `lens_count()`. 

Use this function to get counts of patent families and publications from different kinds of searches.

2. `lens_search()` allows for complex searching and retrieval of data from the Lens.

**Specific functions**

3. `lens_applicants()`. Search on applicant names.
4. `lens_inventors()`. Search on inventor names.
5. `lens_ipcs()`. Search on International Patent Classification codes.
6. `lens_authors()`. Search patent documents for citations containing an author name. 

**Worker functions**

For those interested in contributing to package development, the main worker functions are:

1. `lens_urls()` Generate urls across the different functions (called by `lens_search`).
2. `lens_iterate()` loops over vectors of urls with `lapply` and times each call to minimise pressure on the server. 
3. `lens_parse()` parses the results to a tibble (data.frame) and can be called with `lens_iterate()`.

## Workflow

A typical workflow will begin with `lens_count()` to work out the total number of results for a query, or combination of queries. This will be followed by either the use of `lens_search()` or one of the specific functions (e.g. applicants, inventors, ipcs).

### lens_count

Use `lens_count()` to get an idea of the results for different queries. `lens_count()` presently concentrates on searches using keywords and phrases as these will typically generate large number of results that need refining. 

Note that you can use `type = ` to control whether to search the full text (default), title, abstract or claims as separate fields or the `title or abstract or claims` (tac) at the same time. 

```{r count_default}
lensr::lens_count("drones") # searches full text

To search titles we would use:

```{r count_title} lens_count("drones", type = "title")


and the title or abstract or claims ("tac")

```{r count_tac}
lens_count("drones", type = "tac")

We can return the results of a query for multiple terms as follows. First we construct a vector of search terms. In this case we will use synthetic biology related terms.

```{r mult} synbio <- c("synthetic biology", "synthetic genomics", "synthetic genome", "synthetic genomes", "biological parts", "genetic circuit", "genetic circuits")


Next we use `lens_count` to fetch the count of results. Note that the timer, in seconds, can be adjusted. In this case we want a count for each individual term to allow us to get a better understanding of the impacts of different search terms. The status of the search will be displayed by dots, with one dot per search term (url).  

```{r synbio_count}
synbio_count <- lens_count(synbio, timer = 10)
synbio_count

In other cases, when retrieving results we would want to use a boolean operator ("AND" or "OR") to combine the search terms into one dataset.

```{r synbio_or} synbio_or <- lens_count(synbio, boolean = "OR", timer = 10) synbio_or


When retrieving results using `lens_search()` below be sure to use a boolean operator or the data retrieval will not work.  

By default the Lens returns the number of patent families and the number of publications across the database. For the results above we can see that the number of patent families are almost always lower than the number of publications. The exceptions are cases where the number of publications is equivalent to the number of families (for very low document counts). Note that where the Lens finds very low numbers of documents it will not produce a separate results and families count. To handle this, in relevant cases, `lens_count()` copies the publication count into families and generates a message (see the documentation for `lens_count()`).

If you are new to the concept of patent families, and they are very important, read the brief introduction below. If you are familiar with patent families, note that Lens patent families are of the `simple` (e.g. DOCDB) type.

## A brief introduction to patent families

When a patent application is filed for the first time anywhere in the world it becomes the **"priority"** or **"first" filing**. That application may then be published as an application and as a patent grant or with administrative documents such as search reports. This creates a basic patent family where the original application is the "parent" of any subsequent republications of the application, including administrative documents such as search reports, corrections etc. 

However, the same patent application, and divisions of that application, may also be submitted for potential protection in multiple countries using patent instruments such as the European Patent Convention (EP) and internationally using the Patent Cooperation Treaty (WO). This will result in publications of the application, and any grants, in other countries. These documents are **family members** that link to the original priority application as their parent. 

Counts of patent families have the advantage that they `deduplicate` republications of patent documents to only the original (earliest) filing. In the case of the Lens the patent families are simple families (DOCDB families) rather than INPADOC families. Patent families are generally preferred in innovation studies because the priority (filing date) is closest to the date of investment in research and counts of an invention are made only once. 

In contrast, publications include family members such as publications of applications, grants, search reports, corrections and other administrative documents. `lens_search()` returns the document type and in future an argument will be included to allow the type of document to be controlled during query construction. 

## Count by Year

There are two types of year available in the Lens. `filing_year` and `publication_year`. `lensr` allows you to restrict searches using date ranges. 

### Restricting by year

To restrict counts by publication year (publn) we need to specify start date (publn_date_start) and the publication date

```{r publn_year}
library(tidyverse)
lens_count("drones", publn_date_start = 19900101, publn_date_end = 20001231) %>% print()

As an alternative we could also use the filing dates.

lens_count("drones", filing_date_start = 19900101, filing_date_end = 20001231)

In considering the use of filing dates or publication dates note that patent documents only become available when they are published. In the example above the number of filings for the period was higher than the number of publications for the same period. Why? This is because of the lag time between the filing of an application and its publication.

The publication of an application typically takes place 24 months after filing but may take considerably longer. So, in the case above it is likely that some of the documents filed in the period 1990-2000 were not published until after the start of 2001. Filing (priority) dates are important when calculating trends in filings for an area of science and technology. However, if you intend to read the documents in practice you will normally want either the earliest or the latest patent publications and will therefore want to use publication (publn) date ranges.

By default lensr will return upto 50 results and a maximum of 500 results. Date ranges are the main tool for dividing the data into chunks to overcome these limitations.

Because of the limitations on data retrieval from the Lens we recommend that for larger scale data you login to the Lens and create Collections. You can use lens_urls() to create urls with complex queries that you can simply paste into your browser. If you have logged into the Lens you can then create a Collection for download. We will add some functions at a later data to process the downloaded .csv file in R. Note that when using Collections you gain access to more data fields (such as inventors) than from lensr.

lens_search()

lens_search() is the main package function and allows you to conduct searches using key words or phrases and combinations of inventor and applicant names and key words or phrases. In future the function will be expanded to include International Patent Classification codes (ipcs) and author names (for cited literature).

Search using key terms

lens_search will return a tibble (data frame) with 50 results by default. We will normally want to retrieve data deduplicated to families and will set families = TRUE. To limit the data to searches that are directly concerned in some way with the subject matter we will search the title or abstract or claims ("tac""). Note the quotes if you are new to R.

The timer is set to 20 seconds by default across lensr functions. You do not need to include it unless you would like to change it. It is included below for illustration.

```{r timer} drones <- lens_search("drones", type = "tac", families = TRUE, timer = 20) drones


Note that by default the lens returns patent numbers with an underscore separator between the country code (e.g. US) and then a `_` in addition forward slashes are preserved in the number and an underscore separates the kind code at the end. To assist with the use of the numbers in other databases the publications_numbers field is added that concatenates the country and the kind code and removes the forward slash. While this approach should work in most databases if transferring the numbers as inputs to another database experimentation may be needed to ensure data capture. 

### Search using key terms, applicants and inventors

You can combine searches with key terms and inventors either as individual terms or as multiple sets. 

In this case we will search for synthetic biology related key terms where the inventors are listed as Craig Venter or his long term collaborator Hamilton Smith. 

In constructing these queries note that where using more than one phrase or name we specify the boolean operator as OR or AND. For search terms the boolean is simply `boolean`. For inventor names it is `inventor_boolean` and for applicants `applicant_boolean`. 

To see the difference between setting families = TRUE try setting the value to FALSE and reviewing the titles. 

<!--- review the count of results and develop a test--->

<!---
```{r inventors}
inventors <- lens_search(query = c("synthetic genomics", "synthetic biology"), boolean = "OR", inventor = c("Venter Craig", "Smith Hamilton"), inventor_boolean = "AND", families = TRUE)
inventors

Note that inventor names do not appear in the results retrieved from the Lens.

Search using applicants and key terms

Key term and inventor searches can be combined with applicant searches. In this case we will use a vector of search terms and a single inventor and applicant name. Note that within lens_search() families is set to TRUE to avoid retrieving duplicates of the same document.



poldham/lensr documentation built on May 25, 2019, 11:22 a.m.