citationsr

The R package citationsr comprises functions that can be used to extract and analyze citation cases. When study A cites study B, it contains text fragments that refer to study B. We call study A a citing document and the text fragments it contains citation cases.

This readme serves to outline the methods applied in Bauer et al. (2016) with contributions from Paul C. Bauer, Pablo Barberá and Simon Munzert. The idea is to go beyond a simple and primitive analysis of impact as 'times cited'. The code is licensed under an Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

Disclaimer: We currently don't have time to work on this project and can't provide support. But we hope to develop it further at a later stage.

If you have questions please contact us at mail@paulcbauer.eu.

knitr::opts_chunk$set(
    message = FALSE,
    warning = FALSE,
    include = FALSE,
    tidy = FALSE
)
library(knitr)
library(dplyr)
library(kableExtra)

Intro & setup

Background

We are interested in questions such as the following:

The tutorial illustrates how the code in the package citationsr can be used to investigate the impact of a particular study. The tutorial targets users that are very familiar with R. In principle, one may want to analyze the impact of a single or several studies. Most steps are common to both aims.

As described more extensively in Bauer et al. (2016) we need to pursue the following steps:

  1. Collect citation information, i.e. which works cite our study(ies) of interest
    • Documents that cite a study are called citing documents, text passages within those documents that refer to a study are called citation cases.
  2. Collect the full text of citing documents
  3. Convert citing documents into raw text that can be analyzed
  4. Collect metadata on the citing documents
  5. Extract citation cases from citing documents
  6. Analyze citation cases

Below we present code that cycles through those steps. Essentially, we present the steps pursued in our study in which we investigate the impact of six highly cited studies in the fields of Political Science, Sociology and Economics.

Importantly, all of the above steps require methods that come with error. For intance Step 2, collecting fulltexts, contains error because we are not able to collect all fulltexts. Step 5, extracting citation cases, contains error because technically it is a challenging problem. Hence, the methods we present here are by no means perfect.



Package installation

Right now you can install the package from github for which you need to install the devtools package.

install.packages("devtools")
library(devtools)
install_github("paulcbauer/citationsr")



Create working directory + subfolder

Create a working directory in which you store all the files linked to the analysis. We created a folder called analysis within which most of the analysis will take place.

dir.create("C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis") # Create folder
setwd("C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis") # Set working directory
# Create a folder in which you store the citation information
dir.create("C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/citations") # Create folder



Collect citation information/citations

In principle there are several ways to do this. Here we just present one approach relying on the Web of Science. Naturally, the citation information from the Web of Science is biased in various ways, e.g. it mostly contains journal publications and not books.



Web of Science

It is rather easy to program an R function that scrapes websites such as the Web of Science. However, since that is not legal, we present the manual way here. The Web of Science lets you download 500 citation records at a time and you can obtain them following the steps below. Unfortunately, you need access to the Web of Science (usually trough your university).













files <- dir("C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/citations", pattern = ".txt")
citation_data <- NULL
for(i in files){ # loop over files and merge them
  file.name <- paste("C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/citations/", i, sep="")  
  print(i)
  x <- readLines(file.name)
  x <- stringr::str_replace_all(x, '"', "'")
  writeLines(x, con = file.name)
  citations <- readr::read_delim(file.name, delim = "\t")
  print(nrow(citations))
  citations$filename <- i
  citation_data <- rbind(citation_data, citations)
}
save(citation_data, # save the dataframe with the citations
     file = "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/citation_data.RData")
# View(citation_data) # Have a look at the data



Collect fulltexts of citing documents

Again there are several ways to do this. We'll go with the third approach.

  1. Manual approach: Click & Save on Google Scholar
  2. Automated approach: Fulltext package
  3. Automated approach: Paperpile



Manual approach: Click & Save on Google Scholar

Go to GoogleScholar. Search for you article of interest. Click on cited by. Go through the articles and download those that you can access through the links on the right. Potentially, you will get more articles then by going through the citations that you get from the web of knowledge. On the other hand it's increadibly laborious.



Automated approach: Fulltext package

The fulltext package and the contained function ft_get() by Scott Chamberlain should be able to scrape fulltexts of articles if you feed it a set of dois.

# Packages
  library(fulltext)
  library(stringr)

# Load DOIs
  load("dois.RData")  

# Use subset of DOIs
  dois <- dois[1:20]

# GET THE LINKS  FOR DOIs
  links <- list() 

  for(i in 1:length(dois)){ # loop over DOIs

    print(i)                # print counter
    print(dois[i])          # print DOI

    x <- try(ft_links(dois[i]), silent = TRUE) # continue even when error

    if(!str_detect(x, "Error")){ # IF no error write link to list
              links[[i]] <- c(x[[1]]$data[[1]][[1]], x[[1]]$data[[1]][[2]]) # fill list element with links
    }else{
            links[[i]] <- "empty"
         } # IF error write "empty" to list

  }
  names(links) <- dois # name list elements with DOIs



# GET THE FULLTEXTS FOR DOIs
  texts <- list() 

  for(i in 1:length(dois)){

    print(i)                # print counter
    print(dois[i])          # print DOI

    x <- try(ft_get(dois[i]), silent = TRUE)  # continue even when error

    if(!str_detect(x, "Error")){  # when no error write link to list

              texts[[i]] <- x[[1]]$data$data[[1]]

    }else{
              texts[[i]] <- "empty"
      }  # when error write "empty" to list
  }
  names(texts) <- dois



Automated approach: Paperpile

Paperpile is a commercial reference manager (3 Euros/month for academics) that works together with Google Drive. The nice thing is that it includes a very PDF scraper (as other reference managers as well). Once you upload DOIs for the studies for which you want to collect fulltexts, paperpile does a good job at downloading them to your GoogleDrive.

To proceed we need two helper functions that are available in a package called paperpiler.

install.packages("devtools")
library(devtools)
install_github("paulcbauer/paperpiler")

library(paperpiler)
dois <- citation_data$DI # Store dois in object

# gen_ris: creates a ris file named 'records.ris' in the working directory to fetch the fulltexts.
gen_ris(dois = dois,
        filename = "records.ris")
# More on this file format: https://en.wikipedia.org/wiki/RIS_(file_format)











"You are about to start an automatic bulk download of 9428 papers. If too many papers in your library are from a single publisher, your computer might get temporarily blocked from access to the publisher's website. Also, if many of your papers have incomplete meta-data your computer might get blocked by Google Scholar because Paperpile make too many requests to find the missing data. Although temporary blocks are rare and not a big problem, please consider downloading the PDFs in smaller batches."





load("C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/citation_data.RData")

fetch_docs(from = "C:/GoogleDrive/Paperpile", 
           to = "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/docs", 
           citations = citation_data)



Extract citation cases from citing documents

Delete duplicate PDFs

Paperpile allows you to store several versions of an articles. Normally, these are marked with "(1)", "(2)" etc. in their file names. Use the code below to delete any duplicate files that you fetched from the Paperpile folder.

folder <- "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/docs"
file.names <- dir(paste(folder, sep = ""), pattern = "\\(1\\)|\\(2\\)|\\(3\\)|\\(4\\)|\\(5\\)")
file.paths <- paste(paste(folder,"/", sep = ""), file.names, sep="")
    # renaming
for(i in 1:length(file.paths)){file.remove(from = file.paths[i])}

Optional: Rename PDF docs before text extraction

Skip this step. Sometimes it's a good idea to rename the PDFs files before you analyze them. The function below simply searches for all PDF documents in the folder and renames them from 1.pdf to *.pdf.

# Indicate the folder in which you stored the docs.. in our case 'docs'
  folder <- "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/docs"

# Rename filenames for first extraction
  file.names <- dir(folder, pattern = ".pdf|.PDF")
  file.paths <- paste(paste(folder,"/", sep = ""), file.names, sep="")
    # renaming
  for(i in 1:length(file.paths)){
      file.rename(from = file.paths[i], 
                  to = paste(folder, "/doc", i, ".pdf", sep=""))
  }



Extract txt documents from pdf documents

Now, we need to extract text from those PDFs to analyze them. Here you can use the extract_text() function. The argument from specifies the folder in which the documents are located. number can be omitted or specified to limit the number of documents for which you want to extract text (e.g. for testing extraction for a few documents starting with the first).

library(citationsr) # Load citationsr package
extract_text(from = "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/docs",
             to = "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/docs_text",
             number = NULL, 
             method = "pdftotext",
             path.pdftotext = "C:/Program Files/pdftotext.exe")



Collect metadata on citing documents and rename files accordingly

Above we started with citation data from the Web of Science. There might be cases where we just have text documents or PDFs but we don't have any more information on them. The function get_metadata() analyzes the text documents in the folder specified by from = and tries to identify them (relying on the DOIs the contain). Crossref does not like it if you scrape metada for too many docs at once. So ideally execute the function for batches of files be specifying start = and end =.

library(citationsr) # Load citationsr package
get_metadata(from = "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/docs_text",
            file = "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/metadata-1-500.RData",
            encoding = "ASCII",
            start = 1,
            end = 500) # Or take UTF8
load("C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/metadata.RData")
View(metadata)



Extract citation cases

Optional: Modify encoding for filenames

If now strange encoding appears in your files you can safely ignore this encoding issues. However, sometimes this may be necessary. We worked on both Windows PCs and Macs and sometimes ran into encoding issues.

# Modify encoding for filenames
  file.names <- dir("C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/docs_text", 
                    pattern = ".pdf|.txt")
  file.paths <- paste(paste("C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/docs_text","/", sep = ""),
                      file.names, sep="")
  file.paths2 <- stringr::str_replace(file.paths, "“", "")
  file.paths2 <- stringr::str_replace(file.paths2, "'", '')
  file.paths2 <- iconv(file.paths2, "UTF-8", "ASCII", sub="")
  # Modify file names
  for(i in 1:length(file.paths)){
    file.rename(from = file.paths[i], 
                to = file.paths2[i])
  }



Clean text files before extraction

The extraction of citation cases works better in a .txt file in which the text is not interrupted by running heads, page number etc. Below we provide two functions that try to clean the text at least to some extent.

  load("C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/metadata.RData")
View(metadata)

  delete_refs_n_heads(folder = "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/docs_text",
                      metadata = metadata,
                      encoding = "ASCII")

  clean_text(folder = "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/docs_text",
             encoding = "ASCII")



Extract citation cases

The extract_citation_cases() function cycles through files ending on _processed.txt in the from folder and extracts citation cases. It writes both html (for easy lookup) and csv (for analyses later) to the to folder that are named according to the study whose impact we study, e.g. AcemogluJohnsonRobinson_2001_citation_cases.html and AcemogluJohnsonRobinson_2001_citation_cases.csv.

extract_citation_cases(from = "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/docs_text",
                       to = "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis",
                       authorname = "Acemoglu, Johnson, Robinson",
                       studyyear = "2001",
                       encoding = "ASCII")

Above we only looked at a single study. In our paper we investigate the impact of six studies. We stored the information on those six studies - such as authors and publication year - in a text file called publicationdata.txt and load this information for a starter.

  publicationdata <- read.table("C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/publicationdata.txt", 
                                sep=";",
                                header = T, 
                                stringsAsFactors = F)

We loop over the info in publicationdata.txt and extract the citation cases below. For each of the six studies it produces html and csv files with the citation cases.

  for (i in 1:6){

# SPECIFY ARGUMENT FOR SINGLE FUNCTIONS
    study.title <- publicationdata[i,2]
    authorname <- publicationdata[i,3]
    studyyear <- publicationdata[i,4]


# EXTRACT CITATION CASES
  extract_citation_cases(from = "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/docs_text",
                         to = "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis",
                         authorname = authorname,
                         studyyear = studyyear,
                         encoding = "ASCII")
  }



Analysis of citation cases

After citation case extraction we end up with a dataframe in which the rows are citations cases (in our case 6 dataframes for 6 studies). The columns are different variables that contain information such as...

From hereon you can apply any methods you like to analyze the citation case data.

We also programmed some functions to automate the process of analysis.



dir.create("C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/output")
dir.create("C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/output/acemoglu_2001")



Simple analyses

analyze_citations(file = "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/AcemogluJohnsonRobinson_2001_citation_cases.csv", # File with citation cases
                  article = "Acemoglu, Johnson & Robinson (2001)", # Specify name of article
                  output = "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/output/acemoglu_2001") # Specify output folder



Topic analyses

topic_analysis(file = "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/AcemogluJohnsonRobinson_2001_citation_cases.csv", # File with citation cases
                  article = "Acemoglu, Johnson & Robinson (2001)", # Specify name of article
                  output = "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/output/acemoglu_2001") # Specify output folder



Analyzing several dataframes with citation cases

Just as we extracted the citation cases referring to several studies with a loop above we can also apply the two analysis functions within a loop.

files <- paste("C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/",
               c("AcemogluJohnsonRobinson_2001_citation_cases.csv",
                 "AudretschFeldman_1996_citation_cases.csv",
                 "BeckKatz_1995_citation_cases.csv",
                 "FearonLaitin_2003_citation_cases.csv",
                 "InglehartBaker_2000_citation_cases.csv",
                 "Uzzi_1996_citation_cases.csv"),
  sep = "")

articles <- c("Acemoglu et al. 2001 ",
                 "Audretsch and Feldman 1996 ",
                 "Beck and Katz 1995 ",
                 "Fearon and Laitin 2003 ",
                 "Inglehart and Baker 2000 ",
                 "Uzzi 1996 ")
outputs <- c("acemoglu_2001",
                 "audretsch_1996",
                 "beck_1995",
                 "fearon_2003",
                 "inglehart_2000",
                 "uzzi_1996")
folder <- paste("C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/output/",
               outputs,
  sep = "")

for(i in 1:6){
dir.create(folder[i])
analyze_citations(file = files[i],
                  article = articles[i],
                  output = folder[i])
}

for(i in 1:6){
dir.create(folder[i])
topic_analysis(file = files[i],
                  article = articles[i],
                  output = folder[i])
}

References



paulcbauer/citationsr documentation built on May 3, 2019, 5:47 p.m.