The R package citationsr
comprises functions that can be used to extract and analyze citation cases. When study A cites study B, it contains text fragments that refer to study B. We call study A a citing document and the text fragments it contains citation cases.
This readme serves to outline the methods applied in Bauer et al. (2016) with contributions from Paul C. Bauer, Pablo Barberá and Simon Munzert. The idea is to go beyond a simple and primitive analysis of impact as 'times cited'. The code is licensed under an Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.
Disclaimer: We currently don't have time to work on this project and can't provide support. But we hope to develop it further at a later stage.
If you have questions please contact us at mail@paulcbauer.eu.
knitr::opts_chunk$set( message = FALSE, warning = FALSE, include = FALSE, tidy = FALSE ) library(knitr) library(dplyr) library(kableExtra)
We are interested in questions such as the following:
The tutorial illustrates how the code in the package citationsr
can be used to investigate the impact of a particular study. The tutorial targets users that are very familiar with R. In principle, one may want to analyze the impact of a single or several studies. Most steps are common to both aims.
As described more extensively in Bauer et al. (2016) we need to pursue the following steps:
Below we present code that cycles through those steps. Essentially, we present the steps pursued in our study in which we investigate the impact of six highly cited studies in the fields of Political Science, Sociology and Economics.
Importantly, all of the above steps require methods that come with error. For intance Step 2, collecting fulltexts, contains error because we are not able to collect all fulltexts. Step 5, extracting citation cases, contains error because technically it is a challenging problem. Hence, the methods we present here are by no means perfect.
Right now you can install the package from github for which you need to install the devtools
package.
install.packages("devtools") library(devtools) install_github("paulcbauer/citationsr")
Create a working directory in which you store all the files linked to the analysis. We created a folder called analysis
within which most of the analysis will take place.
dir.create("C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis") # Create folder setwd("C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis") # Set working directory # Create a folder in which you store the citation information dir.create("C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/citations") # Create folder
In principle there are several ways to do this. Here we just present one approach relying on the Web of Science. Naturally, the citation information from the Web of Science is biased in various ways, e.g. it mostly contains journal publications and not books.
It is rather easy to program an R function that scrapes websites such as the Web of Science. However, since that is not legal, we present the manual way here. The Web of Science lets you download 500 citation records at a time and you can obtain them following the steps below. Unfortunately, you need access to the Web of Science (usually trough your university).
analysis/citations
folder.
analysis/citations
folder should like below.
files <- dir("C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/citations", pattern = ".txt") citation_data <- NULL for(i in files){ # loop over files and merge them file.name <- paste("C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/citations/", i, sep="") print(i) x <- readLines(file.name) x <- stringr::str_replace_all(x, '"', "'") writeLines(x, con = file.name) citations <- readr::read_delim(file.name, delim = "\t") print(nrow(citations)) citations$filename <- i citation_data <- rbind(citation_data, citations) } save(citation_data, # save the dataframe with the citations file = "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/citation_data.RData") # View(citation_data) # Have a look at the data
Again there are several ways to do this. We'll go with the third approach.
Go to GoogleScholar. Search for you article of interest. Click on cited by
. Go through the articles and download those that you can access through the links on the right. Potentially, you will get more articles then by going through the citations that you get from the web of knowledge. On the other hand it's increadibly laborious.
The fulltext
package and the contained function ft_get()
by Scott Chamberlain should be able to scrape fulltexts of articles if you feed it a set of dois.
# Packages library(fulltext) library(stringr) # Load DOIs load("dois.RData") # Use subset of DOIs dois <- dois[1:20] # GET THE LINKS FOR DOIs links <- list() for(i in 1:length(dois)){ # loop over DOIs print(i) # print counter print(dois[i]) # print DOI x <- try(ft_links(dois[i]), silent = TRUE) # continue even when error if(!str_detect(x, "Error")){ # IF no error write link to list links[[i]] <- c(x[[1]]$data[[1]][[1]], x[[1]]$data[[1]][[2]]) # fill list element with links }else{ links[[i]] <- "empty" } # IF error write "empty" to list } names(links) <- dois # name list elements with DOIs # GET THE FULLTEXTS FOR DOIs texts <- list() for(i in 1:length(dois)){ print(i) # print counter print(dois[i]) # print DOI x <- try(ft_get(dois[i]), silent = TRUE) # continue even when error if(!str_detect(x, "Error")){ # when no error write link to list texts[[i]] <- x[[1]]$data$data[[1]] }else{ texts[[i]] <- "empty" } # when error write "empty" to list } names(texts) <- dois
Paperpile is a commercial reference manager (3 Euros/month for academics) that works together with Google Drive. The nice thing is that it includes a very PDF scraper (as other reference managers as well). Once you upload DOIs for the studies for which you want to collect fulltexts, paperpile does a good job at downloading them to your GoogleDrive.
To proceed we need two helper functions that are available in a package called paperpiler.
paperpiler::gen_ris()
below takes a set of dois and generates a *.ris
file that can be imported into paperpile. install.packages("devtools") library(devtools) install_github("paulcbauer/paperpiler") library(paperpiler) dois <- citation_data$DI # Store dois in object # gen_ris: creates a ris file named 'records.ris' in the working directory to fetch the fulltexts. gen_ris(dois = dois, filename = "records.ris") # More on this file format: https://en.wikipedia.org/wiki/RIS_(file_format)
Once you have created the ris file called, e.g. records.ris
you can import that into paperpile. See the steps below in which you import the file and the corresponding records into a folder called impactanalysis
. You might also want to label all records in that folder with a particular label, e.g. impactanalysis
to keep track of them. Importantly, Paperpile only downloads articles onto your google drive that are accessible through your university network.
Choose "Upload Files" in the "Add Papers" menu.
"You are about to start an automatic bulk download of 9428 papers. If too many papers in your library are from a single publisher, your computer might get temporarily blocked from access to the publisher's website. Also, if many of your papers have incomplete meta-data your computer might get blocked by Google Scholar because Paperpile make too many requests to find the missing data. Although temporary blocks are rare and not a big problem, please consider downloading the PDFs in smaller batches."
paperpiler::fetch_docs()
relies on the dataframe citation_data
that we generated from the web of knowledge and searches through your paperpile directory. I tries to identify files through their title and if that does not work through an author/year combination. It's fuzzy so it may fetch more docs then necessary.load("C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/citation_data.RData") fetch_docs(from = "C:/GoogleDrive/Paperpile", to = "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/docs", citations = citation_data)
Paperpile allows you to store several versions of an articles. Normally, these are marked with "(1)", "(2)" etc. in their file names. Use the code below to delete any duplicate files that you fetched from the Paperpile folder.
folder <- "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/docs" file.names <- dir(paste(folder, sep = ""), pattern = "\\(1\\)|\\(2\\)|\\(3\\)|\\(4\\)|\\(5\\)") file.paths <- paste(paste(folder,"/", sep = ""), file.names, sep="") # renaming for(i in 1:length(file.paths)){file.remove(from = file.paths[i])}
Skip this step. Sometimes it's a good idea to rename the PDFs files before you analyze them. The function below simply searches for all PDF documents in the folder and renames them from 1.pdf
to *.pdf
.
# Indicate the folder in which you stored the docs.. in our case 'docs' folder <- "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/docs" # Rename filenames for first extraction file.names <- dir(folder, pattern = ".pdf|.PDF") file.paths <- paste(paste(folder,"/", sep = ""), file.names, sep="") # renaming for(i in 1:length(file.paths)){ file.rename(from = file.paths[i], to = paste(folder, "/doc", i, ".pdf", sep="")) }
Now, we need to extract text from those PDFs to analyze them. Here you can use the extract_text()
function. The argument from
specifies the folder in which the documents are located. number
can be omitted or specified to limit the number of documents for which you want to extract text (e.g. for testing extraction for a few documents starting with the first).
pdftotext.exe
. You have to indicate the path to pdftotext.exe
in the extract_text()
function.library(citationsr) # Load citationsr package extract_text(from = "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/docs", to = "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/docs_text", number = NULL, method = "pdftotext", path.pdftotext = "C:/Program Files/pdftotext.exe")
*.txt
that are present in the folder specified in to =
. If the folder does not exist it will create it. Above we started with citation data from the Web of Science. There might be cases where we just have text documents or PDFs but we don't have any more information on them. The function get_metadata()
analyzes the text documents in the folder specified by from =
and tries to identify them (relying on the DOIs the contain). Crossref does not like it if you scrape metada for too many docs at once. So ideally execute the function for batches of files be specifying start =
and end =
.
library(citationsr) # Load citationsr package get_metadata(from = "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/docs_text", file = "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/metadata-1-500.RData", encoding = "ASCII", start = 1, end = 500) # Or take UTF8 load("C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/metadata.RData") View(metadata)
doc_text
folder. Simply, by specifying the rank of the first end last file for which you would like to collect metadata through start =
and end =
. If now strange encoding appears in your files you can safely ignore this encoding issues. However, sometimes this may be necessary. We worked on both Windows PCs and Macs and sometimes ran into encoding issues.
# Modify encoding for filenames file.names <- dir("C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/docs_text", pattern = ".pdf|.txt") file.paths <- paste(paste("C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/docs_text","/", sep = ""), file.names, sep="") file.paths2 <- stringr::str_replace(file.paths, "“", "") file.paths2 <- stringr::str_replace(file.paths2, "'", '') file.paths2 <- iconv(file.paths2, "UTF-8", "ASCII", sub="") # Modify file names for(i in 1:length(file.paths)){ file.rename(from = file.paths[i], to = file.paths2[i]) }
The extraction of citation cases works better in a .txt file in which the text is not interrupted by running heads, page number etc. Below we provide two functions that try to clean the text at least to some extent.
delete_refs_n_heads()
deletes references and running headers (e.g. author names on every page). It relies on the metadata that was collected before through get_metadata()
deleted_running_headers.html
is stored in your working directory that includes all the lines that were deleted from the .txt files.clean_text()
replaces any dots that are not punctuation marks among other things. For instance, it converts abbrevations, e.g. "No." to "NUMBER". It also produces a text without linebreaks.load("C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/metadata.RData") View(metadata) delete_refs_n_heads(folder = "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/docs_text", metadata = metadata, encoding = "ASCII") clean_text(folder = "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/docs_text", encoding = "ASCII")
The extract_citation_cases()
function cycles through files ending on _processed.txt
in the from
folder and extracts citation cases. It writes both html (for easy lookup) and csv (for analyses later) to the to
folder that are named according to the study whose impact we study, e.g. AcemogluJohnsonRobinson_2001_citation_cases.html
and AcemogluJohnsonRobinson_2001_citation_cases.csv
.
extract_citation_cases(from = "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/docs_text", to = "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis", authorname = "Acemoglu, Johnson, Robinson", studyyear = "2001", encoding = "ASCII")
Above we only looked at a single study. In our paper we investigate the impact of six studies. We stored the information on those six studies - such as authors and publication year - in a text file called publicationdata.txt
and load this information for a starter.
publicationdata <- read.table("C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/publicationdata.txt", sep=";", header = T, stringsAsFactors = F)
We loop over the info in publicationdata.txt
and extract the citation cases below. For each of the six studies it produces html and csv files with the citation cases.
for (i in 1:6){ # SPECIFY ARGUMENT FOR SINGLE FUNCTIONS study.title <- publicationdata[i,2] authorname <- publicationdata[i,3] studyyear <- publicationdata[i,4] # EXTRACT CITATION CASES extract_citation_cases(from = "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/docs_text", to = "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis", authorname = authorname, studyyear = studyyear, encoding = "ASCII") }
After citation case extraction we end up with a dataframe in which the rows are citations cases (in our case 6 dataframes for 6 studies). The columns are different variables that contain information such as...
document
: Contains the path and name of the file that was subject to extractioncitation.case
: Contains the extracted citation case/text fragmentyear
: Contains the publication year of the citing documentnchar.citation.case
: Contains the number of characters of the citation caseFrom hereon you can apply any methods you like to analyze the citation case data.
We also programmed some functions to automate the process of analysis.
dir.create("C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/output") dir.create("C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/output/acemoglu_2001")
analyze_citations()
produces some simple analyses/graphs of the citation cases. analyze_citations(file = "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/AcemogluJohnsonRobinson_2001_citation_cases.csv", # File with citation cases article = "Acemoglu, Johnson & Robinson (2001)", # Specify name of article output = "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/output/acemoglu_2001") # Specify output folder
topic_analysis()
performs some topic analysis on the citation cases.topic_analysis(file = "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/AcemogluJohnsonRobinson_2001_citation_cases.csv", # File with citation cases article = "Acemoglu, Johnson & Robinson (2001)", # Specify name of article output = "C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/output/acemoglu_2001") # Specify output folder
Just as we extracted the citation cases referring to several studies with a loop above we can also apply the two analysis functions within a loop.
files <- paste("C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/", c("AcemogluJohnsonRobinson_2001_citation_cases.csv", "AudretschFeldman_1996_citation_cases.csv", "BeckKatz_1995_citation_cases.csv", "FearonLaitin_2003_citation_cases.csv", "InglehartBaker_2000_citation_cases.csv", "Uzzi_1996_citation_cases.csv"), sep = "") articles <- c("Acemoglu et al. 2001 ", "Audretsch and Feldman 1996 ", "Beck and Katz 1995 ", "Fearon and Laitin 2003 ", "Inglehart and Baker 2000 ", "Uzzi 1996 ") outputs <- c("acemoglu_2001", "audretsch_1996", "beck_1995", "fearon_2003", "inglehart_2000", "uzzi_1996") folder <- paste("C:/GoogleDrive/1-Research/2017_Quality_of_citations/analysis/output/", outputs, sep = "") for(i in 1:6){ dir.create(folder[i]) analyze_citations(file = files[i], article = articles[i], output = folder[i]) } for(i in 1:6){ dir.create(folder[i]) topic_analysis(file = files[i], article = articles[i], output = folder[i]) }
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.