In paulcbauer/citationsr: A package to analyze citations

Introduction

The package should help in various situations. For instance, you might be interested in the impact of a particular study. But not simply the quantitative impact ("times cited") but you want to know what people are actually writing about that study. This could even be your own study if you wanna know for what reasons people cite you... to be continued..

Basic Steps

Below the steps you would normally pursue when analyzing a study's citations. We provide functions that facilitate parts of these steps. These functions should be executed in the order below.

Collect citation information, i.e. which works cite your study(ies) of interest (called citing documents)
- Somehow get your hands on the citation data (DOIs) of the papers citing your study of interest..
Collect the full text of citing documents and store all of them in a folder
- Use Paperpile or the fulltext package.
Convert citing documents into raw text
1. Rename *.pdf files in folder to doc*.pdf.
2. Use extract_text() to generate doc*.txt files from pdf files.
Collect metadata on citing documents and rename files
- Use get_meta_data_doc_nv() to identify doc*.txt (using DOIs or titles in them) and then rename those txt files with the information you identified. The idea is that you might have a bunch of raw text files without any additional information on them.
Cleaning before citation case extraction
- Use delete_refs_n_heads() and clean_text() to clean the text files, e.g. delete running titles etc. This is necessary because citation cases may run across pages.
Re-check whether the study/ies was/were really cited in those documents
- Use identify_study_doc() to do that. But this step is not essential.
Extract citation cases from citing documents
- Use extract_citation_cases() to extract the citation cases. The results data is stored as .csv file in the directory where you store your data in a docs folder.
- Eventually use clean_citation_cases() if necessary.
Analyze quality of citation cases (impact)
- Use analyze_citations() to analyze the .csv files that contain the citation cases.
- Use topic_analysis() to do the topic analysis on those .csv files.

To be continued...

Retrieving meta data from parsed journal articles

By "quality of citations", we mean that citing a reference has more value than just being a numeric event; the context of a citation matters as well. We can have multiple perspectives on this context--e.g., by looking at the semantic context within the text, or a higher-level context, which is particular features of the document a reference was cited in. To get to grips with the latter, we provide the get_metadata_doc() function, a tool that enables you to identify various features of parsed journal articles.

Fundamental logic of `get_metadata_doc()`

Working with get_metadata_doc() is very simple. It takes a full text journal article as input, which has to be transformed into txt format with the package's own extract_text() function or other tools (e.g., Jeroen Ooms' pdftools package).

The function imports parts of the txt file. By default, the first 2,000 lines are imported, but you can change that using the lines.import parameter. This is basically a question of computational speed and the most important information to extract meta data tends to be written on the first few pages. After some cleaning work, the function looks for a DOI (digital object identifier) that is linked to the paper. If successful, it uses the identifier to run a query to the Crossref API using the rcrossref package. If no DOI is found, it nevertheless runs a query but uses the first 20 lines of the document as input for the query string. If successful, the API returns a lot of information on the article, which is stored in a data.frame object. Optionally, you can also download the article reference in bibtex format using bibtex = TRUE.

Once the function has identified the corresponding journal, it also taps a journal rankings data base from SJR (SCImago Journal & Country Rank), matches it with the journal and appends information on various journal-level citation statistics.

Finally, you can decide which variables to store in the output data.frame. An overview of all available variables can be found in the documentation; see ?get_metadata_doc.

Example

Parsing one document

setwd("C:\\Users\\Paul\\GDrive\\1-Research\\2017_Quality_of_citations\\data")

### Set folder ###
folder <- "docs"
setnumber <- 200
get_metadata_doc_nv(folder, number = setnumber)