library(rscielo) knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "README-", ev = "png", dev.args = list(type = "cairo"), fig.path = "man/figures/README-", dpi = 96 )
Authors: Fernando Meireles, Denisson Silva, and Rogerio Barbosa
rscielo
offers functions to easily scrape bibliometric information from scientific journals and articles hosted on the Scientific Electronic Library Online Platform (Scielo.org). The retrieved data includes a journal’s details and citation counts; article’s contents, footnotes, bibliographic references; and several other common information used in bibliometric studies. The package also provides functions to quickly summarize the scraped data.
To install the latest stable release of rscielo
from CRAN, use:
install.packages("rscielo")
Alternatively, one may install the latest pre-release version from GitHub via:
if(!require("remotes")) install.packages("remotes") remotes::install_github("meirelesff/rscielo")
At its core, rscielo
is a scraper that offers a transparent and reproducible approach to gather data from the Scientific Electronic Library Online Platform (Scielo.br), one of the largest open repositories for scientific publications in the world. In particular, the package provides functions to automatically extract and parse different types of information from (1) scientific journals (pointed by _journal
or _journal_
in their names) and (2) articles (with functions that contains _article
or _article_
in their names).
To get data from a particular journal, such as citation counts and ISSN, the rscielo
relies on an ID (or pid) that uniquely identifies each journal within the Scielo repository. As an example, this is the URL of the Brazilian Political Science Review homepage on Scielo:
http://www.scielo.br/scielo.php?script=sci_serial&pid=1981-3821&lng=en&nrm=iso
The journal ID can be found between &pid=
and &lng
(i.e., 1981-3821
). Most of rscielo
's functions that retrieve data from journals rely on this information to work. To automatically extract an ID from the URL of a journal, one may use the get_journal_id()
function:
get_journal_id("http://www.scielo.br/scielo.php?script=sci_serial&pid=1981-3821&lng=en&nrm=iso")
With a journal ID in hand, use the get_journal()
function to scrape meta-data from all articles published in its last issue:
df <- get_journal("1981-3821")
This code returns a tibble
in which the observations correspond to the articles that appeared in the selected journal's lastest issue. Among the returned variables are authors' names, institutional affiliations, and home countries; articles' abstracts, keywords, and the number of pages (check the get_journal
documentation executing help(get_journal)
for a full description of the retrieved data).
For a quick glimpse at the scraped data, one may use the summary
method:
summary(df)
get_journal()
also extracts data from all articles ever published by a journal. To do that, set the argument last_issue
to FALSE
:
get_journal("1981-3821", last_issue = "FALSE")
rscielo
contains functions to scrape and report publication and citation counts of a journal:
# Gets citation metrics cit <- get_journal_metrics("1981-3821") # Plots the data for a quick visualization plot(cit)
get_journal_info()
and get_journal_list()
scrapes a journal's meta-information (publisher, ISSN, and mission) and a list of all journals hosted on Scielo, respectively:
# Get a journal's meta-information meta_info <- get_journal_info("1981-3821") # Get a list with all journals names, URLs and IDs journals <- get_journal_list()
Scientific articles stored on Scielo are also identified by a unique ID, which is formed by a combination between their Digital Object Identifiers (DOI) plus other characters. These IDs can se seen in each article's URL (after &pid=
until &lng=
):
# URL of an article url_article <- "http://www.scielo.br/scielo.php?script=sci_arttext&pid=S1981-38212016000200201&lng=en&nrm=iso&tlng=en"
By design, rscielo
handles full articles' URLs as inputs, but users may obtain the IDs by using the get_article_id
function:
get_article_id(url_article)
To scrape the content of a single scientific article, the rscielo
provides the get_article()
function:
# Scrape the meta-data article <- get_article(url_article)
As can be seen, the function returns the full text of the requested article as a character
vector. Users may also pass the article's ID to the function to achieve the same results:
article <- get_article("S1981-38212016000200201")
Or set the argument output_text
to FALSE
to get a tibble
with the article's DOI (which might be useful in bibliometric analysis):
article <- get_article("S1981-38212016000200201", output_text = FALSE)
Similar to the get_journal()
function, get_article_meta
returns meta-data of a selected article hosted on Scielo:
url <- "http://www.scielo.br/scielo.php?script=sci_arttext&pid=S1981-38212016000200201&lng=en&nrm=iso&tlng=en" article_meta <- get_article_meta(url)
To retrieve a list of bibliographic items cited by an article, use get_article_referencs()
:
article_references <- get_article_references(url)
The function outputs a tibble
in which every bibliographic item corresponds to an observation. get_article_footnotes()
returns a similar object, but with footnotes in the rows:
article_foots <- get_article_footnotes(url)
For convenience, here is a description of the rscielo
functions.
Function to extract data from journals:
get_journal_id()
: Get a journal's ID from its URL.get_journal()
: Get meta-data of all articles published by a journal.get_journal_info()
: Get a journal's description.get_journal_list()
: Get a list with all journals' names, URLs and ID's.get_journal_metrics()
: Get publication and citation counts of a journal.Function to extract data from articles:
get_article_id()
: Get an article's ID from its URL.get_article()
: Get the full text of a single article.get_article_meta()
: Get meta-data of a single article.get_article_referencs()
: Get the list of bibliographic references cited by a single article.get_article_footnotes()
: Get the list of the footnotes of a single article.Methods:
summary.Scielo()
: Summarize the data of a tibble
returned by get_journal
.plot.scielo_metrics()
: Plot citation counts of a journal retrieved by get_journal_metrics
.The rscielo
's functions extract data directly from the Scielo online repository. In any event, sometimes users might find errors or obtain incomplete information when using its functions, mainly when using the _article
ones to scrape articles' full contents. This happens when journals feeds invalid or wrongly formatted information into the Scielo platform. In most situations, a bit of data cleaning solves the issues, but users must be aware that the retrieved data still might be lacking.
To cite rscielo
in publications, use:
citation("rscielo")
We welcome comments or suggestions to improve the package. Feel free to start a issue at our GitHub repository.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.