Introduction to getLattes
In getLattes: Import and Process Data from the 'Lattes' Curriculum Platform

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

The Lattes platform has been hosting curricula of Brazilian researchers since the late 1990s, containing more than 5 million curricula. The data from the Lattes curricula can be downloaded to XML format, the complexity of this reading process motivated the development of the getLattes package, which imports the information from the XML files to a list in the R software and then tabulates the Lattes data to a data.frame.

The main information contained in XML files, and imported via getLattes, are:

Research Area getAreasAtuacao()
Published Papers getArtigosPublicados()
Profissional Links getAtuacoesProfissionais()
Ph.D. Examination Board's getBancasDoutorado()
Undergraduate Examination Board's getBancasGraduacao()
Master Examination Board's getBancasMestrado()
Books Chapters getCapitulosLivros()
General Data getDadosGerais()
Profissional Address getEnderecoProfissional()
Events and Congresses getEventosCongressos()
Profissional Formation (Ph.D. Thesis) getFormacaoDoutorado()
Profissional Formation (Master Thesis) getFormacaoMestrado()
Profissional Formation (Undergraduation) getFormacaoGraduacao()
Languages getIdiomas()
Research Lines getLinhaPesquisa()
Published Books getLivrosPublicados()
Event's Organization getOrganizacaoEvento()
Academic Advisory (Ph.D. Thesis) getOrientacoesDoutorado()
Academic Advisory (Master Thesis) getOrientacoesMestrado()
Academic Advisory (Post Doctorate) getOrientacoesPosDoutorado()
Other Technical Productions getOutrasProducoesTecnicas()
Participation in Projects getParticipacaoProjeto()
Technical Production getProducaoTecnica()
Patents getPatentes()
Personal Lattes 16 digits identification getId()

From the functionalities presented in this package, the main challenge to work with the Lattes curriculum data is now to download the data, as there are Captchas. To download a lot of curricula I suggest the use of Captchas Negated by Python reQuests - CNPQ. The second barrier to be overcome is the management and processing of a large volume of data, the whole Lattes platform in XML files totals over 200 GB. In this tutorial we will focus on the getLattes package features, being the reader responsible for download and manage the files.

Follow an example of how to search and download data from the Lattes website.

Installation

To install the released version of getLattes from github.

# install and load devtools from CRAN
# install.packages("devtools")
library(devtools)

# install and load getLattes
devtools::install_github("roneyfraga/getLattes")

Load getLattes.

library(getLattes)

# support packages
library(xml2)
library(dplyr)
library(tibble)
library(purrr)

Single curriculum

Import

Using the get* functions to import data from a single curriculum is straightforward. The curriculum need to be imported into R by the read_xml() function from the xml2 package.

# esse executo para testar meu código
# o próximo chunk é só para aparecer o código no html
curriculo <- xml2::read_xml('inst/extdata/4984859173592703.zip')

curriculo <- xml2::read_xml('../inst/extdata/4984859173592703.zip')

`get` functions

getDadosGerais(curriculo)
getArtigosPublicados(curriculo)
getAreasAtuacao(curriculo)
getArtigosPublicados(curriculo)
getAtuacoesProfissionais(curriculo)
getBancasDoutorado(curriculo)
getBancasGraduacao(curriculo)
getBancasMestrado(curriculo)
getCapitulosLivros(curriculo)
getDadosGerais(curriculo)
getEnderecoProfissional(curriculo)
getEventosCongressos(curriculo)
getFormacaoDoutorado(curriculo)
getFormacaoMestrado(curriculo)
getFormacaoGraduacao(curriculo)
getIdiomas(curriculo)
getLinhaPesquisa(curriculo)
getLivrosPublicados(curriculo)
getOrganizacaoEventos(curriculo)
getOrientacoesDoutorado(curriculo)
getOrientacoesMestrado(curriculo)
getOrientacoesPosDoutorado(curriculo)
getOutrasProducoesTecnicas(curriculo)
getParticipacaoProjeto(curriculo)
getProducaoTecnica(curriculo)
getId(curriculo)

Several curricula

Import

To import data from two or more curricula it is easier to use list.files(), a native R function, or dir_ls() from fs package. As xml2::read_xml() allow to read a xml curriculum inside a zip files, we can insert both options in pattern argument.

# esse executo para testar meu código
# o próximo chunk é só para aparecer o código no html
files <- list.files(path = 'inst/extdata/', pattern = '*.xml|*.zip', full.names = T)
system.file("extdata", "4984859173592703.zip", package = "getLattes")

files <- list.files(path = '../inst/extdata/', pattern = '*.xml|*.zip', full.names = T)

Import the listed curricula to R memory as xml2::read_xml object.

curriculos <- lapply(files, read_xml)

The lapply() function is a well-known and widely used alternative in the R world. However, it does not natively handle errors, which makes the map function from the purrr package an excellent alternative.

Adding an extra layer of complexity, I will use pipe |>. Programming using the pipe operator |> allows faster coding and clearer syntax.

curriculos <- 
    purrr::map(files, safely(read_xml)) |> 
    purrr::map(pluck, 'result')

`get` functions

To read data from only one curriculum any function get can be executed singly, but to import data from two or more curricula is easier to use get* functions with lapply() or map().

dados_gerais <- 
    purrr::map(curriculos, safely(getDadosGerais)) |>
    purrr::map(pluck, 'result') 

dados_gerais

Import general data from 2 curricula. The output is a list of data frames, converted by a unique data frame with bind_rows().

dados_gerais <- 
    purrr::map(curriculos, safely(getDadosGerais)) |>
    purrr::map(pluck, 'result') |>
    dplyr::bind_rows() 

glimpse(dados_gerais)

It is worth remembering that all variable names obtained by get* functions are the transcription of the field names in the XML file, the - being replaced with _ and the capital letters replaced with lower case letters.

Publications

artigos_publicados <- 
    purrr::map(curriculos, safely(getArtigosPublicados)) |>
    purrr::map(pluck, 'result') |>
    dplyr::bind_rows() 

artigos_publicados |>
    dplyr::arrange(desc(ano_do_artigo)) |>
    dplyr::select(titulo_do_artigo, ano_do_artigo, titulo_do_periodico_ou_revista) 

livros_publicados <- 
    purrr::map(curriculos, safely(getLivrosPublicados)) |>
    purrr::map(pluck, 'result') |>
    dplyr::bind_rows() 

capitulos_livros <- 
    purrr::map(curriculos, safely(getCapitulosLivros)) |>
    purrr::map(pluck, 'result') |>
    dplyr::bind_rows()

Grouping data

To group the data key variable is id, which is a unique 16 digit code.

artigos_publicados2 <- 
    dplyr::group_by(artigos_publicados, id) |>
    dplyr::tally(name = 'artigos') 

artigos_publicados2

livros_publicados2 <- 
    dplyr::group_by(livros_publicados, id) |>
    dplyr::tally(name = 'livros') 

livros_publicados2

capitulos_livros2 <- 
    dplyr::group_by(capitulos_livros, id) |>
    dplyr::tally(name = 'capitulos') 

capitulos_livros2

Merge data

to join the data from different tables the recommended variable is id, which is a unique 16 digit code.

artigos_publicados2 |>
    dplyr::left_join(livros_publicados2) |>
    dplyr::left_join(capitulos_livros2)

Add information from a different tables.

artigos_publicados2 |>
    dplyr::left_join(livros_publicados2) |>
    dplyr::left_join(capitulos_livros2) |>
    dplyr::left_join(dados_gerais |> dplyr::select(id, nome_completo)) |>
    dplyr::select(nome_completo, artigos, livros, capitulos)

Any scripts or data that you put into this service are public.

getLattes documentation built on June 11, 2021, 9:06 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

getLattes
Import and Process Data from the 'Lattes' Curriculum Platform

Introduction to getLattes
In getLattes: Import and Process Data from the 'Lattes' Curriculum Platform

Installation

Single curriculum

Import

`get` functions

Several curricula

Import

`get` functions

Publications

Grouping data

Merge data

Try the getLattes package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

getLattes Import and Process Data from the 'Lattes' Curriculum Platform

Introduction to getLattes In getLattes: Import and Process Data from the 'Lattes' Curriculum Platform

Installation

Single curriculum

Import

get functions

Several curricula

Import

get functions

Publications

Grouping data

Merge data

Try the getLattes package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

getLattes
Import and Process Data from the 'Lattes' Curriculum Platform

Introduction to getLattes
In getLattes: Import and Process Data from the 'Lattes' Curriculum Platform

`get` functions

`get` functions