The main interface to fetching full text is through ft_get()
.
library("fulltext")
Simple call, pass in a DOI and say where you want to get data from (by default, it's plos)
res <- ft_get('10.1371/journal.pone.0086169', from = 'plos')
The article text and metadata is stored in the output object.
The res
object is a list, with slots for each of the data sources, b/c you can request
data from more than 1 data source.
names(res) #> [1] "plos" "entrez" "elife" "pensoft" #> [5] "arxiv" "biorxiv" "elsevier" "sciencedirect" #> [9] "wiley"
Let's dig into the plos
source object, which is another list, including metadata the
text data itself (in the data
slot).
res$plos #> $found #> [1] 1 #> #> $dois #> [1] "10.1371/journal.pone.0086169" #> #> $data #> $data$backend #> [1] "ext" #> #> $data$cache_path #> [1] "/Users/sckott/Library/Caches/R/fulltext" #> #> $data$path #> $data$path$`10.1371/journal.pone.0086169` #> $data$path$`10.1371/journal.pone.0086169`$path #> [1] "/Users/sckott/Library/Caches/R/fulltext/10_1371_journal_pone_0086169.xml" #> #> $data$path$`10.1371/journal.pone.0086169`$id #> [1] "10.1371/journal.pone.0086169" #> #> $data$path$`10.1371/journal.pone.0086169`$type #> [1] "xml" #> #> $data$path$`10.1371/journal.pone.0086169`$error #> NULL #> #> #> #> $data$data #> NULL #> #> #> $opts #> $opts$doi #> [1] "10.1371/journal.pone.0086169" #> #> $opts$type #> [1] "xml" #> #> $opts$progress #> [1] FALSE #> #> #> $errors #> id error #> 1 10.1371/journal.pone.0086169 <NA>
Indexing to the data
slot takes us to another list with metadata and the article
res$plos$data #> $backend #> [1] "ext" #> #> $cache_path #> [1] "/Users/sckott/Library/Caches/R/fulltext" #> #> $path #> $path$`10.1371/journal.pone.0086169` #> $path$`10.1371/journal.pone.0086169`$path #> [1] "/Users/sckott/Library/Caches/R/fulltext/10_1371_journal_pone_0086169.xml" #> #> $path$`10.1371/journal.pone.0086169`$id #> [1] "10.1371/journal.pone.0086169" #> #> $path$`10.1371/journal.pone.0086169`$type #> [1] "xml" #> #> $path$`10.1371/journal.pone.0086169`$error #> NULL #> #> #> #> $data #> NULL
Going down one more index gets us the data object. There is no actual text as we have to
collect it from the file on disk. See ft_collect()
to get the text.
res$plos$data #> $backend #> [1] "ext" #> #> $cache_path #> [1] "/Users/sckott/Library/Caches/R/fulltext" #> #> $path #> $path$`10.1371/journal.pone.0086169` #> $path$`10.1371/journal.pone.0086169`$path #> [1] "/Users/sckott/Library/Caches/R/fulltext/10_1371_journal_pone_0086169.xml" #> #> $path$`10.1371/journal.pone.0086169`$id #> [1] "10.1371/journal.pone.0086169" #> #> $path$`10.1371/journal.pone.0086169`$type #> [1] "xml" #> #> $path$`10.1371/journal.pone.0086169`$error #> NULL #> #> #> #> $data #> NULL
You can get a bunch of DOIs first, e.g., from PLOS using the rplos
package
library("rplos") (dois <- searchplos(q = "*:*", fl = 'id', fq = list('doc_type:full', "article_type:\"research article\""), limit = 5)$data$id) #> [1] "10.1371/journal.pone.0020843" "10.1371/journal.pone.0022257" #> [3] "10.1371/journal.pone.0023139" "10.1371/journal.pone.0023138" #> [5] "10.1371/journal.pone.0023119" ft_get(dois, from = 'plos') #> <fulltext text> #> [Docs] 5 #> [Source] ext - /Users/sckott/Library/Caches/R/fulltext #> [IDs] 10.1371/journal.pone.0020843 10.1371/journal.pone.0022257 #> 10.1371/journal.pone.0023139 10.1371/journal.pone.0023138 #> 10.1371/journal.pone.0023119 ...
One article
ft_get('10.7554/eLife.04300', from = 'elife') #> <fulltext text> #> [Docs] 1 #> [Source] ext - /Users/sckott/Library/Caches/R/fulltext #> [IDs] 10.7554/eLife.04300 ...
Many articles
ft_get(c('10.7554/eLife.04300','10.7554/eLife.03032'), from = 'elife') #> <fulltext text> #> [Docs] 2 #> [Source] ext - /Users/sckott/Library/Caches/R/fulltext #> [IDs] 10.7554/eLife.04300 10.7554/eLife.03032 ...
doi <- '10.3389/fphar.2014.00109' ft_get(doi, from = "entrez") #> <fulltext text> #> [Docs] 1 #> [Source] ext - /Users/sckott/Library/Caches/R/fulltext #> [IDs] 4050532 ...
For example, search entrez, get some DOIs, then fetch some articles
(res <- ft_search(query = 'ecology', from = 'entrez')) #> Query: #> [ecology] #> Found: #> [PLoS: 0; BMC: 0; Crossref: 0; Entrez: 203062; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 0; Microsoft: 0] #> Returned: #> [PLoS: 0; BMC: 0; Crossref: 0; Entrez: 10; arxiv: 0; biorxiv: 0; Europe PMC: 0; Scopus: 0; Microsoft: 0] res$entrez$data$doi #> [1] "10.1111/tesg.12439" "10.1002/tie.22161" #> [3] "10.1182/bloodadvances.2019001144" "10.1186/s13063-020-04513-w" #> [5] "10.1007/s10109-020-00330-6" "10.3390/ma13112563" #> [7] "10.3390/molecules25112618" "10.3390/molecules25112634" #> [9] "10.3390/molecules25112471" "10.1186/s40168-020-00861-6"
Get articles
ft_get(res$entrez$data$doi[1:3], from = 'entrez') #> <fulltext text> #> [Docs] 3 #> [Source] ext - /Users/sckott/Library/Caches/R/fulltext #> [IDs] 7323205 7323122 7322952 ...
When using ft_get()
you write the files to disk, and you have to pull text out of them as a
separate step.
(res <- ft_get('10.1371/journal.pone.0086169', from = 'plos')) #> <fulltext text> #> [Docs] 1 #> [Source] ext - /Users/sckott/Library/Caches/R/fulltext #> [IDs] 10.1371/journal.pone.0086169 ...
One way to do that is with ft_collect()
. Before running ft_collect()
the data
slot is NULL
.
res$plos$data$data #> NULL
Run ft_collect()
res <- res %>% ft_collect
After running ft_collect()
the data
slot has the text. If there's more than one article they are named
by the identifier
res$plos$data$data #> $`10.1371/journal.pone.0086169` #> {xml_document} #> <article article-type="research-article" dtd-version="3.0" lang="en" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> #> [1] <front>\n <journal-meta>\n <journal-id journal-id-type="nlm-ta">PLoS ... #> [2] <body>\n <sec id="s1">\n <title>Introduction</title>\n <p>Since th ... #> [3] <back>\n <ack>\n <p>We thank Joan Silk, Julienne Rutherford, and two ...
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.