ft_get: Get full text

Description Usage Arguments Details Value Notes on the type parameter How data is stored Data formats Reusing cached articles Caching Notes on specific publishers Warnings Examples

View source: R/ft_get.R

Description

ft_get is a one stop shop to fetch full text of articles, either XML or PDFs. We have specific support for PLOS via the rplos package, Entrez via the rentrez package, and arXiv via the aRxiv package. For other publishers, we have helpers to ft_get to sort out links for full text based on user input. See Details for help on how to use this function.

Usage

1
2
3
4
5
6
ft_get(x, from = NULL, type = "xml", try_unknown = TRUE,
  plosopts = list(), bmcopts = list(), entrezopts = list(),
  elifeopts = list(), elsevieropts = list(), wileyopts = list(),
  crossrefopts = list(), ...)

ft_get_ls()

Arguments

x

Either identifiers for papers, either DOIs (or other ids) as a list of character strings, or a character vector, OR an object of class ft, as returned from ft_search()

from

Source to query. Optional.

type

(character) one of xml (default), pdf, or plain (Elsevier only). We choose to go with xml as the default as it has structure that a machine can reason about, but you are of course free to try to get xml, pdf, or plain (in the case of Elsevier).

try_unknown

(logical) if publisher plugin not already known, we try to fetch full text link either from ftdoi.org or from Crossref. If not found at ftdoi.org or at Crossref we skip with a warning. If found with ftdoi.org or Crossref we attempt to download. Only applicable in character and list S3 methods. Default: TRUE

plosopts

PLOS options. See rplos::plos_fulltext()

bmcopts

BMC options. parameter DEPRECATED

entrezopts

Entrez options. See rentrez::entrez_search() and entrez_fetch()

elifeopts

eLife options

elsevieropts

Elsevier options

wileyopts

Wiley options

crossrefopts

Crossref options

...

Further args passed on to crul::HttpClient

Details

There are various ways to use ft_get:

Note that some publishers are available via Entrez, but often not recent articles, where "recent" may be a few months to a year or so. In that case, make sure to specify the publisher, or else you'll get back no data.

See Rate Limits and Authentication in fulltext-package for rate limiting and authentication information, respectively

Value

An object of class ft_data (of type S3) with slots for each of the publishers. The returned object is split up by publishers because the full text format is the same within publisher - which should facilitate text mining downstream as different steps may be needed for each publisher's content.

Note that we have a print method for ft_data so you see something like this:

1
2
3
4
<fulltext text>
[Docs] 4
[Source] ext - /Users/foobar/Library/Caches/R/fulltext
[IDs] 10.2307/1592482 10.2307/1119209 10.1037/11755-024 ...

Within each publisher there is a list with the elements:

Notes on the type parameter

Type is sometimes ignored, sometimes used. For certain data sources, they only accept one type. By data source/publisher:

How data is stored

ft_get used to have many options for "backends". We have simplified this to one option. That one option is that all full text files are written to disk on your machine. You can choose where these files are stored.

In addition, files are named by their IDs (usually DOIs), and the file extension for the full text type (pdf or xml usually). This makes inspecting the files easy.

Data formats

xml full text is stored in .xml files. pdf is stored in .pdf files. And plain text is stored in .txt files.

Reusing cached articles

All files are written to disk and we check for a file matching the given DOI/ID on each request - if found we use it and throw message saying so.

Caching

Previously, you could set caching options in each ft_get function call. We've simplified this to only setting caching options through the function cache_options_set() - and you can get your cache options using cache_options_get(). See those docs for help on caching.

Notes on specific publishers

Warnings

You will see warnings thrown in the R shell or in the resulting object. See ft_get-warnings for more information on what warnings mean.

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
# List publishers included
ft_get_ls()

## Not run: 
# If you just have DOIs and don't know the publisher
## PLOS
ft_get('10.1371/journal.pone.0086169')

## PeerJ
ft_get('10.7717/peerj.228')
ft_get('10.7717/peerj.228', type = "pdf")

## eLife
### xml
ft_get('10.7554/eLife.03032')
res <- ft_get(c('10.7554/eLife.03032', '10.7554/eLife.32763'), from = "elife")
res$elife
respdf <- ft_get(c('10.7554/eLife.03032', '10.7554/eLife.32763'), 
  from = "elife", type = "pdf")
respdf$elife

elife_xml <- ft_get('10.7554/eLife.03032', from = "elife")
library(magrittr)
elife_xml %<>% collect()
elife_xml$elife
### pdf
elife_pdf <- ft_get(c('10.7554/eLife.03032', '10.7554/eLife.32763'), 
  from = "elife", type = "pdf")
elife_pdf$elife
elife_pdf %<>% collect()
elife_pdf %>% ft_extract()

## some BMC DOIs will work, but some may not, who knows
ft_get(c('10.1186/2049-2618-2-7', '10.1186/2193-1801-3-7'), from = "entrez")

## FrontiersIn
res <- ft_get(c('10.3389/fphar.2014.00109', '10.3389/feart.2015.00009'))
res
res$frontiersin

## Hindawi - via Entrez
res <- ft_get(c('10.1155/2014/292109','10.1155/2014/162024', '10.1155/2014/249309'))
res
res$hindawi
res$hindawi$data$path
res$hindawi$data$data
res %>% collect() %>% .$hindawi

## F1000Research - via Entrez
x <- ft_get('10.12688/f1000research.6522.1')
## Two different publishers via Entrez - retains publisher names
res <- ft_get(c('10.1155/2014/292109', '10.12688/f1000research.6522.1'))
res$hindawi
res$f1000research

## Pensoft
ft_get('10.3897/mycokeys.22.12528')
### you'll need to specify the publisher for a DOI from a recent publication
ft_get('10.3897/zookeys.515.9332', from = "pensoft")

## Copernicus
out <- ft_get(c('10.5194/angeo-31-2157-2013', '10.5194/bg-12-4577-2015'))
out$copernicus

## arXiv - only pdf, you have to pass in the from parameter
res <- ft_get(x='cond-mat/9309029', from = "arxiv")
res$arxiv
res %>% ft_extract  %>% .$arxiv

## bioRxiv - only pdf
res <- ft_get(x='10.1101/012476')
res$biorxiv

## Karger Publisher
(x <- ft_get('10.1159/000369331'))
x$karger

## MDPI Publisher
(x <- ft_get('10.3390/nu3010063'))
x$mdpi
ft_get('10.3390/nu7085279')
ft_get(c('10.3390/nu3010063', '10.3390/nu7085279')) # not working, only getting 1

# Scientific Societies
## this is a paywall article, you may not have access or you may
x <- ft_get("10.1094/PHYTO-04-17-0144-R")

# Informa
x <- ft_get("10.1080/03088839.2014.926032")
ft_get("10.1080/03088839.2013.863435")

## CogentOA - part of Inform/Taylor Francis now
ft_get('10.1080/23311916.2014.938430')

library(rplos)
(dois <- searchplos(q="*:*", fl='id',
   fq=list('doc_type:full',"article_type:\"research article\""), limit=5)$data$id)
ft_get(dois, from='plos')
ft_get(c('10.7717/peerj.228','10.7717/peerj.234'), from='entrez')

# elife
ft_get('10.7554/eLife.04300', from='elife')
ft_get(c('10.7554/eLife.04300', '10.7554/eLife.03032'), from='elife')
## search for elife papers via Entrez
dois <- ft_search("elife[journal]", from = "entrez")
ft_get(dois)

# Frontiers in Pharmacology (publisher: Frontiers)
doi <- '10.3389/fphar.2014.00109'
ft_get(doi, from="entrez")

# Hindawi Journals
ft_get(c('10.1155/2014/292109','10.1155/2014/162024','10.1155/2014/249309'), from='entrez')
res <- ft_search(query='ecology', from='crossref', limit=50,
                 crossrefopts = list(filter=list(has_full_text = TRUE,
                                                 member=98,
                                                 type='journal-article')))

out <- ft_get(res$crossref$data$DOI[1:20], from='entrez')

# Frontiers Publisher - Frontiers in Aging Nueroscience
res <- ft_get("10.3389/fnagi.2014.00130", from='entrez')
res$entrez

# Search entrez, get some DOIs
(res <- ft_search(query='ecology', from='entrez'))
res$entrez$data$doi
ft_get(res$entrez$data$doi[1], from='entrez')
ft_get(res$entrez$data$doi[1:3], from='entrez')

# Search entrez, and pass to ft_get()
(res <- ft_search(query='ecology', from='entrez'))
ft_get(res)

# elsevier, ugh
## set an environment variable like Sys.setenv(CROSSREF_TDM = "your key")
ft_get(x = "10.1016/j.trac.2016.01.027", from = "elsevier")

# wiley, ugh
## Wiley has only PDF, so type parameter doesn't do anything
ft_get(x = "10.1006/asle.2001.0035", from = "wiley")

# IEEE, ugh
ft_get('10.1109/TCSVT.2012.2221191', type = "pdf")

# AIP Publishing
ft_get('10.1063/1.4967823', try_unknown = TRUE)

# PNAS
ft_get('10.1073/pnas.1708584115', try_unknown = TRUE)


# From ft_links output
## Crossref
(res2 <- ft_search(query = 'ecology', from = 'crossref', limit = 3))
(out <- ft_links(res2))
(ress <- ft_get(x = out, type = "pdf"))
ress$crossref

(x <- ft_links("10.1111/2041-210X.12656", "crossref"))
(y <- ft_get(x))

## PLOS
(res2 <- ft_search(query = 'ecology', from = 'plos', limit = 4))
(out <- ft_links(res2))
out$plos
(ress <- ft_get(x = out, type = "pdf"))
ress$plos
ress$plos$dois
ress$plos$data
ress$plos$data$path$`10.1371/journal.pone.0059813`

## No publisher plugin provided yet
# ft_get('10.1037/10740-005')
### but no link available for this DOI
res <- ft_get('10.1037/10740-005', try_unknown = TRUE)
res$crossref
### a link IS available for this DOI
res <- ft_get('10.1037/10740-005', try_unknown = TRUE)
res$crossref

## End(Not run)

fulltext documentation built on Feb. 9, 2018, 6:08 a.m.