Description Usage Arguments Details Value Important Access Notes Notes on the type parameter How data is stored Data formats Reusing cached articles Caching Notes on specific publishers Warnings See Also Examples
ft_get
is a one stop shop to fetch full text of articles,
either XML or PDFs. We have specific support for PLOS via the
rplos package, Entrez via the rentrez package, and arXiv via the
aRxiv package. For other publishers, we have helpers to ft_get
to
sort out links for full text based on user input. Articles are saved on
disk. See Details
for help on how to use this function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
x |
Either identifiers for papers, either DOIs (or other ids) as a
list of character strings, or a character vector, OR an object of class |
from |
Source to query. Optional. |
type |
(character) one of xml (default), pdf, or plain (Elsevier and ScienceDirect only). We choose to go with xml as the default as it has structure that a machine can reason about, but you are of course free to try to get xml, pdf, or plain (in the case of Elsevier and ScienceDirect). |
try_unknown |
(logical) if publisher plugin not already known, we try
to fetch full text link either using ftdoi package or from Crossref.
If not found at ftdoi or at Crossref we skip with a warning.
If found with ftdoi or Crossref we attempt to download. Only
applicable in |
bmcopts |
BMC options. parameter DEPRECATED |
entrezopts |
Entrez options, a named list. See
|
elifeopts |
eLife options, a named list. |
elsevieropts |
Elsevier options, a named list. Use |
sciencedirectopts |
Elsevier ScienceDirect options, a named list. |
wileyopts |
Wiley options, a named list. |
crossrefopts |
Crossref options, a named list. |
progress |
(logical) whether to show progress bar or not.
default: |
... |
curl options passed on to crul::HttpClient, see examples below |
There are various ways to use ft_get
:
Pass in only DOIs - leave from
parameter NULL
. This route will
first query Crossref API for the publisher of the DOI, then we'll use
the appropriate method to fetch full text from the publisher. If a publisher
is not found for the DOI, then we'll throw back a message telling you a
publisher was not found.
Pass in DOIs (or other pub IDs) and use the from
parameter. This route
means we don't have to make an extra API call to Crossref (thus, this route
is faster) to determine the publisher for each DOI. We go straight to
getting full text based on the publisher.
Use ft_search()
to search for articles. Then pass that output to
this function, which will use info in that object. This behaves the same
as the previous option in that each DOI has publisher info so we know how to
get full text for each DOI.
Note that some publishers are available via Entrez, but often not recent articles, where "recent" may be a few months to a year or so. In that case, make sure to specify the publisher, or else you'll get back no data.
An object of class ft_data
(of type S3
) with slots for
each of the publishers. The returned object is split up by publishers
because the full text format is the same within publisher - which should
facilitate text mining downstream as different steps may be needed for
each publisher's content.
Note that we have a print method for ft_data
so you see something
like this:
1 2 3 4 |
Within each publisher there is a list with the elements:
found
: number of full text articles found
dois
: the DOIs given and searched for
data
backend
: the backend. right now only ext
for "by file extension",
we may add other backends in the future, thus we retain this
cache_path
: the base directory path for file caching
path
: if file retrieved the full path to the file. if file not
retrived this is NULL
data
: if text extracted (see ft_collect()
) the text will be here,
but until then this is NULL
opts
: the options given like article type, dois
errors
: data.frame of errors, with two columns for article id and error
See Rate Limits and Authentication in fulltext-package for rate limiting and authentication information, respectively.
In particular, take note that when fetching full text from Wiley, the only way that's done is through the Crossref Text and Data Mining service. See the Authenticaiton section of fulltext-package for all the details.
When fetching articles from Elsevier, the only way that used to be done was through the Crossref TDM flow. However, Crossref TDM is going away. See Authentication in fulltext-package for details.
type
parameterType is sometimes ignored, sometimes used. For certain data sources, they only accept one type. By data source/publisher:
PLOS: pdf and xml
Entrez: only xml
eLife: pdf and xml
Pensoft: pdf and xml
arXiv: only pdf
BiorXiv: only pdf
Elsevier: xml and plain
Elsevier ScienceDirect: xml and plain
Wiley: pdf and xml
Peerj: pdf and xml
Informa: only pdf
FrontiersIn: pdf and xml
Copernicus: pdf and xml
Scientific Societies: only pdf
Cambridge: only pdf
Crossref: depends on the publisher
other data sources/publishers: there are too many to cover here - will try to make a helper in the future for what is covered by different publishers
ft_get
used to have many options for "backends". We have simplified this
to one option. That one option is that all full text files are written
to disk on your machine. You can choose where these files are stored.
In addition, files are named by their IDs (usually DOIs), and the file extension for the full text type (pdf or xml usually). This makes inspecting the files easy.
xml full text is stored in .xml
files. pdf is stored in .pdf
files.
And plain text is stored in .txt
files.
All files are written to disk and we check for a file matching the given DOI/ID on each request - if found we use it and throw message saying so.
Previously, you could set caching options in each ft_get
function call.
We've simplified this to only setting caching options through the function
cache_options_set()
- and you can get your cache options using
cache_options_get()
. See those docs for help on caching.
arXiv: The IDs passed are not actually DOIs, though they look
similar. Thus, there's no way to not pass in the from
parameter
as we can't determine unambiguously that the IDs passed in are from
arXiv.org.
bmc: Is a hot mess since the Springer acquisition. It's been removed as an officially supported plugin, some DOIs from them may still work when passed in here, who knows, it's a mess.
You will see warnings thrown in the R shell or in the resulting object. See ft_get-warnings for more information on what warnings mean.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 | # List publishers included
ft_get_ls()
## Not run:
# If you just have DOIs and don't know the publisher
## PLOS
ft_get('10.1371/journal.pone.0086169')
# Collect all errors from across papers
# similarly can combine from different publishers as well
res <- ft_get(c('10.7554/eLife.03032', '10.7554/eLife.aaaa'), from = "elife")
res$elife$errors
## PeerJ
ft_get('10.7717/peerj.228')
ft_get('10.7717/peerj.228', type = "pdf")
## eLife
### xml
ft_get('10.7554/eLife.03032')
res <- ft_get(c('10.7554/eLife.03032', '10.7554/eLife.32763'),
from = "elife")
res$elife
respdf <- ft_get(c('10.7554/eLife.03032', '10.7554/eLife.32763'),
from = "elife", type = "pdf")
respdf$elife
elife_xml <- ft_get('10.7554/eLife.03032', from = "elife")
library(magrittr)
elife_xml %<>% ft_collect()
elife_xml$elife
### pdf
elife_pdf <- ft_get(c('10.7554/eLife.03032', '10.7554/eLife.32763'),
from = "elife", type = "pdf")
elife_pdf$elife
elife_pdf %<>% ft_collect()
elife_pdf %>% ft_extract()
## some BMC DOIs will work, but some may not, who knows
ft_get(c('10.1186/2049-2618-2-7', '10.1186/2193-1801-3-7'), from = "entrez")
## FrontiersIn
res <- ft_get(c('10.3389/fphar.2014.00109', '10.3389/feart.2015.00009'))
res
res$frontiersin
## Hindawi - via Entrez
res <- ft_get(c('10.1155/2014/292109','10.1155/2014/162024',
'10.1155/2014/249309'))
res
res$hindawi
res$hindawi$data$path
res %>% ft_collect() %>% .$hindawi
## F1000Research - via Entrez
x <- ft_get('10.12688/f1000research.6522.1')
## Two different publishers via Entrez - retains publisher names
res <- ft_get(c('10.1155/2014/292109', '10.12688/f1000research.6522.1'))
res$hindawi
res$f1000research
## Thieme -
### coverage is hit and miss, it's not great
ft_get('10.1055/s-0032-1316462')
## Pensoft
ft_get('10.3897/mycokeys.22.12528')
## Copernicus
out <- ft_get(c('10.5194/angeo-31-2157-2013', '10.5194/bg-12-4577-2015'))
out$copernicus
## arXiv - only pdf, you have to pass in the from parameter
res <- ft_get(x='cond-mat/9309029', from = "arxiv")
res$arxiv
res %>% ft_extract %>% .$arxiv
## bioRxiv - only pdf
res <- ft_get(x='10.1101/012476')
res$biorxiv
## AAAS - only pdf
res <- ft_get(x='10.1126/science.276.5312.548')
res$aaas
# The Royal Society
res <- ft_get("10.1098/rspa.2007.1849")
ft_get(c("10.1098/rspa.2007.1849", "10.1098/rstb.1970.0037",
"10.1098/rsif.2006.0142"))
## Karger Publisher
(x <- ft_get('10.1159/000369331'))
x$karger
## MDPI Publisher
(x <- ft_get('10.3390/nu3010063'))
x$mdpi
ft_get('10.3390/nu7085279')
ft_get(c('10.3390/nu3010063', '10.3390/nu7085279'))
# Scientific Societies
## this is a paywall article, you may not have access or you may
x <- ft_get("10.1094/PHYTO-04-17-0144-R")
x$scientificsocieties
# Informa
x <- ft_get("10.1080/03088839.2014.926032")
ft_get("10.1080/03088839.2013.863435")
## CogentOA - part of Inform/Taylor Francis now
ft_get('10.1080/23311916.2014.938430')
library(rplos)
(dois <- searchplos(q="*:*", fl='id',
fq=list('doc_type:full',"article_type:\"research article\""),
limit=5)$data$id)
ft_get(dois)
ft_get(c('10.7717/peerj.228','10.7717/peerj.234'))
# elife
ft_get('10.7554/eLife.04300', from='elife')
ft_get(c('10.7554/eLife.04300', '10.7554/eLife.03032'), from='elife')
## search for elife papers via Entrez
dois <- ft_search("elife[journal]", from = "entrez")
ft_get(dois)
# Frontiers in Pharmacology (publisher: Frontiers)
doi <- '10.3389/fphar.2014.00109'
ft_get(doi, from="entrez")
# Hindawi Journals
ft_get(c('10.1155/2014/292109','10.1155/2014/162024','10.1155/2014/249309'),
from='entrez')
# Frontiers Publisher - Frontiers in Aging Nueroscience
res <- ft_get("10.3389/fnagi.2014.00130", from='entrez')
res$entrez
# Search entrez, get some DOIs
(res <- ft_search(query='ecology', from='entrez'))
res$entrez$data$doi
ft_get(res$entrez$data$doi[1], from='entrez')
ft_get(res$entrez$data$doi[1:3], from='entrez')
# Search entrez, and pass to ft_get()
(res <- ft_search(query='ecology', from='entrez'))
ft_get(res)
# elsevier, ugh
## set the environment variable Sys.setenv(ELSEVIER_TDM_KEY = "your key")
### an open access article
ft_get(x = "10.1016/j.trac.2016.01.027", from = "elsevier")
### non open access article
#### If you don't have access, by default you get abstract only, and we
##### treat it as an error as we assume you want full text
ft_get(x = "10.1016/j.trac.2016.05.027", from = "elsevier")
#### If you want to retain whatever Elsevier gives you
##### set "retain_non_ft = TRUE"
ft_get(x = "10.1016/j.trac.2016.05.027", from = "elsevier",
elsevieropts = list(retain_non_ft = TRUE))
# sciencedirect
## set the environment variable Sys.setenv(ELSEVIER_TDM_KEY = "your key")
ft_get(x = "10.1016/S0140-6736(13)62329-6", from = "sciencedirect")
# wiley, ugh
## set the environment variable Sys.setenv(WILEY_TDM_KEY = "your key")
ft_get(x = "10.1006/asle.2001.0035", from = "wiley", type = "pdf")
## xml
ft_get(x = "10.1111/evo.13812", from = "wiley")
## highwire fiasco paper
ft_get(x = "10.3732/ajb.1300053", from = "wiley")
ft_get(x = "10.3732/ajb.1300053", from = "wiley", type = "pdf")
# IEEE, ugh
ft_get('10.1109/TCSVT.2012.2221191', type = "pdf")
# AIP Publishing
ft_get('10.1063/1.4967823', try_unknown = TRUE)
# PNAS
ft_get('10.1073/pnas.1708584115', try_unknown = TRUE)
# American Society for Microbiology
ft_get('10.1128/cvi.00178-17')
# American Society of Clinical Oncology
ft_get('10.1200/JCO.18.00454')
# American Institute of Physics
ft_get('10.1063/1.4895527')
# American Chemical Society
ft_get(c('10.1021/la903074z', '10.1021/jp048806z'))
# Royal Society of Chemistry
ft_get('10.1039/c8cc06410e')
# From ft_links output
## Crossref
(res2 <- ft_search(query = 'ecology', from = 'crossref', limit = 3,
crossrefopts = list(filter = list(has_full_text=TRUE, member=98))))
(out <- ft_links(res2))
(ress <- ft_get(x = out, type = "pdf"))
ress$crossref
(x <- ft_links("10.1111/2041-210X.12656", "crossref"))
(y <- ft_get(x))
## Cambridge
x <- ft_get("10.1017/s0922156598230305")
x$cambridge
z <- ft_get("10.1017/jmo.2019.20")
z$cambridge
m <- ft_get("10.1017/S0266467419000270")
m$cambridge
## No publisher plugin provided yet
ft_get('10.1037/10740-005')
### no link available for this DOI
res <- ft_get('10.1037/10740-005', try_unknown = TRUE)
res[[1]]
# Get a progress bar - off by default
library(rplos)
(dois <- searchplos(q="*:*", fl='id',
fq=list('doc_type:full',"article_type:\"research article\""),
limit=5)$data$id)
## when articles not already downloaded you see the progress bar
b <- ft_get(dois, progress = TRUE)
## if articles already downloaded/cached, normally we through messages
## saying so
b <- ft_get(dois, progress = FALSE)
## but if a progress bar is requested, then the messages are suppressed
b <- ft_get(dois, progress = TRUE)
# curl options
ft_get("10.1371/journal.pcbi.1002487", verbose = TRUE)
ft_get('10.3897/mycokeys.22.12528', from = "pensoft", verbose = TRUE)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.