crm_pdf | R Documentation |
Get full text PDFs
crm_pdf(url, overwrite = TRUE, read = TRUE, overwrite_unspecified = FALSE, ...)
url |
A URL (character) or an object of class |
overwrite |
(logical) Overwrite file if it exists already?
Default: |
read |
(logical) If reading a pdf, this toggles whether we extract
text from the pdf or simply download. If |
overwrite_unspecified |
(logical) Sometimes the crossref API returns
mime type 'unspecified' for the full text links (for some Wiley dois
for example). This parameter overrides the mime type to be |
... |
Named curl options passed on to crul::verb-GET, see
|
Note that this function is not vectorized. To do many requests use a for/while loop or lapply family calls, or similar.
Note that some links returned will not in fact lead you to full text
content as you would understandbly think and expect. That is, if you
use the filter
parameter with e.g., rcrossref::cr_works()
and filter to only full text content, some links may actually give back
only metadata for an article. Elsevier is perhaps the worst offender,
for one because they have a lot of entries in Crossref TDM, but most
of the links that are apparently full text are not in facct full text,
but only metadata.
Check out auth for details on authentication.
You can optionally set a user agent string with the curl option useragent
,
like crm_text("some doi", "pdf", useragent = "foo bar")
.
user agent strings are sometimes used by servers to decide whether to
provide a response (in this case, the full text article). sometimes, a
browser like user agent string will make the server happy. by default all
requests in this package have a user agent string like
libcurl/7.64.1 r-curl/4.3 crul/0.9.0
, which is a string with the names
and versions of the http clients used under the hood. If you supply
a user agent string using the useragent
curl option, we'll use it instead.
For more information on user agent's, and exmaples of user agent strings you
can use here, see
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent
For at least some PDFs from Elsevier, most likely when you do not have full access to the full text, they will return a successful response, but only return the first page of the PDF. They do however include a warning message in the response headers, which we look for and pass on to the user AND delete the pdf because we assume if you are using this package you don't want just the first page but the whole article. This behavior as far as we know does not occur with other article types (xml, plain), but let us know if you see it.
By default we use
paste0(rappdirs::user_cache_dir(), "/crminer")
, but you can
set this directory to something different. Paths are setup under "/crminer"
for each of the file types: "/crminer/pdf", "/crminer/xml", "/crminer/txt",
and "/crminer/html". See crm_cache for caching details.
We cache all file types, as well as the extracted text from the pdf. The text is saved in a text file with the same file name as the pdf, but with the file extension ".txt". On subsequent requests of the same DOI, we first look for a cached .txt file matching the DOI, and return it if it exists. If it does not exist, but the the PDF does exist, we skip the PDF download step and move on to reading the PDF to text; we cache that text in to .txt file. If there's no .txt or .pdf file, we download the PDF and read the pdf to text, and both are cached.
## Not run: # set a temp dir. cache path crm_cache$cache_path_set(path = "crminer", type = "tempdir") ## you can set the entire path directly via the `full_path` arg ## like crm_cache$cache_path_set(full_path = "your/path") ## peerj x <- crm_pdf("https://peerj.com/articles/6840.pdf") ## pensoft data(dois_pensoft) (links <- crm_links(dois_pensoft[10], "all")) crm_pdf(links) ## hindawi data(dois_pensoft) (links <- crm_links(dois_pensoft[12], "all")) ### pdf crm_pdf(links, read=FALSE) crm_pdf(links) ## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.