crm_pdf: Get full text PDFs

View source: R/crm_pdf.R

crm_pdfR Documentation

Get full text PDFs

Description

Get full text PDFs

Usage

crm_pdf(url, overwrite = TRUE, read = TRUE, overwrite_unspecified = FALSE, ...)

Arguments

url

A URL (character) or an object of class tdmurl from a call to crm_links(). If you'll be getting text from the publishers are use Crossref TDM (which requires authentication), we strongly recommend using crm_links() first and passing output of that here, as crm_links() grabs the publisher Crossref member ID, which we use to do authentication and other publisher specific fixes to URLs

overwrite

(logical) Overwrite file if it exists already? Default: TRUE

read

(logical) If reading a pdf, this toggles whether we extract text from the pdf or simply download. If TRUE, you get the text from the pdf back. If FALSE, you only get back the metadata. Default: TRUE

overwrite_unspecified

(logical) Sometimes the crossref API returns mime type 'unspecified' for the full text links (for some Wiley dois for example). This parameter overrides the mime type to be type.

...

Named curl options passed on to crul::verb-GET, see curl::curl_options() for available curl options. See especially the User-agent section below

Notes

Note that this function is not vectorized. To do many requests use a for/while loop or lapply family calls, or similar.

Note that some links returned will not in fact lead you to full text content as you would understandbly think and expect. That is, if you use the filter parameter with e.g., rcrossref::cr_works() and filter to only full text content, some links may actually give back only metadata for an article. Elsevier is perhaps the worst offender, for one because they have a lot of entries in Crossref TDM, but most of the links that are apparently full text are not in facct full text, but only metadata.

Check out auth for details on authentication.

User-agent

You can optionally set a user agent string with the curl option useragent, like crm_text("some doi", "pdf", useragent = "foo bar"). user agent strings are sometimes used by servers to decide whether to provide a response (in this case, the full text article). sometimes, a browser like user agent string will make the server happy. by default all requests in this package have a user agent string like libcurl/7.64.1 r-curl/4.3 crul/0.9.0, which is a string with the names and versions of the http clients used under the hood. If you supply a user agent string using the useragent curl option, we'll use it instead. For more information on user agent's, and exmaples of user agent strings you can use here, see https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent

Elsevier-partial

For at least some PDFs from Elsevier, most likely when you do not have full access to the full text, they will return a successful response, but only return the first page of the PDF. They do however include a warning message in the response headers, which we look for and pass on to the user AND delete the pdf because we assume if you are using this package you don't want just the first page but the whole article. This behavior as far as we know does not occur with other article types (xml, plain), but let us know if you see it.

Caching

By default we use paste0(rappdirs::user_cache_dir(), "/crminer"), but you can set this directory to something different. Paths are setup under "/crminer" for each of the file types: "/crminer/pdf", "/crminer/xml", "/crminer/txt", and "/crminer/html". See crm_cache for caching details.

We cache all file types, as well as the extracted text from the pdf. The text is saved in a text file with the same file name as the pdf, but with the file extension ".txt". On subsequent requests of the same DOI, we first look for a cached .txt file matching the DOI, and return it if it exists. If it does not exist, but the the PDF does exist, we skip the PDF download step and move on to reading the PDF to text; we cache that text in to .txt file. If there's no .txt or .pdf file, we download the PDF and read the pdf to text, and both are cached.

Examples

## Not run: 
# set a temp dir. cache path
crm_cache$cache_path_set(path = "crminer", type = "tempdir")
## you can set the entire path directly via the `full_path` arg
## like crm_cache$cache_path_set(full_path = "your/path")

## peerj
x <- crm_pdf("https://peerj.com/articles/6840.pdf")

## pensoft
data(dois_pensoft)
(links <- crm_links(dois_pensoft[10], "all"))
crm_pdf(links)

## hindawi
data(dois_pensoft)
(links <- crm_links(dois_pensoft[12], "all"))
### pdf
crm_pdf(links, read=FALSE)
crm_pdf(links)

## End(Not run)

ropensci/crminer documentation built on May 18, 2022, 9:50 a.m.