crm_xml: Get full text XML

View source: R/crm_xml.R

crm_xmlR Documentation

Get full text XML

Description

Get full text XML

Usage

crm_xml(url, overwrite_unspecified = FALSE, ...)

Arguments

url

A URL (character) or an object of class tdmurl from a call to crm_links(). If you'll be getting text from the publishers are use Crossref TDM (which requires authentication), we strongly recommend using crm_links() first and passing output of that here, as crm_links() grabs the publisher Crossref member ID, which we use to do authentication and other publisher specific fixes to URLs

overwrite_unspecified

(logical) Sometimes the crossref API returns mime type 'unspecified' for the full text links (for some Wiley dois for example). This parameter overrides the mime type to be type.

...

Named curl options passed on to crul::verb-GET, see curl::curl_options() for available curl options. See especially the User-agent section below

Details

Note that this function is not vectorized. To do many requests use a for/while loop or lapply family calls, or similar.

Note that some links returned will not in fact lead you to full text content as you would understandbly think and expect. That is, if you use the filter parameter with e.g., rcrossref::cr_works() and filter to only full text content, some links may actually give back only metadata for an article. Elsevier is perhaps the worst offender, for one because they have a lot of entries in Crossref TDM, but most of the links that are apparently full text are not in facct full text, but only metadata.

Check out auth for details on authentication.

User-agent

You can optionally set a user agent string with the curl option useragent, like crm_text("some doi", "pdf", useragent = "foo bar"). user agent strings are sometimes used by servers to decide whether to provide a response (in this case, the full text article). sometimes, a browser like user agent string will make the server happy. by default all requests in this package have a user agent string like libcurl/7.64.1 r-curl/4.3 crul/0.9.0, which is a string with the names and versions of the http clients used under the hood. If you supply a user agent string using the useragent curl option, we'll use it instead. For more information on user agent's, and exmaples of user agent strings you can use here, see https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent

Examples

## Not run: 
## peerj
x <- crm_xml("https://peerj.com/articles/2356.xml")

## pensoft
data(dois_pensoft)
(links <- crm_links(dois_pensoft[1], "all"))
### xml
crm_xml(url=links)

## End(Not run)

ropensci/crminer documentation built on May 18, 2022, 9:50 a.m.