searchplos: Base function to search PLoS Journals

Description Usage Arguments Details Value Faceting Website vs. API behavior Phrase searching Pagination Examples

View source: R/searchplos.R

Description

Base function to search PLoS Journals

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
searchplos(
  q = NULL,
  fl = "id",
  fq = NULL,
  sort = NULL,
  start = 0,
  limit = 10,
  sleep = 6,
  errors = "simple",
  proxy = NULL,
  callopts = list(),
  progress = NULL,
  ...
)

Arguments

q

Search terms (character). You can search on specific fields by doing 'field:your query'. For example, a real query on a specific field would be 'author:Smith'.

fl

Fields to return from search (character) [e.g., 'id,title'], any combination of search fields (see the dataset plosfields)

fq

List specific fields to filter the query on (if NA, all queried). The options for this parameter are the same as those for the fl parameter. Note that using this parameter doesn't influence the actual query, but is used to filter the results to a subset of those you want returned. For example, if you want full articles only, you can do 'doc_type:full'. In another example, if you want only results from the journal PLOS One, you can do 'journal_key:PLoSONE'. See journalnamekey for journal abbreviations.

sort

Sort results according to a particular field, and specify ascending (asc) or descending (desc) after a space; see examples. For example, to sort the counter_total_all field in descending fashion, do sort='counter_total_all desc'

start

Record to start at (used in combination with limit when you need to cycle through more results than the max allowed=1000). See Pagination below

limit

Number of results to return (integer). Setting limit=0 returns only metadata. See Pagination below

sleep

Number of seconds to wait between requests. No need to use this for a single call to searchplos. However, if you are using searchplos in a loop or lapply type call, do sleep parameter is used to prevent your IP address from being blocked. You can only do 10 requests per minute, so one request every 6 seconds is about right.

errors

(character) One of simple or complete. Simple gives http code and error message on an error, while complete gives both http code and error message, and stack trace, if available.

proxy

List of arguments for a proxy connection, including one or more of: url, port, username, password, and auth. See proxy for help, which is used to construct the proxy connection.

callopts

(list) optional curl options passed to HttpClient

progress

a function with logic for printing a progress bar for an HTTP request, ultimately passed down to curl. only supports httr::progress()

...

Additional Solr arguments

Details

Details:

Value

An object of class "plos", with a list of length two, each element being a list itself.

Faceting

Read more about faceting here: urlhttp://wiki.apache.org/solr/SimpleFacetParameters

Website vs. API behavior

Don't be surprised if queries you perform in a scripting language, like using rplos in R, give different results than when searching for articles on the PLOS website. I am not sure what exact defaults they use on their website. There are a few things to consider. You can tweak which types of articles are returned: Try using the article_type filter in the fq parameter. For which journal to search, e.g., do 'journal_key:PLoSONE'. See journalnamekey() for journal abbreviations.

Phrase searching

To search phrases, e.g., synthetic biology as a single item, rather than separate occurrences of synthetic and biology, simply put double quotes around the phrase. For example, to search for cases of synthetic biology, do searchplos(q = '"synthetic biology"').

You can modify phrase searches as well. For example, searchplos(q = '"synthetic biology" ~ 10') asks for cases of synthetic biology within 10 words of each other. See examples.

Pagination

The searchplos function and the many functions that are wrappers around searchplos all do paginatino internally for you. That is, if you request for example, 2000 results, the max you can get in any one request is 1000, so we'll do two requests for you. And so on for larger requests.

You can always do your own paginatino by doing a lapply type call or a for loop to cycle through pages of results.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
## Not run: 
searchplos(q='ecology', fl=c('id','publication_date'), limit = 2)
searchplos('ecology', fl=c('id','publication_date'), limit = 2)
searchplos('ecology', c('id','title'), limit = 2)

# Get only full article DOIs
out <- searchplos(q="*:*", fl='id', fq='doc_type:full', start=0, limit=250)
head(out$data)

# Get DOIs for only PLoS One articles
out <- searchplos(q="*:*", fl='id', fq='journal_key:PLoSONE', start=0, limit=15)
out$data

# Get DOIs for full article in PLoS One
out <- searchplos(q="*:*", fl='id', fq=list('journal_key:PLoSONE',
   'doc_type:full'), limit=50)
out$data

# Serch for many q
q <- c('ecology','evolution','science')
lapply(q, function(x) searchplos(x, limit=2))

# Query to get some PLOS article-level metrics, notice difference between two outputs
out <- searchplos(q="*:*", fl=c('id','counter_total_all','alm_twitterCount'),fq='doc_type:full')
out_sorted <- searchplos(q="*:*", fl=c('id','counter_total_all','alm_twitterCount'),
   fq='doc_type:full', sort='counter_total_all desc')
out$data
out_sorted$data

# Show me all articles that have these two words less then about 15 words apart.
searchplos(q='everything:"sports alcohol"~15', fl='title', fq='doc_type:full')

# Now let's try to narrow our results to 7 words apart. Here I'm changing the ~15 to ~7
searchplos(q='everything:"sports alcohol"~7', fl='title', fq='doc_type:full')

# A list of articles about social networks that are popular on a social network
searchplos(q="*:*",fl=c('id','alm_twitterCount'),
   fq=list('doc_type:full','subject:"Social networks"','alm_twitterCount:[100 TO 10000]'),
   sort='counter_total_month desc')

# Now, lets also only look at articles that have seen some activity on twitter.
# Add "fq=alm_twitterCount:[1 TO *]" as a parameter within the fq argument.
searchplos(q='everything:"sports alcohol"~7', fl=c('alm_twitterCount','title'),
   fq=list('doc_type:full','alm_twitterCount:[1 TO *]'))
searchplos(q='everything:"sports alcohol"~7', fl=c('alm_twitterCount','title'),
   fq=list('doc_type:full','alm_twitterCount:[1 TO *]'),
   sort='counter_total_month desc')

# Return partial doc parts
## Return Abstracts only
out <- searchplos(q='*:*', fl=c('doc_partial_body','doc_partial_parent_id'),
   fq=list('doc_type:partial', 'doc_partial_type:Abstract'), limit=3)
## Return Title's only
out <- searchplos(q='*:*', fl=c('doc_partial_body','doc_partial_parent_id'),
   fq=list('doc_type:partial', 'doc_partial_type:Title'), limit=3)

# Remove DOIs for annotations (i.e., corrections)
searchplos(q='*:*', fl=c('id','article_type'),
   fq='-article_type:correction', limit=100)

# Remove DOIs for annotations (i.e., corrections) and Viewpoints articles
searchplos(q='*:*', fl=c('id','article_type'),
   fq=list('-article_type:correction','-article_type:viewpoints'), limit=100)

# Get eissn codes
searchplos(q='*:*', fl=c('id','journal','eissn','cross_published_journal_eissn'),
   fq="doc_type:full", limit = 60)

searchplos(q='*:*', fl=c('id','journal','eissn','cross_published_journal_eissn'),
   limit = 2000)

## End(Not run)

rplos documentation built on Feb. 24, 2021, 1:06 a.m.