README.md

scrapex

Since web scraping and API tutorials are deemed to be obsolete once they changes it’s structure or go down indefinately, scrapex saves specific web paths of certain websites to make scraping examples reproducibly in the long term. Moreover, it scrapex contains local copies of custom APIs for educational purposes using the plumber package. To make the package as lightweight as possible, for the local copies of websites, some images/figures/maps have been excluded. Aside from that, most examples should contain the rough skeleton of the original website.

Installation

scrapex will probably never see the light of CRAN since it is deemed to grow in size if new examples come along, so you need to install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("cimentadaj/scrapex")

Website Example

scrapex contains websites which have been saved locally for users to use as examples. Each example has only one function which returns the local links to the html websites. For example, for the spanish schools examples (see ?spanish_schools_ex for documentation on the example), we can obtain the links with spanish_schools_ex():

library(scrapex)
library(xml2)

school_websites <- spanish_schools_ex()
school_websites
#>  [1] "/tmp/Rtmph3vN2y/temp_libpath524336b0c86d6/scrapex/extdata/spanish_schools_ex/school_15012626.html"
#>  [2] "/tmp/Rtmph3vN2y/temp_libpath524336b0c86d6/scrapex/extdata/spanish_schools_ex/school_15013308.html"
#>  [3] "/tmp/Rtmph3vN2y/temp_libpath524336b0c86d6/scrapex/extdata/spanish_schools_ex/school_2009122.html" 
#>  [4] "/tmp/Rtmph3vN2y/temp_libpath524336b0c86d6/scrapex/extdata/spanish_schools_ex/school_2009158.html" 
#>  [5] "/tmp/Rtmph3vN2y/temp_libpath524336b0c86d6/scrapex/extdata/spanish_schools_ex/school_28043727.html"
#>  [6] "/tmp/Rtmph3vN2y/temp_libpath524336b0c86d6/scrapex/extdata/spanish_schools_ex/school_28060661.html"
#>  [7] "/tmp/Rtmph3vN2y/temp_libpath524336b0c86d6/scrapex/extdata/spanish_schools_ex/school_28061262.html"
#>  [8] "/tmp/Rtmph3vN2y/temp_libpath524336b0c86d6/scrapex/extdata/spanish_schools_ex/school_28061663.html"
#>  [9] "/tmp/Rtmph3vN2y/temp_libpath524336b0c86d6/scrapex/extdata/spanish_schools_ex/school_28061675.html"
#> [10] "/tmp/Rtmph3vN2y/temp_libpath524336b0c86d6/scrapex/extdata/spanish_schools_ex/school_28062606.html"
#> [11] "/tmp/Rtmph3vN2y/temp_libpath524336b0c86d6/scrapex/extdata/spanish_schools_ex/school_28063003.html"
#> [12] "/tmp/Rtmph3vN2y/temp_libpath524336b0c86d6/scrapex/extdata/spanish_schools_ex/school_30018357.html"
#> [13] "/tmp/Rtmph3vN2y/temp_libpath524336b0c86d6/scrapex/extdata/spanish_schools_ex/school_3006839.html" 
#> [14] "/tmp/Rtmph3vN2y/temp_libpath524336b0c86d6/scrapex/extdata/spanish_schools_ex/school_37013560.html"
#> [15] "/tmp/Rtmph3vN2y/temp_libpath524336b0c86d6/scrapex/extdata/spanish_schools_ex/school_37013638.html"
#> [16] "/tmp/Rtmph3vN2y/temp_libpath524336b0c86d6/scrapex/extdata/spanish_schools_ex/school_37013663.html"
#> [17] "/tmp/Rtmph3vN2y/temp_libpath524336b0c86d6/scrapex/extdata/spanish_schools_ex/school_37013687.html"
#> [18] "/tmp/Rtmph3vN2y/temp_libpath524336b0c86d6/scrapex/extdata/spanish_schools_ex/school_37013705.html"
#> [19] "/tmp/Rtmph3vN2y/temp_libpath524336b0c86d6/scrapex/extdata/spanish_schools_ex/school_43007440.html"
#> [20] "/tmp/Rtmph3vN2y/temp_libpath524336b0c86d6/scrapex/extdata/spanish_schools_ex/school_43007464.html"
#> [21] "/tmp/Rtmph3vN2y/temp_libpath524336b0c86d6/scrapex/extdata/spanish_schools_ex/school_43007491.html"
#> [22] "/tmp/Rtmph3vN2y/temp_libpath524336b0c86d6/scrapex/extdata/spanish_schools_ex/school_43007555.html"
#> [23] "/tmp/Rtmph3vN2y/temp_libpath524336b0c86d6/scrapex/extdata/spanish_schools_ex/school_43007579.html"
#> [24] "/tmp/Rtmph3vN2y/temp_libpath524336b0c86d6/scrapex/extdata/spanish_schools_ex/school_8051276.html" 
#> [25] "/tmp/Rtmph3vN2y/temp_libpath524336b0c86d6/scrapex/extdata/spanish_schools_ex/school_8052190.html" 
#> [26] "/tmp/Rtmph3vN2y/temp_libpath524336b0c86d6/scrapex/extdata/spanish_schools_ex/school_8067961.html" 
#> [27] "/tmp/Rtmph3vN2y/temp_libpath524336b0c86d6/scrapex/extdata/spanish_schools_ex/school_8067971.html"

To browse the actual website of one of these links, use prep_browser with openURL:

browseURL(prep_browser(school_websites)[1])

To actually scrape information from these websites, they work as expected with read_html.

# You can read them with `read_html`
read_html(school_websites[1])
#> {html_document}
#> <html lang="es-es">
#> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
#> [2] <body>\r\n  \r\n      <a href="#contenidoPpal" class="sr-only">Saltar a c ...

API example

All APIs in scrapex export a function that launches the API. These functions start with api_*. For example, loading the local copy of the API for the public bicycle system of Barcelona can be done like this:

library(scrapex)
library(httr2)
#> 
#> Attaching package: 'httr2'
#> The following object is masked from 'package:xml2':
#> 
#>     url_parse
bicing <- api_bicing()
#> [1] "Visit your REST API at http://localhost:3330"
#> [1] "Documentation is at http://localhost:3330/__docs__/"

You’ll see that the API returns a link to the documentation of the API as well as to the main URL of the API. Users can then make requests just as if this was a real API. For example:

rt_bicing <- paste0(bicing$api_web, "/api/v1/real_time_bicycles")
resp_output <- 
  rt_bicing %>%
  request() %>% 
  req_perform() %>%
  resp_body_json() %>%
  lapply(data.frame) %>%
  do.call(rbind, .)

resp_output
#>    type latitude longitude                     streetName streetNumber slots
#> 1  BIKE  41.3980    2.1800       Gran Via Corts Catalanes          760    29
#> 2  BIKE  41.3955    2.1771        Roger de Flor/ Gran Vía          126    27
#> 3  BIKE  41.3941    2.1813                         Nàpols           82    27
#> 4  BIKE  41.3935    2.1816                          Ribes           13    20
#> 5  BIKE  41.3911    2.1802              Pg Lluís Companys           11    39
#> 6  BIKE  41.3913    2.1806              Pg Lluís Companys           18    39
#> 7  BIKE  41.3889    2.1833              Pg Lluís Companys            1    23
#> 8  BIKE  41.3891    2.1836              Pg Lluís Companys            2    26
#> 9  BIKE  41.3845    2.1849         Marquès de l'Argentera           13    26
#> 10 BIKE  41.3817    2.1939                Passeig Marítim           19    21
#> 11 BIKE  41.3845    2.1957         Pg Marítim Barceloneta           23    29
#> 12 BIKE  41.3869    2.1958               Avinguda Litoral           16     3
#> 13 BIKE  41.3847    2.1850 Avinguda del Marques Argentera           15    26
#> 14 BIKE  41.3948    2.1712                         Girona           68    18
#> 15 BIKE  41.3983    2.1867                  Av. Meridiana           47    21
#> 16 BIKE  41.3982    2.1867                  Av. Meridiana           47    21
#> 17 BIKE  41.4061    2.1742                       Rosselló          453    26
#> 18 BIKE  41.4033    2.1707                       Rosselló          354    28
#> 19 BIKE  41.4102    2.1758                      Cartagena          308    20
#> 20 BIKE  41.4109    2.1740       Sant Antoni Maria Claret          214     2
#>           current_time in_use
#> 1  2022-12-23 17:43:33     29
#> 2  2022-12-23 17:43:33     12
#> 3  2022-12-23 17:43:33     13
#> 4  2022-12-23 17:43:33     17
#> 5  2022-12-23 17:43:33     14
#> 6  2022-12-23 17:43:33      8
#> 7  2022-12-23 17:43:33      1
#> 8  2022-12-23 17:43:33     26
#> 9  2022-12-23 17:43:33     26
#> 10 2022-12-23 17:43:33      7
#> 11 2022-12-23 17:43:33     21
#> 12 2022-12-23 17:43:33      3
#> 13 2022-12-23 17:43:33      1
#> 14 2022-12-23 17:43:33     15
#> 15 2022-12-23 17:43:33     14
#> 16 2022-12-23 17:43:33      3
#> 17 2022-12-23 17:43:33     20
#> 18 2022-12-23 17:43:33     21
#> 19 2022-12-23 17:43:33      1
#> 20 2022-12-23 17:43:33      2

Collaboration

This package is a work in progress and welcomes more examples from different backgrounds. If you want to contribute to the package you can send a pull request following these guidelines:

For webscraping, I encourage anyone looking to do a PR to look at the one for spanish_schools_ex which is called download_school.R (note that this is tricky because downloading a path of a website many times results in downloading the complete website. Look at the example in download_school.R to only download single paths). For the API example, I encourage users to look at the file R/api_coveragedb.R and inst/plumber/api_coveragedb/.

All suggestions are very welcome if you feel this is not the most optimal storage strategy! Open an issue request and we’ll discuss it.



cimentadaj/scrapex documentation built on Jan. 5, 2023, 9:49 p.m.