scrape_scores: Webscraping Match Scores

Description Usage Arguments Details Value Author(s)

Description

Webscrapers for various types of match score content managers.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
scrape.korrio(url, file="Korrio", url.date.format="%B %Y %a %d", 
     date.format="%Y-%m-%d", append=FALSE, get.surface=FALSE, ...)
scrape.demosphere(url, file="Demosphere", url.date.format="%B %d %Y", 
    date.format="%Y-%m-%d", table.style=1, year=NULL, append=FALSE, 
    get.surface=FALSE, ...)
scrape.demosphere.main(url, div.resolver, name="Demosphere", basedir=".", 
    url.date.format="%B %d %Y", date.format="%Y-%m-%d", U12=2001, 
    table.style=1, append=FALSE, get.surface=FALSE, ...)
scrape.gotsport(url, file="GotSport", tb.num=10, url.date.format="%m/%d/%Y", 
    table.style=1, date.format="%Y-%m-%d", append=FALSE, ...)
scrape.gotsport.main(url, name="test", basedir=".", tb.num=10, 
    url.date.format="%m/%d/%Y", table.style=1, date.format="%Y-%m-%d", 
    U12=2001, append=FALSE, ...)
scrape.sportaffinity(url,file="SportAffinity", url.date.format="%B %d, %Y", 
    date.format="%Y-%m-%d", append=FALSE, ...)
scrape.sportaffinity.brackets(url, file, venue=NULL, ...)
scrape.sportaffinity.main(url, name="SportAffinity", basedir=".", 
    url.date.format="%B %d, %Y", date.format="%Y-%m-%d", U12=2001, 
    append=FALSE, add.to.base.url="tour/public/info/", ..., 
    U.designation="Under ", name.delimiter="Under ", name.skip=3)
scrape.scoreboard(html.file, file="ScoreBoard", url.date.format="%a %m/%d/%Y", 
     date.format="%Y-%m-%d", append=FALSE, get.surface=FALSE, ...)
scrape.custom1(url, file="Custom1", weeks=NULL, first.td.tag=3, last.td.tag=7, 
    td.per.row=5, append=FALSE, ...)
scrape.custom2(url, file="Custom2", year=NULL, date.format="%Y-%m-%d", append=FALSE, ...)
scrape.custom3(url, file="Custom3", year=NULL, date.format="%Y-%m-%d", append=FALSE, ...)
scrape.custom4(url, file="Custom4", year=NULL, date.format="%Y-%m-%d", append=FALSE, ...)
scrape.json1(url, file="Json1", date.format="%Y-%m-%d", append=FALSE, ...)
scrape.usclub(url, file="USClub", url.date.format="%A%m/%d/%Y", 
    date.format="%Y-%m-%d", append=FALSE, ...)

Arguments

url

URL to the webpage with the match information.

html.file

the html file(s) to scrape.

file

file where the match data is to be saved. Will be saved as a comma-delimined flat file. This should include the directory if needed (i.e. not to be saved in the working directory.

name

name of the file to be saved. Needed when the name of the file need to be dynamically created for scrapers that scrape all age groups.

basedir

base directory where gender-age files are to be saved. Needed when the name of the file need to be dynamically created for scrapers that scrape all age groups.

U12

The year that a U12 player is associated with. Needed when multiple ages are being scraped and the gender-age must be computed.

append

whether to append the match data to the existing file.

date.format

the date format to be used in the date column in the outputted match file.

get.surface

Some websites have surface (turf, grass) information and this can be scraped if desired.

tb.num

the table number to scrape. Some websites put have the table in different places.

url.date.format

the date format on the webpage.

table.style

the table style. Some content managers use different formats on different pages.

first.td.tag, last.td.tag, td.per.row

custom information for dealing with badly formed table html.

weeks

dates for webpages that show week number instead of a date. Must be in YYYY-mm-dd format.

year

The year to associate with the match dates since the match dates are not shown with the year on the webpage.

venue

For scraping sport affinity brackets, the user should use the column name venue if the bracket name should be added to the score data. Otherwise no bracket info is added, only the scores are scraped.

div.resolver

For scrape.demosphere.main. tells what division names are on the Demosphere page for scraping the main page

U.designation

For scrape.sportaffinity.main. U.designation is the text right before the age (a 2 digit number); so if the ages are denoted like "Boys Under 11", U.designation = "Under " or "der ". If ages were denoted "BU11", U.designation="U".

name.delimiter

For scrape.sportaffinity.main. name.delimiter and name.skip help form the division name. Say divisions are like "Boys Under 11 Foobar" and you want to use Foobar as the division. Then name.delimiter="Under " and name.skip=3. This says the delimiter is "Under " and after that skip another 3 spaces to find the division name. If you want 11 Foobar as your division, use name.skip=0. If you want Under 11 Foobar, name.delimiter="Boys" name.skip=1

name.skip

For scrape.sportaffinity.main. See above.

add.to.base.url

For scrape.sportaffinity.main. The scraper constructs the urls for the individual ages. It needs to know if it shoud add anything to the base url of the main page to the info it gets from the age links.

...

Other columns to append to the match file. For example, a column denoting the league or venue name.

Details

These webscrapers are customized for various match delivery platforms: Korrio, Demosphere, GotSport, SportAffinity and ScoreBoard. Look at the bottom of the match website to determine the platform and thus scraper to use. These scrapers will go out of date quickly as website structure is changed. Thus you will probably need to modify the scrapers for your own purposes. The custom1 scraper shows how to scrape a page with improperly formed html. Custom2, Custom3 and Custom4 are for scores not provided by a standard content provider. The json1 scraper shows how to scrape JSON data.

Type the scraper name (e.g. scrape.custom1) at the command line to see comments and a url example for each scraper.

Value

The scrapers output a comma-delimited file with the match data. For examples, see the vignettes.

Author(s)

Eli Holmes, Seattle, USA. ee(dot)holmes(at)u(dot)washington(dot)edu


fbRanks documentation built on May 1, 2019, 8:01 p.m.