web_crawler: Extract information starting from a given url

Description Usage Arguments Value Examples

Description

web_crawler is a supplementary function to ‘recursive_crawler'. It’s more flexible when users need to extract additional contents apart from order, suborder, family, subfamily, tribe, subtribe, genus, subgenus, species, subspecies if specified. The function handles some unusual situations when certain html nodes of identical contents on different pages change unexpectly. For example, html nodes of parent and child webpages which refer to exactly the same contents may change from "p:nth-child(4)" to "p:nth-child(6)" among pages by varying the number in a certain range. In general, this function starts from the top layer page, follows available url links to lower level pages and can only crawl contents on the lowest layer pages, which is the main difference to 'recursive_crawler'. Since some useless information may also be grabbed due the changing html node scenarios, further data cleaning by users is strongly recommended.

Usage

1
web_crawler(starturl, crawl_format, pre_postfix_list, colnames = "", search_range = 5)

Arguments

crawl_format

Required. The html nodes which contain urls that can lead to the child pages of each parent page. The format should be as follows: crawl_format <- list(first_page = ”, sec_page = ”, third_page =” , fourth_page = ”, fifth_page = ”) The last page should be defined as ” since it does not have a child page.

pre_postfix_list

Required. The constant part of child page urls which are not identical with the parent webpage. For example, suppose the parent page is "https://species.wikimedia.org/wiki/Belenois" and the child page is "https://species.wikimedia.org/wiki/Belenois_aurota", and the href part captured from the source code is "/wiki/Belenois_aurota". Because the subpage url can't be obtained by concatenating "https://species.wikimedia.org/wiki/Belenois" and "/wiki/Belenois_aurota", the user needs to specify the prefix to be "https://species.wikimedia.org" and postfix to be "". The standard format of passing this parameter is as follows: pre_postfix_list <- list(first_page = c(prefix = "", postfix = ""), sec_page = c(prefix = "", postfix = ""), third_page = c(prefix = "", postfix = ""), fourth_page = c(prefix = "", postfix = "")) Note that the names of the pages can be named as you like, but prefix and postfix can't be changed.

colnames

Optional. Set names of columns in advance to avoid confusions. Default is system default column names.

search_range

Optional. The range to change the original number in nodes.Pay attention that the actual range is double.For example,the original number in the node is 4 and the range is 2,there will be a for loop from (4-2) to (4+2).Default is 5.

start_url

Required. The starting webpage which needs to be processed and can lead to child webpages via hyperlinks.

Value

A data frame containing the result of crawler.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
start_url <- "https://species.wikimedia.org/wiki/Pieridae"
crawl_format <- list(first_page = "p:nth-child(5)",
                     second_page = "p:nth-child(4)",
                     third_page = "p:nth-child(6)",
                     forth_page = " i:nth-child(11) a, i:nth-child(10) a, i:nth-child(9) a",
                     fifth_page = "div:nth-child(8) , p:nth-child(6)")

pre_postfix_list <- list(first_page = c(prefix = "https://species.wikimedia.org", postfix = ""),
                         sec_page = c(prefix = "https://species.wikimedia.org", postfix = ""),
                         third_page = c(prefix = "https://species.wikimedia.org", postfix = ""),
                         fourth_page = c(prefix = "https://species.wikimedia.org", postfix = ""))

colnames <- c("sciname", "vernacular_name")
df <- web_crawler(start_url, crawl_format, pre_postfix_list, colnames)

qingyuexu/bioparser documentation built on May 19, 2019, 4:13 p.m.