Description Usage Arguments Value Examples
web_crawler
is a supplementary function to ‘recursive_crawler'. It’s more flexible when users
need to extract additional contents apart from order, suborder, family, subfamily, tribe, subtribe,
genus, subgenus, species, subspecies if specified. The function handles some unusual situations when
certain html nodes of identical contents on different pages change unexpectly. For example, html nodes
of parent and child webpages which refer to exactly the same contents may change from "p:nth-child(4)"
to "p:nth-child(6)" among pages by varying the number in a certain range. In general, this function
starts from the top layer page, follows available url links to lower level pages and can only crawl
contents on the lowest layer pages, which is the main difference to 'recursive_crawler'. Since some
useless information may also be grabbed due the changing html node scenarios, further data cleaning
by users is strongly recommended.
1 | web_crawler(starturl, crawl_format, pre_postfix_list, colnames = "", search_range = 5)
|
crawl_format |
Required. The html nodes which contain urls that can lead to the child pages of each parent page. The format should be as follows: crawl_format <- list(first_page = ”, sec_page = ”, third_page =” , fourth_page = ”, fifth_page = ”) The last page should be defined as ” since it does not have a child page. |
pre_postfix_list |
Required. The constant part of child page urls which are not identical with
the parent webpage. For example, suppose the parent page is "https://species.wikimedia.org/wiki/Belenois"
and the child page is "https://species.wikimedia.org/wiki/Belenois_aurota", and the href part captured
from the source code is "/wiki/Belenois_aurota". Because the subpage url can't be obtained by
concatenating "https://species.wikimedia.org/wiki/Belenois" and "/wiki/Belenois_aurota", the user needs
to specify the |
colnames |
Optional. Set names of columns in advance to avoid confusions. Default is system default column names. |
search_range |
Optional. The range to change the original number in nodes.Pay attention that the actual range is double.For example,the original number in the node is 4 and the range is 2,there will be a for loop from (4-2) to (4+2).Default is 5. |
start_url |
Required. The starting webpage which needs to be processed and can lead to child webpages via hyperlinks. |
A data frame containing the result of crawler.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | start_url <- "https://species.wikimedia.org/wiki/Pieridae"
crawl_format <- list(first_page = "p:nth-child(5)",
second_page = "p:nth-child(4)",
third_page = "p:nth-child(6)",
forth_page = " i:nth-child(11) a, i:nth-child(10) a, i:nth-child(9) a",
fifth_page = "div:nth-child(8) , p:nth-child(6)")
pre_postfix_list <- list(first_page = c(prefix = "https://species.wikimedia.org", postfix = ""),
sec_page = c(prefix = "https://species.wikimedia.org", postfix = ""),
third_page = c(prefix = "https://species.wikimedia.org", postfix = ""),
fourth_page = c(prefix = "https://species.wikimedia.org", postfix = ""))
colnames <- c("sciname", "vernacular_name")
df <- web_crawler(start_url, crawl_format, pre_postfix_list, colnames)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.