recursive_crawler: Crawl and extract full taxonomic names from iterative...
In qingyuexu/bioparser: Parser and Crawler for Biodiversity Checklists.

Description Usage Arguments Value Examples

recursive_crawler crawls, locates and extracts full taxonomic names including order, suborder, family, subfamily, tribe, subtribe, genus, subgenus, species, subspecies, if given. Users need to clarify a full structure of contents for crawling by indicating the corresponding html nodes on different pages. Html nodes which contain urls that lead to the child pages should also be passed into the functions. Besides, users should also specify the prefixes or postfixes of child pages if they have one.

1	recursive_crawler(start_url, crawl_contents, link_urls, pre_postfix_list, output_file)

`start_url`	Required. The starting webpage which needs to be processed and can lead to child webpages via hyperlinks.
`crawl_contents`	Required. A full structure indicating the crawling format. It should specify the html nodes on each layer of websites. The html nodes can be obtained by using selectorgadget and the html nodes should point to exact contents which need to be extracted. Only accepts `order_node`, `suborder_node`, `family_node`, `subfamily_node`, `tribe_node`, `subtribe_node`, `genus_node`, `subgenus_node`, `species`, `subspecies` are accepted. The format should be as follows: crawl_contents <- list(first_page = list(order_node = ”, suborder_node = ”, family_node = ”), sec_page = list(subfamily_node = ”), third_page = list(genus_node = ”), fourth_page = list(species_node = ”)) Note that the number of pages is not restricted and users can name the page as they like.
`link_urls`	Required. The html nodes which contain urls that can lead to the child pages of each parent page. The format should be as follows: link_urls <- list(first_page = ”, sec_page = ”, third_page =” , fourth_page = ”, fifth_page = ”) Note that one page should have one and only one html node, and the last page should be defined as ” since it does not have a child page.
`pre_postfix_list`	Required. The constant part of child page urls which are not identical with the parent webpage. For example, suppose the parent page is "https://species.wikimedia.org/wiki/Belenois" and the child page is "https://species.wikimedia.org/wiki/Belenois_aurota", and the href part captured from the source code is "/wiki/Belenois_aurota". Because the subpage url can't be obtained by concatenating "https://species.wikimedia.org/wiki/Belenois" and "/wiki/Belenois_aurota", the user needs to specify the `prefix` to be "https://species.wikimedia.org" and `postfix` to be "". The standard format of passing this parameter is as follows: pre_postfix_list <- list(first_page = c(prefix = "", postfix = ""), sec_page = c(prefix = "", postfix = ""), third_page = c(prefix = "", postfix = ""), fourth_page = c(prefix = "", postfix = "")) Note that the names of the pages can be named as you like, but `prefix` and `postfix` can't be changed.
`output_file`	Required. The path and name of the file for writing. If it does not contain an absolute path, the file name is relative to the current working directory.

A data frame containing the result.

A TXT file written from the above data frame.

example#1:
start_url = "http://www.nic.funet.fi/pub/sci/bio/life/insecta/coleoptera/"
crawl_format = list(first_page = list(order_node = '#Coleoptera i',
                                      suborder_node = '.TN .TN b',
                                      family_node = '.LIST .TN'),
                    sec_page = list(subfamily_node = '.LIST .TN'),
                    third_page = list(genus_node = '.LIST .TN'),
                    fourth_page = list(species_node = '.SP .TN .TN i'))
link_urls = list(first_page = '.LIST .TN',
                 sec_page = '.LIST .TN',
                 third_page = '.LIST .TN',
                 fourth_page = '')
pre_postfix_list = list(first_page = c(prefix = "", postfix = ""),
                        sec_page = c(prefix = "", postfix = ""),
                        third_page = c(prefix = "", postfix = ""),
                        fourth_page = c(prefix = "", postfix = ""))
output_file = './Examples/output_data/recursive_crawler_result1.csv'
df_result = recursive_crawler(start_url, crawl_format, link_urls, pre_postfix_list, output_file)

example#2:
start_url = "https://species.wikimedia.org/wiki/Pierini"
crawl_contents <- list(first_page = list(family_node = '.mw-collapsed+ p a:nth-child(1)',
                                         subfamily_node = '#mw-content-text a:nth-child(3)',
                                         tribe_node = '.selflink'),
                       sec_page = list(subtribe_node = '.selflink'),
                       third_page = list(genus_node = '.selflink'),
                       fourth_page = list(species_node = '.selflink'),
                       fifth_page = list(subspecies_node = 'h2+ p'))
link_urls = list(first_page = '.selflink~ a',
                  sec_page = 'i a',
                  third_page ='i~ i a' ,
                  fourth_page = 'i+ i a , i:nth-child(13) a',
                  fifth_page = '')
pre_postfix_list = list(first_page = c(prefix = "https://species.wikimedia.org", postfix = " "),
                         sec_page = c(prefix = "https://species.wikimedia.org", postfix = ""),
                         third_page = c(prefix = "https://species.wikimedia.org", postfix = ""),
                         fourth_page = c(prefix = "https://species.wikimedia.org", postfix = ""),
                         fifth_page = c(prefix = "https://species.wikimedia.org", postfix = ""))
output_file = './Examples/output_data/recursive_crawler_result2.csv'
df_result = recursive_crawler(start_url, crawl_contents, link_urls, pre_postfix_list, output_file)