Description Usage Arguments Value Examples
recursive_crawler
crawls, locates and extracts full taxonomic names including order,
suborder, family, subfamily, tribe, subtribe, genus, subgenus, species, subspecies, if
given. Users need to clarify a full structure of contents for crawling by indicating the
corresponding html nodes on different pages. Html nodes which contain urls that lead to
the child pages should also be passed into the functions. Besides, users should also specify
the prefixes or postfixes of child pages if they have one.
1 | recursive_crawler(start_url, crawl_contents, link_urls, pre_postfix_list, output_file)
|
start_url |
Required. The starting webpage which needs to be processed and can lead to child webpages via hyperlinks. |
crawl_contents |
Required. A full structure indicating the crawling format. It should
specify the html nodes on each layer of websites. The html nodes can be obtained by using
selectorgadget and the html nodes should point to exact contents which need to be extracted.
Only accepts |
link_urls |
Required. The html nodes which contain urls that can lead to the child pages of each parent page. The format should be as follows: link_urls <- list(first_page = ”, sec_page = ”, third_page =” , fourth_page = ”, fifth_page = ”) Note that one page should have one and only one html node, and the last page should be defined as ” since it does not have a child page. |
pre_postfix_list |
Required. The constant part of child page urls which are not identical with
the parent webpage. For example, suppose the parent page is "https://species.wikimedia.org/wiki/Belenois"
and the child page is "https://species.wikimedia.org/wiki/Belenois_aurota", and the href part captured
from the source code is "/wiki/Belenois_aurota". Because the subpage url can't be obtained by
concatenating "https://species.wikimedia.org/wiki/Belenois" and "/wiki/Belenois_aurota", the user needs
to specify the |
output_file |
Required. The path and name of the file for writing. If it does not contain an absolute path, the file name is relative to the current working directory. |
A data frame containing the result.
A TXT file written from the above data frame.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | example#1:
start_url = "http://www.nic.funet.fi/pub/sci/bio/life/insecta/coleoptera/"
crawl_format = list(first_page = list(order_node = '#Coleoptera i',
suborder_node = '.TN .TN b',
family_node = '.LIST .TN'),
sec_page = list(subfamily_node = '.LIST .TN'),
third_page = list(genus_node = '.LIST .TN'),
fourth_page = list(species_node = '.SP .TN .TN i'))
link_urls = list(first_page = '.LIST .TN',
sec_page = '.LIST .TN',
third_page = '.LIST .TN',
fourth_page = '')
pre_postfix_list = list(first_page = c(prefix = "", postfix = ""),
sec_page = c(prefix = "", postfix = ""),
third_page = c(prefix = "", postfix = ""),
fourth_page = c(prefix = "", postfix = ""))
output_file = './Examples/output_data/recursive_crawler_result1.csv'
df_result = recursive_crawler(start_url, crawl_format, link_urls, pre_postfix_list, output_file)
example#2:
start_url = "https://species.wikimedia.org/wiki/Pierini"
crawl_contents <- list(first_page = list(family_node = '.mw-collapsed+ p a:nth-child(1)',
subfamily_node = '#mw-content-text a:nth-child(3)',
tribe_node = '.selflink'),
sec_page = list(subtribe_node = '.selflink'),
third_page = list(genus_node = '.selflink'),
fourth_page = list(species_node = '.selflink'),
fifth_page = list(subspecies_node = 'h2+ p'))
link_urls = list(first_page = '.selflink~ a',
sec_page = 'i a',
third_page ='i~ i a' ,
fourth_page = 'i+ i a , i:nth-child(13) a',
fifth_page = '')
pre_postfix_list = list(first_page = c(prefix = "https://species.wikimedia.org", postfix = " "),
sec_page = c(prefix = "https://species.wikimedia.org", postfix = ""),
third_page = c(prefix = "https://species.wikimedia.org", postfix = ""),
fourth_page = c(prefix = "https://species.wikimedia.org", postfix = ""),
fifth_page = c(prefix = "https://species.wikimedia.org", postfix = ""))
output_file = './Examples/output_data/recursive_crawler_result2.csv'
df_result = recursive_crawler(start_url, crawl_contents, link_urls, pre_postfix_list, output_file)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.