recursive_crawler: Crawl and extract full taxonomic names from iterative...

Description Usage Arguments Value Examples

Description

recursive_crawler crawls, locates and extracts full taxonomic names including order, suborder, family, subfamily, tribe, subtribe, genus, subgenus, species, subspecies, if given. Users need to clarify a full structure of contents for crawling by indicating the corresponding html nodes on different pages. Html nodes which contain urls that lead to the child pages should also be passed into the functions. Besides, users should also specify the prefixes or postfixes of child pages if they have one.

Usage

1
recursive_crawler(start_url, crawl_contents, link_urls, pre_postfix_list, output_file)

Arguments

start_url

Required. The starting webpage which needs to be processed and can lead to child webpages via hyperlinks.

crawl_contents

Required. A full structure indicating the crawling format. It should specify the html nodes on each layer of websites. The html nodes can be obtained by using selectorgadget and the html nodes should point to exact contents which need to be extracted. Only accepts order_node, suborder_node, family_node, subfamily_node, tribe_node, subtribe_node, genus_node, subgenus_node, species, subspecies are accepted. The format should be as follows: crawl_contents <- list(first_page = list(order_node = ”, suborder_node = ”, family_node = ”), sec_page = list(subfamily_node = ”), third_page = list(genus_node = ”), fourth_page = list(species_node = ”)) Note that the number of pages is not restricted and users can name the page as they like.

link_urls

Required. The html nodes which contain urls that can lead to the child pages of each parent page. The format should be as follows: link_urls <- list(first_page = ”, sec_page = ”, third_page =” , fourth_page = ”, fifth_page = ”) Note that one page should have one and only one html node, and the last page should be defined as ” since it does not have a child page.

pre_postfix_list

Required. The constant part of child page urls which are not identical with the parent webpage. For example, suppose the parent page is "https://species.wikimedia.org/wiki/Belenois" and the child page is "https://species.wikimedia.org/wiki/Belenois_aurota", and the href part captured from the source code is "/wiki/Belenois_aurota". Because the subpage url can't be obtained by concatenating "https://species.wikimedia.org/wiki/Belenois" and "/wiki/Belenois_aurota", the user needs to specify the prefix to be "https://species.wikimedia.org" and postfix to be "". The standard format of passing this parameter is as follows: pre_postfix_list <- list(first_page = c(prefix = "", postfix = ""), sec_page = c(prefix = "", postfix = ""), third_page = c(prefix = "", postfix = ""), fourth_page = c(prefix = "", postfix = "")) Note that the names of the pages can be named as you like, but prefix and postfix can't be changed.

output_file

Required. The path and name of the file for writing. If it does not contain an absolute path, the file name is relative to the current working directory.

Value

A data frame containing the result.

A TXT file written from the above data frame.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
example#1:
start_url = "http://www.nic.funet.fi/pub/sci/bio/life/insecta/coleoptera/"
crawl_format = list(first_page = list(order_node = '#Coleoptera i',
                                      suborder_node = '.TN .TN b',
                                      family_node = '.LIST .TN'),
                    sec_page = list(subfamily_node = '.LIST .TN'),
                    third_page = list(genus_node = '.LIST .TN'),
                    fourth_page = list(species_node = '.SP .TN .TN i'))
link_urls = list(first_page = '.LIST .TN',
                 sec_page = '.LIST .TN',
                 third_page = '.LIST .TN',
                 fourth_page = '')
pre_postfix_list = list(first_page = c(prefix = "", postfix = ""),
                        sec_page = c(prefix = "", postfix = ""),
                        third_page = c(prefix = "", postfix = ""),
                        fourth_page = c(prefix = "", postfix = ""))
output_file = './Examples/output_data/recursive_crawler_result1.csv'
df_result = recursive_crawler(start_url, crawl_format, link_urls, pre_postfix_list, output_file)

example#2:
start_url = "https://species.wikimedia.org/wiki/Pierini"
crawl_contents <- list(first_page = list(family_node = '.mw-collapsed+ p a:nth-child(1)',
                                         subfamily_node = '#mw-content-text a:nth-child(3)',
                                         tribe_node = '.selflink'),
                       sec_page = list(subtribe_node = '.selflink'),
                       third_page = list(genus_node = '.selflink'),
                       fourth_page = list(species_node = '.selflink'),
                       fifth_page = list(subspecies_node = 'h2+ p'))
link_urls = list(first_page = '.selflink~ a',
                  sec_page = 'i a',
                  third_page ='i~ i a' ,
                  fourth_page = 'i+ i a , i:nth-child(13) a',
                  fifth_page = '')
pre_postfix_list = list(first_page = c(prefix = "https://species.wikimedia.org", postfix = " "),
                         sec_page = c(prefix = "https://species.wikimedia.org", postfix = ""),
                         third_page = c(prefix = "https://species.wikimedia.org", postfix = ""),
                         fourth_page = c(prefix = "https://species.wikimedia.org", postfix = ""),
                         fifth_page = c(prefix = "https://species.wikimedia.org", postfix = ""))
output_file = './Examples/output_data/recursive_crawler_result2.csv'
df_result = recursive_crawler(start_url, crawl_contents, link_urls, pre_postfix_list, output_file)

qingyuexu/bioparser documentation built on May 19, 2019, 4:13 p.m.