View source: R/cas_extract_links.R
cas_extract_links | R Documentation |
Extract direct links to individual content pages from index pages
cas_extract_links(
id = NULL,
batch = "latest",
domain = NULL,
index = TRUE,
index_group = NULL,
output_index = FALSE,
output_index_group = NULL,
include_when = NULL,
exclude_when = NULL,
container = NULL,
container_class = NULL,
container_id = NULL,
custom_xpath = NULL,
custom_css = NULL,
match = NULL,
min_length = NULL,
max_length = NULL,
attribute_type = "href",
append_string = NULL,
remove_string = NULL,
write_to_db = FALSE,
file_format = "html",
keep_only_within_domain = TRUE,
sample = FALSE,
check_previous = TRUE,
check_again = FALSE,
encoding = "UTF-8",
reverse_order = FALSE,
db_connection = NULL,
disconnect_db = TRUE,
...
)
id |
Defaults to NULL. If provided, it should be a vector of integers. Only html files corresponding to given id will be processed. |
domain |
Defaults to "". Web domain of the website. It is added at the beginning of each link found. If links in the page already include the full web address this should be ignored. |
output_index |
Defaults to FALSE. If FALSE, new links are added to the
contents table. If TRUE, the links extracted will be stored again as
index, using |
output_index_group |
Defaults to NULL. Relevant only when |
include_when |
Part of URL found only in links of individual articles to be downloaded. If more than one provided, it includes all links that contains either of the strings provided. |
exclude_when |
If an URL includes this string, it is excluded from the output. One or more strings may be provided. |
container |
Defaults to NULL. Type of html container from where links
are to be extracted, such as "div", "ul", and others. Either
|
container_class |
Defaults to NULL. If provided, also |
container_id |
Defaults to NULL. If provided, also |
custom_xpath |
Defaults to NULL. If given, all other parameters are ignored and given Xpath used instead. |
match |
Defaults to NULL. Used when extracting json files. Name of property from where url is to be extracted. N.B. Only partly implemented, please report issues along with specific example where it emerged. |
min_length |
If a link is shorter than the number of characters given in min_length, it is excluded from the output. |
max_length |
If a link is longer than the number of characters given in max_length, it is excluded from the output. |
attribute_type |
Defaults to "href". Type of attribute to extract from links. |
append_string |
If provided, appends given string to the extracted articles. Typically used to create links for print or mobile versions of the extracted page. |
remove_string |
If provided, remove given string (or strings) from links. |
write_to_db |
Logical, defaults to FALSE. If TRUE stored newly extracted links in the database, associates each of them with an id, and records the source for each link. |
keep_only_within_domain |
Logical, defaults to TRUE. If TRUE, and domain given, links to external websites are dropped. |
check_previous |
Defaults to TRUE. If TRUE, checks if newly found links
are previously stored in database, and if they are, it discards them. If
FALSE, and |
check_again |
Defaults to FALSE. If FALSE, files from where are at least a link has been extracted are not re-processed. If TRUE, they are processed again. By default, only new links are then actually included in the output or stored in the local database. |
reverse_order |
Logical, defaults to FALSE. If TRUE, index files are
processed in reverse order of |
db_connection |
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example). |
disconnect_db |
Defaults to TRUE. If FALSE, leaves the connection to database open. |
... |
Passed to |
A data frame.
## Not run:
links <- cas_extract_links(domain = "http://www.example.com/")
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.