Description Usage Arguments Value Author(s) Examples
Fetch and parse a document by URL, to extract page info, HTML source and links (internal/external). Fetching process can be done by HTTP GET request or through webdriver (phantomjs) which simulate a real browser rendering.
1 2 3 4 5 |
url |
character, url to fetch and parse. |
id |
numeric, an id to identify a specific web page in a website collection, it's auto-generated byauto-generated by |
lev |
numeric, the depth level of the web page, auto-generated by |
IndexErrPages |
character vector, http error code-statut that can be processed, by default, it's |
Useragent |
, the name the request sender, default to "Rcrawler". but we recommand using a regular browser user-agent to avoid being blocked by some server. |
Timeout |
,default to 5s |
use_proxy, |
object created by httr::use_proxy() function, if you want to use a proxy to retreive web page. (does not work with webdriver). |
URLlenlimit |
interger, Maximum URL length to process, default to 255 characters (Useful to avoid spider traps) |
urlExtfilter |
character vector, the list of file extensions to exclude from parsing, Actualy, only html pages are processed(parsed, scraped); To define your own lis use |
urlregexfilter |
character vector, filter out extracted internal urls by one or more regular expression. |
encod |
character, web page character encoding |
urlbotfiler |
character vector , directories/files restricted by robot.txt |
removeparams |
character vector, list of url parameters to be removed form web page internal links. |
removeAllparams |
boolean, IF TRUE the list of scraped urls will have no parameters. |
ExternalLInks |
boolean, default FALSE, if set to TRUE external links also are returned. |
urlsZoneXpath, |
xpath pattern of the section from where links should be exclusively gathered/collected. |
Browser |
the client object of a remote headless web driver(virtual browser), created by |
RenderingDelay |
the time required by a webpage to be fully rendred, in seconds. |
return a list of three elements, the first is a list containing the web page details (url, encoding-type, content-type, content ... etc), the second is a character-vector containing the list of retreived internal urls and the third is a vetcor of external Urls.
salim khalil
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 | ## Not run:
###### Fetch a URL using GET request :
######################################################
##
## Very Fast, but can't fetch javascript rendred pages or sections
# fetch the page with default config, then returns page info and internal links
page<-LinkExtractor(url="http://www.glofile.com")
# this will return alse external links
page<-LinkExtractor(url="http://www.glofile.com", ExternalLInks = TRUE)
# Specify Useragent to overcome bots blocking by some websites rules
page<-LinkExtractor(url="http://www.glofile.com", ExternalLInks = TRUE,
Useragent = "Mozilla/5.0 (Windows NT 6.3; Win64; x64)",)
# By default, only HTTP succeeded page are parsed, therefore, to force
# parse error pages like 404 you need to specify IndexErrPages,
page<-LinkExtractor(url="http://www.glofile.com/404notfoundpage",
ExternalLInks = TRUE, IndexErrPages = c(200,404))
#### Use GET request with a proxy
#
proxy<-httr::use_proxy("190.90.100.205",41000)
pageinfo<-LinkExtractor(url="http://glofile.com/index.php/2017/06/08/taux-nette-detente/",
use_proxy = proxy)
#' Note : use_proxy arguments can' not't be configured with webdriver
###### Fetch a URL using a web driver (virtual browser)
######################################################
##
## Slow, because a headless browser called phantomjs will simulate
## a user session on a website. It's useful for web page having important
## javascript rendred sections such as menus.
## We recommend that you first try normal previous request, if the function
## returns a forbidden 403 status code or an empty/incomplete source code body,
## then try to set a normal useragent like
## Useragent = "Mozilla/5.0 (Windows NT 6.3; Win64; x64)",
## if you still have issue then you shoud try to set up a virtual browser.
#1 Download and install phantomjs headless browser
install_browser()
#2 start browser process (takes 30 seconds usualy)
br <-run_browser()
#3 call the function
page<-LinkExtractor(url="http://www.master-maroc.com", Browser = br,
ExternalLInks = TRUE)
#4 dont forget to stop the browser at the end of all your work with it
stop_browser(br)
###### Fetch a web page that requires authentication
#########################################################
## In some case you may need to retreive content from a web page which
## requires authentication via a login page like private forums, platforms..
## In this case you need to run \link{LoginSession} function to establish a
## authenticated browser session; then use \link{LinkExtractor} to fetch
## the URL using the auhenticated session.
## In the example below we will try to fech a private blog post which
## require authentification .
If you retreive the page using regular function LinkExtractor or your browser
page<-LinkExtractor("http://glofile.com/index.php/2017/06/08/jcdecaux/")
The post is not visible because it's private.
Now we will try to login to access this post using folowing creditentials
username : demo and password : rc@pass@r
#1 Download and install phantomjs headless browser (skip if installed)
install_browser()
#2 start browser process
br <-run_browser()
#3 create auhenticated session
# see \link{LoginSession} for more details
LS<-LoginSession(Browser = br, LoginURL = 'http://glofile.com/wp-login.php',
LoginCredentials = c('demo','rc@pass@r'),
cssLoginFields =c('#user_login', '#user_pass'),
cssLoginButton='#wp-submit' )
#check if login successful
LS$session$getTitle()
#Or
LS$session$getUrl()
#Or
LS$session$takeScreenshot(file = 'sc.png')
#3 Retreive the target private page using the logged-in session
page<-LinkExtractor(url='http://glofile.com/index.php/2017/06/08/jcdecaux/',Browser = LS)
#4 dont forget to stop the browser at the end of all your work with it
stop_browser(LS)
################### Returned Values #####################
#########################################################
# Returned 'page' variable should include :
# 1- list of page details,
# 2- Internal links
# 3- external links.
#1 Vector of extracted internal links (in-links)
page$InternalLinks
#2 Vector of extracted external links (out-links)
page$ExternalLinks
page$Info
# Requested Url
page$Info$Url
# Sum of extracted links
page$Info$SumLinks
# The status code of the HTTP response 200, 401, 300...
page$Info$Status_code
# The MIME type of this content from HTTP response
page$Info$Content_type
# Page text encoding UTF8, ISO-8859-1 , ..
page$Info$Encoding
# Page source code
page$Info$Source_page
Page title
page$Info$Title
Other returned values page$Info$Id, page$Info$Crawl_level,
page$Info$Crawl_status are only used by Rcrawler funtion.
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.