LinkExtractor: LinkExtractor

Description Usage Arguments Value Author(s) Examples

Description

Fetch and parse a document by URL, to extract page info, HTML source and links (internal/external). Fetching process can be done by HTTP GET request or through webdriver (phantomjs) which simulate a real browser rendering.

Usage

1
2
3
4
5
LinkExtractor(url, id, lev, IndexErrPages, Useragent, Timeout = 6,
  use_proxy = NULL, URLlenlimit = 255, urlExtfilter, urlregexfilter,
  encod, urlbotfiler, removeparams, removeAllparams = FALSE,
  ExternalLInks = FALSE, urlsZoneXpath = NULL, Browser,
  RenderingDelay = 0)

Arguments

url

character, url to fetch and parse.

id

numeric, an id to identify a specific web page in a website collection, it's auto-generated byauto-generated by Rcrawler function.

lev

numeric, the depth level of the web page, auto-generated by Rcrawler function.

IndexErrPages

character vector, http error code-statut that can be processed, by default, it's IndexErrPages<-c(200) which means only successfull page request should be parsed .Eg, To parse also 404 error pages add, IndexErrPages<-c(200,404).

Useragent

, the name the request sender, default to "Rcrawler". but we recommand using a regular browser user-agent to avoid being blocked by some server.

Timeout

,default to 5s

use_proxy,

object created by httr::use_proxy() function, if you want to use a proxy to retreive web page. (does not work with webdriver).

URLlenlimit

interger, Maximum URL length to process, default to 255 characters (Useful to avoid spider traps)

urlExtfilter

character vector, the list of file extensions to exclude from parsing, Actualy, only html pages are processed(parsed, scraped); To define your own lis use urlExtfilter<-c(ext1,ext2,ext3)

urlregexfilter

character vector, filter out extracted internal urls by one or more regular expression.

encod

character, web page character encoding

urlbotfiler

character vector , directories/files restricted by robot.txt

removeparams

character vector, list of url parameters to be removed form web page internal links.

removeAllparams

boolean, IF TRUE the list of scraped urls will have no parameters.

ExternalLInks

boolean, default FALSE, if set to TRUE external links also are returned.

urlsZoneXpath,

xpath pattern of the section from where links should be exclusively gathered/collected.

Browser

the client object of a remote headless web driver(virtual browser), created by br<-run_browser() function, or a logged-in browser session object, created by LoginSession, after installing web driver Agent install_browser(). see examples below.

RenderingDelay

the time required by a webpage to be fully rendred, in seconds.

Value

return a list of three elements, the first is a list containing the web page details (url, encoding-type, content-type, content ... etc), the second is a character-vector containing the list of retreived internal urls and the third is a vetcor of external Urls.

Author(s)

salim khalil

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
## Not run: 

###### Fetch a URL using GET request :
######################################################
##
## Very Fast, but can't fetch javascript rendred pages or sections

# fetch the page with default config, then returns page info and internal links

page<-LinkExtractor(url="http://www.glofile.com")

# this will return  alse external links

page<-LinkExtractor(url="http://www.glofile.com", ExternalLInks = TRUE)

# Specify Useragent to overcome bots blocking by some websites rules

page<-LinkExtractor(url="http://www.glofile.com", ExternalLInks = TRUE,
       Useragent = "Mozilla/5.0 (Windows NT 6.3; Win64; x64)",)

# By default, only HTTP succeeded page are parsed, therefore, to force
# parse error pages like 404 you need to specify IndexErrPages,

page<-LinkExtractor(url="http://www.glofile.com/404notfoundpage",
      ExternalLInks = TRUE, IndexErrPages = c(200,404))


#### Use GET request with a proxy
#
proxy<-httr::use_proxy("190.90.100.205",41000)
pageinfo<-LinkExtractor(url="http://glofile.com/index.php/2017/06/08/taux-nette-detente/",
use_proxy = proxy)

#' Note : use_proxy arguments can' not't be configured with webdriver

###### Fetch a URL using a web driver (virtual browser)
######################################################
##
## Slow, because a headless browser called phantomjs will simulate
## a user session on a website. It's useful for web page having important
## javascript rendred sections such as menus.
## We recommend that you first try normal previous request, if the function
## returns a forbidden 403 status code or an empty/incomplete source code body,
## then try to set a normal useragent like
## Useragent = "Mozilla/5.0 (Windows NT 6.3; Win64; x64)",
## if you still have issue then you shoud try to set up a virtual browser.

#1 Download and install phantomjs headless browser
install_browser()

#2 start browser process (takes 30 seconds usualy)
br <-run_browser()

#3 call the function
page<-LinkExtractor(url="http://www.master-maroc.com", Browser = br,
      ExternalLInks = TRUE)

#4 dont forget to stop the browser at the end of all your work with it
stop_browser(br)

###### Fetch a web page that requires authentication
#########################################################
## In some case you may need to retreive content from a web page which
## requires authentication via a login page like private forums, platforms..
## In this case you need to run \link{LoginSession} function to establish a
## authenticated browser session; then use \link{LinkExtractor} to fetch
## the URL using the auhenticated session.
## In the example below we will try to fech a private blog post which
## require authentification .

If you retreive the page using regular function LinkExtractor or your browser
page<-LinkExtractor("http://glofile.com/index.php/2017/06/08/jcdecaux/")
The post is not visible because it's private.
Now we will try to login to access this post using folowing creditentials
username : demo and password : rc@pass@r

#1 Download and install phantomjs headless browser (skip if installed)
install_browser()

#2 start browser process
br <-run_browser()

#3 create auhenticated session
#  see \link{LoginSession} for more details

 LS<-LoginSession(Browser = br, LoginURL = 'http://glofile.com/wp-login.php',
                LoginCredentials = c('demo','rc@pass@r'),
                cssLoginFields =c('#user_login', '#user_pass'),
                cssLoginButton='#wp-submit' )

#check if login successful
LS$session$getTitle()
#Or
LS$session$getUrl()
#Or
LS$session$takeScreenshot(file = 'sc.png')

#3 Retreive the target private page using the logged-in session
page<-LinkExtractor(url='http://glofile.com/index.php/2017/06/08/jcdecaux/',Browser = LS)

#4 dont forget to stop the browser at the end of all your work with it
stop_browser(LS)


################### Returned Values #####################
#########################################################

# Returned 'page' variable should include :
# 1- list of page details,
# 2- Internal links
# 3- external links.

#1 Vector of extracted internal links  (in-links)
page$InternalLinks

#2 Vector of extracted external links  (out-links)
page$ExternalLinks

page$Info

# Requested Url
page$Info$Url

# Sum of extracted links
page$Info$SumLinks

# The status code of the HTTP response 200, 401, 300...
page$Info$Status_code

# The MIME type of this content from HTTP response
page$Info$Content_type

# Page text encoding UTF8, ISO-8859-1 , ..
page$Info$Encoding

# Page source code
page$Info$Source_page

Page title
page$Info$Title

Other returned values page$Info$Id, page$Info$Crawl_level,
page$Info$Crawl_status are only used by Rcrawler funtion.



## End(Not run)

Rcrawler documentation built on May 2, 2019, 3:42 a.m.