curl_parse_url: Normalizing URL parser

View source: R/parser.R

curl_parse_urlR Documentation

Normalizing URL parser

Description

Interfaces the libcurl URL parser. URLs are automatically normalized where possible, such as in the case of relative paths or url-encoded queries (see examples). When parsing hyperlinks from a HTML document, it is possible to set baseurl to the location of the document itself such that relative links can be resolved.

Usage

curl_parse_url(url, baseurl = NULL, decode = TRUE, params = TRUE)

curl_modify_url(
  url = NULL,
  scheme = NULL,
  host = NULL,
  port = NULL,
  path = NULL,
  query = NULL,
  fragment = NULL,
  user = NULL,
  password = NULL,
  params = NULL
)

Arguments

url

either URL string or list returned by curl_parse_url. Use this to modify a URL using the other parameters.

baseurl

use this as the parent if url may be a relative path

decode

automatically url-decode output. Set to FALSE to get output in url-encoded format.

params

named character vector with http GET parameters. This will automatically be converted to application/x-www-form-urlencoded and override query,

scheme

string with e.g. https. Required if no url parameter was given.

host

string with hostname. Required if no url parameter was given.

port

string or number with port, e.g. "443".

path

piece of the url starting with / up till ⁠?⁠ or ⁠#⁠

query

piece of url starting with ⁠?⁠ up till ⁠#⁠. Only used if no params is given.

fragment

part of url starting with ⁠#⁠.

user

string with username

password

string with password

Details

A valid URL contains at least a scheme and a host, other pieces are optional. If these are missing, the parser raises an error. Otherwise it returns a list with the following elements:

  • url: the normalized input URL

  • scheme: the protocol part before the ⁠://⁠ (required)

  • host: name of host without port (required)

  • port: decimal between 0 and 65535

  • path: normalized path up till the ⁠?⁠ of the url

  • query: search query: part between the ⁠?⁠ and ⁠#⁠ of the url. Use params below to get individual parameters from the query.

  • fragment: the hash part after the ⁠#⁠ of the url

  • user: authentication username

  • password: authentication password

  • params: named vector with parameters from query if set

Each element above is either a string or NULL, except for params which is always a character vector with the length equal to the number of parameters.

Note that the params field is only usable if the query is in the usual application/x-www-form-urlencoded format which is technically not part of the RFC. Some services may use e.g. a json blob as the query, in which case the parsed params field here can be ignored. There is no way for the parser to automatically infer or validate the query format, this is up to the caller.

For more details on the URL format see rfc3986 or the steps explained in the whatwg basic url parser.

On platforms that do not have a recent enough curl version (basically only RHEL-8) the Ada URL library is used as fallback. Results should be identical, though curl has nicer error messages. This is a temporary solution, we plan to remove the fallback when old systems are no longer supported.

You can use curl_modify_url() both to modify an existing URL, or to create new URL from scratch. Arguments get automatically URL-encoded where needed, unless wrapped in I(). If params is given, this gets converted into a application/x-www-form-urlencoded string which overrides query. When modifying a URL, use an empty string "" to unset a piece of the URL.

Examples

url <- "https://jerry:secret@google.com:888/foo/bar?test=123#bla"
curl_parse_url(url)

# Resolve relative links from a baseurl
curl_parse_url("/somelink", baseurl = url)

# Paths get normalized
curl_parse_url("https://foobar.com/foo/bar/../baz/../yolo")$url

# Also normalizes URL-encoding (these URLs are equivalent):
url1 <- "https://ja.wikipedia.org/wiki/\u5bff\u53f8"
url2 <- "https://ja.wikipedia.org/wiki/%e5%af%bf%e5%8f%b8"
curl_parse_url(url1)$path
curl_parse_url(url2)$path
curl_parse_url(url1, decode = FALSE)$path
curl_parse_url(url1, decode = FALSE)$path

curl documentation built on June 8, 2025, 11:35 a.m.