index2df: Extract data from multiple index pages of a PTT board.

Description Usage Arguments Value Warning See Also Examples

View source: R/scrape-index2df.R

Description

index2df scrapes the index pages of a board (" kan ban ") and extracts the information into a data frame.

Usage

1
2
index2df(board, newest = 1, pages = NA, search_term = NA,
  search_page = 1)

Arguments

board

Character. Either a URL or a board name, such as "Gossiping", "Baseball", "LoL". board name is case-insensitive. See Examples for details. board has a different requirements when used with argument search (See below).

newest

Integer. Number of pages, starting from the most recent page, to scrape. Defaults to 1, which scrapes only the newest page. If set to 2, then scrapes the newest and the second-newest page, and so forth.

pages

Integer vector. A vector of index page number(s). This parameter lets you scrape index pages by providing index page numbers. Becareful not to provide numbers exceeding the range of current index pages. Defaults to NA.

search_term

Character. A term to search in the index, such as "lu she". There are also some advanced search methods:

Post thread

Prepend "thread:" to the search term (post title): "thread:<post-title>".

Posts of an author

Prepend "author:" to the author's ID, e.g., "author:Plumage".

search_page

Integer vector. A vector of index page number(s). With argument search_term set, search_page lets you scrape index pages related to a specific term. Defaults to 1, which scrapes only the newest page.

Value

A data frame with one post info per row.

Warning

Do not request too many pages one time. It places heavy load on the server.

See Also

get_index_info get_index_info extracts data from one index page, while index2df deals with several. In addition, index2df has more functionality to deal with multiple pages extraction

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Get data from 'Gossiping'
index_df <- index2df("Gossiping")
head(index_df)

## Not run: 
# Or use URL directly
link <- "https://www.ptt.cc/bbs/Gossiping/index"

index_df <- index2df(link)

## End(Not run)

liao961120/pttR documentation built on Dec. 16, 2019, 2:19 a.m.