README.md

FanficReadeR

A webscraper for gathering public data on AO3

The Package

FanficReadeR scrapes data from ArchiveOfOurOwn (AO3), one of the world's leading fanfiction websites, with more than 3.7 million registered users and 7.6 million works listed on the platform. This package gathers data about three broad categories: Fandoms, Users, and Works

Fandoms

Works are organized into fandoms, which refer to the media each work is a fanfiction of. Fandoms are things like "Harry Potter", or "Percy Jackson", or "Stranger Things".

Works

Works are simply stories written by users. Works contain chapters, and all works have at least one chapter. Works can be either incomplete (chapters are still being posted) or complete (the user has marked the work as finished).

Each work has a series of attributes associated with it:

Users

There are a number of features about users that may prove interesting to an external observer.

Functions

FanficReadeR uses a small set of functions to generate this data about individual works and/or authors.

Fandom Data

Fandoms can be searched for in one of two ways: GetFandomIndex() and GetSearchIndex. GetFandomIndex() simply searches for all fictions within a fandom. So, for example, if you wanted to get an index of all fictions within the Harry Potter fandom, this function would return all fictions up to a maximum of 5,000 pages. Each page has 20 fictions listed, so the maximum that GetFandomIndex() can return is 100,000 fictions. This will only be a problem for the largest fandoms, and this can be solved using GetSearchIndex() with iterated date ranges.

GetSearchIndex() is a more advanced tool for gather information on fictions. Rather than returning all relevant fictions, GetSearchIndex() returns all fictions that meet particular criteria, such as when a fiction was updated (date_from to date_to), or its complete status (complete, incomplete, or all fictions). Basic descriptions of each function are listed below:

| Function | Inputs | Description | |--------------------|-------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | GetFandomIndex() | Fandom name, max pages to collect, start page (default = 1) | Gathers an index of fanfiction URLs for a given fandom, outputted as a dataframe. Currently, this function selects the most recently updated fanfictions in the given fandom. When using this function, you need to specify how many pages of results to gather -- AO3 displays 20 results per page, so if you want to gather 20 fanfiction URLs you would set numbere of pages to 1, and for 100, set page numbers to 5. |

Author Data

| Function | Inputs | Description | |------------------------|-----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | GetAuthorInfo() | Author Name OR Profile Link | Gathers basic biographical data about an author on AO3. Includes data like: Date joined, Number of stories, number of bookmarks, Author ID | | GetAuthorWorks() | Author Name OR Profile Link | Gathers data about all works generated by the author. Includes data like: Work title, completion status, user engagement (kudos, comments, bookmarks, hits), romantic pairings (M/M, F/M, F/F, Multi, etc) | | GetAuthorBookmarks() | Author Name OR Profile Link | Gathers data about all works bookmarked by the author. The data is near-identical to the output of GetAuthorWorks() | | GetAuthorAll() | Author Name OR Profile Link | Applies the above three functions in a single call, output as a list |

Works Data

| Function | Inputs | Description | |---------------------|---------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | GetWorksInfo() | Work Link OR Chapter Link | Gathers basic summary data about the work in question. Includes data like: Work title, completion status, user engagement (kudos, comments, bookmarks, hits), romantic pairings (M/M, F/M, F/F, Multi, etc) | | GetChapterIndex() | Work Link OR Chapter Link | Creates an index of all chapters in the relevant work. Lists their names, chapter order, and provides a URL | | GetComments | Work Link OR Chapter Link | Gathers all comments on the relevant work. For each comment, this function also tells you which user made the comment, if that user was the author, when the comment was made, what chapter the comment was made on. The large nature of the data that this can generate means that I have added a few extra options to the function: 1) keep.text = TRUE is the default, and preserves the original text of the comment in the output, 2) excl.author = FALSE is the default, and removes comments made by the author on their own work |

Rate Limits

AO3 limits how many requests can be made to their website. They use the platform rack::attack for their servers, which limits requests to AO3's website to only 60 per minute (the exact rate limit is a actually 300 req per 300 sec, per the open source AO3 GitHub). However, in practice the throttle threashold for AO3 is much smaller than this. I've tested the functions here many times, and there will be no point at which you will consistently achieve 60rpm without hitting a HTTP 429 error for making too many requests.

I've experimented with different parameters for a Sys.sleep() call attached to every html request, and found that a 5.5 second delay per request is the smallest delay that lets these functions run continuously without error. This is obviously not ideal, and frankly I don't know why the OTW archive says they have a 1 req/s limit when in practice their limits are more like 0.2 req/s.

Unless you're trying to gather large quantities of data, this shouldn't matter all that much to you. The scraper works relatively fast for single-fanfic scrapings, but the delays will make it so that large-scale scraping efforts could take multiple hours or even days. I'd appreciate any suggestions for improving the speed at which these functions can continuously collect data.

Examples

See example_scrape.qmd for a functioning scraper workflow that gets data on \~3,000 Harry Potter fanfictions, as well as associated comment sections and authors.

Future roadmap

While this package can gather most of the information from AO3 that any researcher might want, there are still several ways in which it could -- and will be -- improved. Specifically:

Installation

You can install FanficReadeR with the following code:

install.packages("devtools") # if you have not installed "devtools" package
devtools::install_github("SEthanMilne/FanficReadeR")

Citation

If you use this package for academic purposes, I ask that you cite me using the below information:

Ethan Milne (2024). FanficReadeR. R package version 1.0.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {FanficReadeR},
    author = {Ethan Milne},
    year = {2021},
    note = {R package version 1.0},
  }


SEthanMilne/FanficReadeR documentation built on Nov. 18, 2024, 9:17 a.m.