# ignore this fiddly bit and just leave it as is
knitr::opts_chunk$set(comment=NA)
library("stringr")

Thinking about strings

  1. Try the funny-looking assignment z <- c(). Concatenate z with some other numeric vectors and explain what z is.

YOUR DESCRIPTIVE ANSWER HERE

  1. Now consider Z <- "". Imagine that s is a character vector but you don't know anything about it. What is the value of str_c(Z, s)? How about str_c(s, Z)? Try a few example values of your own for s.
# this code is not executed, so you can write abstract identities
str_c(Z, s) <-      # fill in the blank
str_c(s, Z) <-      # fill in the blank

Explain in words the difference between this and paste(s, Z).

  1. Optional. Specify another R value and another R function that has the same properties as the pairs (z, c) and (Z, str_c).

OPTIONAL ANSWER HERE

  1. What is the default value of the sep parameter to str_c? Demonstrate your answer with an example:
x <- "Let us go then"
y <- "you and I"
DEFAULT <- "?????" # CHANGE THIS
str_c(x, y)
str_c(x, y, sep=DEFAULT) # value should be same as previous

(Bonus. What could the default value for collapse possibly be?)

  1. What is the default value of sep for paste?

EXPLAIN HERE

Jockers's "first foray"

  1. Copy the TextAnalysisWithR/data/plainText/melville.txt file to the same directory as your homework .Rmd file. Then you can replace the longer path with a simple mention of melville.txt.

  2. Copy into the following code block the code for loading the text of the Gutenberg Moby-Dick from Jockers:

# INSERT HERE: code to read the text into a new variable
text.v <- "fix me"
  1. EXPLAIN HERE

  2. Expression using match:

  3. Single expression to strip paratexts:

  4. Go ahead and do the practice problem.

# INSERT: fill in some code from Jockers here

# then:
top_word_counts <- 1:100    # REPLACE THIS with an expression for the
                            # frequencies of the top 100 words in Moby-Dick
plot(top_word_counts)
  1. Download the plain text of The Sheik from Project Gutenberg. Use "Save as" in your browser and put the file in the same directory as this homework, and give it the name sheik-gutenberg.txt.

  2. Load it into a character vector in R. Rather than use scan, use the simpler function readLines:

# uncomment the next line when you have downloaded the file to the right place
# sheik_lines <- readLines("sheik-gutenberg.txt")
  1. Following Jockers's pattern, remove the metadata, tokenize the text, and determine the number of times the words sheik, body, love, and the occur in this novel, ignoring case, and the total number of words in the novel. Assign these as named elements of a vector sheik_counts.

Let's stipulate that the text starts after the title page, with the word "CHAPTER," and ends before "THE END." Insert your code here:

# fill in operations on sheik_lines...



sheik_counts <- numeric()


# fill in assignments to sheik_counts
sheik_counts["sheik"] <- 0
sheik_counts["body"] <- 0
# etc.

# and sheik_total
sheik_total <- 0
  1. Compare the most frequent words in The Sheik to those in Moby-Dick. Make a few remarks on whether the 20 most frequent words distinguish these texts.
# your code here

BRIEF DISCUSSION HERE

A small challenge

  1. It's often easier to work this kind of thing out with a toy example.
story <- "For sale: baby shoes, unworn."
story_words <- unlist(str_split(story, "\\W+"))
story_words <- story_words[story_words != ""]
story_bigrams <- str_c()  # what goes here?
  1. If your expression has any explicit vector constructions using c(), replace them with sequences using :.
story_bigrams <- str_c()  # improved version here
  1. If your expression has any literal numbers other than 1 in it, replace those with expressions using length(story_words).
story_bigrams <- str_c()  # even more improved here
  1. Now construct the vector of all the bigrams in The Sheik.
sheik_bigrams <- str_c() # what goes here?
  1. Compute the word frequency table and sort it, and use it to list the top ten bigrams in the novel.
# your code for constructing the bigrams table and listing the top 10


agoldst/litdata documentation built on May 10, 2019, 7:34 a.m.