Description Usage Arguments Debugging Grouping Output Query principles Examples
Use src_postgres
to connect to an existing postgresql database,
and tbl
to connect to tables within that database.
If you are running a local postgresql database, leave all parameters set as
their defaults to connect. If you're connecting to a remote database,
ask your database administrator for the values of these variables.
1 2 3 4 5 |
dbname |
Database name |
host, port |
Host name and port number of database |
user, password |
User name and password (if needed) |
... |
for the src, other arguments passed on to the underlying
database connector, |
src |
a postgres src created with |
from |
Either a string giving the name of table in database, or
|
To see exactly what SQL is being sent to the database, you see
show_query
and explain
.
Typically you will create a grouped data table is to call the group_by
method on a mysql tbl: this will take care of capturing
the unevalated expressions for you.
For best performance, the database should have an index on the variables
that you are grouping by. Use explain
to check that
the database is using the indexes that you expect.
All data manipulation on SQL tbls are lazy: they will not actually
run the query or retrieve the data unless you ask for it: they all return
a new tbl_sql
object. Use compute
to run the
query and save the results in a temporary in the database, or use
collect
to retrieve the results to R.
Note that do
is not lazy since it must pull the data into R.
It returns a tbl_df
or grouped_df
, with one
column for each grouping variable, and one list column that contains the
results of the operation. do
never simplifies its output.
This section attempts to lay out the principles governing the generation of SQL queries from the manipulation verbs. The basic principle is that a sequence of operations should return the same value (modulo class) regardless of where the data is stored.
arrange(arrange(df, x), y)
should be equivalent to
arrange(df, y, x)
select(select(df, a:x), n:o)
should be equivalent to
select(df, n:o)
mutate(mutate(df, x2 = x * 2), y2 = y * 2)
should be
equivalent to mutate(df, x2 = x * 2, y2 = y * 2)
filter(filter(df, x == 1), y == 2)
should be
equivalent to filter(df, x == 1, y == 2)
summarise
should return the summarised output with
one level of grouping peeled off.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | ## Not run:
# Connection basics ---------------------------------------------------------
# To connect to a database first create a src:
my_db <- src_postgres(host = "blah.com", user = "hadley",
password = "pass")
# Then reference a tbl within that src
my_tbl <- tbl(my_db, "my_table")
## End(Not run)
# Here we'll use the Lahman database: to create your own local copy,
# create a local database called "lahman", or tell lahman_postgres() how to
# access a database that you can write to
if (has_lahman("postgres")) {
lahman_p <- lahman_postgres()
# Methods -------------------------------------------------------------------
batting <- tbl(lahman_p, "Batting")
dim(batting)
colnames(batting)
head(batting)
# Data manipulation verbs ---------------------------------------------------
filter(batting, yearID > 2005, G > 130)
select(batting, playerID:lgID)
arrange(batting, playerID, desc(yearID))
summarise(batting, G = mean(G), n = n())
mutate(batting, rbi2 = if(is.null(AB)) 1.0 * R / AB else 0)
# note that all operations are lazy: they don't do anything until you
# request the data, either by `print()`ing it (which shows the first ten
# rows), by looking at the `head()`, or `collect()` the results locally.
system.time(recent <- filter(batting, yearID > 2010))
system.time(collect(recent))
# Group by operations -------------------------------------------------------
# To perform operations by group, create a grouped object with group_by
players <- group_by(batting, playerID)
group_size(players)
summarise(players, mean_g = mean(G), best_ab = max(AB))
best_year <- filter(players, AB == max(AB) | G == max(G))
best_year
progress <- mutate(players,
cyear = yearID - min(yearID) + 1,
ab_rank = rank(desc(AB)),
cumulative_ab = order_by(yearID, cumsum(AB)))
# When you group by multiple level, each summarise peels off one level
per_year <- group_by(batting, playerID, yearID)
stints <- summarise(per_year, stints = max(stint))
filter(stints, stints > 3)
summarise(stints, max(stints))
mutate(stints, order_by(yearID, cumsum(stints)))
# Joins ---------------------------------------------------------------------
player_info <- select(tbl(lahman_p, "Master"), playerID, birthYear)
hof <- select(filter(tbl(lahman_p, "HallOfFame"), inducted == "Y"),
playerID, votedBy, category)
# Match players and their hall of fame data
inner_join(player_info, hof)
# Keep all players, match hof data where available
left_join(player_info, hof)
# Find only players in hof
semi_join(player_info, hof)
# Find players not in hof
anti_join(player_info, hof)
# Arbitrary SQL -------------------------------------------------------------
# You can also provide sql as is, using the sql function:
batting2008 <- tbl(lahman_p,
sql('SELECT * FROM "Batting" WHERE "yearID" = 2008'))
batting2008
}
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.