library(learnr) library(tutorial.helpers) library(knitr) knitr::opts_chunk$set(echo = FALSE) knitr::opts_chunk$set(out.width = '90%') options(tutorial.exercise.timelimit = 60, tutorial.storage = "local")
This is tutorial introduces you to using Python for data science. You will learn how to work with data sets using polars, a DataFrame library that uses a syntax similar to R's tidyverse. You will learn how to chain operations using method chaining with pipes, and how to make plots using plotnine, which implements the grammar of graphics just like ggplot2 in R.
This tutorial assumes that you have already completed the "Getting Started" tutorial in the tutorial.helpers package. If you haven't, do so now. It is quick!
To complete this tutorial, you must either install Python on your local machine or work with GitHub Codespaces in the cloud. Because Python installation is a tricky, finicky process, we do not provide any guidance. We recommend you use GitHub Codespaces instead. But all our instructions should work just as well on your machine.
Professionals store their work on GitHub, or a similar source control tool. If your computer blows up, you don't want to lose your work. GitHub is like Google Drive --- both live in the cloud --- but for your computational work rather than your documents.
Create a GitHub account by following the instructions at the GitHub homepage.
Follow this advice when choosing your username.
We recommend using a permanent email address for this account, not one which you lose access to when, for example, you change schools or jobs.
Copy your GitHub account URL in the field below.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 2)
Your answer should look like this:
https://github.com/your-username
Git is "software for tracking changes in any set of files, usually used for coordinating work among programmers collaboratively developing source code during software development."
Once you are logged in to GitHub, go to https://github.com/codespaces. (This is also accessible via the lefthand-side pull-down menu.)
include_graphics("images/codespaces-1.png")
Press the "Use this template" button for the "Blank" template. This will create a new Codespace which looks like this:
include_graphics("images/codespaces-2.png")
Copy/paste the URL for your Codespace
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 5)
A new URL is created for each Codespace. Ours was:
https://congenial-computing-machine-65ppr9j7wrhqq9.github.dev/
Yours will look similar.
Place your cursor in the Terminal, type pwd and hit return/enter. Copy/paste the command and return, an instruction which we will henceforth abbreviate as CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 5)
@davidkane9 ➜ /workspaces/codespaces-blank $ pwd /workspaces/codespaces-blank @davidkane9 ➜ /workspaces/codespaces-blank $
Your answer should look similar, except with your GitHub ID in place of mine, which is davidkane9. In this tutorial, we will just work at the Terminal because our goal is to learn from Python, not learn about Codespaces. We just use Codespaces as a handy, free location to use Python.
Type python at the Terminal and hit return/enter.
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 5)
@davidkane9 ➜ /workspaces/codespaces-blank $ python Python 3.12.1 (main, Nov 27 2025, 10:47:52) [GCC 13.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>>
This is the Python "shell" or "interpreter" or "console." It allows us to work with Python interactively. We will use the term "Console" and capitalize it for clarity.
At the Console, type 2 + 2 and hit enter/return. Remember, the Console is just the Python shell opened up in the terminal after typing "Python" in.
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 5)
>>> 2 + 2 4 >>>
You can exit the Console by typing exit(). Make sure to quit a codespace when you stop using it so that you don't use up your free credits.
Learn how to explore a data set using functions like describe(), info(), and sample().
Before you start doing data science, you must import the libraries you are going to use. Let's start with the polars library. At the Console, type:
import polars as pl
Going forward, we won't remind you to hit return/enter after every command.
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 5)
You will probably get an error which looks like this:
>>> import polars as pl Traceback (most recent call last): File "<stdin>", line 1, in <module> ModuleNotFoundError: No module named 'polars' >>>
The polars library is not installed by default in Codespaces. So we need to install it, along with the plotnines and pyarrow libraries ourselves. Note that, in Python, the terms "library," "package," and "module" are used (mostly) interchangeably.
Run exit() at the Console. This ends your Python session and dumps you back into the shell. Run:
pip install polars plotnine pyarrow
Note that the command "run" means to type the provided command and then hit the return/enter key.
CP/CR. (But just the first few lines.)
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 5)
>>> exit() @davidkane9 ➜ /workspaces/codespaces-blank $ pip install polars plotnine Collecting polars Downloading polars-1.36.1-py3-none-any.whl.metadata (10 kB) Collecting plotnine Downloading plotnine-0.15.2-py3-none-any.whl.metadata (9.5 kB) Collecting polars-runtime-32==1.36.1 (from polars) Downloading polars_runtime_32-1.36.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.5 kB) ...
One the of the most important parts of data science is keeping track of all the packages which you need and ensuring that they are installed when you need them.
Run python at the Terminal. This starts the Python Console again. At the Console, type:
import polars as pl
Going forward, we won't remind you to hit return/enter after every command.
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 5)
@davidkane9 ➜ /workspaces/codespaces-blank $ python Python 3.12.1 (main, Nov 27 2025, 10:47:52) [GCC 13.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import polars as pl >>>
In Python, we import libraries to access their functions and data. The import statement makes the library available, and using as pl creates a short alias so we can type pl instead of polars every time we use a function from the polars library.
Your answer never needs to match ours perfectly. Our goal is just to ensure that you are actually following the instructions.
DataFrames are spreadsheet-like data structures in polars. Let's load the famous iris dataset. We'll read it from a URL.
In the Console, run:
iris = pl.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv")
iris
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 15)
>>> iris = pl.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv")
>>> iris
shape: (150, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬───────────┐
│ sepal_length ┆ sepal_width ┆ petal_length ┆ petal_width ┆ species │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ str │
╞══════════════╪═════════════╪══════════════╪═════════════╪═══════════╡
│ 5.1 ┆ 3.5 ┆ 1.4 ┆ 0.2 ┆ setosa │
│ 4.9 ┆ 3.0 ┆ 1.4 ┆ 0.2 ┆ setosa │
│ 4.7 ┆ 3.2 ┆ 1.3 ┆ 0.2 ┆ setosa │
│ 4.6 ┆ 3.1 ┆ 1.5 ┆ 0.2 ┆ setosa │
│ 5.0 ┆ 3.6 ┆ 1.4 ┆ 0.2 ┆ setosa │
│ … ┆ … ┆ … ┆ … ┆ … │
│ 6.7 ┆ 3.0 ┆ 5.2 ┆ 2.3 ┆ virginica │
│ 6.3 ┆ 2.5 ┆ 5.0 ┆ 1.9 ┆ virginica │
│ 6.5 ┆ 3.0 ┆ 5.2 ┆ 2.0 ┆ virginica │
│ 6.2 ┆ 3.4 ┆ 5.4 ┆ 2.3 ┆ virginica │
│ 5.9 ┆ 3.0 ┆ 5.1 ┆ 1.8 ┆ virginica │
└──────────────┴─────────────┴──────────────┴─────────────┴───────────┘
>>>
Whenever we show outputs like this after a question, then we are showing our answer to the previous question, even if we do not label it as such.
In the Console, run iris.describe(). This provides summary statistics for numerical columns.
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 10)
>>> iris.describe() shape: (9, 6) ┌────────────┬──────────────┬─────────────┬──────────────┬─────────────┬───────────┐ │ statistic ┆ sepal_length ┆ sepal_width ┆ petal_length ┆ petal_width ┆ species │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ str │ ╞════════════╪══════════════╪═════════════╪══════════════╪═════════════╪═══════════╡ │ count ┆ 150.0 ┆ 150.0 ┆ 150.0 ┆ 150.0 ┆ 150 │ │ null_count ┆ 0.0 ┆ 0.0 ┆ 0.0 ┆ 0.0 ┆ 0 │ │ mean ┆ 5.843333 ┆ 3.057333 ┆ 3.758 ┆ 1.199333 ┆ null │ │ std ┆ 0.828066 ┆ 0.435866 ┆ 1.765298 ┆ 0.762238 ┆ null │ │ min ┆ 4.3 ┆ 2.0 ┆ 1.0 ┆ 0.1 ┆ setosa │ │ 25% ┆ 5.1 ┆ 2.8 ┆ 1.6 ┆ 0.3 ┆ null │ │ 50% ┆ 5.8 ┆ 3.0 ┆ 4.4 ┆ 1.3 ┆ null │ │ 75% ┆ 6.4 ┆ 3.3 ┆ 5.1 ┆ 1.8 ┆ null │ │ max ┆ 7.9 ┆ 4.4 ┆ 6.9 ┆ 2.5 ┆ virginica │ └────────────┴──────────────┴─────────────┴──────────────┴─────────────┴───────────┘ >>>
This method provides a quick statistical overview of each numerical variable in the dataset. In some cases, the tutorial displays the same object differently from what you were able to copy/paste. And that is OK! Your answer does not need to match our answer.
In the Console, run iris.sample(). This selects a random row from the dataset.
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 7)
>>> iris.sample() shape: (1, 5) ┌──────────────┬─────────────┬──────────────┬─────────────┬───────────┐ │ sepal_length ┆ sepal_width ┆ petal_length ┆ petal_width ┆ species │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 ┆ f64 ┆ str │ ╞══════════════╪═════════════╪══════════════╪═════════════╪═══════════╡ │ 6.3 ┆ 2.8 ┆ 5.1 ┆ 1.5 ┆ virginica │ └──────────────┴─────────────┴──────────────┴─────────────┴───────────┘ >>>
Your answer will differ from this answer because of the inherent randomness in methods like sample().
In the Console, hit the Up Arrow to retrieve the previous command. Edit it to add the argument n = 4 to iris.sample(). This will return 4 random rows from the iris dataset.
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 10)
>>> iris.sample(n = 4) shape: (4, 5) ┌──────────────┬─────────────┬──────────────┬─────────────┬───────────┐ │ sepal_length ┆ sepal_width ┆ petal_length ┆ petal_width ┆ species │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 ┆ f64 ┆ str │ ╞══════════════╪═════════════╪══════════════╪═════════════╪═══════════╡ │ 5.0 ┆ 3.3 ┆ 1.4 ┆ 0.2 ┆ setosa │ │ 6.3 ┆ 2.9 ┆ 5.6 ┆ 1.8 ┆ virginica │ │ 5.6 ┆ 2.8 ┆ 4.9 ┆ 2.0 ┆ virginica │ │ 5.2 ┆ 3.4 ┆ 1.4 ┆ 0.2 ┆ setosa │ └──────────────┴─────────────┴──────────────┴─────────────┴───────────┘ >>>
Editing code directly in the Console quickly becomes annoying. See the positron.tutorials package for tutorials about using the Positron IDE to write and organize your code.
In the Console, run print(iris). This returns the same result as typing iris.
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 10)
>>> print(iris) shape: (150, 5) ┌──────────────┬─────────────┬──────────────┬─────────────┬───────────┐ │ sepal_length ┆ sepal_width ┆ petal_length ┆ petal_width ┆ species │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 ┆ f64 ┆ str │ ╞══════════════╪═════════════╪══════════════╪═════════════╪═══════════╡ │ 5.1 ┆ 3.5 ┆ 1.4 ┆ 0.2 ┆ setosa │ │ 4.9 ┆ 3.0 ┆ 1.4 ┆ 0.2 ┆ setosa │ │ 4.7 ┆ 3.2 ┆ 1.3 ┆ 0.2 ┆ setosa │ │ 4.6 ┆ 3.1 ┆ 1.5 ┆ 0.2 ┆ setosa │ │ 5.0 ┆ 3.6 ┆ 1.4 ┆ 0.2 ┆ setosa │ │ … ┆ … ┆ … ┆ … ┆ … │ │ 6.7 ┆ 3.0 ┆ 5.2 ┆ 2.3 ┆ virginica │ │ 6.3 ┆ 2.5 ┆ 5.0 ┆ 1.9 ┆ virginica │ │ 6.5 ┆ 3.0 ┆ 5.2 ┆ 2.0 ┆ virginica │ │ 6.2 ┆ 3.4 ┆ 5.4 ┆ 2.3 ┆ virginica │ │ 5.9 ┆ 3.0 ┆ 5.1 ┆ 1.8 ┆ virginica │ └──────────────┴─────────────┴──────────────┴─────────────┴───────────┘ >>>
You can control how many rows to display using iris.head(n) for the first n rows or iris.tail(n) for the last n rows.
In the Console, run iris.head(3). This returns the first 3 rows of the iris dataset.
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 8)
>>> iris.head(3) shape: (3, 5) ┌──────────────┬─────────────┬──────────────┬─────────────┬─────────┐ │ sepal_length ┆ sepal_width ┆ petal_length ┆ petal_width ┆ species │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 ┆ f64 ┆ str │ ╞══════════════╪═════════════╪══════════════╪═════════════╪═════════╡ │ 5.1 ┆ 3.5 ┆ 1.4 ┆ 0.2 ┆ setosa │ │ 4.9 ┆ 3.0 ┆ 1.4 ┆ 0.2 ┆ setosa │ │ 4.7 ┆ 3.2 ┆ 1.3 ┆ 0.2 ┆ setosa │ └──────────────┴─────────────┴──────────────┴─────────────┴─────────┘ >>>
head() by default gives the top of the DataFrame, so your answer should match our answer. sample(), on the other hand, picks random rows to return. But, in both cases, the result is a DataFrame.
A central organizing principle of polars is that most methods take a DataFrame and return a DataFrame. This allows us to "chain" commands together, one after the other, creating a pipeline very similar to R's pipe operator |>.
In the Console, run help(iris). CP/CR (but just the first few rows).
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 10)
Help on DataFrame in module polars.dataframe.frame object: class DataFrame(builtins.object) | DataFrame(data: 'FrameInitTypes | None' = None, schema: 'SchemaDefinition | None' = None, *, schema_overrides: 'SchemaDict | N one' = None, strict: 'bool' = True, orient: 'Orientation | None' = None, infer_schema_length: 'int | None' = 100, nan_to_null: 'bo ol' = False) -> 'None' ...
To exit from the help interface, type q and hit return/enter.
In the Console, run iris.schema. This shows the column names and data types. CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 10)
>>> iris.schema
Schema({'sepal_length': Float64, 'sepal_width': Float64, 'petal_length': Float64, 'petal_width': Float64, 'species': String})
>>>
The schema attribute displays information about the DataFrame's structure including the data types of each column. For example, sepal_length is listed as Float64, meaning it's a 64-bit floating-point number. You can also use iris.dtypes to see just the data types, or iris.columns to see just the column names.
In the Console, run import math then math.sqrt(144).
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 5)
>>> import math >>> math.sqrt(144) 12.0 >>>
The square root function is one of many built-in functions in Python's math package. Most return their result, which Python then, by default, prints out. We did not need to use pip to install the math module because it is part of the base Python installation.
In the Console, run x = math.sqrt(144).
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
x = math.sqrt(144)
The = symbol is the assignment operator in Python. In this case, we are assigning the value of math.sqrt(144) to the variable x. Nothing is printed out because of that assignment.
In the Console, run x or print(x).
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
>>> x 12.0 >>>
Now that x has been defined in the Console, it is available for your use. Above, we just print it out. But we could also use it in other calculations, e.g., x + 5.
Although polars includes hundreds of methods for data manipulation, the most important are filter(), select(), sort(), with_columns(), and group_by() with agg(). These work very similarly to their R tidyverse equivalents.
Let's warm up by examining the tips dataset. Run:
tips = pl.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv")
tips
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 20)
>>> tips = pl.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv")
>>> tips
shape: (244, 7)
┌────────────┬──────┬────────┬────────┬──────┬────────┬──────┐
│ total_bill ┆ tip ┆ sex ┆ smoker ┆ day ┆ time ┆ size │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ str ┆ str ┆ str ┆ str ┆ i64 │
╞════════════╪══════╪════════╪════════╪══════╪════════╪══════╡
│ 16.99 ┆ 1.01 ┆ Female ┆ No ┆ Sun ┆ Dinner ┆ 2 │
│ 10.34 ┆ 1.66 ┆ Male ┆ No ┆ Sun ┆ Dinner ┆ 3 │
│ 21.01 ┆ 3.5 ┆ Male ┆ No ┆ Sun ┆ Dinner ┆ 3 │
│ 23.68 ┆ 3.31 ┆ Male ┆ No ┆ Sun ┆ Dinner ┆ 2 │
│ 24.59 ┆ 3.61 ┆ Female ┆ No ┆ Sun ┆ Dinner ┆ 4 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 29.03 ┆ 5.92 ┆ Male ┆ No ┆ Sat ┆ Dinner ┆ 3 │
│ 27.18 ┆ 2.0 ┆ Female ┆ Yes ┆ Sat ┆ Dinner ┆ 2 │
│ 22.67 ┆ 2.0 ┆ Male ┆ Yes ┆ Sat ┆ Dinner ┆ 2 │
│ 17.82 ┆ 1.75 ┆ Male ┆ No ┆ Sat ┆ Dinner ┆ 2 │
│ 18.78 ┆ 3.0 ┆ Female ┆ No ┆ Thur ┆ Dinner ┆ 2 │
└────────────┴──────┴────────┴────────┴──────┴────────┴──────┘
>>>
The tips dataset contains information about restaurant tips, including total bill, tip amount, and other variables.
Run tips.describe() to see summary statistics.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 5)
>>> tips.describe() shape: (9, 8) ┌────────────┬────────────┬──────────┬────────┬────────┬──────┬────────┬──────────┐ │ statistic ┆ total_bill ┆ tip ┆ sex ┆ smoker ┆ day ┆ time ┆ size │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ f64 ┆ f64 ┆ str ┆ str ┆ str ┆ str ┆ f64 │ ╞════════════╪════════════╪══════════╪════════╪════════╪══════╪════════╪══════════╡ │ count ┆ 244.0 ┆ 244.0 ┆ 244 ┆ 244 ┆ 244 ┆ 244 ┆ 244.0 │ │ null_count ┆ 0.0 ┆ 0.0 ┆ 0 ┆ 0 ┆ 0 ┆ 0 ┆ 0.0 │ │ mean ┆ 19.785943 ┆ 2.998279 ┆ null ┆ null ┆ null ┆ null ┆ 2.569672 │ │ std ┆ 8.902412 ┆ 1.383638 ┆ null ┆ null ┆ null ┆ null ┆ 0.9511 │ │ min ┆ 3.07 ┆ 1.0 ┆ Female ┆ No ┆ Fri ┆ Dinner ┆ 1.0 │ │ 25% ┆ 13.37 ┆ 2.0 ┆ null ┆ null ┆ null ┆ null ┆ 2.0 │ │ 50% ┆ 17.81 ┆ 2.92 ┆ null ┆ null ┆ null ┆ null ┆ 2.0 │ │ 75% ┆ 24.08 ┆ 3.55 ┆ null ┆ null ┆ null ┆ null ┆ 3.0 │ │ max ┆ 50.81 ┆ 10.0 ┆ Male ┆ Yes ┆ Thur ┆ Lunch ┆ 6.0 │ └────────────┴────────────┴──────────┴────────┴────────┴──────┴────────┴──────────┘ >>>
Note that this gives us statistics for the numerical columns in the dataset.
Use .drop_nulls() to remove rows with missing values. In polars, we pipe operations by calling methods one after another, just like in R. Try:
tips.drop_nulls()
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 10)
>>> tips.drop_nulls() shape: (244, 7) ┌────────────┬──────┬────────┬────────┬──────┬────────┬──────┐ │ total_bill ┆ tip ┆ sex ┆ smoker ┆ day ┆ time ┆ size │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str ┆ str ┆ str ┆ str ┆ i64 │ ╞════════════╪══════╪════════╪════════╪══════╪════════╪══════╡ │ 16.99 ┆ 1.01 ┆ Female ┆ No ┆ Sun ┆ Dinner ┆ 2 │ │ 10.34 ┆ 1.66 ┆ Male ┆ No ┆ Sun ┆ Dinner ┆ 3 │ │ 21.01 ┆ 3.5 ┆ Male ┆ No ┆ Sun ┆ Dinner ┆ 3 │ │ 23.68 ┆ 3.31 ┆ Male ┆ No ┆ Sun ┆ Dinner ┆ 2 │ │ 24.59 ┆ 3.61 ┆ Female ┆ No ┆ Sun ┆ Dinner ┆ 4 │ │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │ │ 29.03 ┆ 5.92 ┆ Male ┆ No ┆ Sat ┆ Dinner ┆ 3 │ │ 27.18 ┆ 2.0 ┆ Female ┆ Yes ┆ Sat ┆ Dinner ┆ 2 │ │ 22.67 ┆ 2.0 ┆ Male ┆ Yes ┆ Sat ┆ Dinner ┆ 2 │ │ 17.82 ┆ 1.75 ┆ Male ┆ No ┆ Sat ┆ Dinner ┆ 2 │ │ 18.78 ┆ 3.0 ┆ Female ┆ No ┆ Thur ┆ Dinner ┆ 2 │ └────────────┴──────┴────────┴────────┴──────┴────────┴──────┘ >>>
Note the number of rows in the DataFrame after drop_nulls(). This dataset actually has no missing values, so all rows remain.
We can chain methods by writing tips.drop_nulls().head() to first drop NA values and then show the first few rows.
Chain .filter() to filter rows. Use pl.col("time") == "Dinner" as the argument. This is very similar to R's filter(). In other words, run:
tips.filter(pl.col("time") == "Dinner")
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 10)
>>> tips.filter(pl.col("time") == "Dinner")
shape: (176, 7)
┌────────────┬──────┬────────┬────────┬──────┬────────┬──────┐
│ total_bill ┆ tip ┆ sex ┆ smoker ┆ day ┆ time ┆ size │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ str ┆ str ┆ str ┆ str ┆ i64 │
╞════════════╪══════╪════════╪════════╪══════╪════════╪══════╡
│ 16.99 ┆ 1.01 ┆ Female ┆ No ┆ Sun ┆ Dinner ┆ 2 │
│ 10.34 ┆ 1.66 ┆ Male ┆ No ┆ Sun ┆ Dinner ┆ 3 │
│ 21.01 ┆ 3.5 ┆ Male ┆ No ┆ Sun ┆ Dinner ┆ 3 │
│ 23.68 ┆ 3.31 ┆ Male ┆ No ┆ Sun ┆ Dinner ┆ 2 │
│ 24.59 ┆ 3.61 ┆ Female ┆ No ┆ Sun ┆ Dinner ┆ 4 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 29.03 ┆ 5.92 ┆ Male ┆ No ┆ Sat ┆ Dinner ┆ 3 │
│ 27.18 ┆ 2.0 ┆ Female ┆ Yes ┆ Sat ┆ Dinner ┆ 2 │
│ 22.67 ┆ 2.0 ┆ Male ┆ Yes ┆ Sat ┆ Dinner ┆ 2 │
│ 17.82 ┆ 1.75 ┆ Male ┆ No ┆ Sat ┆ Dinner ┆ 2 │
│ 18.78 ┆ 3.0 ┆ Female ┆ No ┆ Thur ┆ Dinner ┆ 2 │
└────────────┴──────┴────────┴────────┴──────┴────────┴──────┘
>>>
This workflow --- in which we chain DataFrame methods together --- is very common in polars and very similar to R's pipe workflow.
The resulting DataFrame has the same number of columns as tips because filter() only affects the rows. But there are fewer rows now.
Use the "Up" arrow to retrieve the command from the previous question in the Console. (We will do this for most exercises in this section.) Continue the chain with .select() to choose specific columns. Use ["total_bill", "tip", "sex", "day"] as the argument. That is, run:
tips.filter(pl.col("time") == "Dinner").select(["total_bill", "tip", "sex", "day"])
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 10)
>>> tips.filter(pl.col("time") == "Dinner").select(["total_bill", "tip", "sex", "day"])
shape: (176, 4)
┌────────────┬──────┬────────┬──────┐
│ total_bill ┆ tip ┆ sex ┆ day │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ str ┆ str │
╞════════════╪══════╪════════╪══════╡
│ 16.99 ┆ 1.01 ┆ Female ┆ Sun │
│ 10.34 ┆ 1.66 ┆ Male ┆ Sun │
│ 21.01 ┆ 3.5 ┆ Male ┆ Sun │
│ 23.68 ┆ 3.31 ┆ Male ┆ Sun │
│ 24.59 ┆ 3.61 ┆ Female ┆ Sun │
│ … ┆ … ┆ … ┆ … │
│ 29.03 ┆ 5.92 ┆ Male ┆ Sat │
│ 27.18 ┆ 2.0 ┆ Female ┆ Sat │
│ 22.67 ┆ 2.0 ┆ Male ┆ Sat │
│ 17.82 ┆ 1.75 ┆ Male ┆ Sat │
│ 18.78 ┆ 3.0 ┆ Female ┆ Thur │
└────────────┴──────┴────────┴──────┘
>>>
Because select() doesn't affect rows, we have the same number as after filter(). But we only have 4 columns now.
Use the "Up" arrow to get the previous code. Continue the chain with .describe().
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 10)
>>> tips.filter(pl.col("time") == "Dinner").select(["total_bill", "tip", "sex", "day"]).describe()
shape: (9, 5)
┌────────────┬────────────┬──────────┬────────┬──────┐
│ statistic ┆ total_bill ┆ tip ┆ sex ┆ day │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ str ┆ str │
╞════════════╪════════════╪══════════╪════════╪══════╡
│ count ┆ 176.0 ┆ 176.0 ┆ 176 ┆ 176 │
│ null_count ┆ 0.0 ┆ 0.0 ┆ 0 ┆ 0 │
│ mean ┆ 20.797159 ┆ 3.10267 ┆ null ┆ null │
│ std ┆ 9.142029 ┆ 1.436243 ┆ null ┆ null │
│ min ┆ 3.07 ┆ 1.0 ┆ Female ┆ Fri │
│ 25% ┆ 14.48 ┆ 2.0 ┆ null ┆ null │
│ 50% ┆ 18.43 ┆ 3.0 ┆ null ┆ null │
│ 75% ┆ 25.28 ┆ 3.68 ┆ null ┆ null │
│ max ┆ 50.81 ┆ 10.0 ┆ Male ┆ Thur │
└────────────┴────────────┴──────────┴────────┴──────┘
>>>
This gives us summary statistics for our filtered and selected data.
Continue the chain with .sort("tip") to sort by the tip column.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 10)
>>> tips.filter(pl.col("time") == "Dinner").select(["total_bill", "tip", "sex", "day"]).sort("tip")
shape: (176, 4)
┌────────────┬──────┬────────┬─────┐
│ total_bill ┆ tip ┆ sex ┆ day │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ str ┆ str │
╞════════════╪══════╪════════╪═════╡
│ 3.07 ┆ 1.0 ┆ Female ┆ Sat │
│ 5.75 ┆ 1.0 ┆ Female ┆ Fri │
│ 7.25 ┆ 1.0 ┆ Female ┆ Sat │
│ 12.6 ┆ 1.0 ┆ Male ┆ Sat │
│ 16.99 ┆ 1.01 ┆ Female ┆ Sun │
│ … ┆ … ┆ … ┆ … │
│ 28.17 ┆ 6.5 ┆ Female ┆ Sat │
│ 48.27 ┆ 6.73 ┆ Male ┆ Sat │
│ 39.42 ┆ 7.58 ┆ Male ┆ Sat │
│ 48.33 ┆ 9.0 ┆ Male ┆ Sat │
│ 50.81 ┆ 10.0 ┆ Male ┆ Sat │
└────────────┴──────┴────────┴─────┘
>>>
The sort() method sorts the rows of a DataFrame. By default, it sorts in ascending order.
Before we can create plots, we need to import the plotnine library. Run: from plotnine import *
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 10)
>>> from plotnine import * >>>
plotnine is Python's implementation of the Grammar of Graphics, similar to R's ggplot2.
Note that this import invocation is different from the one we used with polars: import polars as pl. That invocation requires us to use pl as a suffix whenever we call a function from polars. Using import * allows as to just call the functions directly. The danger comes if we have two or more packages with functions which have the same name. The as invocation is very common because it forces us to specify the function we want. However, doing so can get annoying! So, we will be "lazy" for this plot.
Let's create our filtered dataset and assign it to a variable called dinner_data. Use the chain from Exercise 7. Run:dinner_data = tips.filter(pl.col("time") == "Dinner").select(["total_bill", "tip", "sex", "day"]).sort("tip"). After doing so, just print out data_dinner to confirm that it looks correct.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 10)
>>> dinner_data = tips.filter(pl.col("time") == "Dinner").select(["total_bill", "tip", "sex", "day"]).sort("tip")
>>> dinner_data
shape: (176, 4)
┌────────────┬──────┬────────┬─────┐
│ total_bill ┆ tip ┆ sex ┆ day │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ str ┆ str │
╞════════════╪══════╪════════╪═════╡
│ 3.07 ┆ 1.0 ┆ Female ┆ Sat │
│ 5.75 ┆ 1.0 ┆ Female ┆ Fri │
│ 7.25 ┆ 1.0 ┆ Female ┆ Sat │
│ 12.6 ┆ 1.0 ┆ Male ┆ Sat │
│ 16.99 ┆ 1.01 ┆ Female ┆ Sun │
│ … ┆ … ┆ … ┆ … │
│ 28.17 ┆ 6.5 ┆ Female ┆ Sat │
│ 48.27 ┆ 6.73 ┆ Male ┆ Sat │
│ 39.42 ┆ 7.58 ┆ Male ┆ Sat │
│ 48.33 ┆ 9.0 ┆ Male ┆ Sat │
│ 50.81 ┆ 10.0 ┆ Male ┆ Sat │
└────────────┴──────┴────────┴─────┘
>>>
Assigning our filtered data to a variable makes it easier to reuse and keeps our plotting code cleaner.
Show AI the dinner_data DataFrame and ask it to use the plotnine package to create a simple but nice looking plot and to save that plot as myplot.png. Run that code at the Console using copy/paste. If it fails, show AI the error and try again until it works. Double-click on myplot.png.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 5)
Your screen should look something like this:
include_graphics("images/plot-1.png")
Our code:
p = (
ggplot(dinner_data, aes(x = 'total_bill', y = 'tip', color = 'sex')) +
geom_point() +
theme_minimal() +
labs(x= 'Total Bill', y= 'Tip', color = 'Sex')
)
p.save('myplot.png')
You code is probably different. And that is OK! The important skill is how to use AI to create beautiful graphics.
Generative AI --- tools like ChatGPT, Grok, Claude, DeepSeek and so on --- are the future, of data science and everything else. The more you use these tools, the better off you will be. Unfortunately, the tools are changing so much that it is hard for a tutorial like this to stay up-to-date. This section provides some general advice and practice exercises.
Using any AI you like, ask it to write a one-sentence summary about the Python programming language. Copy the answer below.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 5)
Example answer:
Python is a high-level, interpreted programming language known for its simplicity, readability, and versatility, widely used for web development, data analysis, artificial intelligence, and scientific computing.
If you do not want to pay for an AI service, then you will probably need to have free accounts with several different services. That way, if one service cuts you off for the day, you can switch to another.
Run this in the Python Console:
tips.head()
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 10)
>>> tips.head() shape: (5, 7) ┌────────────┬──────┬────────┬────────┬─────┬────────┬──────┐ │ total_bill ┆ tip ┆ sex ┆ smoker ┆ day ┆ time ┆ size │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str ┆ str ┆ str ┆ str ┆ i64 │ ╞════════════╪══════╪════════╪════════╪═════╪════════╪══════╡ │ 16.99 ┆ 1.01 ┆ Female ┆ No ┆ Sun ┆ Dinner ┆ 2 │ │ 10.34 ┆ 1.66 ┆ Male ┆ No ┆ Sun ┆ Dinner ┆ 3 │ │ 21.01 ┆ 3.5 ┆ Male ┆ No ┆ Sun ┆ Dinner ┆ 3 │ │ 23.68 ┆ 3.31 ┆ Male ┆ No ┆ Sun ┆ Dinner ┆ 2 │ │ 24.59 ┆ 3.61 ┆ Female ┆ No ┆ Sun ┆ Dinner ┆ 4 │ └────────────┴──────┴────────┴────────┴─────┴────────┴──────┘ >>>
When working with AI, you often need to tell it about the dataset. The easiest way to do that is often to just copy/paste the first few rows. That shows the AI what the column names and types are, which is key information for creating plots and data pipelines.
Copy/paste the top of the tips DataFrame into your AI interface and ask it to create a chain of methods using polars that calculates the average tip for each sex. Run the provided code in the Console. If it fails, show the AI the error and ask for better code.
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 10)
Claude gave us this answer:
>>> (
tips
.group_by("sex")
.agg(pl.col("tip").mean())
)
shape: (2, 2)
┌────────┬──────────┐
│ sex ┆ tip │
│ --- ┆ --- │
│ str ┆ f64 │
╞════════╪══════════╡
│ Male ┆ 3.089618 │
│ Female ┆ 2.833448 │
└────────┴──────────┘
>>>
There are two differences between this code and the code we have been writing. First, the entire statement is within parantheses. Second, the methods are placed on separate lines, which makes readability easier. Using parantheses allows us to place methods on separate lines.
This is a great answer! It uses group_by() just like R's tidyverse, and then agg() (short for aggregate) with pl.col() to specify which column to calculate the mean for.
Using AI is good. But intelligent use --- use in which you understand what the AI has done and try to improve/clarify its answer --- is even better.
Ask AI to create a beautiful plot using the tips dataset and the plotnine library and to save the resulting plot as myplot.png. Run the provided code in the Console. If it fails, show the AI the error and ask for better code.
CP/CR.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 20)
Example code from Claude:
p = (
ggplot(tips, aes(x="total_bill", y="tip", color="sex")) +
geom_point(alpha=0.7, size=3) +
geom_smooth(method="lm", se=False) +
facet_wrap("~time") +
labs(
title="Tipping Patterns by Total Bill",
subtitle="Comparing lunch and dinner service",
x="Total Bill ($)",
y="Tip ($)",
color="Sex"
) +
theme_minimal() +
theme(
figure_size=(10, 5),
plot_title=element_text(size=14, weight="bold"),
plot_subtitle=element_text(size=10, color="gray")
)
)
p.save("output_plot.png")
It is convenient to have the new plot saved as the same name as the plot which is already open. In this case, the new plot will just replace the old plot without you having to double-click anything. Of course, you can't use that trick if you want to save both plots.
This tutorial introduced you to the Python language for data science. You learned how to work with datasets using polars, a fast DataFrame library with syntax very similar to R's tidyverse. You learned how to chain operations using method chaining (just like R's pipe operator |>), and how to make plots using plotnine.
The key advantage of polars is that it feels familiar to R users while providing the speed and ecosystem of Python. Functions like filter(), select(), sort(), and group_by() work almost identically to their R counterparts!
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.