title: "Looping Over Data Sets"
teaching: 5
exercises: 10
questions:
- "How can I process many data sets with a single command?"
objectives:
- "Be able to read and write globbing expressions that match sets of files."
- "Use glob to create lists of files."
- "Write for loops to perform operations on files given their names in a list."
keypoints:
- "Use a for
loop to process files given a list of their names."
- "Use glob.glob
to find sets of files whose names match a pattern."
- "Use glob
and for
to process batches of files."
for
loop to process files given a list of their names.import pandas as pd
for filename in ['data/gapminder_gdp_africa.csv', 'data/gapminder_gdp_asia.csv']:
data = pd.read_csv(filename, index_col='country')
print(filename, data.min())
{: .language-python}
data/gapminder_gdp_africa.csv gdpPercap_1952 298.846212
gdpPercap_1957 335.997115
gdpPercap_1962 355.203227
gdpPercap_1967 412.977514
⋮ ⋮ ⋮
gdpPercap_1997 312.188423
gdpPercap_2002 241.165877
gdpPercap_2007 277.551859
dtype: float64
data/gapminder_gdp_asia.csv gdpPercap_1952 331
gdpPercap_1957 350
gdpPercap_1962 388
gdpPercap_1967 349
⋮ ⋮ ⋮
gdpPercap_1997 415
gdpPercap_2002 611
gdpPercap_2007 944
dtype: float64
{: .output}
glob.glob
to find sets of files whose names match a pattern.*
meaning "match zero or more characters"?
meaning "match exactly one character"glob
library to provide pattern matching functionalityglob
library contains a function also called glob
to match file patternsglob.glob('*.txt')
matches all files in the current directory
whose names end with .txt
.import glob
print('all csv files in data directory:', glob.glob('data/*.csv'))
{: .language-python}
all csv files in data directory: ['data/gapminder_all.csv', 'data/gapminder_gdp_africa.csv', \
'data/gapminder_gdp_americas.csv', 'data/gapminder_gdp_asia.csv', 'data/gapminder_gdp_europe.csv', \
'data/gapminder_gdp_oceania.csv']
{: .output}
print('all PDB files:', glob.glob('*.pdb'))
{: .language-python}
all PDB files: []
{: .output}
glob
and for
to process batches of files.for filename in glob.glob('data/gapminder_*.csv'):
data = pd.read_csv(filename)
print(filename, data['gdpPercap_1952'].min())
{: .language-python}
data/gapminder_all.csv 298.8462121
data/gapminder_gdp_africa.csv 298.8462121
data/gapminder_gdp_americas.csv 1397.717137
data/gapminder_gdp_asia.csv 331.0
data/gapminder_gdp_europe.csv 973.5331948
data/gapminder_gdp_oceania.csv 10039.59564
{: .output}
Determining Matches
Which of these files is not matched by the expression
glob.glob('data/*as*.csv')
?
data/gapminder_gdp_africa.csv
data/gapminder_gdp_americas.csv
data/gapminder_gdp_asia.csv
- 1 and 2 are not matched.
Solution
1 is not matched by the glob. {: .solution} {: .challenge}
Minimum File Size
Modify this program so that it prints the number of records in the file that has the fewest records.
~~~ import glob import pandas as pd fewest = _ for filename in glob.glob('data/*.csv'): dataframe = pd._(filename) fewest = min(____, dataframe.shape[0]) print('smallest file has', fewest, 'records') ~~~ {: .language-python} Note that the shape method returns a tuple with the number of rows and columns of the data frame.
Solution
~~~ import glob import pandas as pd fewest = float('Inf') for filename in glob.glob('data/*.csv'): dataframe = pd.read_csv(filename) fewest = min(fewest, dataframe.shape[0]) print('smallest file has', fewest, 'records') ~~~ {: .language-python} {: .solution} {: .challenge}
Comparing Data
Write a program that reads in the regional data sets and plots the average GDP per capita for each region over time in a single chart.
Solution
This solution builds a useful legend by using the string
split
method to extract theregion
from the path 'data/gapminder_gdp_a_specific_region.csv'. The [pathlib module
] also provides useful abstractions for file and path manipulation like returning the name of a file without the file extension. ~~~ import glob import pandas as pd import matplotlib.pyplot as plt fig, ax = plt.subplots(1,1) for filename in glob.glob('data/gapminder_gdp*.csv'): dataframe = pd.read_csv(filename) # extract from the filename, expected to be in the format 'data/gapminder_gdp_.csv'. # we will split the string using the split method and_
as our separator, # retrieve the last string in the list that split returns (<region>.csv
), # and then remove the.csv
extension from that string. region = filename.split('_')[-1][:-4] dataframe.mean().plot(ax=ax, label=region) plt.legend() plt.show() ~~~ {: .language-python} {: .solution} {: .challenge}
![Non-working image with jekyll syntax]({{ page.root }}/no-workie.svg)
This text includes a link that isn't parsed correctly by commonmark. The rest of the text should be properly parsed.
{% include links.md %}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.