inst/lesson-fragment/_episodes/14-looping-data-sets.md

title: "Looping Over Data Sets" teaching: 5 exercises: 10 questions: - "How can I process many data sets with a single command?" objectives: - "Be able to read and write globbing expressions that match sets of files." - "Use glob to create lists of files." - "Write for loops to perform operations on files given their names in a list." keypoints: - "Use a for loop to process files given a list of their names." - "Use glob.glob to find sets of files whose names match a pattern." - "Use glob and for to process batches of files."

Use a for loop to process files given a list of their names.

import pandas as pd
for filename in ['data/gapminder_gdp_africa.csv', 'data/gapminder_gdp_asia.csv']:
    data = pd.read_csv(filename, index_col='country')
    print(filename, data.min())

{: .language-python}

data/gapminder_gdp_africa.csv gdpPercap_1952    298.846212
gdpPercap_1957    335.997115
gdpPercap_1962    355.203227
gdpPercap_1967    412.977514
⋮ ⋮ ⋮
gdpPercap_1997    312.188423
gdpPercap_2002    241.165877
gdpPercap_2007    277.551859
dtype: float64
data/gapminder_gdp_asia.csv gdpPercap_1952    331
gdpPercap_1957    350
gdpPercap_1962    388
gdpPercap_1967    349
⋮ ⋮ ⋮
gdpPercap_1997    415
gdpPercap_2002    611
gdpPercap_2007    944
dtype: float64

{: .output}

Use glob.glob to find sets of files whose names match a pattern.

import glob
print('all csv files in data directory:', glob.glob('data/*.csv'))

{: .language-python}

all csv files in data directory: ['data/gapminder_all.csv', 'data/gapminder_gdp_africa.csv', \
'data/gapminder_gdp_americas.csv', 'data/gapminder_gdp_asia.csv', 'data/gapminder_gdp_europe.csv', \
'data/gapminder_gdp_oceania.csv']

{: .output}

print('all PDB files:', glob.glob('*.pdb'))

{: .language-python}

all PDB files: []

{: .output}

Use glob and for to process batches of files.

for filename in glob.glob('data/gapminder_*.csv'):
    data = pd.read_csv(filename)
    print(filename, data['gdpPercap_1952'].min())

{: .language-python}

data/gapminder_all.csv 298.8462121
data/gapminder_gdp_africa.csv 298.8462121
data/gapminder_gdp_americas.csv 1397.717137
data/gapminder_gdp_asia.csv 331.0
data/gapminder_gdp_europe.csv 973.5331948
data/gapminder_gdp_oceania.csv 10039.59564

{: .output}

Determining Matches

Which of these files is not matched by the expression glob.glob('data/*as*.csv')?

  1. data/gapminder_gdp_africa.csv
  2. data/gapminder_gdp_americas.csv
  3. data/gapminder_gdp_asia.csv
  4. 1 and 2 are not matched.

Solution

1 is not matched by the glob. {: .solution} {: .challenge}

Minimum File Size

Modify this program so that it prints the number of records in the file that has the fewest records.

~~~ import glob import pandas as pd fewest = _ for filename in glob.glob('data/*.csv'): dataframe = pd._(filename) fewest = min(____, dataframe.shape[0]) print('smallest file has', fewest, 'records') ~~~ {: .language-python} Note that the shape method returns a tuple with the number of rows and columns of the data frame.

Solution

~~~ import glob import pandas as pd fewest = float('Inf') for filename in glob.glob('data/*.csv'): dataframe = pd.read_csv(filename) fewest = min(fewest, dataframe.shape[0]) print('smallest file has', fewest, 'records') ~~~ {: .language-python} {: .solution} {: .challenge}

Comparing Data

Write a program that reads in the regional data sets and plots the average GDP per capita for each region over time in a single chart.

Solution

This solution builds a useful legend by using the string split method to extract the region from the path 'data/gapminder_gdp_a_specific_region.csv'. The [pathlib module] also provides useful abstractions for file and path manipulation like returning the name of a file without the file extension. ~~~ import glob import pandas as pd import matplotlib.pyplot as plt fig, ax = plt.subplots(1,1) for filename in glob.glob('data/gapminder_gdp*.csv'): dataframe = pd.read_csv(filename) # extract from the filename, expected to be in the format 'data/gapminder_gdp_.csv'. # we will split the string using the split method and _ as our separator, # retrieve the last string in the list that split returns (<region>.csv), # and then remove the .csv extension from that string. region = filename.split('_')[-1][:-4] dataframe.mean().plot(ax=ax, label=region) plt.legend() plt.show() ~~~ {: .language-python} {: .solution} {: .challenge}

ZNK test links and images

books as clubs

books as clubs

Link to Home and to shell

Carpentries logo

Non-working image

![Non-working image with jekyll syntax]({{ page.root }}/no-workie.svg)

This text includes a link that isn't parsed correctly by commonmark. The rest of the text should be properly parsed.

{% include links.md %}



carpentries/pegboard documentation built on Nov. 13, 2024, 8:53 a.m.