5  Intro to pandas

In this Introduction to pandas we will get to know and become experts in:

  1. Data Frames
  2. Slicing
  3. Counting and Summary Statistics
  4. Handling Files

Relevant DataCamp lessons:

import pandas as pd # hopefully colab has this preinstalled
import numpy as np

Introduction to pandas

While numpy offers a lot of powerful numerical capabilities it lacks some of the necessary convenience and natural of handling data as we encounter them. For example, we would typically like to - mix data types (strings, numbers, categories, Boolean, …) - refer to columns and rows by names - summarize and visualize data in efficient pivot style manners

All of the above (and more) can be achieved easily by extending the concept of an array (or a matrix) to a so called dataframe.

There are many ways to construct a DataFrame, though one of the most common is from a dictionary of equal-length lists or NumPy arrays:

data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)# creates a dataframe out of the data given!
frame.head(3)
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
frame[2,1]#too bad
#to get the full row: use the .iloc method
frame.iloc[2]
frame.iloc[2,1]
2002

Subsetting/Slicing

We first need to understand the attributes index (=rownames) and columns (= column names):

frame.index
RangeIndex(start=0, stop=6, step=1)
#We can set a column as an index:
frame2 = frame.set_index("year")
print(frame2)
# 
       state  pop
year             
2000    Ohio  1.5
2001    Ohio  1.7
2002    Ohio  3.6
2001  Nevada  2.4
2002  Nevada  2.9
2003  Nevada  3.2
#it would be nice to access elements in the same fashion as numpy
#frame2[1,1]
frame["pop"]
0    1.5
1    1.7
2    3.6
3    2.4
4    2.9
5    3.2
Name: pop, dtype: float64
frame.pop
<bound method DataFrame.pop of     state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9
5  Nevada  2003  3.2>

Asking for rows

Unfortunately, we cannot use the simple [row,col] notation that we are used to from numpy arrays. (Try asking for frame[0,1])

Instead, row subsetting can be achieved with either the .loc() or the .iloc() methods. The latter takes integers, the former indices:

frame2.loc[2001] #note that I am not using quotes !!
#at first glance this looks like I am asking for the row number 2001 !!
state pop
year
2001 Ohio 1.7
2001 Nevada 2.4
frame2.loc[2001,"state"]
year
2001      Ohio
2001    Nevada
Name: state, dtype: object
frame.iloc[0]#first row
state    Ohio
year     2000
pop       1.5
Name: 0, dtype: object
frame3 = frame.set_index("state", drop=False)
print(frame3)
         state  year  pop
state                    
Ohio      Ohio  2000  1.5
Ohio      Ohio  2001  1.7
Ohio      Ohio  2002  3.6
Nevada  Nevada  2001  2.4
Nevada  Nevada  2002  2.9
Nevada  Nevada  2003  3.2
frame3.loc["Ohio"]
year pop
state
Ohio 2000 1.5
Ohio 2001 1.7
Ohio 2002 3.6
frame.iloc[2001]# this does not work because we do not have 2001 rows !
frame.iloc[0,1]
2000

Asking for columns

#The columns are also an index:
frame.columns
Index(['state', 'year', 'pop'], dtype='object')

A column in a DataFrame can be retrieved MUCH easier: as a Series either by dictionary-like notation or by using the dot attribute notation:

frame["state"]
0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object
frame.year#equivalent to frame["year"]
0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64

Summary Stats

Just like in numpy you can compute sums, means, counts and many other summaries along rows and columns, by specifying the axis argument:

height = np.array([1.79, 1.85, 1.95, 1.55])
weight = np.array([70, 80, 85, 65])
hw = np.array([height, weight]).transpose()

hw
array([[ 1.79, 70.  ],
       [ 1.85, 80.  ],
       [ 1.95, 85.  ],
       [ 1.55, 65.  ]])
df = pd.DataFrame(hw, columns = ["height", "weight"]) 
print(df)
   height  weight
0    1.79    70.0
1    1.85    80.0
2    1.95    85.0
3    1.55    65.0
df = pd.DataFrame(hw , columns = ["height", "weight"],
                  index = ["Peter", "Matilda", "Bee", "Tom"]) 
print(df)
         height  weight
Peter      1.79    70.0
Matilda    1.85    80.0
Bee        1.95    85.0
Tom        1.55    65.0

Can you extract:

  1. All weights
  2. Peter’s height
  3. Bee’s full info
  4. the average height
  5. get all persons with height greater than 180cm
#see Lab5
print(df.mean(axis=0))
print(df.mean(axis=1))# are these averages sensible ?
height     1.785
weight    75.000
dtype: float64
Peter      35.895
Matilda    40.925
Bee        43.475
Bee        33.275
dtype: float64

Some methods are neither reductions nor accumulations. describe is one such example, producing multiple summary statistics in one shot:

df.describe()
height weight
count 4.000 4.000000
mean 1.785 75.000000
std 0.170 9.128709
min 1.550 65.000000
25% 1.730 68.750000
50% 1.820 75.000000
75% 1.875 81.250000
max 1.950 85.000000

Built in data sets

Gapminder Data

https://www.gapminder.org/fw/world-health-chart/

https://www.ted.com/talks/hans_rosling_the_best_stats_you_ve_ever_seen#t-241405

You’ve never seen data presented like this. With the drama and urgency of a sportscaster, statistics guru Hans Rosling debunks myths about the so-called “developing world.”

!pip install gapminder
#!conda install gapminder
from gapminder import gapminder
#gapminder.to_csv("../datasets/gapminder.csv")
gapminder
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
1 Afghanistan Asia 1957 30.332 9240934 820.853030
2 Afghanistan Asia 1962 31.997 10267083 853.100710
3 Afghanistan Asia 1967 34.020 11537966 836.197138
4 Afghanistan Asia 1972 36.088 13079460 739.981106
... ... ... ... ... ... ...
1699 Zimbabwe Africa 1987 62.351 9216418 706.157306
1700 Zimbabwe Africa 1992 60.377 10704340 693.420786
1701 Zimbabwe Africa 1997 46.809 11404948 792.449960
1702 Zimbabwe Africa 2002 39.989 11926563 672.038623
1703 Zimbabwe Africa 2007 43.487 12311143 469.709298

1704 rows × 6 columns

#find the unique years

#get the years:
gapminder["year"]
np.unique(gapminder.year)
array([1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002,
       2007])
#get all rows with year 1952:
#Hint:
#either use Boolean subsetting
gapminder["year"] == 1952
gapminder[gapminder["year"] == 1952]
#or use an index !!
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
12 Albania Europe 1952 55.230 1282697 1601.056136
24 Algeria Africa 1952 43.077 9279525 2449.008185
36 Angola Africa 1952 30.015 4232095 3520.610273
48 Argentina Americas 1952 62.485 17876956 5911.315053
... ... ... ... ... ... ...
1644 Vietnam Asia 1952 40.412 26246839 605.066492
1656 West Bank and Gaza Asia 1952 43.160 1030585 1515.592329
1668 Yemen, Rep. Asia 1952 32.548 4963829 781.717576
1680 Zambia Africa 1952 42.038 2672000 1147.388831
1692 Zimbabwe Africa 1952 48.451 3080907 406.884115

142 rows × 6 columns

Handling Files

Get to know your friends

  • pd.read_csv
  • pd.read_table
  • pd.read_excel
'''url = "https://drive.google.com/file/d/1oIvCdN15UEwt4dCyjkArekHnTrivN43v/view?usp=share_link"
url='https://drive.google.com/uc?id=' + url.split('/')[-2]
gapminder = pd.read_csv(url, index_col=0)
gapminder.head()'''
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
1 Afghanistan Asia 1957 30.332 9240934 820.853030
2 Afghanistan Asia 1962 31.997 10267083 853.100710
3 Afghanistan Asia 1967 34.020 11537966 836.197138
4 Afghanistan Asia 1972 36.088 13079460 739.981106
gapminder.sort_values(by="year").head()
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
528 France Europe 1952 67.410 42459667 7029.809327
540 Gabon Africa 1952 37.003 420702 4293.476475
1656 West Bank and Gaza Asia 1952 43.160 1030585 1515.592329
552 Gambia Africa 1952 30.000 284320 485.230659
#How many countries?
CtryCts = gapminder["country"].value_counts()
CtryCts
#note the similarity with np.unique(..., return_counts=True)
Afghanistan          12
Pakistan             12
New Zealand          12
Nicaragua            12
Niger                12
                     ..
Eritrea              12
Equatorial Guinea    12
El Salvador          12
Egypt                12
Zimbabwe             12
Name: country, Length: 142, dtype: int64
from numpy.random import default_rng
rng = default_rng()
rng.choice(gapminder["country"].unique(),2)
gapminder["year"].unique()
array([1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002,
       2007])
#How meaningful are the column stats?
print(gapminder.mean(axis=0))
gapminder.describe()
year         1.979500e+03
lifeExp      5.947444e+01
pop          2.960121e+07
gdpPercap    7.215327e+03
dtype: float64
/var/folders/h4/k73g68ds6xj791sf8cpmlxlc0000gn/T/ipykernel_33611/633466148.py:2: FutureWarning: The default value of numeric_only in DataFrame.mean is deprecated. In a future version, it will default to False. In addition, specifying 'numeric_only=None' is deprecated. Select only valid columns or specify the value of numeric_only to silence this warning.
  print(gapminder.mean(axis=0))
year lifeExp pop gdpPercap
count 1704.00000 1704.000000 1.704000e+03 1704.000000
mean 1979.50000 59.474439 2.960121e+07 7215.327081
std 17.26533 12.917107 1.061579e+08 9857.454543
min 1952.00000 23.599000 6.001100e+04 241.165876
25% 1965.75000 48.198000 2.793664e+06 1202.060309
50% 1979.50000 60.712500 7.023596e+06 3531.846988
75% 1993.25000 70.845500 1.958522e+07 9325.462346
max 2007.00000 82.603000 1.318683e+09 113523.132900

Sort the index before you slice!!

Choose a time range and specific countries

gapminder2 = gapminder.set_index("year").sort_index()
gap1982_92 = gapminder2.loc[1982:1992].reset_index()
gap1982_92 = gap1982_92.set_index("country").sort_index()
gap1982_92.loc["Afghanistan":"Albania"]
year continent lifeExp pop gdpPercap
country
Afghanistan 1982 Asia 39.854 12881816 978.011439
Afghanistan 1987 Asia 40.822 13867957 852.395945
Afghanistan 1992 Asia 41.674 16317921 649.341395
Albania 1992 Europe 71.581 3326498 2497.437901
Albania 1987 Europe 72.000 3075321 3738.932735
Albania 1982 Europe 70.420 2780097 3630.880722
gap1982_92.loc["Afghanistan":"Albania","lifeExp"].mean()
56.0585