5 Intro to pandas

In this Introduction to pandas we will get to know and become experts in:

Data Frames
Slicing
Counting and Summary Statistics
Handling Files

Relevant DataCamp lessons:

Data manipulation with pandas, Chaps 1-4
Matplotlib, Chap 1

import pandas as pd # hopefully colab has this preinstalled
import numpy as np

Introduction to pandas

While numpy offers a lot of powerful numerical capabilities it lacks some of the necessary convenience and natural of handling data as we encounter them. For example, we would typically like to - mix data types (strings, numbers, categories, Boolean, …) - refer to columns and rows by names - summarize and visualize data in efficient pivot style manners

All of the above (and more) can be achieved easily by extending the concept of an array (or a matrix) to a so called dataframe.

There are many ways to construct a DataFrame, though one of the most common is from a dictionary of equal-length lists or NumPy arrays:

data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)# creates a dataframe out of the data given!
frame.head(3)

	state	year	pop
0	Ohio	2000	1.5
1	Ohio	2001	1.7
2	Ohio	2002	3.6

frame[2,1]#too bad

#to get the full row: use the .iloc method
frame.iloc[2]
frame.iloc[2,1]

Subsetting/Slicing

We first need to understand the attributes index (=rownames) and columns (= column names):

frame.index

RangeIndex(start=0, stop=6, step=1)

#We can set a column as an index:
frame2 = frame.set_index("year")
print(frame2)
#

       state  pop
year             
2000    Ohio  1.5
2001    Ohio  1.7
2002    Ohio  3.6
2001  Nevada  2.4
2002  Nevada  2.9
2003  Nevada  3.2

#it would be nice to access elements in the same fashion as numpy
#frame2[1,1]
frame["pop"]

0    1.5
1    1.7
2    3.6
3    2.4
4    2.9
5    3.2
Name: pop, dtype: float64

frame.pop

<bound method DataFrame.pop of     state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9
5  Nevada  2003  3.2>

Asking for rows

Unfortunately, we cannot use the simple [row,col] notation that we are used to from numpy arrays. (Try asking for frame[0,1])

Instead, row subsetting can be achieved with either the .loc() or the .iloc() methods. The latter takes integers, the former indices:

frame2.loc[2001] #note that I am not using quotes !!
#at first glance this looks like I am asking for the row number 2001 !!

	state	pop
year
2001	Ohio	1.7
2001	Nevada	2.4

frame2.loc[2001,"state"]

year
2001      Ohio
2001    Nevada
Name: state, dtype: object

frame.iloc[0]#first row

state    Ohio
year     2000
pop       1.5
Name: 0, dtype: object

frame3 = frame.set_index("state", drop=False)
print(frame3)

         state  year  pop
state                    
Ohio      Ohio  2000  1.5
Ohio      Ohio  2001  1.7
Ohio      Ohio  2002  3.6
Nevada  Nevada  2001  2.4
Nevada  Nevada  2002  2.9
Nevada  Nevada  2003  3.2

frame3.loc["Ohio"]

	year	pop
state
Ohio	2000	1.5
Ohio	2001	1.7
Ohio	2002	3.6

frame.iloc[2001]# this does not work because we do not have 2001 rows !

frame.iloc[0,1]

Asking for columns

#The columns are also an index:
frame.columns

Index(['state', 'year', 'pop'], dtype='object')

A column in a DataFrame can be retrieved MUCH easier: as a Series either by dictionary-like notation or by using the dot attribute notation:

frame["state"]

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object

frame.year#equivalent to frame["year"]

0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64

Summary Stats

Just like in numpy you can compute sums, means, counts and many other summaries along rows and columns, by specifying the axis argument:

height = np.array([1.79, 1.85, 1.95, 1.55])
weight = np.array([70, 80, 85, 65])
hw = np.array([height, weight]).transpose()

hw

array([[ 1.79, 70.  ],
       [ 1.85, 80.  ],
       [ 1.95, 85.  ],
       [ 1.55, 65.  ]])

df = pd.DataFrame(hw, columns = ["height", "weight"]) 
print(df)

   height  weight
0    1.79    70.0
1    1.85    80.0
2    1.95    85.0
3    1.55    65.0

df = pd.DataFrame(hw , columns = ["height", "weight"],
                  index = ["Peter", "Matilda", "Bee", "Tom"]) 
print(df)

         height  weight
Peter      1.79    70.0
Matilda    1.85    80.0
Bee        1.95    85.0
Tom        1.55    65.0

Can you extract:

All weights
Peter’s height
Bee’s full info
the average height
get all persons with height greater than 180cm

#see Lab5

print(df.mean(axis=0))
print(df.mean(axis=1))# are these averages sensible ?

height     1.785
weight    75.000
dtype: float64
Peter      35.895
Matilda    40.925
Bee        43.475
Bee        33.275
dtype: float64

Some methods are neither reductions nor accumulations. describe is one such example, producing multiple summary statistics in one shot:

df.describe()

	height	weight
count	4.000	4.000000
mean	1.785	75.000000
std	0.170	9.128709
min	1.550	65.000000
25%	1.730	68.750000
50%	1.820	75.000000
75%	1.875	81.250000
max	1.950	85.000000

Built in data sets

Gapminder Data

https://www.gapminder.org/fw/world-health-chart/

https://www.ted.com/talks/hans_rosling_the_best_stats_you_ve_ever_seen#t-241405

You’ve never seen data presented like this. With the drama and urgency of a sportscaster, statistics guru Hans Rosling debunks myths about the so-called “developing world.”

!pip install gapminder
#!conda install gapminder
from gapminder import gapminder
#gapminder.to_csv("../datasets/gapminder.csv")

gapminder

	country	continent	year	lifeExp	pop	gdpPercap
0	Afghanistan	Asia	1952	28.801	8425333	779.445314
1	Afghanistan	Asia	1957	30.332	9240934	820.853030
2	Afghanistan	Asia	1962	31.997	10267083	853.100710
3	Afghanistan	Asia	1967	34.020	11537966	836.197138
4	Afghanistan	Asia	1972	36.088	13079460	739.981106
...	...	...	...	...	...	...
1699	Zimbabwe	Africa	1987	62.351	9216418	706.157306
1700	Zimbabwe	Africa	1992	60.377	10704340	693.420786
1701	Zimbabwe	Africa	1997	46.809	11404948	792.449960
1702	Zimbabwe	Africa	2002	39.989	11926563	672.038623
1703	Zimbabwe	Africa	2007	43.487	12311143	469.709298

1704 rows × 6 columns

#find the unique years

#get the years:
gapminder["year"]
np.unique(gapminder.year)

array([1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002,
       2007])

#get all rows with year 1952:
#Hint:
#either use Boolean subsetting
gapminder["year"] == 1952
gapminder[gapminder["year"] == 1952]
#or use an index !!

	country	continent	year	lifeExp	pop	gdpPercap
0	Afghanistan	Asia	1952	28.801	8425333	779.445314
12	Albania	Europe	1952	55.230	1282697	1601.056136
24	Algeria	Africa	1952	43.077	9279525	2449.008185
36	Angola	Africa	1952	30.015	4232095	3520.610273
48	Argentina	Americas	1952	62.485	17876956	5911.315053
...	...	...	...	...	...	...
1644	Vietnam	Asia	1952	40.412	26246839	605.066492
1656	West Bank and Gaza	Asia	1952	43.160	1030585	1515.592329
1668	Yemen, Rep.	Asia	1952	32.548	4963829	781.717576
1680	Zambia	Africa	1952	42.038	2672000	1147.388831
1692	Zimbabwe	Africa	1952	48.451	3080907	406.884115

142 rows × 6 columns

Handling Files

Get to know your friends

pd.read_csv
pd.read_table
pd.read_excel

'''url = "https://drive.google.com/file/d/1oIvCdN15UEwt4dCyjkArekHnTrivN43v/view?usp=share_link"
url='https://drive.google.com/uc?id=' + url.split('/')[-2]
gapminder = pd.read_csv(url, index_col=0)
gapminder.head()'''

	country	continent	year	lifeExp	pop	gdpPercap
0	Afghanistan	Asia	1952	28.801	8425333	779.445314
1	Afghanistan	Asia	1957	30.332	9240934	820.853030
2	Afghanistan	Asia	1962	31.997	10267083	853.100710
3	Afghanistan	Asia	1967	34.020	11537966	836.197138
4	Afghanistan	Asia	1972	36.088	13079460	739.981106

gapminder.sort_values(by="year").head()

	country	continent	year	lifeExp	pop	gdpPercap
0	Afghanistan	Asia	1952	28.801	8425333	779.445314
528	France	Europe	1952	67.410	42459667	7029.809327
540	Gabon	Africa	1952	37.003	420702	4293.476475
1656	West Bank and Gaza	Asia	1952	43.160	1030585	1515.592329
552	Gambia	Africa	1952	30.000	284320	485.230659

#How many countries?
CtryCts = gapminder["country"].value_counts()
CtryCts
#note the similarity with np.unique(..., return_counts=True)

Afghanistan          12
Pakistan             12
New Zealand          12
Nicaragua            12
Niger                12
                     ..
Eritrea              12
Equatorial Guinea    12
El Salvador          12
Egypt                12
Zimbabwe             12
Name: country, Length: 142, dtype: int64

from numpy.random import default_rng
rng = default_rng()
rng.choice(gapminder["country"].unique(),2)
gapminder["year"].unique()

array([1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002,
       2007])

#How meaningful are the column stats?
print(gapminder.mean(axis=0))
gapminder.describe()

year         1.979500e+03
lifeExp      5.947444e+01
pop          2.960121e+07
gdpPercap    7.215327e+03
dtype: float64

/var/folders/h4/k73g68ds6xj791sf8cpmlxlc0000gn/T/ipykernel_33611/633466148.py:2: FutureWarning: The default value of numeric_only in DataFrame.mean is deprecated. In a future version, it will default to False. In addition, specifying 'numeric_only=None' is deprecated. Select only valid columns or specify the value of numeric_only to silence this warning.
  print(gapminder.mean(axis=0))

	year	lifeExp	pop	gdpPercap
count	1704.00000	1704.000000	1.704000e+03	1704.000000
mean	1979.50000	59.474439	2.960121e+07	7215.327081
std	17.26533	12.917107	1.061579e+08	9857.454543
min	1952.00000	23.599000	6.001100e+04	241.165876
25%	1965.75000	48.198000	2.793664e+06	1202.060309
50%	1979.50000	60.712500	7.023596e+06	3531.846988
75%	1993.25000	70.845500	1.958522e+07	9325.462346
max	2007.00000	82.603000	1.318683e+09	113523.132900

Sort the index before you slice!!

Choose a time range and specific countries

gapminder2 = gapminder.set_index("year").sort_index()
gap1982_92 = gapminder2.loc[1982:1992].reset_index()
gap1982_92 = gap1982_92.set_index("country").sort_index()
gap1982_92.loc["Afghanistan":"Albania"]

	year	continent	lifeExp	pop	gdpPercap
country
Afghanistan	1982	Asia	39.854	12881816	978.011439
Afghanistan	1987	Asia	40.822	13867957	852.395945
Afghanistan	1992	Asia	41.674	16317921	649.341395
Albania	1992	Europe	71.581	3326498	2497.437901
Albania	1987	Europe	72.000	3075321	3738.932735
Albania	1982	Europe	70.420	2780097	3630.880722

gap1982_92.loc["Afghanistan":"Albania","lifeExp"].mean()

56.0585