import pandas as pd # hopefully colab has this preinstalled
import numpy as np5 Intro to pandas
In this Introduction to pandas we will get to know and become experts in:
- Data Frames
- Slicing
- Counting and Summary Statistics
- Handling Files
Relevant DataCamp lessons:
- Data manipulation with pandas, Chaps 1-4
- Matplotlib, Chap 1
Introduction to pandas
While numpy offers a lot of powerful numerical capabilities it lacks some of the necessary convenience and natural of handling data as we encounter them. For example, we would typically like to - mix data types (strings, numbers, categories, Boolean, …) - refer to columns and rows by names - summarize and visualize data in efficient pivot style manners
All of the above (and more) can be achieved easily by extending the concept of an array (or a matrix) to a so called dataframe.
There are many ways to construct a DataFrame, though one of the most common is from a dictionary of equal-length lists or NumPy arrays:
data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
"year": [2000, 2001, 2002, 2001, 2002, 2003],
"pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)# creates a dataframe out of the data given!
frame.head(3)| state | year | pop | |
|---|---|---|---|
| 0 | Ohio | 2000 | 1.5 |
| 1 | Ohio | 2001 | 1.7 |
| 2 | Ohio | 2002 | 3.6 |
frame[2,1]#too bad#to get the full row: use the .iloc method
frame.iloc[2]
frame.iloc[2,1]2002
Subsetting/Slicing
We first need to understand the attributes index (=rownames) and columns (= column names):
frame.indexRangeIndex(start=0, stop=6, step=1)
#We can set a column as an index:
frame2 = frame.set_index("year")
print(frame2)
# state pop
year
2000 Ohio 1.5
2001 Ohio 1.7
2002 Ohio 3.6
2001 Nevada 2.4
2002 Nevada 2.9
2003 Nevada 3.2
#it would be nice to access elements in the same fashion as numpy
#frame2[1,1]
frame["pop"]0 1.5
1 1.7
2 3.6
3 2.4
4 2.9
5 3.2
Name: pop, dtype: float64
frame.pop<bound method DataFrame.pop of state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
5 Nevada 2003 3.2>
Asking for rows
Unfortunately, we cannot use the simple [row,col] notation that we are used to from numpy arrays. (Try asking for frame[0,1])
Instead, row subsetting can be achieved with either the .loc() or the .iloc() methods. The latter takes integers, the former indices:
frame2.loc[2001] #note that I am not using quotes !!
#at first glance this looks like I am asking for the row number 2001 !!| state | pop | |
|---|---|---|
| year | ||
| 2001 | Ohio | 1.7 |
| 2001 | Nevada | 2.4 |
frame2.loc[2001,"state"]year
2001 Ohio
2001 Nevada
Name: state, dtype: object
frame.iloc[0]#first rowstate Ohio
year 2000
pop 1.5
Name: 0, dtype: object
frame3 = frame.set_index("state", drop=False)
print(frame3) state year pop
state
Ohio Ohio 2000 1.5
Ohio Ohio 2001 1.7
Ohio Ohio 2002 3.6
Nevada Nevada 2001 2.4
Nevada Nevada 2002 2.9
Nevada Nevada 2003 3.2
frame3.loc["Ohio"]| year | pop | |
|---|---|---|
| state | ||
| Ohio | 2000 | 1.5 |
| Ohio | 2001 | 1.7 |
| Ohio | 2002 | 3.6 |
frame.iloc[2001]# this does not work because we do not have 2001 rows !frame.iloc[0,1]2000
Asking for columns
#The columns are also an index:
frame.columnsIndex(['state', 'year', 'pop'], dtype='object')
A column in a DataFrame can be retrieved MUCH easier: as a Series either by dictionary-like notation or by using the dot attribute notation:
frame["state"]0 Ohio
1 Ohio
2 Ohio
3 Nevada
4 Nevada
5 Nevada
Name: state, dtype: object
frame.year#equivalent to frame["year"]0 2000
1 2001
2 2002
3 2001
4 2002
5 2003
Name: year, dtype: int64
Summary Stats
Just like in numpy you can compute sums, means, counts and many other summaries along rows and columns, by specifying the axis argument:
height = np.array([1.79, 1.85, 1.95, 1.55])
weight = np.array([70, 80, 85, 65])
hw = np.array([height, weight]).transpose()
hwarray([[ 1.79, 70. ],
[ 1.85, 80. ],
[ 1.95, 85. ],
[ 1.55, 65. ]])
df = pd.DataFrame(hw, columns = ["height", "weight"])
print(df) height weight
0 1.79 70.0
1 1.85 80.0
2 1.95 85.0
3 1.55 65.0
df = pd.DataFrame(hw , columns = ["height", "weight"],
index = ["Peter", "Matilda", "Bee", "Tom"])
print(df) height weight
Peter 1.79 70.0
Matilda 1.85 80.0
Bee 1.95 85.0
Tom 1.55 65.0
Can you extract:
- All weights
- Peter’s height
- Bee’s full info
- the average height
- get all persons with height greater than 180cm
#see Lab5print(df.mean(axis=0))
print(df.mean(axis=1))# are these averages sensible ?height 1.785
weight 75.000
dtype: float64
Peter 35.895
Matilda 40.925
Bee 43.475
Bee 33.275
dtype: float64
Some methods are neither reductions nor accumulations. describe is one such example, producing multiple summary statistics in one shot:
df.describe()| height | weight | |
|---|---|---|
| count | 4.000 | 4.000000 |
| mean | 1.785 | 75.000000 |
| std | 0.170 | 9.128709 |
| min | 1.550 | 65.000000 |
| 25% | 1.730 | 68.750000 |
| 50% | 1.820 | 75.000000 |
| 75% | 1.875 | 81.250000 |
| max | 1.950 | 85.000000 |
Built in data sets
Gapminder Data
https://www.gapminder.org/fw/world-health-chart/
https://www.ted.com/talks/hans_rosling_the_best_stats_you_ve_ever_seen#t-241405
You’ve never seen data presented like this. With the drama and urgency of a sportscaster, statistics guru Hans Rosling debunks myths about the so-called “developing world.”
!pip install gapminder
#!conda install gapminder
from gapminder import gapminder
#gapminder.to_csv("../datasets/gapminder.csv")gapminder| country | continent | year | lifeExp | pop | gdpPercap | |
|---|---|---|---|---|---|---|
| 0 | Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.445314 |
| 1 | Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.853030 |
| 2 | Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.100710 |
| 3 | Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.197138 |
| 4 | Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.981106 |
| ... | ... | ... | ... | ... | ... | ... |
| 1699 | Zimbabwe | Africa | 1987 | 62.351 | 9216418 | 706.157306 |
| 1700 | Zimbabwe | Africa | 1992 | 60.377 | 10704340 | 693.420786 |
| 1701 | Zimbabwe | Africa | 1997 | 46.809 | 11404948 | 792.449960 |
| 1702 | Zimbabwe | Africa | 2002 | 39.989 | 11926563 | 672.038623 |
| 1703 | Zimbabwe | Africa | 2007 | 43.487 | 12311143 | 469.709298 |
1704 rows × 6 columns
#find the unique years
#get the years:
gapminder["year"]
np.unique(gapminder.year)array([1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002,
2007])
#get all rows with year 1952:
#Hint:
#either use Boolean subsetting
gapminder["year"] == 1952
gapminder[gapminder["year"] == 1952]
#or use an index !!| country | continent | year | lifeExp | pop | gdpPercap | |
|---|---|---|---|---|---|---|
| 0 | Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.445314 |
| 12 | Albania | Europe | 1952 | 55.230 | 1282697 | 1601.056136 |
| 24 | Algeria | Africa | 1952 | 43.077 | 9279525 | 2449.008185 |
| 36 | Angola | Africa | 1952 | 30.015 | 4232095 | 3520.610273 |
| 48 | Argentina | Americas | 1952 | 62.485 | 17876956 | 5911.315053 |
| ... | ... | ... | ... | ... | ... | ... |
| 1644 | Vietnam | Asia | 1952 | 40.412 | 26246839 | 605.066492 |
| 1656 | West Bank and Gaza | Asia | 1952 | 43.160 | 1030585 | 1515.592329 |
| 1668 | Yemen, Rep. | Asia | 1952 | 32.548 | 4963829 | 781.717576 |
| 1680 | Zambia | Africa | 1952 | 42.038 | 2672000 | 1147.388831 |
| 1692 | Zimbabwe | Africa | 1952 | 48.451 | 3080907 | 406.884115 |
142 rows × 6 columns
Handling Files
Get to know your friends
pd.read_csvpd.read_tablepd.read_excel
'''url = "https://drive.google.com/file/d/1oIvCdN15UEwt4dCyjkArekHnTrivN43v/view?usp=share_link"
url='https://drive.google.com/uc?id=' + url.split('/')[-2]
gapminder = pd.read_csv(url, index_col=0)
gapminder.head()'''| country | continent | year | lifeExp | pop | gdpPercap | |
|---|---|---|---|---|---|---|
| 0 | Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.445314 |
| 1 | Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.853030 |
| 2 | Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.100710 |
| 3 | Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.197138 |
| 4 | Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.981106 |
gapminder.sort_values(by="year").head()| country | continent | year | lifeExp | pop | gdpPercap | |
|---|---|---|---|---|---|---|
| 0 | Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.445314 |
| 528 | France | Europe | 1952 | 67.410 | 42459667 | 7029.809327 |
| 540 | Gabon | Africa | 1952 | 37.003 | 420702 | 4293.476475 |
| 1656 | West Bank and Gaza | Asia | 1952 | 43.160 | 1030585 | 1515.592329 |
| 552 | Gambia | Africa | 1952 | 30.000 | 284320 | 485.230659 |
#How many countries?
CtryCts = gapminder["country"].value_counts()
CtryCts
#note the similarity with np.unique(..., return_counts=True)Afghanistan 12
Pakistan 12
New Zealand 12
Nicaragua 12
Niger 12
..
Eritrea 12
Equatorial Guinea 12
El Salvador 12
Egypt 12
Zimbabwe 12
Name: country, Length: 142, dtype: int64
from numpy.random import default_rng
rng = default_rng()
rng.choice(gapminder["country"].unique(),2)
gapminder["year"].unique()array([1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002,
2007])
#How meaningful are the column stats?
print(gapminder.mean(axis=0))
gapminder.describe()year 1.979500e+03
lifeExp 5.947444e+01
pop 2.960121e+07
gdpPercap 7.215327e+03
dtype: float64
/var/folders/h4/k73g68ds6xj791sf8cpmlxlc0000gn/T/ipykernel_33611/633466148.py:2: FutureWarning: The default value of numeric_only in DataFrame.mean is deprecated. In a future version, it will default to False. In addition, specifying 'numeric_only=None' is deprecated. Select only valid columns or specify the value of numeric_only to silence this warning.
print(gapminder.mean(axis=0))
| year | lifeExp | pop | gdpPercap | |
|---|---|---|---|---|
| count | 1704.00000 | 1704.000000 | 1.704000e+03 | 1704.000000 |
| mean | 1979.50000 | 59.474439 | 2.960121e+07 | 7215.327081 |
| std | 17.26533 | 12.917107 | 1.061579e+08 | 9857.454543 |
| min | 1952.00000 | 23.599000 | 6.001100e+04 | 241.165876 |
| 25% | 1965.75000 | 48.198000 | 2.793664e+06 | 1202.060309 |
| 50% | 1979.50000 | 60.712500 | 7.023596e+06 | 3531.846988 |
| 75% | 1993.25000 | 70.845500 | 1.958522e+07 | 9325.462346 |
| max | 2007.00000 | 82.603000 | 1.318683e+09 | 113523.132900 |
Sort the index before you slice!!
Choose a time range and specific countries
gapminder2 = gapminder.set_index("year").sort_index()
gap1982_92 = gapminder2.loc[1982:1992].reset_index()
gap1982_92 = gap1982_92.set_index("country").sort_index()
gap1982_92.loc["Afghanistan":"Albania"]| year | continent | lifeExp | pop | gdpPercap | |
|---|---|---|---|---|---|
| country | |||||
| Afghanistan | 1982 | Asia | 39.854 | 12881816 | 978.011439 |
| Afghanistan | 1987 | Asia | 40.822 | 13867957 | 852.395945 |
| Afghanistan | 1992 | Asia | 41.674 | 16317921 | 649.341395 |
| Albania | 1992 | Europe | 71.581 | 3326498 | 2497.437901 |
| Albania | 1987 | Europe | 72.000 | 3075321 | 3738.932735 |
| Albania | 1982 | Europe | 70.420 | 2780097 | 3630.880722 |
gap1982_92.loc["Afghanistan":"Albania","lifeExp"].mean()56.0585