import pandas as pd # Standard convention: import pandas as pd -- equivalent to library(tidyverse)
import numpy as np # NumPy often used alongside pandas
# Check version
pd.__version__'2.3.1'
pandasNow that you understand Python basics, let’s look at pandas – Python’s most popular library for data manipulation. If you’re familiar with R’s tidyverse (especially dplyr and tidyr), pandas will feel conceptually familiar, though the syntax differs.
pandas provides two main data structures:
Series: Like an R vector or a single columnDataFrame: Like an R tibble/data.frameYou can create DataFrames in several ways:
name age city
0 Alice 25 New York
1 Bob 30 London
2 Charlie 35 Paris
3 Diana 28 Tokyo
name age city
0 Alice 25 New York
1 Bob 30 Paris
2 Charlie 35 London
# from a list of lists (specify column names)
data = [
['Alice', 25, 'New York'],
['Bob', 30, 'Paris'],
['Charlie', 35, 'London']
]
df_list_of_lists = pd.DataFrame(data, columns=['name', 'age', 'city'])
# from a numpy array (essentially a matrix, specify column names)
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
dataarray([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
A B C
0 1 2 3
1 4 5 6
2 7 8 9
rdf <- tibble( name = c(‘Alice’, ‘Bob’, ‘Charlie’, ‘Diana’), age = c(25, 30, 35, 28), city = c(‘New York’, ‘London’, ‘Paris’, ‘Tokyo’) )
pandas works through methods rather than functions. Methods are attached to specific object types. A string has string methods, a list has list methods, a DataFrame has DataFrame methods.
There are several differences:
5
[1, 1, 3, 4, 5]
['hello', 'world']
'hello world'
[3, 1, 4, 1, 5, 9]
R primarily uses functions - you write head(df), nrow(df), names(df). Python often uses methods - you write df.head(), df.shape, df.columns.
This matters for pandas, since almost everything is a method and hence the syntax differs significantly.
pandas in actionViewing Data (note that objects remain unchanged):
name age city
0 Alice 25 New York
1 Bob 30 London
name age city
2 Charlie 35 Paris
3 Diana 28 Tokyo
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 4 non-null object
1 age 4 non-null int64
2 city 4 non-null object
dtypes: int64(1), object(2)
memory usage: 224.0+ bytes
age
count 4.000000
mean 29.500000
std 4.203173
min 25.000000
25% 27.250000
50% 29.000000
75% 31.250000
max 35.000000
Selecting Columns:
0 Alice
1 Bob
2 Charlie
3 Diana
Name: name, dtype: object
name
0 Alice
1 Bob
2 Charlie
3 Diana
name age
0 Alice 25
1 Bob 30
2 Charlie 35
3 Diana 28
name
0 Alice
1 Bob
2 Charlie
3 Diana
name age
0 Alice 25
1 Bob 30
2 Charlie 35
3 Diana 28
name age
0 Alice 25
1 Bob 30
2 Charlie 35
3 Diana 28
Filtering Rows:
name age city
1 Bob 30 London
2 Charlie 35 Paris
name age city
1 Bob 30 London
name age city
0 Alice 25 New York
1 Bob 30 London
2 Charlie 35 Paris
name age city
1 Bob 30 London
2 Charlie 35 Paris
3 Diana 28 Tokyo
Adding New Columns:
name age city age_in_months
0 Alice 25 New York 300
1 Bob 30 London 360
2 Charlie 35 Paris 420
3 Diana 28 Tokyo 336
name age city age_in_months year_of_birth
0 Alice 25 New York 300 2000
1 Bob 30 London 360 1995
2 Charlie 35 Paris 420 1990
3 Diana 28 Tokyo 336 1997
Sorting:
name age city age_in_months year_of_birth
0 Alice 25 New York 300 2000
3 Diana 28 Tokyo 336 1997
1 Bob 30 London 360 1995
2 Charlie 35 Paris 420 1990
name age city age_in_months year_of_birth
2 Charlie 35 Paris 420 1990
1 Bob 30 London 360 1995
3 Diana 28 Tokyo 336 1997
0 Alice 25 New York 300 2000
Read in Data:
Rank Title ... Revenue (Millions) Metascore
0 1 Guardians of the Galaxy ... 333.13 76.0
1 2 Prometheus ... 126.46 65.0
2 3 Split ... 138.12 62.0
3 4 Sing ... 270.32 59.0
4 5 Suicide Squad ... 325.02 40.0
[5 rows x 12 columns]
publisher city
0 37 ink atria New York City, NY
1 abrams New York City, NY
2 ace New York City, NY
3 alfred a knopf New York City, NY
4 algonquin Chapel Hill, NC
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
Quick Data Exploration:
(244, 7)
['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size']
day
Sat 87
Sun 76
Thur 62
Fri 19
Name: count, dtype: int64
day
Fri 17.151579
Sat 20.441379
Sun 21.410000
Thur 17.682742
Name: total_bill, dtype: float64
day total_bill
0 Fri 17.151579
1 Sat 20.441379
2 Sun 21.410000
3 Thur 17.682742
pandas supports method chaining (similar to the pipe |>/%>% in R):
total_bill tip sex smoker day time size tip_pct
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
181 23.33 5.65 Male Yes Sun Dinner 2 0.242177
185 20.69 5.00 Male No Sun Dinner 5 0.241663
88 24.71 5.85 Male No Thur Lunch 2 0.236746
214 28.17 6.50 Female Yes Sat Dinner 3 0.230742
Joining (or merging) DataFrames is a fundamental operation in data analysis. pandas provides several ways to combine DataFrames, similar to SQL joins or R’s dplyr join functions.
# Create sample DataFrames
customers = pd.DataFrame({
'customer_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'city': ['New York', 'London', 'Paris', 'Tokyo']
})
orders = pd.DataFrame({
'order_id': [101, 102, 103, 104, 105],
'customer_id': [1, 2, 2, 3, 5],
'amount': [100, 150, 200, 75, 300]
})Keep only rows that match in both DataFrames.
customer_id name city order_id amount
0 1 Alice New York 101 100
1 2 Bob London 102 150
2 2 Bob London 103 200
3 3 Charlie Paris 104 75
Keep all rows from the left DataFrame, fill missing values with NaN.
customer_id name city order_id amount
0 1 Alice New York 101.0 100.0
1 2 Bob London 102.0 150.0
2 2 Bob London 103.0 200.0
3 3 Charlie Paris 104.0 75.0
4 4 Diana Tokyo NaN NaN
Keep all rows from the right DataFrame.
customer_id name city order_id amount
0 1 Alice New York 101 100
1 2 Bob London 102 150
2 2 Bob London 103 200
3 3 Charlie Paris 104 75
4 5 NaN NaN 105 300
Keep all rows from both DataFrames.
customer_id name city order_id amount
0 1 Alice New York 101.0 100.0
1 2 Bob London 102.0 150.0
2 2 Bob London 103.0 200.0
3 3 Charlie Paris 104.0 75.0
4 4 Diana Tokyo NaN NaN
5 5 NaN NaN 105.0 300.0
# When join columns have different names
products = pd.DataFrame({
'prod_id': [1, 2, 3],
'product': ['Laptop', 'Phone', 'Tablet']
})
sales = pd.DataFrame({
'product_id': [1, 1, 2, 3],
'quantity': [5, 3, 8, 2]
})
# Specify left_on and right_on
merged = pd.merge(products, sales,
left_on='prod_id',
right_on='product_id',
how='inner')
print(merged) prod_id product product_id quantity
0 1 Laptop 1 5
1 1 Laptop 1 3
2 2 Phone 2 8
3 3 Tablet 3 2
prod_id product quantity
0 1 Laptop 5
1 1 Laptop 3
2 2 Phone 8
3 3 Tablet 2
# Join on multiple columns
df1 = pd.DataFrame({
'year': [2020, 2020, 2021, 2021],
'quarter': [1, 2, 1, 2],
'revenue': [100, 120, 110, 130]
})
df2 = pd.DataFrame({
'year': [2020, 2020, 2021],
'quarter': [1, 2, 1],
'costs': [80, 90, 85]
})
# Join on both year and quarter
merged = pd.merge(df1, df2, on=['year', 'quarter'], how='left')
merged year quarter revenue costs
0 2020 1 100 80.0
1 2020 2 120 90.0
2 2021 1 110 85.0
3 2021 2 130 NaN
# When both DataFrames have columns with same names (besides join key)
df1 = pd.DataFrame({
'id': [1, 2, 3],
'value': [10, 20, 30]
})
df2 = pd.DataFrame({
'id': [1, 2, 3],
'value': [100, 200, 300]
})
# pandas adds suffixes to distinguish columns
merged = pd.merge(df1, df2, on='id', suffixes=('_left', '_right'))
merged id value_left value_right
0 1 10 100
1 2 20 200
2 3 30 300
For simply stacking DataFrames vertically or horizontally without matching keys:
A B
0 1 3
1 2 4
2 5 7
3 6 8
A B C
0 1 3 9.0
1 2 4 NaN
# Real-world scenario: combining customer, order, and product data
customers = pd.DataFrame({
'customer_id': [1, 2, 3],
'customer_name': ['Alice', 'Bob', 'Charlie']
})
orders = pd.DataFrame({
'order_id': [101, 102, 103],
'customer_id': [1, 2, 1],
'product_id': [1, 2, 3]
})
products = pd.DataFrame({
'product_id': [1, 2, 3],
'product_name': ['Laptop', 'Phone', 'Tablet'],
'price': [1000, 800, 500]
})
# Chain multiple joins
result = (orders
.merge(customers, on='customer_id', how='left')
.merge(products, on='product_id', how='left')
.assign(total = lambda x: x['price'])
)
result order_id customer_id product_id customer_name product_name price total
0 101 1 1 Alice Laptop 1000 1000
1 102 2 2 Bob Phone 800 800
2 103 1 3 Alice Tablet 500 500
| pandas | R dplyr | Description |
|---|---|---|
pd.merge(df1, df2, how='inner') |
inner_join(df1, df2) |
Keep only matching rows |
pd.merge(df1, df2, how='left') |
left_join(df1, df2) |
Keep all rows from left |
pd.merge(df1, df2, how='right') |
right_join(df1, df2) |
Keep all rows from right |
pd.merge(df1, df2, how='outer') |
full_join(df1, df2) |
Keep all rows from both |
pd.concat([df1, df2]) |
bind_rows(df1, df2) |
Stack vertically |
pd.concat([df1, df2], axis=1) |
bind_cols(df1, df2) |
Stack horizontally |
pandas and Rpandas uses 0-based indexingpandas uses NaN (from NumPy), not NAThere is one important pitfall you need to pay attention to: pandas modifies in place unless you use .copy()!
A B
0 10 4
1 20 5
2 30 6
A B
0 1 4
1 2 5
2 3 6
A B
0 10 4
1 20 5
2 30 6
Take the dplyr exercises from Chapter 2 and perform them using pandas.
Open the IMDb file.
Rank Title ... Revenue (Millions) Metascore
0 1 Guardians of the Galaxy ... 333.13 76.0
1 2 Prometheus ... 126.46 65.0
2 3 Split ... 138.12 62.0
3 4 Sing ... 270.32 59.0
4 5 Suicide Squad ... 325.02 40.0
.. ... ... ... ... ...
995 996 Secret in Their Eyes ... NaN 45.0
996 997 Hostel: Part II ... 17.54 46.0
997 998 Step Up 2: The Streets ... 58.01 50.0
998 999 Search Party ... NaN 22.0
999 1000 Nine Lives ... 19.64 11.0
[1000 rows x 12 columns]
Title count
0 The Host 2
Title
0 The Host
Rank Title ... Revenue (Millions) Metascore
828 829 Grindhouse ... 25.03 NaN
[1 rows x 12 columns]
Rank Title ... Revenue (Millions) Metascore
0 1 Guardians of the Galaxy ... 333.13 76.0
[1 rows x 12 columns]
Rank Title Genre ... Votes Revenue (Millions) Metascore
478 479 Paint It Black Drama ... 61 NaN 71.0
[1 rows x 12 columns]
Rank Title ... Revenue (Millions) Metascore
12 13 Rogue One ... 532.17 65.0
[1 rows x 12 columns]
Year Revenue (Millions)
0 2006 3624.46
1 2007 4306.23
2 2008 5053.22
3 2009 5292.26
4 2010 5989.65
5 2011 5431.96
6 2012 6910.29
7 2013 7666.72
8 2014 7997.40
9 2015 8854.12
10 2016 11211.65
mutate() before).mode() from the pandas package). Rank Title ... avg_runtime runtime
0 1 Guardians of the Galaxy ... 113.172 121
1 2 Prometheus ... 113.172 124
2 3 Split ... 113.172 117
4 5 Suicide Squad ... 113.172 123
6 7 La La Land ... 113.172 128
.. ... ... ... ... ...
979 980 The Skin I Live In ... 113.172 120
981 982 Annie ... 113.172 118
982 983 Across the Universe ... 113.172 133
989 990 Selma ... 113.172 128
991 992 Taare Zameen Par ... 113.172 165
[433 rows x 14 columns]
Rank ... Metascore
50 51 ... 81.0
140 141 ... 82.0
362 363 ... 72.0
497 498 ... 72.0
869 870 ... 66.0
[5 rows x 12 columns]
Rank Title ... Metascore median_votes
0 1 Guardians of the Galaxy ... 76.0 110799.0
1 2 Prometheus ... 65.0 110799.0
2 3 Split ... 62.0 110799.0
4 5 Suicide Squad ... 40.0 110799.0
6 7 La La Land ... 93.0 110799.0
.. ... ... ... ... ...
971 972 Disturbia ... NaN 110799.0
983 984 Let's Be Cops ... 30.0 110799.0
990 991 Underworld: Rise of the Lycans ... 44.0 110799.0
993 994 Resident Evil: Afterlife ... 37.0 110799.0
994 995 Project X ... 48.0 110799.0
[500 rows x 13 columns]
Rank Title ... Revenue (Millions) Metascore
0 1 Guardians of the Galaxy ... 333.13 76.0
1 2 Prometheus ... 126.46 65.0
2 3 Split ... 138.12 62.0
4 5 Suicide Squad ... 325.02 40.0
6 7 La La Land ... 151.06 93.0
.. ... ... ... ... ...
971 972 Disturbia ... 80.05 NaN
983 984 Let's Be Cops ... 82.39 30.0
990 991 Underworld: Rise of the Lycans ... 45.80 44.0
993 994 Resident Evil: Afterlife ... 60.13 37.0
994 995 Project X ... 54.72 48.0
[500 rows x 12 columns]
Rank ... Metascore
8 9 ... 78.0
32 33 ... 52.0
39 40 ... NaN
48 49 ... 68.0
71 72 ... 60.0
75 76 ... 50.0
93 94 ... 57.0
134 135 ... 61.0
139 140 ... 44.0
212 213 ... 61.0
221 222 ... 73.0
226 227 ... 82.0
248 249 ... 75.0
259 260 ... 76.0
276 277 ... 71.0
282 283 ... NaN
294 295 ... 55.0
298 299 ... 66.0
352 353 ... 47.0
378 379 ... 64.0
385 386 ... 64.0
402 403 ... NaN
435 436 ... NaN
436 437 ... 33.0
446 447 ... 59.0
497 498 ... 72.0
513 514 ... 72.0
515 516 ... 72.0
543 544 ... NaN
551 552 ... 79.0
554 555 ... 33.0
556 557 ... 77.0
571 572 ... 75.0
572 573 ... 51.0
592 593 ... 69.0
664 665 ... 88.0
689 690 ... 85.0
702 703 ... 60.0
706 707 ... NaN
721 722 ... 76.0
737 738 ... 57.0
807 808 ... 82.0
842 843 ... 80.0
843 844 ... 73.0
853 854 ... 70.0
863 864 ... 81.0
892 893 ... 56.0
894 895 ... 46.0
895 896 ... 55.0
896 897 ... NaN
929 930 ... NaN
988 989 ... 89.0
[52 rows x 12 columns]
pandas cheatsheetpandas creator)