import pandas as pd # Standard convention: import pandas as pd -- equivalent to library(tidyverse)
import numpy as np # NumPy often used alongside pandas
# Check version
pd.__version__'2.3.1'
pandasNow that you understand Python basics, let’s look at pandas – Python’s most popular library for data manipulation. If you’re familiar with R’s tidyverse (especially dplyr and tidyr), pandas will feel conceptually familiar, though the syntax differs.
pandas provides two main data structures:
Series: Like an R vector or a single columnDataFrame: Like an R tibble/data.frameYou can create DataFrames in several ways:
# From a dictionary (like tibble() in R)
df_dic = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 35, 28],
'city': ['New York', 'London', 'Paris', 'Tokyo']
})
df_dic name age city
0 Alice 25 New York
1 Bob 30 London
2 Charlie 35 Paris
3 Diana 28 Tokyo
# from a list of dictionaries
data = [
{'name': 'Alice', 'age': 25, 'city': 'New York'},
{'name': 'Bob', 'age': 30, 'city': 'Paris'},
{'name': 'Charlie', 'age': 35, 'city': 'London'}
]
df_dic_list = pd.DataFrame(data)
df_dic_list name age city
0 Alice 25 New York
1 Bob 30 Paris
2 Charlie 35 London
# from a list of lists (specify column names)
data = [
['Alice', 25, 'New York'],
['Bob', 30, 'Paris'],
['Charlie', 35, 'London']
]
df_list_of_lists = pd.DataFrame(data, columns=['name', 'age', 'city'])
# from a numpy array (essentially a matrix, specify column names)
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
dataarray([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
A B C
0 1 2 3
1 4 5 6
2 7 8 9
rdf <- tibble( name = c(‘Alice’, ‘Bob’, ‘Charlie’, ‘Diana’), age = c(25, 30, 35, 28), city = c(‘New York’, ‘London’, ‘Paris’, ‘Tokyo’) )
pandas works through methods rather than functions. Methods are attached to specific object types. A string has string methods, a list has list methods, a DataFrame has DataFrame methods.
There are several differences:
# Function syntax: function(object, arguments)
my_list = [3, 1, 4, 1, 5]
len(my_list) # len is a function5
[1, 1, 3, 4, 5]
# Method syntax: object.method(arguments)
my_string = "hello world"
upper_string = my_string.upper() # upper() is a method of strings
my_string.split() # split() is a method of strings['hello', 'world']
'hello world'
my_list.append(9) # append() is a method of lists; note that this modifies the list in place
my_list[3, 1, 4, 1, 5, 9]
R primarily uses functions - you write head(df), nrow(df), names(df). Python often uses methods - you write df.head(), df.shape, df.columns.
This matters for pandas, since almost everything is a method and hence the syntax differs significantly.
pandas in actionViewing Data (note that objects remain unchanged):
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 35, 28],
'city': ['New York', 'London', 'Paris', 'Tokyo']
})
# First few rows (like head() in R)
df.head(2) name age city
0 Alice 25 New York
1 Bob 30 London
name age city
2 Charlie 35 Paris
3 Diana 28 Tokyo
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 4 non-null object
1 age 4 non-null int64
2 city 4 non-null object
dtypes: int64(1), object(2)
memory usage: 224.0+ bytes
age
count 4.000000
mean 29.500000
std 4.203173
min 25.000000
25% 27.250000
50% 29.000000
75% 31.250000
max 35.000000
Selecting Columns:
0 Alice
1 Bob
2 Charlie
3 Diana
Name: name, dtype: object
name
0 Alice
1 Bob
2 Charlie
3 Diana
name age
0 Alice 25
1 Bob 30
2 Charlie 35
3 Diana 28
Filtering Rows:
name age city
1 Bob 30 London
2 Charlie 35 Paris
name age city
1 Bob 30 London
# Filter something depending on its presence in another list using .isin() (R equivalent: %in%)
cities_ive_been_to = ["Paris", "New York", "London"]
df[df['city'].isin(cities_ive_been_to)] name age city
0 Alice 25 New York
1 Bob 30 London
2 Charlie 35 Paris
name age city
1 Bob 30 London
2 Charlie 35 Paris
3 Diana 28 Tokyo
Adding New Columns:
# Add a new column (like mutate() in R)
df['age_in_months'] = df['age'] * 12 # in R: df$age_in_months <- df$age * 12
df name age city age_in_months
0 Alice 25 New York 300
1 Bob 30 London 360
2 Charlie 35 Paris 420
3 Diana 28 Tokyo 336
name age city age_in_months year_of_birth
0 Alice 25 New York 300 2000
1 Bob 30 London 360 1995
2 Charlie 35 Paris 420 1990
3 Diana 28 Tokyo 336 1997
Sorting:
name age city age_in_months year_of_birth
0 Alice 25 New York 300 2000
3 Diana 28 Tokyo 336 1997
1 Bob 30 London 360 1995
2 Charlie 35 Paris 420 1990
name age city age_in_months year_of_birth
2 Charlie 35 Paris 420 1990
1 Bob 30 London 360 1995
3 Diana 28 Tokyo 336 1997
0 Alice 25 New York 300 2000
Read in Data:
Rank Title ... Revenue (Millions) Metascore
0 1 Guardians of the Galaxy ... 333.13 76.0
1 2 Prometheus ... 126.46 65.0
2 3 Split ... 138.12 62.0
3 4 Sing ... 270.32 59.0
4 5 Suicide Squad ... 325.02 40.0
[5 rows x 12 columns]
publisher city
0 37 ink atria New York City, NY
1 abrams New York City, NY
2 ace New York City, NY
3 alfred a knopf New York City, NY
4 algonquin Chapel Hill, NC
# Read from URL
df_example = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv')
df_example.head() total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
Quick Data Exploration:
(244, 7)
['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size']
day
Sat 87
Sun 76
Thur 62
Fri 19
Name: count, dtype: int64
# Group by and aggregate (yields a Series)
df_example.groupby('day')['total_bill'].mean() # in R: df |> group_by(day) |> summarize(mean_bill = mean(total_bill)) |> pull(mean_bill) |> set_names(day)day
Fri 17.151579
Sat 20.441379
Sun 21.410000
Thur 17.682742
Name: total_bill, dtype: float64
# Group by and aggregate (yields a DataFrame)
df_example.groupby('day')['total_bill'].mean().reset_index() # in R: df |> group_by(day) |> summarize(mean_bill = mean(total_bill)) day total_bill
0 Fri 17.151579
1 Sat 20.441379
2 Sun 21.410000
3 Thur 17.682742
pandas supports method chaining (similar to the pipe |>/%>% in R):
# Chain multiple operations
result = (df_example
.query('total_bill > 20') # Filter
.assign(tip_pct = lambda x: x['tip'] / x['total_bill']) # New column
.sort_values('tip_pct', ascending=False) # Sort
.head(5) # Top 5
)
result total_bill tip sex smoker day time size tip_pct
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
181 23.33 5.65 Male Yes Sun Dinner 2 0.242177
185 20.69 5.00 Male No Sun Dinner 5 0.241663
88 24.71 5.85 Male No Thur Lunch 2 0.236746
214 28.17 6.50 Female Yes Sat Dinner 3 0.230742
Joining (or merging) DataFrames is a fundamental operation in data analysis. pandas provides several ways to combine DataFrames, similar to SQL joins or R’s dplyr join functions.
# Create sample DataFrames
customers = pd.DataFrame({
'customer_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'city': ['New York', 'London', 'Paris', 'Tokyo']
})
orders = pd.DataFrame({
'order_id': [101, 102, 103, 104, 105],
'customer_id': [1, 2, 2, 3, 5],
'amount': [100, 150, 200, 75, 300]
})Keep only rows that match in both DataFrames.
Keep all rows from the left DataFrame, fill missing values with NaN.
Keep all rows from the right DataFrame.
Keep all rows from both DataFrames.
# Outer join - all customers and all orders
outer = pd.merge(customers, orders, on='customer_id', how='outer')
outer customer_id name city order_id amount
0 1 Alice New York 101.0 100.0
1 2 Bob London 102.0 150.0
2 2 Bob London 103.0 200.0
3 3 Charlie Paris 104.0 75.0
4 4 Diana Tokyo NaN NaN
5 5 NaN NaN 105.0 300.0
# When join columns have different names
products = pd.DataFrame({
'prod_id': [1, 2, 3],
'product': ['Laptop', 'Phone', 'Tablet']
})
sales = pd.DataFrame({
'product_id': [1, 1, 2, 3],
'quantity': [5, 3, 8, 2]
})
# Specify left_on and right_on
merged = pd.merge(products, sales,
left_on='prod_id',
right_on='product_id',
how='inner')
print(merged) prod_id product product_id quantity
0 1 Laptop 1 5
1 1 Laptop 1 3
2 2 Phone 2 8
3 3 Tablet 3 2
# Clean up: drop redundant column
merged = pd.merge(products, sales,
left_on='prod_id',
right_on='product_id',
how='inner').drop('product_id', axis=1) #axis=0 would drop the row with id 'product_id' (doesn't exist)
merged prod_id product quantity
0 1 Laptop 5
1 1 Laptop 3
2 2 Phone 8
3 3 Tablet 2
# Join on multiple columns
df1 = pd.DataFrame({
'year': [2020, 2020, 2021, 2021],
'quarter': [1, 2, 1, 2],
'revenue': [100, 120, 110, 130]
})
df2 = pd.DataFrame({
'year': [2020, 2020, 2021],
'quarter': [1, 2, 1],
'costs': [80, 90, 85]
})
# Join on both year and quarter
merged = pd.merge(df1, df2, on=['year', 'quarter'], how='left')
merged year quarter revenue costs
0 2020 1 100 80.0
1 2020 2 120 90.0
2 2021 1 110 85.0
3 2021 2 130 NaN
# When both DataFrames have columns with same names (besides join key)
df1 = pd.DataFrame({
'id': [1, 2, 3],
'value': [10, 20, 30]
})
df2 = pd.DataFrame({
'id': [1, 2, 3],
'value': [100, 200, 300]
})
# pandas adds suffixes to distinguish columns
merged = pd.merge(df1, df2, on='id', suffixes=('_left', '_right'))
merged id value_left value_right
0 1 10 100
1 2 20 200
2 3 30 300
For simply stacking DataFrames vertically or horizontally without matching keys:
# Vertical concatenation (like rbind/bind_rows() in R)
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
vertical = pd.concat([df1, df2], ignore_index=True, axis=0)
vertical A B
0 1 3
1 2 4
2 5 7
3 6 8
# Horizontal concatenation (like cbind in R)
df3 = pd.DataFrame({'C': [9]})
horizontal = pd.concat([df1, df3], axis=1)
horizontal A B C
0 1 3 9.0
1 2 4 NaN
# Real-world scenario: combining customer, order, and product data
customers = pd.DataFrame({
'customer_id': [1, 2, 3],
'customer_name': ['Alice', 'Bob', 'Charlie']
})
orders = pd.DataFrame({
'order_id': [101, 102, 103],
'customer_id': [1, 2, 1],
'product_id': [1, 2, 3]
})
products = pd.DataFrame({
'product_id': [1, 2, 3],
'product_name': ['Laptop', 'Phone', 'Tablet'],
'price': [1000, 800, 500]
})
# Chain multiple joins
result = (orders
.merge(customers, on='customer_id', how='left')
.merge(products, on='product_id', how='left')
.assign(total = lambda x: x['price'])
)
result order_id customer_id product_id customer_name product_name price total
0 101 1 1 Alice Laptop 1000 1000
1 102 2 2 Bob Phone 800 800
2 103 1 3 Alice Tablet 500 500
| pandas | R dplyr | Description |
|---|---|---|
pd.merge(df1, df2, how='inner') |
inner_join(df1, df2) |
Keep only matching rows |
pd.merge(df1, df2, how='left') |
left_join(df1, df2) |
Keep all rows from left |
pd.merge(df1, df2, how='right') |
right_join(df1, df2) |
Keep all rows from right |
pd.merge(df1, df2, how='outer') |
full_join(df1, df2) |
Keep all rows from both |
pd.concat([df1, df2]) |
bind_rows(df1, df2) |
Stack vertically |
pd.concat([df1, df2], axis=1) |
bind_cols(df1, df2) |
Stack horizontally |
pandas and Rpandas uses 0-based indexingpandas uses NaN (from NumPy), not NAThere is one important pitfall you need to pay attention to: pandas modifies in place unless you use .copy()!
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
# This creates a reference, NOT a copy
df2 = df
# Modify df2
df2['A'] = [10, 20, 30]
# Surprise! df is also changed
print(df) A B
0 10 4
1 20 5
2 30 6
### FIX
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
# This creates a copy, NOT a reference
df2 = df.copy()
# Modify df2
df2['A'] = [10, 20, 30]
# thank god! df did not change
print(df) A B
0 1 4
1 2 5
2 3 6
A B
0 10 4
1 20 5
2 30 6
Take the dplyr exercises from Chapter 2 and perform them using pandas.
pandas cheatsheetpandas creator)