Chapter 5: pandas

Now that you understand Python basics, let’s look at pandas – Python’s most popular library for data manipulation. If you’re familiar with R’s tidyverse (especially dplyr and tidyr), pandas will feel conceptually familiar, though the syntax differs.

What is Pandas?

pandas provides two main data structures:

  • Series: Like an R vector or a single column
  • DataFrame: Like an R tibble/data.frame
import pandas as pd # Standard convention: import pandas as pd -- equivalent to library(tidyverse)
import numpy as np  # NumPy often used alongside pandas

# Check version
pd.__version__
'2.3.1'

Creating DataFrames

You can create DataFrames in several ways:

# From a dictionary (like tibble() in R)
df_dic = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['New York', 'London', 'Paris', 'Tokyo']
})

df_dic
      name  age      city
0    Alice   25  New York
1      Bob   30    London
2  Charlie   35     Paris
3    Diana   28     Tokyo
# from a list of dictionaries
data = [
    {'name': 'Alice', 'age': 25, 'city': 'New York'},
    {'name': 'Bob', 'age': 30, 'city': 'Paris'},
    {'name': 'Charlie', 'age': 35, 'city': 'London'}
]
df_dic_list = pd.DataFrame(data)
df_dic_list
      name  age      city
0    Alice   25  New York
1      Bob   30     Paris
2  Charlie   35    London
# from a list of lists (specify column names)
data = [
    ['Alice', 25, 'New York'],
    ['Bob', 30, 'Paris'],
    ['Charlie', 35, 'London']
]
df_list_of_lists = pd.DataFrame(data, columns=['name', 'age', 'city'])

# from a numpy array (essentially a matrix, specify column names)
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
data
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])
df_array = pd.DataFrame(data, columns=['A', 'B', 'C'])
df_array
   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9
Compared to R

rdf <- tibble( name = c(‘Alice’, ‘Bob’, ‘Charlie’, ‘Diana’), age = c(25, 30, 35, 28), city = c(‘New York’, ‘London’, ‘Paris’, ‘Tokyo’) )

Basic DataFrame Operations

pandas works through methods rather than functions. Methods are attached to specific object types. A string has string methods, a list has list methods, a DataFrame has DataFrame methods.

There are several differences:

  • Functions are called by putting the object as an argument:
# Function syntax: function(object, arguments)
my_list = [3, 1, 4, 1, 5]
len(my_list)        # len is a function
5
sorted(my_list)     # sorted is a function
[1, 1, 3, 4, 5]
  • Methods are functions that “belong to” an object and are called using dot notation:
# Method syntax: object.method(arguments)
my_string = "hello world"
upper_string = my_string.upper()      # upper() is a method of strings
my_string.split()      # split() is a method of strings
['hello', 'world']
my_string              # object remains unchanged
'hello world'
my_list.append(9)      # append() is a method of lists; note that this modifies the list in place
my_list
[3, 1, 4, 1, 5, 9]
Compared to R

R primarily uses functions - you write head(df), nrow(df), names(df). Python often uses methods - you write df.head(), df.shape, df.columns.

This matters for pandas, since almost everything is a method and hence the syntax differs significantly.

pandas in action

Viewing Data (note that objects remain unchanged):

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['New York', 'London', 'Paris', 'Tokyo']
})

# First few rows (like head() in R)
df.head(2)
    name  age      city
0  Alice   25  New York
1    Bob   30    London
# Last few rows (like tail() in R)
df.tail(2)
      name  age   city
2  Charlie   35  Paris
3    Diana   28  Tokyo
# Info about the DataFrame (like glimpse() or str() in R)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    4 non-null      object
 1   age     4 non-null      int64 
 2   city    4 non-null      object
dtypes: int64(1), object(2)
memory usage: 224.0+ bytes
# Summary statistics (like summary() in R)
df.describe()
             age
count   4.000000
mean   29.500000
std     4.203173
min    25.000000
25%    27.250000
50%    29.000000
75%    31.250000
max    35.000000

Selecting Columns:

# Select single column (returns a Series)
df['name']           # in R: df$name
0      Alice
1        Bob
2    Charlie
3      Diana
Name: name, dtype: object
# Select single column (returns a DataFrame)
df[['name']]         # in R: df |> select(name)
      name
0    Alice
1      Bob
2  Charlie
3    Diana
# Select multiple columns (returns a DataFrame)
df[['name', 'age']]  # in R: df |> select(name, age)
      name  age
0    Alice   25
1      Bob   30
2  Charlie   35
3    Diana   28

Filtering Rows:

# Filter rows where age > 28
df[df['age'] > 28]
      name  age    city
1      Bob   30  London
2  Charlie   35   Paris
# Multiple conditions (use & for AND, | for OR)
df[(df['age'] > 25) & (df['city'] == 'London')]
  name  age    city
1  Bob   30  London
# Filter something depending on its presence in another list using .isin() (R equivalent: %in%)
cities_ive_been_to = ["Paris", "New York", "London"]
df[df['city'].isin(cities_ive_been_to)]
      name  age      city
0    Alice   25  New York
1      Bob   30    London
2  Charlie   35     Paris
# or using .query()
df.query('age > 25') #R: df |> filter(age > 25)
      name  age    city
1      Bob   30  London
2  Charlie   35   Paris
3    Diana   28   Tokyo

Adding New Columns:

# Add a new column (like mutate() in R)
df['age_in_months'] = df['age'] * 12    # in R: df$age_in_months <- df$age * 12
df
      name  age      city  age_in_months
0    Alice   25  New York            300
1      Bob   30    London            360
2  Charlie   35     Paris            420
3    Diana   28     Tokyo            336
# or using .assign()
df = df.assign(year_of_birth = lambda x: 2025-x['age'])
df
      name  age      city  age_in_months  year_of_birth
0    Alice   25  New York            300           2000
1      Bob   30    London            360           1995
2  Charlie   35     Paris            420           1990
3    Diana   28     Tokyo            336           1997

Sorting:

# Sort by age (like arrange() in R)
df.sort_values('age')                    # in R: df |> arrange(age)
      name  age      city  age_in_months  year_of_birth
0    Alice   25  New York            300           2000
3    Diana   28     Tokyo            336           1997
1      Bob   30    London            360           1995
2  Charlie   35     Paris            420           1990
# Sort descending
df.sort_values('age', ascending=False)   # in R: df |> arrange(desc(age))
      name  age      city  age_in_months  year_of_birth
2  Charlie   35     Paris            420           1990
1      Bob   30    London            360           1995
3    Diana   28     Tokyo            336           1997
0    Alice   25  New York            300           2000

Read in Data:

# Read CSV (like read_csv() in R)
imdb_df = pd.read_csv('data/imdb2006-2016.csv')
imdb_df.head()
   Rank                    Title  ... Revenue (Millions) Metascore
0     1  Guardians of the Galaxy  ...             333.13      76.0
1     2               Prometheus  ...             126.46      65.0
2     3                    Split  ...             138.12      62.0
3     4                     Sing  ...             270.32      59.0
4     5            Suicide Squad  ...             325.02      40.0

[5 rows x 12 columns]
# Read Excel
publishers_df = pd.read_excel('data/publishers_with_places.xlsx')
publishers_df.head()
        publisher               city
0    37 ink atria  New York City, NY
1          abrams  New York City, NY
2             ace  New York City, NY
3  alfred a knopf  New York City, NY
4       algonquin    Chapel Hill, NC
# Read from URL
df_example = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv')
df_example.head()
   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4

Quick Data Exploration:

# Shape (rows, columns)
df_example.shape                                  # in R: dim(df)
(244, 7)
# Column names 
df_example.columns.tolist()                       # in R: colnames(df)
['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size']
# Value counts
df_example['day'].value_counts()                  # in R: df |> count(day)
day
Sat     87
Sun     76
Thur    62
Fri     19
Name: count, dtype: int64
# Group by and aggregate (yields a Series)
df_example.groupby('day')['total_bill'].mean()    # in R: df |> group_by(day) |> summarize(mean_bill = mean(total_bill)) |> pull(mean_bill) |> set_names(day)
day
Fri     17.151579
Sat     20.441379
Sun     21.410000
Thur    17.682742
Name: total_bill, dtype: float64
# Group by and aggregate (yields a DataFrame)
df_example.groupby('day')['total_bill'].mean().reset_index() # in R: df |> group_by(day) |> summarize(mean_bill = mean(total_bill))
    day  total_bill
0   Fri   17.151579
1   Sat   20.441379
2   Sun   21.410000
3  Thur   17.682742

Method Chaining

pandas supports method chaining (similar to the pipe |>/%>% in R):

# Chain multiple operations
result = (df_example
    .query('total_bill > 20')  # Filter
    .assign(tip_pct = lambda x: x['tip'] / x['total_bill'])  # New column
    .sort_values('tip_pct', ascending=False)  # Sort
    .head(5)  # Top 5
)

result
     total_bill   tip     sex smoker   day    time  size   tip_pct
183       23.17  6.50    Male    Yes   Sun  Dinner     4  0.280535
181       23.33  5.65    Male    Yes   Sun  Dinner     2  0.242177
185       20.69  5.00    Male     No   Sun  Dinner     5  0.241663
88        24.71  5.85    Male     No  Thur   Lunch     2  0.236746
214       28.17  6.50  Female    Yes   Sat  Dinner     3  0.230742
Compared to R
result <- df_example |>
  filter(total_bill > 20) |>
  mutate(tip_pct = tip / total_bill) |>
  arrange(desc(tip_pct)) |>
  slice(5)

Joining DataFrames

Joining (or merging) DataFrames is a fundamental operation in data analysis. pandas provides several ways to combine DataFrames, similar to SQL joins or R’s dplyr join functions.

Basic Join Types

# Create sample DataFrames
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'city': ['New York', 'London', 'Paris', 'Tokyo']
})

orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105],
    'customer_id': [1, 2, 2, 3, 5],
    'amount': [100, 150, 200, 75, 300]
})

Inner Join

Keep only rows that match in both DataFrames.

# Inner join (default) - only matching rows
inner = pd.merge(customers, orders, on='customer_id', how='inner')
inner
   customer_id     name      city  order_id  amount
0            1    Alice  New York       101     100
1            2      Bob    London       102     150
2            2      Bob    London       103     200
3            3  Charlie     Paris       104      75
Compared to R
# R equivalent
inner_join(customers, orders, by = "customer_id")

Left Join

Keep all rows from the left DataFrame, fill missing values with NaN.

# Left join - all customers, even if no orders
left = pd.merge(customers, orders, on='customer_id', how='left')
left
   customer_id     name      city  order_id  amount
0            1    Alice  New York     101.0   100.0
1            2      Bob    London     102.0   150.0
2            2      Bob    London     103.0   200.0
3            3  Charlie     Paris     104.0    75.0
4            4    Diana     Tokyo       NaN     NaN
Compared to R
# R equivalent
left_join(customers, orders, by = "customer_id")

Right Join

Keep all rows from the right DataFrame.

# Right join - all orders, even if customer doesn't exist
right = pd.merge(customers, orders, on='customer_id', how='right')
right
   customer_id     name      city  order_id  amount
0            1    Alice  New York       101     100
1            2      Bob    London       102     150
2            2      Bob    London       103     200
3            3  Charlie     Paris       104      75
4            5      NaN       NaN       105     300
Compared to R
# R equivalent
right_join(customers, orders, by = "customer_id")

Outer Join

Keep all rows from both DataFrames.

# Outer join - all customers and all orders
outer = pd.merge(customers, orders, on='customer_id', how='outer')
outer
   customer_id     name      city  order_id  amount
0            1    Alice  New York     101.0   100.0
1            2      Bob    London     102.0   150.0
2            2      Bob    London     103.0   200.0
3            3  Charlie     Paris     104.0    75.0
4            4    Diana     Tokyo       NaN     NaN
5            5      NaN       NaN     105.0   300.0
Compared to R
# R equivalent
full_join(customers, orders, by = "customer_id")

Joining on Different Column Names

# When join columns have different names
products = pd.DataFrame({
    'prod_id': [1, 2, 3],
    'product': ['Laptop', 'Phone', 'Tablet']
})

sales = pd.DataFrame({
    'product_id': [1, 1, 2, 3],
    'quantity': [5, 3, 8, 2]
})

# Specify left_on and right_on
merged = pd.merge(products, sales, 
                  left_on='prod_id', 
                  right_on='product_id', 
                  how='inner')
print(merged)
   prod_id product  product_id  quantity
0        1  Laptop           1         5
1        1  Laptop           1         3
2        2   Phone           2         8
3        3  Tablet           3         2
# Clean up: drop redundant column
merged = pd.merge(products, sales, 
                  left_on='prod_id', 
                  right_on='product_id', 
                  how='inner').drop('product_id', axis=1) #axis=0 would drop the row with id 'product_id' (doesn't exist)
merged
   prod_id product  quantity
0        1  Laptop         5
1        1  Laptop         3
2        2   Phone         8
3        3  Tablet         2
Compared to R
# R equivalent
inner_join(products, sales, by = join_by("prod_id" == "product_id"))

Joining on Multiple Columns

# Join on multiple columns
df1 = pd.DataFrame({
    'year': [2020, 2020, 2021, 2021],
    'quarter': [1, 2, 1, 2],
    'revenue': [100, 120, 110, 130]
})

df2 = pd.DataFrame({
    'year': [2020, 2020, 2021],
    'quarter': [1, 2, 1],
    'costs': [80, 90, 85]
})

# Join on both year and quarter
merged = pd.merge(df1, df2, on=['year', 'quarter'], how='left')
merged
   year  quarter  revenue  costs
0  2020        1      100   80.0
1  2020        2      120   90.0
2  2021        1      110   85.0
3  2021        2      130    NaN
Compared to R
# R equivalent
left_join(df1, df2, by = join_by("year", "quarter"))

Handling Duplicate Column Names

# When both DataFrames have columns with same names (besides join key)
df1 = pd.DataFrame({
    'id': [1, 2, 3],
    'value': [10, 20, 30]
})

df2 = pd.DataFrame({
    'id': [1, 2, 3],
    'value': [100, 200, 300]
})

# pandas adds suffixes to distinguish columns
merged = pd.merge(df1, df2, on='id', suffixes=('_left', '_right'))
merged
   id  value_left  value_right
0   1          10          100
1   2          20          200
2   3          30          300
Compared to R
# R equivalent (default suffixes are .x and .y)
inner_join(df1, df2, by = "id", suffix = c("_left", "_right"))

Concatenating DataFrames (Stacking)

For simply stacking DataFrames vertically or horizontally without matching keys:

# Vertical concatenation (like rbind/bind_rows() in R)
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

vertical = pd.concat([df1, df2], ignore_index=True, axis=0)
vertical
   A  B
0  1  3
1  2  4
2  5  7
3  6  8
# Horizontal concatenation (like cbind in R)
df3 = pd.DataFrame({'C': [9]})
horizontal = pd.concat([df1, df3], axis=1)
horizontal
   A  B    C
0  1  3  9.0
1  2  4  NaN
Compared to R
# R equivalents
bind_rows(df1, df2)    # vertical
bind_cols(df1, df3)    # horizontal

Chaining Joins

# Real-world scenario: combining customer, order, and product data
customers = pd.DataFrame({
    'customer_id': [1, 2, 3],
    'customer_name': ['Alice', 'Bob', 'Charlie']
})

orders = pd.DataFrame({
    'order_id': [101, 102, 103],
    'customer_id': [1, 2, 1],
    'product_id': [1, 2, 3]
})

products = pd.DataFrame({
    'product_id': [1, 2, 3],
    'product_name': ['Laptop', 'Phone', 'Tablet'],
    'price': [1000, 800, 500]
})

# Chain multiple joins
result = (orders
    .merge(customers, on='customer_id', how='left')
    .merge(products, on='product_id', how='left')
    .assign(total = lambda x: x['price'])
)

result
   order_id  customer_id  product_id customer_name product_name  price  total
0       101            1           1         Alice       Laptop   1000   1000
1       102            2           2           Bob        Phone    800    800
2       103            1           3         Alice       Tablet    500    500
Compared to R
# R equivalent with pipe
result <- orders |>
  left_join(customers, by = "customer_id") |>
  left_join(products, by = "product_id") |>
  mutate(total = price)

Join Summary Table

pandas R dplyr Description
pd.merge(df1, df2, how='inner') inner_join(df1, df2) Keep only matching rows
pd.merge(df1, df2, how='left') left_join(df1, df2) Keep all rows from left
pd.merge(df1, df2, how='right') right_join(df1, df2) Keep all rows from right
pd.merge(df1, df2, how='outer') full_join(df1, df2) Keep all rows from both
pd.concat([df1, df2]) bind_rows(df1, df2) Stack vertically
pd.concat([df1, df2], axis=1) bind_cols(df1, df2) Stack horizontally

Key Differences between pandas and R

  • Indexing: pandas uses 0-based indexing
  • Missing values: pandas uses NaN (from NumPy), not NA
  • Syntax: Methods instead of functions (e.g., df.head() not head(df))

There is one important pitfall you need to pay attention to: pandas modifies in place unless you use .copy()!

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# This creates a reference, NOT a copy
df2 = df

# Modify df2
df2['A'] = [10, 20, 30]

# Surprise! df is also changed
print(df)
    A  B
0  10  4
1  20  5
2  30  6
### FIX
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# This creates a copy, NOT a reference
df2 = df.copy()

# Modify df2
df2['A'] = [10, 20, 30]

# thank god! df did not change
print(df)
   A  B
0  1  4
1  2  5
2  3  6
print(df2)
    A  B
0  10  4
1  20  5
2  30  6

Exercises

Take the dplyr exercises from Chapter 2 and perform them using pandas.

Further Resources