Chapter 5: `pandas`

Now that you understand Python basics, let’s look at pandas – Python’s most popular library for data manipulation. If you’re familiar with R’s tidyverse (especially dplyr and tidyr), pandas will feel conceptually familiar, though the syntax differs.

What is Pandas?

pandas provides two main data structures:

Series: Like an R vector or a single column
DataFrame: Like an R tibble/data.frame

import pandas as pd # Standard convention: import pandas as pd -- equivalent to library(tidyverse)
import numpy as np  # NumPy often used alongside pandas

# Check version
pd.__version__

'2.3.1'

Creating DataFrames

You can create DataFrames in several ways:

# From a dictionary (like tibble() in R)
df_dic = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['New York', 'London', 'Paris', 'Tokyo']
})

df_dic

      name  age      city
0    Alice   25  New York
1      Bob   30    London
2  Charlie   35     Paris
3    Diana   28     Tokyo

# from a list of dictionaries
data = [
    {'name': 'Alice', 'age': 25, 'city': 'New York'},
    {'name': 'Bob', 'age': 30, 'city': 'Paris'},
    {'name': 'Charlie', 'age': 35, 'city': 'London'}
]
df_dic_list = pd.DataFrame(data)
df_dic_list

      name  age      city
0    Alice   25  New York
1      Bob   30     Paris
2  Charlie   35    London

# from a list of lists (specify column names)
data = [
    ['Alice', 25, 'New York'],
    ['Bob', 30, 'Paris'],
    ['Charlie', 35, 'London']
]
df_list_of_lists = pd.DataFrame(data, columns=['name', 'age', 'city'])

# from a numpy array (essentially a matrix, specify column names)
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
data

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

df_array = pd.DataFrame(data, columns=['A', 'B', 'C'])
df_array

Compared to R

rdf <- tibble( name = c(‘Alice’, ‘Bob’, ‘Charlie’, ‘Diana’), age = c(25, 30, 35, 28), city = c(‘New York’, ‘London’, ‘Paris’, ‘Tokyo’) )

Basic DataFrame Operations

pandas works through methods rather than functions. Methods are attached to specific object types. A string has string methods, a list has list methods, a DataFrame has DataFrame methods.

There are several differences:

Functions are called by putting the object as an argument:

# Function syntax: function(object, arguments)
my_list = [3, 1, 4, 1, 5]
len(my_list)        # len is a function

sorted(my_list)     # sorted is a function

[1, 1, 3, 4, 5]

Methods are functions that “belong to” an object and are called using dot notation:

# Method syntax: object.method(arguments)
my_string = "hello world"
upper_string = my_string.upper()      # upper() is a method of strings
my_string.split()      # split() is a method of strings

['hello', 'world']

my_string              # object remains unchanged

'hello world'

my_list.append(9)      # append() is a method of lists; note that this modifies the list in place
my_list

[3, 1, 4, 1, 5, 9]

Compared to R

R primarily uses functions - you write head(df), nrow(df), names(df). Python often uses methods - you write df.head(), df.shape, df.columns.

This matters for pandas, since almost everything is a method and hence the syntax differs significantly.

`pandas` in action

Viewing Data (note that objects remain unchanged):

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['New York', 'London', 'Paris', 'Tokyo']
})

# First few rows (like head() in R)
df.head(2)

    name  age      city
0  Alice   25  New York
1    Bob   30    London

# Last few rows (like tail() in R)
df.tail(2)

      name  age   city
2  Charlie   35  Paris
3    Diana   28  Tokyo

# Info about the DataFrame (like glimpse() or str() in R)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    4 non-null      object
 1   age     4 non-null      int64 
 2   city    4 non-null      object
dtypes: int64(1), object(2)
memory usage: 224.0+ bytes

# Summary statistics (like summary() in R)
df.describe()

             age
count   4.000000
mean   29.500000
std     4.203173
min    25.000000
25%    27.250000
50%    29.000000
75%    31.250000
max    35.000000

Selecting Columns:

# Select single column (returns a Series)
df['name']           # in R: df$name

0      Alice
1        Bob
2    Charlie
3      Diana
Name: name, dtype: object

# Select single column (returns a DataFrame)
df[['name']]         # in R: df |> select(name)

      name
0    Alice
1      Bob
2  Charlie
3    Diana

# Select multiple columns (returns a DataFrame)
df[['name', 'age']]  # in R: df |> select(name, age)

      name  age
0    Alice   25
1      Bob   30
2  Charlie   35
3    Diana   28

## OR: .filter() -- the equivalent of dplyr::select()
df.filter(['name'])

      name
0    Alice
1      Bob
2  Charlie
3    Diana

# Columns containing a string
df.filter(like='a')  # Gets 'Rating', 'Rating_Count', etc.

      name  age
0    Alice   25
1      Bob   30
2  Charlie   35
3    Diana   28

# Regex pattern
df.filter(regex='^(name|age)$')  # Columns matching the pattern

      name  age
0    Alice   25
1      Bob   30
2  Charlie   35
3    Diana   28

Filtering Rows:

# Filter rows where age > 28
df[df['age'] > 28]

      name  age    city
1      Bob   30  London
2  Charlie   35   Paris

# Multiple conditions (use & for AND, | for OR)
df[(df['age'] > 25) & (df['city'] == 'London')]

  name  age    city
1  Bob   30  London

# Filter something depending on its presence in another list using .isin() (R equivalent: %in%)
cities_ive_been_to = ["Paris", "New York", "London"]
df[df['city'].isin(cities_ive_been_to)]

      name  age      city
0    Alice   25  New York
1      Bob   30    London
2  Charlie   35     Paris

# or using .query()
df.query('age > 25') #R: df |> filter(age > 25)

      name  age    city
1      Bob   30  London
2  Charlie   35   Paris
3    Diana   28   Tokyo

Adding New Columns:

# Add a new column (like mutate() in R)
df['age_in_months'] = df['age'] * 12    # in R: df$age_in_months <- df$age * 12
df

      name  age      city  age_in_months
0    Alice   25  New York            300
1      Bob   30    London            360
2  Charlie   35     Paris            420
3    Diana   28     Tokyo            336

# or using .assign()
df = df.assign(year_of_birth = lambda x: 2025-x['age'])
df

      name  age      city  age_in_months  year_of_birth
0    Alice   25  New York            300           2000
1      Bob   30    London            360           1995
2  Charlie   35     Paris            420           1990
3    Diana   28     Tokyo            336           1997

Sorting:

# Sort by age (like arrange() in R)
df.sort_values('age')                    # in R: df |> arrange(age)

      name  age      city  age_in_months  year_of_birth
0    Alice   25  New York            300           2000
3    Diana   28     Tokyo            336           1997
1      Bob   30    London            360           1995
2  Charlie   35     Paris            420           1990

# Sort descending
df.sort_values('age', ascending=False)   # in R: df |> arrange(desc(age))

      name  age      city  age_in_months  year_of_birth
2  Charlie   35     Paris            420           1990
1      Bob   30    London            360           1995
3    Diana   28     Tokyo            336           1997
0    Alice   25  New York            300           2000

Read in Data:

# Read CSV (like read_csv() in R)
imdb_df = pd.read_csv('data/imdb2006-2016.csv')
imdb_df.head()

   Rank                    Title  ... Revenue (Millions) Metascore
0     1  Guardians of the Galaxy  ...             333.13      76.0
1     2               Prometheus  ...             126.46      65.0
2     3                    Split  ...             138.12      62.0
3     4                     Sing  ...             270.32      59.0
4     5            Suicide Squad  ...             325.02      40.0

[5 rows x 12 columns]

# Read Excel
publishers_df = pd.read_excel('data/publishers_with_places.xlsx')
publishers_df.head()

        publisher               city
0    37 ink atria  New York City, NY
1          abrams  New York City, NY
2             ace  New York City, NY
3  alfred a knopf  New York City, NY
4       algonquin    Chapel Hill, NC

# Read from URL
df_example = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv')
df_example.head()

   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4

Quick Data Exploration:

# Shape (rows, columns)
df_example.shape                                  # in R: dim(df)

(244, 7)

# Column names 
df_example.columns.tolist()                       # in R: colnames(df)

['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size']

# Value counts
df_example['day'].value_counts()                  # in R: df |> count(day)

day
Sat     87
Sun     76
Thur    62
Fri     19
Name: count, dtype: int64

# Group by and aggregate (yields a Series)
df_example.groupby('day')['total_bill'].mean()    # in R: df |> group_by(day) |> summarize(mean_bill = mean(total_bill)) |> pull(mean_bill) |> set_names(day)

day
Fri     17.151579
Sat     20.441379
Sun     21.410000
Thur    17.682742
Name: total_bill, dtype: float64

# Group by and aggregate (yields a DataFrame)
df_example.groupby('day')['total_bill'].mean().reset_index() # in R: df |> group_by(day) |> summarize(mean_bill = mean(total_bill))

    day  total_bill
0   Fri   17.151579
1   Sat   20.441379
2   Sun   21.410000
3  Thur   17.682742

Method Chaining

pandas supports method chaining (similar to the pipe |>/%>% in R):

# Chain multiple operations
result = (df_example
    .query('total_bill > 20')  # Filter
    .assign(tip_pct = lambda x: x['tip'] / x['total_bill'])  # New column
    .sort_values('tip_pct', ascending=False)  # Sort
    .head(5)  # Top 5
)

result

     total_bill   tip     sex smoker   day    time  size   tip_pct
183       23.17  6.50    Male    Yes   Sun  Dinner     4  0.280535
181       23.33  5.65    Male    Yes   Sun  Dinner     2  0.242177
185       20.69  5.00    Male     No   Sun  Dinner     5  0.241663
88        24.71  5.85    Male     No  Thur   Lunch     2  0.236746
214       28.17  6.50  Female    Yes   Sat  Dinner     3  0.230742

Compared to R

result <- df_example |>
  filter(total_bill > 20) |>
  mutate(tip_pct = tip / total_bill) |>
  arrange(desc(tip_pct)) |>
  slice(5)

Joining DataFrames

Joining (or merging) DataFrames is a fundamental operation in data analysis. pandas provides several ways to combine DataFrames, similar to SQL joins or R’s dplyr join functions.

Basic Join Types

# Create sample DataFrames
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'city': ['New York', 'London', 'Paris', 'Tokyo']
})

orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105],
    'customer_id': [1, 2, 2, 3, 5],
    'amount': [100, 150, 200, 75, 300]
})

Inner Join

Keep only rows that match in both DataFrames.

# Inner join (default) - only matching rows
inner = pd.merge(customers, orders, on='customer_id', how='inner')
inner

   customer_id     name      city  order_id  amount
0            1    Alice  New York       101     100
1            2      Bob    London       102     150
2            2      Bob    London       103     200
3            3  Charlie     Paris       104      75

Compared to R

# R equivalent
inner_join(customers, orders, by = "customer_id")

Left Join

Keep all rows from the left DataFrame, fill missing values with NaN.

# Left join - all customers, even if no orders
left = pd.merge(customers, orders, on='customer_id', how='left')
left

   customer_id     name      city  order_id  amount
0            1    Alice  New York     101.0   100.0
1            2      Bob    London     102.0   150.0
2            2      Bob    London     103.0   200.0
3            3  Charlie     Paris     104.0    75.0
4            4    Diana     Tokyo       NaN     NaN

Compared to R

# R equivalent
left_join(customers, orders, by = "customer_id")

Right Join

Keep all rows from the right DataFrame.

# Right join - all orders, even if customer doesn't exist
right = pd.merge(customers, orders, on='customer_id', how='right')
right

   customer_id     name      city  order_id  amount
0            1    Alice  New York       101     100
1            2      Bob    London       102     150
2            2      Bob    London       103     200
3            3  Charlie     Paris       104      75
4            5      NaN       NaN       105     300

Compared to R

# R equivalent
right_join(customers, orders, by = "customer_id")

Outer Join

Keep all rows from both DataFrames.

# Outer join - all customers and all orders
outer = pd.merge(customers, orders, on='customer_id', how='outer')
outer

   customer_id     name      city  order_id  amount
0            1    Alice  New York     101.0   100.0
1            2      Bob    London     102.0   150.0
2            2      Bob    London     103.0   200.0
3            3  Charlie     Paris     104.0    75.0
4            4    Diana     Tokyo       NaN     NaN
5            5      NaN       NaN     105.0   300.0

Compared to R

# R equivalent
full_join(customers, orders, by = "customer_id")

Joining on Different Column Names

# When join columns have different names
products = pd.DataFrame({
    'prod_id': [1, 2, 3],
    'product': ['Laptop', 'Phone', 'Tablet']
})

sales = pd.DataFrame({
    'product_id': [1, 1, 2, 3],
    'quantity': [5, 3, 8, 2]
})

# Specify left_on and right_on
merged = pd.merge(products, sales, 
                  left_on='prod_id', 
                  right_on='product_id', 
                  how='inner')
print(merged)

   prod_id product  product_id  quantity
0        1  Laptop           1         5
1        1  Laptop           1         3
2        2   Phone           2         8
3        3  Tablet           3         2

# Clean up: drop redundant column
merged = pd.merge(products, sales, 
                  left_on='prod_id', 
                  right_on='product_id', 
                  how='inner').drop('product_id', axis=1) #axis=0 would drop the row with id 'product_id' (doesn't exist)
merged

   prod_id product  quantity
0        1  Laptop         5
1        1  Laptop         3
2        2   Phone         8
3        3  Tablet         2

Compared to R

# R equivalent
inner_join(products, sales, by = join_by("prod_id" == "product_id"))

Joining on Multiple Columns

# Join on multiple columns
df1 = pd.DataFrame({
    'year': [2020, 2020, 2021, 2021],
    'quarter': [1, 2, 1, 2],
    'revenue': [100, 120, 110, 130]
})

df2 = pd.DataFrame({
    'year': [2020, 2020, 2021],
    'quarter': [1, 2, 1],
    'costs': [80, 90, 85]
})

# Join on both year and quarter
merged = pd.merge(df1, df2, on=['year', 'quarter'], how='left')
merged

   year  quarter  revenue  costs
0  2020        1      100   80.0
1  2020        2      120   90.0
2  2021        1      110   85.0
3  2021        2      130    NaN

Compared to R

# R equivalent
left_join(df1, df2, by = join_by("year", "quarter"))

Handling Duplicate Column Names

# When both DataFrames have columns with same names (besides join key)
df1 = pd.DataFrame({
    'id': [1, 2, 3],
    'value': [10, 20, 30]
})

df2 = pd.DataFrame({
    'id': [1, 2, 3],
    'value': [100, 200, 300]
})

# pandas adds suffixes to distinguish columns
merged = pd.merge(df1, df2, on='id', suffixes=('_left', '_right'))
merged

   id  value_left  value_right
0   1          10          100
1   2          20          200
2   3          30          300

Compared to R

# R equivalent (default suffixes are .x and .y)
inner_join(df1, df2, by = "id", suffix = c("_left", "_right"))

Concatenating DataFrames (Stacking)

For simply stacking DataFrames vertically or horizontally without matching keys:

# Vertical concatenation (like rbind/bind_rows() in R)
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

vertical = pd.concat([df1, df2], ignore_index=True, axis=0)
vertical

# Horizontal concatenation (like cbind in R)
df3 = pd.DataFrame({'C': [9]})
horizontal = pd.concat([df1, df3], axis=1)
horizontal

   A  B    C
0  1  3  9.0
1  2  4  NaN

Compared to R

# R equivalents
bind_rows(df1, df2)    # vertical
bind_cols(df1, df3)    # horizontal

Chaining Joins

# Real-world scenario: combining customer, order, and product data
customers = pd.DataFrame({
    'customer_id': [1, 2, 3],
    'customer_name': ['Alice', 'Bob', 'Charlie']
})

orders = pd.DataFrame({
    'order_id': [101, 102, 103],
    'customer_id': [1, 2, 1],
    'product_id': [1, 2, 3]
})

products = pd.DataFrame({
    'product_id': [1, 2, 3],
    'product_name': ['Laptop', 'Phone', 'Tablet'],
    'price': [1000, 800, 500]
})

# Chain multiple joins
result = (orders
    .merge(customers, on='customer_id', how='left')
    .merge(products, on='product_id', how='left')
    .assign(total = lambda x: x['price'])
)

result

   order_id  customer_id  product_id customer_name product_name  price  total
0       101            1           1         Alice       Laptop   1000   1000
1       102            2           2           Bob        Phone    800    800
2       103            1           3         Alice       Tablet    500    500

Compared to R

# R equivalent with pipe
result <- orders |>
  left_join(customers, by = "customer_id") |>
  left_join(products, by = "product_id") |>
  mutate(total = price)

Join Summary Table

pandas	R dplyr	Description
`pd.merge(df1, df2, how='inner')`	`inner_join(df1, df2)`	Keep only matching rows
`pd.merge(df1, df2, how='left')`	`left_join(df1, df2)`	Keep all rows from left
`pd.merge(df1, df2, how='right')`	`right_join(df1, df2)`	Keep all rows from right
`pd.merge(df1, df2, how='outer')`	`full_join(df1, df2)`	Keep all rows from both
`pd.concat([df1, df2])`	`bind_rows(df1, df2)`	Stack vertically
`pd.concat([df1, df2], axis=1)`	`bind_cols(df1, df2)`	Stack horizontally

Key Differences between `pandas` and R

Indexing: pandas uses 0-based indexing
Missing values: pandas uses NaN (from NumPy), not NA
Syntax: Methods instead of functions (e.g., df.head() not head(df))

There is one important pitfall you need to pay attention to: pandas modifies in place unless you use .copy()!

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# This creates a reference, NOT a copy
df2 = df

# Modify df2
df2['A'] = [10, 20, 30]

# Surprise! df is also changed
print(df)

### FIX
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# This creates a copy, NOT a reference
df2 = df.copy()

# Modify df2
df2['A'] = [10, 20, 30]

# thank god! df did not change
print(df)

print(df2)

Exercises

Take the dplyr exercises from Chapter 2 and perform them using pandas.

Open the IMDb file.

imdb = pd.read_csv("data/imdb2006-2016.csv")
imdb

     Rank                    Title  ... Revenue (Millions) Metascore
0       1  Guardians of the Galaxy  ...             333.13      76.0
1       2               Prometheus  ...             126.46      65.0
2       3                    Split  ...             138.12      62.0
3       4                     Sing  ...             270.32      59.0
4       5            Suicide Squad  ...             325.02      40.0
..    ...                      ...  ...                ...       ...
995   996     Secret in Their Eyes  ...                NaN      45.0
996   997          Hostel: Part II  ...              17.54      46.0
997   998   Step Up 2: The Streets  ...              58.01      50.0
998   999             Search Party  ...                NaN      22.0
999  1000               Nine Lives  ...              19.64      11.0

[1000 rows x 12 columns]

Find the duplicated movie. How could you go across this?

duplicated_movie = (imdb['Title']
    .value_counts()
    .reset_index()
    .sort_values('count', ascending=False)
    .head(1)
    )
duplicated_movie

      Title  count
0  The Host      2

print(duplicated_movie[['Title']])

      Title
0  The Host

Which director has made the longest movie?

imdb.sort_values('Runtime (Minutes)', ascending=False).head(1)

     Rank       Title  ... Revenue (Millions) Metascore
828   829  Grindhouse  ...              25.03       NaN

[1 rows x 12 columns]

What’s the highest ranked movie?

imdb.sort_values('Rank').head(1)

   Rank                    Title  ... Revenue (Millions) Metascore
0     1  Guardians of the Galaxy  ...             333.13      76.0

[1 rows x 12 columns]

Which movie got the most votes?

imdb.sort_values('Votes').head(1)

     Rank           Title  Genre  ... Votes Revenue (Millions) Metascore
478   479  Paint It Black  Drama  ...    61                NaN      71.0

[1 rows x 12 columns]

Which movie had the biggest revenue in 2016?

imdb.query('Year == 2016').sort_values('Revenue (Millions)', ascending=False).head(1)

    Rank      Title  ... Revenue (Millions) Metascore
12    13  Rogue One  ...             532.17      65.0

[1 rows x 12 columns]

How much revenue did the movies in the data set make each year in total?

imdb.groupby('Year')['Revenue (Millions)'].sum().reset_index()

    Year  Revenue (Millions)
0   2006             3624.46
1   2007             4306.23
2   2008             5053.22
3   2009             5292.26
4   2010             5989.65
5   2011             5431.96
6   2012             6910.29
7   2013             7666.72
8   2014             7997.40
9   2015             8854.12
10  2016            11211.65

Filter movies following some conditions:
1. More runtime than the average runtime (hint: you could also use mutate() before).
2. Movies directed by J.J. Abrams.
3. More votes than the median of all of the votes.
4. The movies which have the most common value (the mode) in terms of rating (you can use mode() from the pandas package).

#a 
imdb.assign(avg_runtime = lambda x: x['Runtime (Minutes)'].mean()).assign(runtime = lambda x: x['Runtime (Minutes)']).query('runtime > avg_runtime')

     Rank                    Title  ... avg_runtime runtime
0       1  Guardians of the Galaxy  ...     113.172     121
1       2               Prometheus  ...     113.172     124
2       3                    Split  ...     113.172     117
4       5            Suicide Squad  ...     113.172     123
6       7               La La Land  ...     113.172     128
..    ...                      ...  ...         ...     ...
979   980       The Skin I Live In  ...     113.172     120
981   982                    Annie  ...     113.172     118
982   983      Across the Universe  ...     113.172     133
989   990                    Selma  ...     113.172     128
991   992         Taare Zameen Par  ...     113.172     165

[433 rows x 14 columns]

#b
imdb[imdb["Director"] == "J.J. Abrams"]

     Rank  ... Metascore
50     51  ...      81.0
140   141  ...      82.0
362   363  ...      72.0
497   498  ...      72.0
869   870  ...      66.0

[5 rows x 12 columns]

#c
imdb.assign(median_votes = lambda x: x['Votes'].median()).query('Votes > median_votes')

     Rank                           Title  ... Metascore median_votes
0       1         Guardians of the Galaxy  ...      76.0     110799.0
1       2                      Prometheus  ...      65.0     110799.0
2       3                           Split  ...      62.0     110799.0
4       5                   Suicide Squad  ...      40.0     110799.0
6       7                      La La Land  ...      93.0     110799.0
..    ...                             ...  ...       ...          ...
971   972                       Disturbia  ...       NaN     110799.0
983   984                   Let's Be Cops  ...      30.0     110799.0
990   991  Underworld: Rise of the Lycans  ...      44.0     110799.0
993   994        Resident Evil: Afterlife  ...      37.0     110799.0
994   995                       Project X  ...      48.0     110799.0

[500 rows x 13 columns]

#or
imdb.query('Votes > @imdb.Votes.median()')

     Rank                           Title  ... Revenue (Millions) Metascore
0       1         Guardians of the Galaxy  ...             333.13      76.0
1       2                      Prometheus  ...             126.46      65.0
2       3                           Split  ...             138.12      62.0
4       5                   Suicide Squad  ...             325.02      40.0
6       7                      La La Land  ...             151.06      93.0
..    ...                             ...  ...                ...       ...
971   972                       Disturbia  ...              80.05       NaN
983   984                   Let's Be Cops  ...              82.39      30.0
990   991  Underworld: Rise of the Lycans  ...              45.80      44.0
993   994        Resident Evil: Afterlife  ...              60.13      37.0
994   995                       Project X  ...              54.72      48.0

[500 rows x 12 columns]

#d
rating_mode = imdb['Rating'].mode()
imdb[imdb['Rating'].isin(rating_mode)]

     Rank  ... Metascore
8       9  ...      78.0
32     33  ...      52.0
39     40  ...       NaN
48     49  ...      68.0
71     72  ...      60.0
75     76  ...      50.0
93     94  ...      57.0
134   135  ...      61.0
139   140  ...      44.0
212   213  ...      61.0
221   222  ...      73.0
226   227  ...      82.0
248   249  ...      75.0
259   260  ...      76.0
276   277  ...      71.0
282   283  ...       NaN
294   295  ...      55.0
298   299  ...      66.0
352   353  ...      47.0
378   379  ...      64.0
385   386  ...      64.0
402   403  ...       NaN
435   436  ...       NaN
436   437  ...      33.0
446   447  ...      59.0
497   498  ...      72.0
513   514  ...      72.0
515   516  ...      72.0
543   544  ...       NaN
551   552  ...      79.0
554   555  ...      33.0
556   557  ...      77.0
571   572  ...      75.0
572   573  ...      51.0
592   593  ...      69.0
664   665  ...      88.0
689   690  ...      85.0
702   703  ...      60.0
706   707  ...       NaN
721   722  ...      76.0
737   738  ...      57.0
807   808  ...      82.0
842   843  ...      80.0
843   844  ...      73.0
853   854  ...      70.0
863   864  ...      81.0
892   893  ...      56.0
894   895  ...      46.0
895   896  ...      55.0
896   897  ...       NaN
929   930  ...       NaN
988   989  ...      89.0

[52 rows x 12 columns]

Further Resources

A very helpful R vs. pandas cheatsheet
Python for Data Analysis by Wes McKinney (pandas creator)

What is Pandas?

Creating DataFrames

Basic DataFrame Operations

pandas in action

Method Chaining

Joining DataFrames

Basic Join Types

Inner Join

Left Join

Right Join

Outer Join

Joining on Different Column Names

Joining on Multiple Columns

Handling Duplicate Column Names

Concatenating DataFrames (Stacking)

Chaining Joins

Join Summary Table

Key Differences between pandas and R

Exercises

Further Resources

`pandas` in action

Key Differences between `pandas` and R