#first, navigate to the folder you want the environment to live in, by using `cd path/to/folder`
python -m venv _pyenv/transformer_env #create environment
#activate
## for mac/linux
source _pyenv/transformer_env/bin/activate
# for windows
# _pyenv/transformer_env/Scripts/activate
# package installation
pip install --upgrade pip
# Install PyTorch for Silicon Macs (only if you have one)
pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu
#otherwise:
pip install torch
# install remaining required packages
pip install transformers
pip install pandas numpy sklearn
pip install tqdm
pip install seaborn matplotlib
import seaborn as sns
import matplotlib.pyplot as plt
Chapter 14: Supervised Classification using BERT
Supervised methods that are based on the Bag of Words hypothesis (see chapter 11) work well, but these days we can do better. In this script, I will run you through how to use BERT (Bidirectional Encoder Representations from Transformers) for text classification. We’ll be working with a sentiment analysis task using the IMDb movie reviews data set, where we’ll classify reviews as either positive or negative.
Unlike traditional bag-of-words approaches, BERT understands context and nuance in language by considering the full context of a word by looking at the words that come before and after it. This allows it to capture more complex patterns in text, leading to better classification performance.
Setup
Here, we specifically use a virtual environment named 'transformer_env'
which needs to contain all necessary Python packages. This isolation ensures reproducibility and prevents package conflicts. Make sure this virtual environment has all required packages (torch
, transformers,
etc.) installed – if not, install them via either the terminal or reticulate
. While in the chapter on selenium
we used reticulate
functions to set up a (conda
) environment, here we do it in the terminal (so that you can use the code on the server later).
Then we can activate Python and the environment from within R (obviously, you can skip this step when you work in Python, e.g., in JupyterLab)
needs(reticulate)
use_virtualenv("_pyenv/transformer_env")
Then, we import all necessary libraries and set up our device configuration. The device setup is particularly important as it allows our code to run efficiently on different hardware configurations – whether that’s a Silicon Mac using MPS (in my case), a machine with CUDA-enabled GPU, or a regular CPU. Depending on this, you might have to install the respective torch
packages. We use pandas
for data manipulation, torch
(PyTorch) for deep learning operations, and the transformers
library for access to pre-trained BERT models. Furthermore, we use an array of sklearn
(scikit-learn) functions for train-test split creation and subsequent model evaluation.
import torch
from torch import nn
from transformers import BertTokenizer, BertModel
from torch.utils.data import Dataset, DataLoader
import pandas as pd
from sklearn.model_selection import train_test_split
from tqdm import tqdm
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
def get_device():
if torch.backends.mps.is_available():
= torch.device("mps")
device elif torch.cuda.is_available():
= torch.device("cuda")
device else:
= torch.device("cpu")
device return device
= get_device()
device print(f"Using device: {device}")
Using device: mps
Similar to our previous supervised learning examples, we need to prepare our data in a format suitable for the model. Hence, this chunk defines our custom Dataset class
for handling text data preparation. It converts our raw text and labels into BERT’s expected format. It handles tokenization using BERT’s specialized tokenizer, ensures all sequences are of the same length through padding or truncation (controlled by max_len
parameter), and generates attention masks to properly handle variable-length inputs. All this information is converted into PyTorch
tensors for model training.
class SentenceDataset(Dataset):
def __init__(self, texts, labels, tokenizer, max_len=128):
self.texts = texts
self.labels = labels
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
= str(self.texts[idx])
text = self.labels[idx]
label
= self.tokenizer.encode_plus(
encoding
text,=self.max_len,
max_length='max_length',
padding=True,
truncation='pt'
return_tensors
)
return {
'input_ids': encoding['input_ids'].flatten(),
'attention_mask': encoding['attention_mask'].flatten(),
'label': torch.tensor(label, dtype=torch.long)
}
Then we define our model architecture by extending PyTorch’s Module class. The classifier builds upon the pre-trained BERT model (transfer learning). To prevent overfitting, we add a dropout layer for regularization with a default rate of 0.1. The final linear layer performs the actual classification, converting BERT’s 768-dimensional output into our desired number of classes (i.e., 2 here).
class BertClassifier(nn.Module):
def __init__(self, dropout=0.1):
super().__init__()
self.bert = BertModel.from_pretrained('bert-base-uncased')
self.dropout = nn.Dropout(dropout)
self.classifier = nn.Linear(768, 2)
def forward(self, input_ids, attention_mask):
= self.bert(input_ids=input_ids, attention_mask=attention_mask)
outputs = outputs[1]
pooled_output = self.dropout(pooled_output)
pooled_output return self.classifier(pooled_output)
The training function implements our training loop with both training and validation phases. It handles device placement automatically (supporting CPU, CUDA, or MPS – for Silicon Macs). During training, it performs forward passes through the model, calculates loss, and updates the model’s parameters. The validation phase tracks the model’s performance on unseen data and the progress bar provides feedback during the training process.
def train_model(model, train_loader, val_loader, epochs=3, lr=2e-5):
= get_device()
device = model.to(device)
model = torch.optim.AdamW(model.parameters(), lr=lr)
optimizer = nn.CrossEntropyLoss()
criterion
for epoch in range(epochs):
model.train()= 0
train_loss
# Disable the progress bar but keep the iteration
= tqdm(train_loader, desc=f'Training Epoch {epoch+1}/{epochs}', disable=True)
train_pbar for batch in train_pbar:
optimizer.zero_grad()= batch['input_ids'].to(device)
input_ids = batch['attention_mask'].to(device)
attention_mask = batch['label'].to(device)
labels
= model(input_ids, attention_mask)
outputs = criterion(outputs, labels)
loss
loss.backward()
optimizer.step()+= loss.item()
train_loss
eval()
model.= 0
val_loss = 0
correct = 0
total
# Disable the progress bar but keep the iteration
= tqdm(val_loader, desc=f'Validating Epoch {epoch+1}/{epochs}', disable=True)
val_pbar with torch.no_grad():
for batch in val_pbar:
= batch['input_ids'].to(device)
input_ids = batch['attention_mask'].to(device)
attention_mask = batch['label'].to(device)
labels
= model(input_ids, attention_mask)
outputs = criterion(outputs, labels)
loss += loss.item()
val_loss
= torch.max(outputs, 1)
_, predicted += labels.size(0)
total += (predicted == labels).sum().item()
correct
print(f'Epoch {epoch+1}:')
print(f'Train Loss: {train_loss/len(train_loader):.4f}')
print(f'Val Loss: {val_loss/len(val_loader):.4f}')
print(f'Val Accuracy: {100*correct/total:.2f}%\n')
Finally, the predict
function handles inference on new texts. It manages the complete pipeline from raw text to final prediction: tokenizing the new, unseen input text using BERT’s tokenizer, moving the processed input to the appropriate device, running it through the model, and converting the model’s output into a prediction.
def predict(model, text, tokenizer):
= torch.device('mps' if torch.backends.mps.is_available() else 'cpu')
device eval()
model.= tokenizer.encode_plus(
encoding
text,=128,
max_length='max_length',
padding=True,
truncation='pt'
return_tensors
)
= encoding['input_ids'].to(device)
input_ids = encoding['attention_mask'].to(device)
attention_mask
with torch.no_grad():
= model(input_ids, attention_mask)
outputs = torch.max(outputs, 1)
_, predicted
return predicted.item()
The Full Process
First, we need to load and prepare our IMDb dataset for training. Here, we are using a two-stage split process. First, we separate our test data (90% of our data, given it’s 25,000 examples, but later only 100 instances are used for speed), then create training and validation sets from the remaining data. We ensure balanced class distribution through stratified sampling. Then, the data is then processed into our custom Dataset format and wrapped in DataLoader objects, which handle batching and shuffling during training. The label mapping converts our text labels into the numeric format required for classification.
= pd.read_csv("files/imdb_reviews.csv")
imdb_reviews
# separate test set (90% of data)
= train_test_split(
train_val_df, test_df
imdb_reviews,=0.9,
test_size=imdb_reviews['sentiment'],
stratify=42
random_state
)# separate train and validation (80/20 split of remaining data)
= train_test_split(
train_df, val_df
train_val_df,=0.2,
test_size=train_val_df['sentiment'],
stratify=42
random_state
)
# Create feature/label pairs
= train_df['text'].tolist()
X_train = train_df['sentiment'].tolist()
y_train
= val_df['text'].tolist()
X_val = val_df['sentiment'].tolist()
y_val
## we use a small test set in this example, only the first 100 instances
= test_df['text'][0:100].tolist()
X_test = test_df['sentiment'][0:100].tolist()
y_test
# create label mapping to change labels to integers
= {'negative': 0, 'positive': 1}
label_map = [label_map[label] for label in y_train]
y_train = [label_map[label] for label in y_val]
y_val = [label_map[label] for label in y_test] y_test
Once the data preparation is finished, we can initialize the tokenizer and model, prepare the data loaders, and start training our model.
#initialize tokenizer and model
= BertTokenizer.from_pretrained('bert-base-uncased')
tokenizer = BertClassifier()
model
# load data
= SentenceDataset(X_train, y_train, tokenizer)
train_dataset = SentenceDataset(X_val, y_val, tokenizer)
val_dataset
= DataLoader(train_dataset, batch_size=16, shuffle=True)
train_loader = DataLoader(val_dataset, batch_size=16)
val_loader
train_model(model, train_loader, val_loader)
Epoch 1:
Train Loss: 0.5076
Val Loss: 0.3161
Val Accuracy: 87.80%
Epoch 2:
Train Loss: 0.2887
Val Loss: 0.3226
Val Accuracy: 87.20%
Epoch 3:
Train Loss: 0.1381
Val Loss: 0.3791
Val Accuracy: 85.60%
Once the model has been trained, we can eyeball results.
"this is a hell of a movie", tokenizer) predict(model,
1
"this movie is hell", tokenizer) predict(model,
0
And do more vigorous evaluation on the held-out test set:
= pd.DataFrame()
df 'text'] = X_test
df['label'] = y_test
df['prediction'] = [predict(model, text, tokenizer) for text in X_test]
df[
= {
metrics 'Accuracy': accuracy_score(df['label'], df['prediction']),
'Precision': precision_score(df['label'], df['prediction'], average='weighted'),
'Recall': recall_score(df['label'], df['prediction'], average='weighted'),
'F1 Score': f1_score(df['label'], df['prediction'], average='weighted')
}
print("\nEvaluation Metrics:")
Evaluation Metrics:
for metric, value in metrics.items():
print(f"{metric}: {value:.3f}")
Accuracy: 0.830
Precision: 0.832
Recall: 0.830
F1 Score: 0.830