needs(tidyverse)
<- list(
model seed = 12345, # for reproducibility
world_size = 13, # grid dimension (M)
agents_per_cell = 6, # agents per cell (n)
agent_distance_function = "geometric", # the specified distribution d is drawn from
agent_distance_parameter = c(0.8), # the parameters for the distribution function
agent_adoption_p = 0.8 # adoption probability (p)
)
Chapter 19: Agent-based Modeling – Empirically Calibrated
Introduction
Note: this script has been kindly provided by Sascha Grehl. You might notice the difference in coding style – i.e., old pipe operator, not necessarily using all of the tidyverse
packages. Style preferences differ, but this script impressively showcases that there are different ways and they are neither mutually exclusive nor necessarily better. I made some changes when I deemed them necessary, mostly to ensure compatibility with Macs. You can find the original code in this GitHub repository.
Welcome to our session on Agent-Based Modeling (ABM) and parameter calibration in R. In the coming chapters, you will learn how to set up a simple ABM that simulates the diffusion of a superior innovation among agents, then use calibration techniques to estimate model parameters from observed (or synthetic) data.
Agent-Based Modeling (ABM) is a computational approach where a large number of individual entities – the agents – interact within a defined environment. Each agent can follow simple or complex rules, and through these interactions, emergent phenomena arise at the system level. The agents can represent people, firms, animals, or any discrete decision-making units. Rules are pre-specified by the model architect and often include how agents move, communicate, adopt behaviors, or otherwise change over time. Finally, the environment can be spatial (e.g., a grid or network), social (e.g., hierarchy, peer groups), or both.
ABMs are nice for a couple of reasons: they capture heterogeneity as ach agent can have different attributes or decision rules. Furthermore, they model interactions directly – agents can influence each other’s state (e.g., spread of ideas or contagion). Finally, they allow for emergence, thereby explicitly capturing the final micro–macro step of the Coleman boat; this allows us to thoroughly model and investigate how large-scale patterns (e.g., diffusion curves) arise from simple local interactions.
However, a model, no matter how cleverly designed, is most useful when it can be compared to real data or synthetic data that mimics reality. Calibration is the process of adjusting a model’s internal parameters so that its output matches observed data as closely as possible. To this end, we perform parameter calibration, i.e., find the combination of parameters (e.g., contagion probability, contact rate, etc.) that leads to the most realistic model outputs. Then we can can validate: once the parameters are estimated, we need to check if the model can predict data in scenarios beyond what it was originally fit to. This is useful since a calibrated model is more reliable for exploring what-if scenarios or interventions.
In today’s session, we will use synthetic historic data generated with hidden (“secret”) parameters. Your goal will be to figure out which parameters best explain the simulated outcomes—mimicking what you would do with real-world data if these parameters were unknown. To this end, we will first build up an ABM. Then, we will build a function that takes two models and compares the goodness of their fit (i.e., how well their outcomes match). Finally, we will systematically alter the parameters of our existing model to find the specification that matches our existing, synthetic data the best.
By the end of this session, you will have practical experience in:
- Building and running an agent-based simulation in R.
- Implementing calibration routines and interpreting the results.
- Visualizing parameter search results with heatmaps or other techniques.
Building the model
In this chapter, we will delve into the mechanics of an agent-based model (ABM) that simulates the diffusion of an innovation over time. Our model starts with a grid-based world of agents, allows them to interact and spread the innovation, and then collects data on how the innovation diffuses over time.
For our diffusion simulation, we use a simple mesh grid of size \(M \times M\). Each cell on this grid is occupied by \(n\) agents. Agents are either innovators (already using the new innovation) or non-innovators.
At each time step, all innovators: 1. Draw a random distance d
from a specified distribution (e.g., geometric, Weibull, etc.) using the model_implement_distance_function
2. Select one random cell exactly d
units away (in Manhattan distance).
3. Pick one random agent in that cell (the targeted agent).
4. Attempt to convince the targeted agent to adopt the innovation, succeeding with probability p
(defined by agent_distance_contagion
).
If successful, the targeted agent becomes an innovator as well.
Additionally, we include a seed
for reproducibility, so each run can produce consistent results. Here is an example list
capturing these model parameters:
Creating the World
After defining the model parameters, we need a function to initialize our world with agents. Each agent is placed on a ( M M ) grid, with ( n ) agents per cell. We also specify which agent starts off with the innovation.
<- function(world_size = 9,
world_init agents_per_cell = 3) {
# Create the grid
<- expand.grid(x = 1:world_size, y = 1:world_size)
cells
# Assign agents to grid cells
<- cells %>%
world slice(rep(1:n(), each = agents_per_cell)) %>%
mutate(
id = row_number(),
cell_id = x + (y - 1) * world_size,
# Initialize the innovation to 1 in the center for just one agent
innovation = ifelse(
== ceiling(world_size / 2) &
x == ceiling(world_size / 2) &
y row_number() == ceiling((world_size * world_size * agents_per_cell) / 2),
1,
0
)
)return(world)
}
<- world_init()
world |> head() world
x y id cell_id innovation
1 1 1 1 1 0
2 1 1 2 1 0
3 1 1 3 1 0
4 2 1 4 2 0
5 2 1 5 2 0
6 2 1 6 2 0
Now the world is a data frame, for each agent a row with its x
and y
coordinates, the cell_id
, and its innovation
status (1 == innovator
, 0 == not
).
Additional Setup: Distance Drawing and Position Lookup
Before we will turn to updating the World in each step, we will introduce to functions which will help us to run the simulation faster.
When we run our ABM, each innovator needs to pick a distance d
from a probability distribution (e.g., a geometric or Weibull distribution). To avoid repeatedly switching among these distributions at each step, we define a single helper function that attaches the appropriate draw_distance()
function to the model
object.
<- function(model){
model_implement_distance_function $draw_distance <- switch(
model$agent_distance_function,
model"weibull" = function(n = 1) {
round(rweibull(n,
shape = model$agent_distance_parameter[1],
scale = model$agent_distance_parameter[2]))
},"geometric" = function(n = 1) {
round(rgeom(n, prob = model$agent_distance_parameter[1]))
},stop("Unsupported agent distance function")
)return(model)
}
model_implement_distance_function(model)
$seed
[1] 12345
$world_size
[1] 13
$agents_per_cell
[1] 6
$agent_distance_function
[1] "geometric"
$agent_distance_parameter
[1] 0.8
$agent_adoption_p
[1] 0.8
$draw_distance
function (n = 1)
{
round(rgeom(n, prob = model$agent_distance_parameter[1]))
}
<environment: 0x12f908a70>
We also define a position lookup function to quickly find all agents in a specific cell, greatly improving performance when we need to identify potential targets.
<- function(world) {
create_position_lookup # Group by cell_id, store a list of all IDs in that cell
# 'deframe()' makes a named list with 'cell_id' as the name
%>%
world group_by(cell_id) %>%
summarize(agent_ids = list(id), .groups = "drop") %>%
deframe()
}
create_position_lookup(world) |> head()
$`1`
[1] 1 2 3
$`2`
[1] 4 5 6
$`3`
[1] 7 8 9
$`4`
[1] 10 11 12
$`5`
[1] 13 14 15
$`6`
[1] 16 17 18
A single time step
At each time step in our ABM, we want to simulate how existing innovators attempt to convince other agents to adopt the innovation. The process includes:
- Identifying all current innovators.
- Drawing a random distance for each innovator and selecting a target cell using Manhattan distance.
- Picking a random agent in that cell.
- Updating the targeted agents’ status if they have been convinced to adopt the innovation.
However, for performance reasons, our actual implementation filters out would-be unsuccessful innovators before drawing distances. This approach avoids unnecessary computations for innovators who would fail anyway. For example, if we have 1,000 innovators but only a 5% chance of convincing someone, filtering first means we only process about 50 innovators for the subsequent steps – significantly reducing computational overhead.
Below is the world_step()
function, highlighting how these steps are organized for both correctness and efficiency.
<- function(
world_step
world,
model, position_lookup = create_position_lookup(world) # if no position_lookup is provided, it is calculated automatically
){
# 1) Select all agents who are innovators
<- world %>% filter(innovation == 1)
innovators
# If everyone is already converted, don't run the simulation step
if (nrow(innovators) == nrow(world)) return(world)
# 2) Filter out innovators whose random_value is above the threshold.
# That is, only keep those who will successfully convince a target agent.
# Filtering here saves performance because we skip the following steps
# for innovators who wouldn't convert anyone anyway.
<- innovators %>%
innovators mutate(random_value = runif(n())) %>%
filter(random_value > model$agent_adoption_p)
# If no innovators remain after filtering, skip the step
if (nrow(innovators) == 0) return(world)
# 3) Draw a random distance for each innovator
<- innovators %>%
innovators mutate(
distances = model$draw_distance(nrow(innovators))
)
# 4) For each innovator, determine a random target cell using Manhattan distance
<- innovators %>%
innovators mutate(
target_cell = mapply(
function(x, y, d) {
get_a_cell_at_manhattan_distance(x, y, d, model$world_size)
},
x, y, distances
)
)
# 5) For each innovator, pick a random target agent within the target cell
<- innovators %>%
innovators mutate(
target_id = mapply(
function(tc, my_id) {
# If invalid cell, return -1
if (tc == -1) return(-1)
# Possible targets in that cell
<- position_lookup[[tc]]
targets_id
# Exclude the innovator itself
<- targets_id[targets_id != my_id]
targets_id
# If no one is left in that cell, return -1
if (length(targets_id) == 0) return(-1)
# Otherwise pick one at random
sample(targets_id, 1)
},# from innovators
target_cell, # from innovators
id
)
)
# Collect all valid targets (target_id != -1)
<- innovators %>% filter(target_id != -1) %>% pull(target_id)
targets
# 6) Update those target agents to become innovators
if (length(targets) > 0) {
$innovation[world$id %in% targets] <- 1
world
}
return(world)
}
world_step(world, model) |>
head()
x y id cell_id innovation
1 1 1 1 1 0
2 1 1 2 1 0
3 1 1 3 1 0
4 2 1 4 2 0
5 2 1 5 2 0
6 2 1 6 2 0
One key detail is how we sample a random cell at distance d
. We use the Manhattan metric (often referred to as L1 distance), which measures distance by summing the absolute differences in the x
and y
coordinates. In a grid setting, moving two cells to the left is considered equivalent in “distance cost” to moving one cell left and one cell up -— both yield a Manhattan distance of 2. This makes sense for grid-based worlds, where movement is often discrete rather than diagonal or Euclidean.
The function for this is as follows:
<- function(x,
get_a_cell_at_manhattan_distance
y,
d,
world_size
) {# Return the same cell right away if d == 0
if (d == 0) return(x+(y-1)*world_size)
# For these values, we know for sure, there are no valid cells
if (d > (world_size-1)*2) return(-1)
<- do.call(
cells_at_distance
rbind,lapply(0:d, function(k) {
<- d - k
dist_rem # Four possible points for each k
rbind(
c(x + k, y + dist_rem),
c(x + k, y - dist_rem),
c(x - k, y + dist_rem),
c(x - k, y - dist_rem)
)
})
)
# Remove duplicates (especially when k=0 or dist_rem=0)
<- unique(cells_at_distance)
cells_at_distance
# Filter out-of-bounds points
<- cells_at_distance[,1] >= 1 & cells_at_distance[,1] <= world_size &
in_bounds 2] >= 1 & cells_at_distance[,2] <= world_size
cells_at_distance[,<- cells_at_distance[in_bounds, , drop = FALSE]
cells_at_distance
# If no valid cells remain return -1
if (nrow(cells_at_distance) == 0) return(-1)
# Pick one cell at random
<- cells_at_distance[sample(nrow(cells_at_distance), 1), ]
chosen_cell
return(chosen_cell[1]+(chosen_cell[2]-1)*world_size)
}
Running the Full Simulation
After defining how to initialize the world (world_init()
) and how a single time step unfolds (world_step()
), we now need to combine these pieces to run our ABM across multiple rounds. The function run_simulation()
orchestrates all the steps, producing a list of world states -— one for each time step.
Here is the function:
<- function(model,
run_simulation max_steps = 25
){set.seed(model$seed)
<- list()
worlds
for (i in 1:max_steps){
if (i==1) {
# 1) Initialize the world in the first iteration
<- world_init(model$world_size, model$agents_per_cell)
worlds[[i]]
# 2) Create a lookup table for quick agent lookups
# Agents do not move, so building this once is sufficient.
<- create_position_lookup(worlds[[i]])
position_lookup
# 3) Attach the chosen distance-drawing function to the model
# (e.g., "weibull" or "geometric")
<- model_implement_distance_function(model)
model else {
} # 4) Update the world one step at a time
<- world_step(worlds[[i-1]], model, position_lookup)
worlds[[i]]
}
}return(worlds)
}
Plotting the Simulation
Understanding the dynamics of your agent-based model (ABM) is crucial. Visualizations help you interpret and communicate the spread of innovation across the grid. In this section, we’ll explore two primary methods for visualizing the simulation results: Static plots provide snapshot views of the world at specific time steps. Animations, on the other hand, give dynamic visualizations the show the evolution of the innovation over time.
Creating Static Plots
A static plot provides a clear snapshot of the world’s state at a particular time step. The world_print()
function uses ggplot2
to generate a color-coded grid that represents the share of innovators in each cell.
<- function(world,
world_print step = NA,
palette = "RdYlBu") {
# Summarize innovation per cell
<- world %>%
world_summary group_by(x, y) %>%
summarise(share_innovation = mean(innovation), .groups = 'drop')
# Create the plot
<- ggplot(world_summary, aes(x = x, y = y, fill = share_innovation)) +
plot # Draw a black background for the entire grid
geom_tile(data = expand.grid(x = seq(min(world$x), max(world$x)), y = seq(min(world$y), max(world$y))), aes(x = x, y = y), fill = "black") +
# Overlay the tiles for non-empty cells
geom_tile() +
scale_x_continuous(breaks = 1:max(world_summary$x)) +
scale_y_continuous(breaks = 1:max(world_summary$y)) +
coord_fixed() +
theme_minimal() +
scale_fill_distiller(palette = palette, limits = c(0, 1)) + # Fill color for tiles
labs(title = paste0("World Innovation Map: Round ", step),
x = "X Coordinate",
y = "Y Coordinate")
print(plot)
}
<- run_simulation(model)
sim_data
world_print(sim_data[[3]], step = 3)
world_print(sim_data[[25]], step = 25)
Creating Animations
While static plots are useful for snapshots, animations provide a comprehensive view of how innovation spreads over time. Animations can make it easier to identify patterns, trends, and anomalies that might not be apparent from static images.
Due to server restrictions on the University Cluster’s RStudio Server version, certain animation functionalities may not work directly within the environment. To circumvent this, animation functions are excluded from the main script and placed in a separate file: functions_animations.R
. This allows you to run animations locally on your machine without interfering with server policies.
(Note Felix: the following function has not been tested on the server. You might consider using Sascha’s original function that can be found in the GitHub repository.)
needs(gganimate)
<- function(worlds){
world_animation <- bind_rows(lapply(seq_along(worlds), function(i) {
all_worlds %>%
worlds[[i]] group_by(x, y) %>%
summarise(share_innovation = mean(innovation), .groups = 'drop') %>%
mutate(time = i)
}))
<- ggplot(all_worlds, aes(x = x, y = y, fill = share_innovation)) +
anim geom_tile() +
scale_x_continuous(breaks = 1:max(all_worlds$x)) +
scale_y_continuous(breaks = 1:max(all_worlds$y)) +
coord_fixed() +
theme_minimal() +
scale_fill_distiller(palette = "RdYlBu", limits = c(0, 1)) +
labs(title = "World Innovation Map: Step {frame_time}") +
transition_time(time) +
ease_aes('linear')
animate(anim, nframes = length(worlds), fps = 2)
}
world_animation(sim_data)
Calibration
In this chapter, we delve into calibration, a crucial process in agent-based modeling (ABM) that aligns simulation outputs with observed (or synthetic) data. Calibration ensures that your model accurately reflects the real-world processes it aims to represent by fine-tuning its parameters. We will explore how to measure the closeness of your simulation to historic data and employ techniques to identify the best-fitting parameters.
While our simulation generates a wealth of detailed data, this richness can be overwhelming and, in many cases, unnecessary for calibration purposes. To effectively calibrate our model, we first need to create a summary of the simulation data to a form that is comparable to our observed (historic) data. This involves extracting the relevant metrics from the simulation that correspond to the data available from the real-world process we aim to model. This ensures that we are comparing like with like, making the calibration process meaningful and effective. We will refer to this data reduction process as “generating a report.”” The purpose of the report is to distill the simulation’s rich data into a concise summary that mirrors the structure and content of our real-world data.
In our script, the report_generate(world_history)
function takes a list of world states at different time steps (referred to as world_history
) and produces a data frame containing all relevant information necessary for calibration.
Example Report Structure
For our diffusion of innovation model, we assume that we know:
- The global share of innovators in the population.
- The number of cells in the grid that fall into specific share thresholds of innovators.
Here’s an illustrative example of what the report might look like:
<- function(world_history){
report_generate <- tibble()
report
for (i in seq_along(world_history)) {
<- world_history[[i]]
world
<- sum(world$innovation, na.rm = TRUE) / nrow(world)
share_global
<- world %>%
world_summary group_by(x, y) %>%
summarise(share_innovation = mean(innovation), .groups = 'drop') %>%
mutate(share_innovation = round(share_innovation, 3) |> as.character()) |>
group_by(share_innovation) %>%
summarise(cells = n(), .groups = 'drop') %>%
mutate(step = i) %>%
pivot_wider(
names_from = share_innovation,
values_from = cells,
values_fill = 0 # Fill missing values with 0
%>%
) ungroup() %>%
mutate(share_global = share_global)
<- bind_rows(report, world_summary)
report
}
<- report %>%
report mutate(across(everything(), ~ replace_na(., 0))) %>%
relocate(step, share_global)
<- names(report)[!names(report) %in% c("share_global", "step")] # Exclude original columns
share_cols <- paste0("s_", seq_along(share_cols)) # Generate s_1, s_2, ...
new_names names(report)[!names(report) %in% c("share_global", "step")] <- new_names
return(report)
}
report_generate(sim_data) |> head()
# A tibble: 6 × 9
step share_global s_1 s_2 s_3 s_4 s_5 s_6 s_7
<int> <dbl> <int> <int> <int> <int> <int> <int> <int>
1 1 0.000986 168 1 0 0 0 0 0
2 2 0.000986 168 1 0 0 0 0 0
3 3 0.00197 167 2 0 0 0 0 0
4 4 0.00197 167 2 0 0 0 0 0
5 5 0.00296 167 1 1 0 0 0 0
6 6 0.00296 167 1 1 0 0 0 0
The report is structured as follows:
step
represents the time step in the simulation.- share_global: The overall proportion of innovators in the entire population.
s_1
tos_7
: The number of cells that fall into specific thresholds of local innovation shares:s_1
are the cells with 0% innovators;s_2
tos_7
: Cells with 1/6 up to 6/6 (100%) innovators.
For instance, at step 1, the global share of innovators is 0.000986, with 168 cells having 0% innovators and 1 cell with 1/6 innovators. By step 6, the global share increases to 0.00690, and the distribution of innovators across cells becomes more varied.
Including this summarized information is essential for calibration because it mirrors the structure of the real-world data we aim to fit our model to. In the next section, we will discuss how to quantify the fit between the simulation reports and historic data using a fit function.
Properties of a good fit function
A fit function (also known as an objective function, distance metric, or error measure) quantifies how closely the simulation’s output aligns with the observed (historic) data. Selecting an appropriate fit function is critical for effective calibration. Here are the desirable properties of a good fit function:
- Dimensionless:
- The fit function should be dimensionless to allow the comparison of discrepancies across different data dimensions.
- This enables the aggregation of errors without bias towards any particular scale.
- Weighted:
- The ability to assign weights to different data points ensures that more important aspects of the data have a greater influence on the fit score.
- This flexibility allows prioritizing certain metrics over others based on their relevance.
- Concave:
- The fit function should treat two small discrepancies of size ( a ) as more desirable than one large discrepancy of size ( 2a ) and no discrepancy elsewhere.
- This encourages a more uniform fit across all data points rather than allowing a single outlier to dominate the fit score.
- Symmetric:
- The function should treat positive and negative discrepancies equally.
- For example, being twice as high as the observed value should carry the same weight as being half as high.
- Non-Negative:
- Discrepancies should be non-negative to ensure that errors do not cancel each other out.
- This property maintains the integrity of the fit score as a true measure of discrepancy.
- Finite:
- The fit function should produce finite values for finite inputs.
- This ensures stability and reliability in the calibration process.
A commonly used fit function that satisfies these properties is:
\[ Fit = \sum_{i=1}^v w_i \cdot \left( \frac{h_i−m_i}{\frac{h_i+m_i}{2}} \right) ^2 \]
With
- \(h_i\) = Historic (observed) value \(i\).
- \(m_i\) = Model (simulation) value \(i\).
- \(w_i\) = Weight assigned to this particular data point \(i\).
- \(v\) = Number of data points fitted.
This function computes the squared relative difference, ensuring symmetry, non-negativity, and dimensionlessness. The weight w allows for the prioritization of specific data points. However, to achieve this, the minimum values of h and m must be set to 0, which may require some data transformation. Then a perfect fits is always a Fit of 0.
Defining the Fit Function
To quantitatively assess the similarity between our simulation output and historic data, we define the measure_fit()
function. This function calculates the fit score by comparing the relevant metrics extracted from both datasets.
# Example measure of fit between two reports
<- function(historic_report, sim_report) {
measure_fit # Join by step (inner join: only compare matching steps)
<- dplyr::inner_join(historic_report, sim_report,
df by = "step",
suffix = c("_hist", "_sim"))
# Identify numeric columns to compare (excluding "step")
<- setdiff(names(df), "step")
compare_cols <- grep("_hist$", compare_cols, value = TRUE)
compare_cols_hist <- grep("_sim$", compare_cols, value = TRUE)
compare_cols_sim
# We'll assume the columns match in order, i.e.
# for every X_hist there is a corresponding X_sim
# For safety, you could match them by sub-string if needed.
# For each pair, compute squared differences
<- c()
diff for (col_hist in compare_cols_hist) {
<- sub("_hist$", "_sim", col_hist) # get the matching sim column
col_sim if (col_sim %in% names(df)) {
# numeric difference
<- df[[col_hist]]
v_hist <- df[[col_sim]]
v_sim # you can use a different method to calculate the difference
# but this function has the following property:
# if the min.value is 0, regardless of the max.value the diff is in the interval [0,4], this makes it very comparable
<- c(diff, ((v_hist - v_sim)/((v_hist +v_sim)/2))^2)
diff is.nan(diff)] <- 0
diff[
}
}
# Return mean difference (or sum, or RMSE, etc.)
mean(diff, na.rm = TRUE)
}
As described earlier, we might want to give different weight to certain parameters; this function allows for this. We can either give more weight to particular steps or particular columns.
# a more complex function that allows for different weights with respect to certain cols or steps
<- function(historic_report,
measure_fit_weighted
sim_report, weight_steps = NULL, # e.g. c(1,2,3,4,5)
weight_cols = NULL # e.g. c("share_global"=2, "s_1"=1, "s_2"=1, ...)
) {# Join by step (inner join: only compare matching steps)
<- dplyr::inner_join(
df
historic_report, sim_report,by = "step", suffix = c("_hist", "_sim")
)
# Identify numeric columns to compare (excluding "step")
<- setdiff(names(df), "step")
compare_cols <- grep("_hist$", compare_cols, value = TRUE)
compare_cols_hist <- grep("_sim$", compare_cols, value = TRUE)
compare_cols_sim
# We will store our weighted differences in a vector:
<- numeric(0) # empty numeric vector
all_diff
for (col_hist in compare_cols_hist) {
# Find the matching simulated column
<- sub("_hist$", "_sim", col_hist)
col_sim if (!(col_sim %in% names(df))) next # skip if missing
# Extract the underlying variable name (without _hist / _sim)
<- sub("_hist$", "", col_hist)
var_name
# Grab vectors from the data frame
<- df[[col_hist]]
v_hist <- df[[col_sim]]
v_sim
# We'll compute the unweighted difference
<- ((v_hist - v_sim)/((v_hist + v_sim)/2))^2
raw_diff
# Replace NaN with 0
is.nan(raw_diff)] <- 0
raw_diff[
# ------------------------------------------------------------------
# Column-based weighting
# ------------------------------------------------------------------
# If `weight_cols` is provided, it should be something like:
# weight_cols = c("share_global"=2, "s_1"=1, "s_2"=3, ...)
# If var_name isn't in weight_cols, we'll assume weight = 1
if (!is.null(weight_cols) && var_name %in% names(weight_cols)) {
<- weight_cols[[var_name]]
w_col else {
} <- 1
w_col
}
# ------------------------------------------------------------------
# Step-based weighting
# ------------------------------------------------------------------
# If `weight_steps` is provided, e.g. weight_steps = c(1,2,3,4,5)
# We assume the step values go from 1..N, so we can do:
# w_step[i] = weight_steps[df$step[i]]
# If no match, default to 1.
# Create a vector of step weights, same length as raw_diff
if (!is.null(weight_steps)) {
# Some caution: step might not start at 1.
# If your steps are strictly 1..max, we can do:
<- weight_steps[df$step]
w_steps # If step is not guaranteed to be an integer index, you can do
# a named lookup, e.g. weight_steps = c("1"=1, "2"=2, ...)
# w_steps <- weight_steps[as.character(df$step)]
# fallback for missing names
is.na(w_steps)] <- 1
w_steps[else {
} <- rep(1, length(raw_diff))
w_steps
}
# ------------------------------------------------------------------
# Combine the weights: multiply raw_diff by (w_col * w_step)
# ------------------------------------------------------------------
<- raw_diff * w_col * w_steps
weighted_diff
# Accumulate in our all_diff vector
<- c(all_diff, weighted_diff)
all_diff
}
# Finally, compute your desired summary (mean, sum, RMSE, etc.)
mean(all_diff, na.rm = TRUE)
}
The calibrate_model function tests different parameter combinations in an agent-based model to find settings that best match historical data. It creates a grid of parameter combinations, including grid sizes, agents per cell, distance functions, and adoption probabilities. For each combination, it runs multiple simulations to account for randomness, generates reports comparing to historic data, and calculates an average fit score.
The function works by first building a data frame of all parameter combinations using expand.grid()
. It then loops through each combination, modifying the base model with new parameters using modifyList()
. For each parameter set, it runs the simulation n_sims
times with different random seeds, generates reports using report_generate()
, and measures the fit against historic data using measure_fit()
. The results are stored in a tibble showing how well each parameter combination performed.
needs(foreach, doParallel)
<- function(
calibrate_model
historic_report,
model,# if model variable is FALSE, use default provided by `model`, otherwise use them to generate a grid
world_size = NA,
world_agents_per_cell = NA,
agent_distance_function = NA,
agent_distance_parameter = NA,
agent_adoption_p = NA,
# number of simulations runs for each combination
n_sims = 3,
# number of steps in each run
max_steps = 25
) {# 1) Build a data frame of all combinations
<- expand.grid(
param_grid world_size = if_else(is.na(world_size) , model$world_size , world_size),
world_agents_per_cell = if_else(is.na(world_agents_per_cell) , model$world_agents_per_cell , world_agents_per_cell),
agent_distance_function = if_else(is.na(agent_distance_function) , model$agent_distance_function , agent_distance_function),
agent_distance_parameter = if_else(is.na(agent_distance_parameter), model$agent_distance_parameter, agent_distance_parameter),
agent_adoption_p = if_else(is.na(agent_adoption_p) , model$agent_adoption_p , agent_adoption_p),
stringsAsFactors = FALSE
)
# We'll store results in a data frame
<- tibble()
results
# 2) Loop over all combinations in the grid
for (i in seq_len(nrow(param_grid))) {
<- modifyList(model, as.list(param_grid[i, ]))
tmp_model
print(param_grid[i, ])
<- numeric(n_sims) # holds error score for each replicate
fit_values
# 3) Repeat the simulation multiple times to average out randomness
for (rep_i in seq_len(n_sims)) {
# Increase seed to vary each run
$seed <- model$seed - 1 + rep_i
tmp_model
# Run the simulation
<- run_simulation(tmp_model, max_steps)
sim_worlds
# Generate the simulated report
<- report_generate(sim_worlds)
sim_report
# Compute error/fit relative to historic data
<- measure_fit(historic_report, sim_report)
fit_values[rep_i]
}
# Average fit score across replicates
<- mean(fit_values)
avg_fit
# 4) Store the results
<- results %>%
results bind_rows(tibble(
world_size = tmp_model$world_size,
world_agents_per_cell = tmp_model$world_agents_per_cell,
agent_distance_function = tmp_model$agent_distance_function,
agent_distance_parameter = tmp_model$agent_distance_parameter,
agent_adoption_p = tmp_model$agent_adoption_p,
start_seed = model$seed,
simulations = n_sims,
fit = avg_fit
))
}
return(results)
}
Since this is tedious work that might take a long time, this parallel version of the calibration function, calibrate_model_mc()
, uses multiple cores to speed up parameter testing. It creates a parallel computing cluster, distributes work across available cores, and combines results into a single dataset. The implementation improves efficiency when testing many parameter combinations.
<- function(
calibrate_model_mc
historic_report,
model,# if model variable is FALSE, use default provided by `model`, otherwise use them to generate a grid
world_size = NA,
world_agents_per_cell = NA,
agent_distance_function = NA,
agent_distance_parameter = NA,
agent_adoption_p = NA,
# number of simulations runs for each combination
n_sims = 3,
# number of steps in each run
max_steps = 25,
# maximum cores to use
max_cores = 24
) {# 1) Build a data frame of all combinations
<- expand.grid(
param_grid world_size = if_else(is.na(world_size) , model$world_size , world_size),
world_agents_per_cell = if_else(is.na(world_agents_per_cell) , model$agents_per_cell , world_agents_per_cell),
agent_distance_function = if_else(is.na(agent_distance_function) , model$agent_distance_function , agent_distance_function),
agent_distance_parameter = if_else(is.na(agent_distance_parameter), model$agent_distance_parameter, agent_distance_parameter),
agent_adoption_p = if_else(is.na(agent_adoption_p) , model$agent_adoption_p , agent_adoption_p),
stringsAsFactors = FALSE
)
# 2) Set up parallel back-end
<- min(parallel::detectCores(), max_cores, nrow(param_grid))
n_cores cat("Using", n_cores, "cores\n")
# Initialize a cluster and register it
<- parallel::makeCluster(n_cores)
cl ::registerDoParallel(cl)
doParallel
# 3) Use foreach to iterate in parallel over each row of param_grid
# .combine = rbind => combine results row-by-row
# .packages = "dplyr" => loads dplyr on each worker
# .export => exports functions from global environment
# that the workers need (run_simulation, report_generate, measure_fit)
<- foreach(
results i = seq_len(nrow(param_grid)),
.combine = rbind,
.packages = c("dplyr"),
.export = c("run_simulation", "report_generate", "measure_fit")
%dopar% {
) source("files/functions_simulation.R")
# Copy the base model and override parameters from row i
<- modifyList(model, as.list(param_grid[i, ]))
tmp_model
# Run multiple simulations for this combination
<- numeric(n_sims)
fit_values for (rep_i in seq_len(n_sims)) {
# Vary the seed slightly for each replicate
$seed <- model$seed - 1 + rep_i
tmp_model
<- run_simulation(tmp_model, max_steps)
sim_worlds <- report_generate(sim_worlds)
sim_report <- measure_fit(historic_report, sim_report)
fit_values[rep_i]
}
# Average fit score across replicates
<- mean(fit_values)
avg_fit
# Return a one-row data.frame for foreach to combine
data.frame(
world_size = tmp_model$world_size,
world_agents_per_cell = tmp_model$world_agents_per_cell,
agent_distance_function = tmp_model$agent_distance_function,
agent_distance_parameter = tmp_model$agent_distance_parameter,
agent_adoption_p = tmp_model$agent_adoption_p,
start_seed = model$seed,
simulations = n_sims,
fit = avg_fit,
stringsAsFactors = FALSE
)
}
# 4) Stop the cluster
::stopCluster(cl)
parallel
# Convert results to a tibble and return
<- as_tibble(results)
results return(results)
}
The calibration_heatmap
function creates a ggplot2 visualization showing how model fit varies across parameter combinations. It displays results as a color-coded tile plot where the x and y axes represent different parameters and the color intensity indicates fit quality. The function offers customization of variables, color schemes, and plot styling.
<- function(
calibration_heatmap
data,x_var = "agent_distance_parameter", # Name of the column to use for x-axis
y_var = "agent_adoption_p", # Name of the column to use for y-axis
fill_var = "fit", # Name of the column to use for the fill (e.g. fit/error)
palette = "RdYlBu", # Brewer palette name (see ?RColorBrewer)
direction = -1, # 1 or -1, flips the palette direction
plot_title = "Calibration Heatmap"
) {ggplot(data = data, aes_string(x = x_var, y = y_var, fill = fill_var)) +
geom_tile() +
scale_fill_distiller(palette = palette, direction = direction) +
labs(
title = plot_title,
x = x_var,
y = y_var,
fill = fill_var
+
) theme_minimal()
}
So now let’s have a look at our historic data. The next chunk loads our data and animates its innovation process. This serves as our “ground truth.”
load("files/historic_data.RData")
<- report_generate(historic_data)
historic_report world_animation(historic_data)
In the next step, we first compare the historic data with our prior run. For this, we measure the fit at each time-step. Perfect overall fit – i.e., if our model perfectly replicates the patterns in the historic data, would be 0.
source("files/functions_simulation.R")
source("files/functions_calibration.R")
load("files/historic_data.RData")
# run a single model
<- list(
model_sim seed = 1234,
world_size = 13,
world_agents_per_cell = 6,
agent_distance_function = "geometric",
agent_distance_parameter = 0.5,
agent_adoption_p = 0.5
)
<- run_simulation(model_sim)
sim_data <- report_generate(sim_data)
sim_report
measure_fit(historic_report, sim_report)
[1] 0.5774862
Then we can move on to running multiple (i.e., 50 per parameter combination) models whose outcomes we subsequently compare to the “real,” observed outcome. For this comparison, we draw a heatmap.
# define the base model
<- list(
base_model seed = 1234,
world_size = 13,
world_agents_per_cell = 6,
agent_distance_function = "geometric",
agent_distance_parameter = NA_real_, # placeholder
agent_adoption_p = NA_real_ # placeholder
)
# run the calibration
<- calibrate_model_mc(
calibration_result historic_report = historic_report,
model = base_model,
agent_distance_parameter = seq(0.05, 1.0, by = 0.05),
agent_adoption_p = seq(0.05, 1.0, by = 0.05),
n_sims = 50,
max_steps = 25
)
save(calibration_result, file = "files/calibration_result.Rdata")
# generate a heatmap based on calibration
calibration_heatmap(calibration_result)
The bluer the tile, the better the fit of our own model to the “original” data. According to our graph, the true adoption parameters lie around 0.4, and the true distance parameter around 0.6.
While in this toy example, the “ground truth” came from and unspecified simulation run of our own, in real research applications you would want to compare your simulation outcomes to data that stem from observational, longitudinal data (e.g., surveys, observational studies).
Further links
- Another handbook chapter on ABMs in the social sciences
- A workshop on ABMs by Marijn Keizer and Jan Lorenz (it’s in Netlogo but gives you some infos on how to build and interpret ABMs)