Chapter 6: selenium
Sometimes you will run into the problem of a website being not as “scrape-able” as you might have wanted it to. This might be due to Javascript, them blocking you for not being a “real” person, captchas, login forms, you name it. To avoid this, you can use a package called selenium
. What it basically does is controlling a web browser. Its original purpose is to automate testing of web-based applications. However, it’s also perfectly suited for helping you with your scraping endeavors.
selenium
is a Python application. An R wrapper (RSelenium
) exists, yet its rather tedious to use, since you would likely have to run your browser out of a Docker container. And if you’re on a recent Mac with a Silicon processor, these containers might not even exist yet. Therefore, in this script you will use the original Python version and run Python in this quarto
document from your R session. For this, you can use reticulate
, which allows you to switch between the R and the Python world. Hence, you can do your scraping in Python and your data manipulations in R and store all of this in one notebook.
Install Python using reticulate
and miniconda
To get started with selenium
, you first need to install reticulate
and set up your Python. This is rather straight-forward and comparably pain-free. One difference between R and Python is the basic workflow. In R, all your packages are available to you all the time once installed. In Python, it is common practice to create so-called environments that contain exactly the packages you need. One reason for this is that some packages rely on exact versions of other packages. Hence, when you set up an environment, you can install all the packages you need, and subsequently they should collaborate well.
We start today’s script with setting up the Python distribution via miniconda. Miniconda is handy since it only comes with a Python distribution, the conda package manager (so that you can set up and use your environments), and some few packages. If this is your first time using it, make sure to uninstall miniconda – just in case, and then install miniconda from scratch using reticulate
’s install_miniconda()
:
::miniconda_uninstall()
reticulate::install_miniconda()
reticulate
::miniconda_path() reticulate
Then you are ready to set up your dedicated environment. This is similar to an RStudio project, except that the packages you install are tied to the environment.
needs(reticulate)
conda_create(envname = "selenium_env") # create empty environment
conda_install(envname = "selenium_env",
packages = c("pandas", "selenium", "numpy"),
pip = TRUE) # install packages into a certain environment
Once it is created, you need to make sure to activate it whenever you want to work on related things
use_condaenv(condaenv = "selenium_env")
Now we are good to go and can work with Python in this quarto
document. If you want to run Python code, make sure to use Python instead of R code chunks, i.e., the first line of the chunk (in Source view) needs to contain {python}
instead of {r}
.
selenium
Now we’re good to go and can start working with selenium
. In this tutorial, you will also learn about some basics of coding in Python. Even more so than in R, we will rely on custom-made functions that we need to def
ine as well as for
loops. They are different in so far as we do not need to preallocate space to objects – such as lists – but rather can grow them iteratively (similar to dplyr::bind_rows()
/base R’s rbind()
). What’s more is that proper indentation of code is key here. In this tutorial, though, we will focus on navigating around websites and then, once we are done, grabbing their raw html
code, which we can then read in using xml2::read_html()
and wrangle using rvest
functions. If you want to go full Python, you can get into BeautifulSoup
, Python’s rvest
equivalent.
Send input to forms and boxes
With selenium
we can also send input to web forms in an automated manner by using send_keys()
. Let’s look at IMDb for this.
Imagine we have a list of movies we are interested in.
= ["top gun", "pirates of the carribean", "fear and loathing in las vegas", "batman"]
movie_list
"https://imdb.com") driver.get(
First, we need to find the search box, for instance using the CSS selector. We also need to locate the button. And finally we can send our input text and click the button.
= driver.find_element(By.CSS_SELECTOR, "#suggestion-search")
search_box = driver.find_element(By.CSS_SELECTOR, "#suggestion-search-button")
search_button
"top gun")
search_box.send_keys( search_button.click()
If you want to do this in a loop, make sure to clear the search box in between searches:
search_box.clear()
Captcha and Error handling
Another common problem you might encounter when scraping the web is that you need to solve captchas to prove that you are human. Since selenium
is simulating a browser, you can just intervene whenever and solve them yourselves. For this, we need to write a function that detects when human input is required, rings a little bell, and then, once you have solved the little riddle, allows you to continue.
def play_alert_windows():
import winsound
1000, 500) # 1000 Hz for 500ms
winsound.Beep(
def play_alert_mac():
import subprocess
"afplay", "/System/Library/Sounds/Ping.aiff"])
subprocess.run([
def play_alert_linux():
print("\a") # ascii bell
def captcha_detected(driver):
= driver.page_source.lower()
page_source = ["captcha", "recaptcha", "verify you're human", "roboter"]
captcha_keywords return any(keyword in page_source for keyword in captcha_keywords)
def solve_captcha_manually():
# or play_alert_windows() if you're using a Windows machine
play_alert_mac() print("CAPTCHA detected! Please solve it manually.")
input("Press Enter when you've solved the CAPTCHA...")
print("Resuming scrape...")
2) time.sleep(
So let’s see how this works in real life using a demo website.
"https://www.google.com/recaptcha/api2/demo")
driver.get(
if captcha_detected(driver):
solve_captcha_manually()= driver.find_element(By.CSS_SELECTOR, "#recaptcha-demo-submit")
submit_button submit_button.click()
This is it for this brief introduction to selenium
. Now that we’re done, we can just close the browser using
driver.quit()
Of course, there is a lot more than this tutorial does cover. Find more information – including a 12 hour YouTube tutorial under the following links.
Further links
Exercises
In general, you could try all the rvest
exercises with selenium
to see how these things differ. Also every page is different, therefore it will probably be best if you just start with your own things. However, here is a quite tricky example.
- Driving home for Christmas. I want to visit my family over the holidays, please give me an overview of trains (“https://bahn.de”) that go from Leipzig to Regensburg on December 20 in a
tibble
format. The tibble should contain: Date and time, number of changes, price.
Hint: Provide the initial search request using user input, then save the html, scroll down, and click “spätere Verbindungen,” save the html, etc.
Bonus (very tricky): do it for Dec 20 through 23 and make a visualization of the price (y-axis) over time (x-axis).
Solution. Click to expand!
"https://bahn.de")
driver.get(## enter search things manually
= []
output_paths for i in range(1, 20):
= f"temp/bahn_page_{i}.html"
output_path
output_paths.append(output_path)
for path in output_paths:
"window.scrollTo(0, document.body.scrollHeight);")
driver.execute_script(1, 3)
wait_random(= driver.page_source
html
save_html(html, path)= driver.find_element(By.XPATH, '//button[normalize-space()="Spätere Verbindungen"]')
later_connections
later_connections.click()1, 3) wait_random(
<- dir_ls("temp") %>%
file_names str_detect(., "bahn")] |>
.[enframe(name = NULL, value = "file_name") |>
mutate(number = str_extract(file_name, "[0-9]{1,2}") |> as.numeric()) |>
arrange(number)
<- dir_ls("temp") %>%
bahn_list str_detect(., "bahn")] |>
.[map(read_html) |>
set_names(dir_ls("temp") %>%
str_detect(., "bahn")]) %>%
.[$file_name]
.[file_names
<- bahn_list |>
raw_info map(\(x) x |>
html_elements("div.reiseplan__infos") |>
html_text2())
<- bahn_list |>
raw_price map(\(x) x |>
html_elements("span.reise-preis__preis") |>
html_text2())
<- tibble(
output_tbl start = raw_info |>
reduce(c) |>
str_extract("geplante Abfahrt[0-9]{2}\\:[0-9]{2}") |>
str_remove("geplante Abfahrt"),
end = raw_info |>
reduce(c) |>
str_extract("Ankunft[0-9]{2}\\:[0-9]{2}") |>
str_remove("Ankunft"),
changes = raw_info |>
reduce(c) |>
str_extract("Umstieg[e]?[0-9]") |>
str_remove("Umstieg[e]?") |>
as.numeric(),
price = raw_price |>
reduce(c) |>
str_extract("[0-9,]*") |>
str_replace(",", "\\.")) |>
replace_na(list(changes = 0)) |>
#mutate(est_time_min = as.numeric((end - start))/60) |>
mutate(former_value = lag(start) |>
str_extract("[0-9]{2}") |>
as.numeric(),
start_hour = str_extract(start, "[0-9]{2}") |>
as.numeric(),
diff_start_next = start_hour - former_value,
day_break = case_when(
< -18 ~ "break",
diff_start_next TRUE ~ NA)) |>
fill(day_break, .direction = "down") |>
filter(is.na(day_break)) |>
distinct(start, end, changes, price)
## BONUS
<- bahn_list |>
dates map(\(x) x |>
html_elements("div.reiseloesung-heading") |>
html_text2()) |>
map(\(x) if(length(x) == 0) "Fr. 20. Dez. 2024" else x) |>
map(\(x) str_c(x, collapse = ";"))
<- bind_cols(
output_tbl start = raw_info |>
map(\(x) x |>
str_extract("geplante Abfahrt[0-9]{2}\\:[0-9]{2}") |>
str_remove("geplante Abfahrt")) |>
map2(dates, \(x, y) x |>
enframe(name = NULL, value = "start") |>
mutate(date = y)) |>
list_rbind(),
end = raw_info |>
map(\(x) x |>
str_extract("Ankunft[0-9]{2}\\:[0-9]{2}") |>
str_remove("Ankunft") |>
enframe(name = NULL, value = "end")) |>
list_rbind(),
|>
raw_info map(\(x) x |>
str_extract("Umstieg[e]?[0-9]") |>
str_remove("Umstieg[e]?") |>
enframe(name = NULL, value = "changes")) |>
list_rbind() |>
replace_na(list(changes = "0")),
|>
raw_price map(\(x) x |>
str_extract("[0-9,]*") |>
str_replace(",", "\\.") |>
enframe(name = NULL, value = "price")) |>
list_rbind()
)
<- raw_info |>
lengths map(\(x) x |>
str_extract("geplante Abfahrt[0-9]{2}\\:[0-9]{2}") |>
str_remove("geplante Abfahrt")) |>
map(length) |>
reduce(c) |>
map2(1:19, \(x, y) rep(y, x)) |>
reduce(c)
<- output_tbl |>
dec_20 mutate(page = lengths) |>
slice(1:66) |>
mutate(date = "Fr. 20. Dez. 2024") |>
distinct(date, start, end, changes, price)
<- output_tbl |>
dec_21 mutate(page = lengths) |>
filter(page > 5, str_detect(date, "21\\.")) |>
slice(21:69) |>
mutate(date = "Sa. 21. Dez. 2024") |>
distinct(date, start, end, changes, price)
<- output_tbl |>
dec_22 mutate(page = lengths) |>
filter(str_detect(date, "22\\.")) |>
slice(32:78) |>
mutate(date = "So. 22. Dez. 2024") |>
distinct(date, start, end, changes, price)
<- output_tbl |>
dec_23 mutate(page = lengths) |>
filter(str_detect(date, "23\\.")) |>
slice(35:100) |>
mutate(date = "Mo. 23. Dez. 2024") |>
distinct(date, start, end, changes, price)
bind_rows(dec_20, dec_21, dec_22, dec_23) |>
mutate(date = str_replace(date, " Dez\\. ", "12.") |>
str_remove("^[A-Za-z]{2}\\. "),
date_time = str_c(date, " ", start, ":00") |>
parse_date_time(orders = "%d.%m.%Y %H:%M:%S"),
price = as.numeric(price)) |>
ungroup() |>
ggplot() +
geom_line(aes(date_time, price))
- Check out all the movies’ pages using a loop and adequate waiting times.
"https://imdb.com")
driver.get(
= ["top gun", "pirates of the carribean", "fear and loathing in las vegas", "batman"] #this is how you create a list to loop over in python
movie_list
# for replacing spaces with underscores
= "fear and loathing"
text = text.replace(" ", "_")
new_text new_text
- Store the results in HTML files.
Solution. Click to expand!
for movie in movie_list:
# Find elements inside the loop (good practice)
= driver.find_element(By.CSS_SELECTOR, "#suggestion-search")
search_box = driver.find_element(By.CSS_SELECTOR, "#suggestion-search-button")
search_button
# Clear previous search
# Clear previous text
search_box.clear()
# Perform search
search_box.send_keys(movie)1, 3)
wait_random(
search_button.click()
# Save results
= movie.replace(" ", "_")
movie_file = f"temp/{movie_file}.html"
output_path = driver.page_source
html
save_html(html, output_path)1, 3) wait_random(
- Use
rvest
to extract the exact years and titles of the results. Note: if you want to store your data in a nice tibble, the vectors need to be of the same length. use this function to replace missing elements (NULL
) in a list:replace_null <- function(list){modify(list, \(x) if(is.null(x)) NA else x)}
Solution. Click to expand!
<- dir_ls("temp") %>%
htmls !str_detect(., "books")] |>
.[map(read_html)
<- function(list){
replace_null modify(list, \(x) if(is.null(x)) NA else x)
}
<- function(html){
extract_data <- html |>
raw html_elements(".find-title-result .ipc-metadata-list-summary-item__c") |>
html_text2()
tibble(
title = map(str_split(raw, "\\n"), \(x) pluck(x, 1)) |>
replace_null() |>
reduce(c),
year = map(str_split(raw, "\\n"), \(x) pluck(x, 2)) |>
replace_null() |>
reduce(c)
)
}
<- htmls |>
movie_tibble map(extract_data) |>
list_rbind()