Chapter 7: OCR with `tesseract`

This chapter demonstrates how you can read images and PDF documents into R in an automated fashion. Note that OCR is not always perfect and you might have to do some significant pre- and/or post-processing. I have included some classic pre-processing commands from the magick package, post-processing will usually be conducted using RegExes.

Install `tesseract` and download language packages

Before we can start OCRing images, we need to install tesseract via the command line. The reason for this is that the R package merely binds to the engine, but the OCRing happens “under the hood.” You can find instructions on how to install tesseract for your respective operating system here.

Once successfully installed, we can just load the package. In order to get the best results, we need to define the language our text is in. Multiple options are available (for a list of languages, see this website) and can be downloaded using tesseract::tesseract_download() (for Mac and Windows users).

needs(tesseract)
english <- tesseract("eng") # use English model

tesseract_download("deu") # download German language model

Training data already exists. Overwriting /Users/felixlennert/Library/Application Support/tesseract5/tessdata/deu.traineddata

[1] "/Users/felixlennert/Library/Application Support/tesseract5/tessdata/deu.traineddata"

tesseract_info()[["available"]] # check available languages

[1] "deu"  "eng"  "osd"  "snum"

OCR 101

Once the package and the language module is installed, you can start OCRing. For illustration purposes, we OCR the first paragraph of the RStudio Wikipedia article:

ocr("figures/rstudio_wiki.png", engine = english)

[1] "RStudio IDE (or RStudio) is an integrated development environment for R, a\nprogramming language for statistical computing and graphics. It is available in two\nformats: RStudio Desktop is a regular desktop application while RStudio Server runs on\na remote server and allows accessing RStudio using a web browser. The RStudio IDE\nis a product of Posit PBC (formerly RStudio PBC, formerly RStudio Inc.).\n"

Note that there are still line breaks in there. We can easily replace them with whitespace them using stringr::str_replace_all(). Make sure to remove redundant whitespaces using stringr::str_squish()

needs(tidyverse)
ocr("figures/rstudio_wiki.png", engine = english) |> 
  str_replace_all("\\n", " ") |> 
  str_squish()

[1] "RStudio IDE (or RStudio) is an integrated development environment for R, a programming language for statistical computing and graphics. It is available in two formats: RStudio Desktop is a regular desktop application while RStudio Server runs on a remote server and allows accessing RStudio using a web browser. The RStudio IDE is a product of Posit PBC (formerly RStudio PBC, formerly RStudio Inc.)."

If we want deeper insights to the confidence tesseract has in its word guesses, use tesseract::ocr_data().

ocr_data("figures/rstudio_wiki.png", engine = english)

# A tibble: 63 × 3
   word        confidence bbox           
   <chr>            <dbl> <chr>          
 1 RStudio           91.8 14,18,134,43   
 2 IDE               96.6 145,19,196,42  
 3 (or               93.3 207,18,245,49  
 4 RStudio)          92.9 256,18,385,49  
 5 is                96.3 397,19,418,43  
 6 an                96.3 429,24,461,43  
 7 integrated        96.5 474,18,612,49  
 8 development       93.7 624,18,806,49  
 9 environment       96.5 816,19,991,43  
10 for               93.2 1001,18,1038,43
# ℹ 53 more rows

Advanced OCR with `magick` preprocessing

This worked quite well. One reason for this is that screenshots from the internet are usually very “clean.” However, often this is not the case, especially with book scans. There might be some noise/speckles in the image, some skewed text, etc. In our next example, we OCR the first page of the “Text As Data” book and preprocess it with magick (find instructions for magick here)

Table of contents of Grimmer, Roberts, and Stewart (2022)

ocr("figures/tad_toc.png") |> 
  cat()

Preface i
Prerequisites and Notation xvll
Uses for This Book wl
What This Book Is Not —
PART |! PRELIMINARIES 1
CHAPTER 1 Introduction 3
1.1. How This Book Informs the Social Sciences 5
1.2. How This Book Informs the Digital Humanities 8
1.3 How This Book Informs Data Science in Industry
and Government
1.4  AGuide to This Book 2
1.5 Conclusion *
CHAPTER 2 Social Science Research and Text Analysis 13
2.1 Discovery
2.2 Measurement 18
2.3. Inference ae
2.4 Social Science as an Iterative and W
Cumulative Process
2.5 An Agnostic Approach to Text Analysis
2.6 Discovery, Measurement, and Causal Inference:
How the Chinese Government Censors Social
Media 20
2.7 Six Principles of Text Analysis 22
2.71 Social Science Theories and Substantive
Knowledge are Essential for Research Design 22
2.7.2 Text Analysis does not Replace Humans—lt
Augments Them 24
273 Building, Refining, and Testing Social Science
Theories Requires Iteration and Cumulation 26
2.74 Text Analysis Methods Distill Generalizations
from Language 28
2.75 The Best Method Depends on the Task 29

As we can see, there are a couple of problems – some page numbers are not detected correctly, some typos, etc. Perhaps, some manual image pre-processing can help here.

needs(magick)

image_read("figures/tad_toc.png") |> 
  image_resize("90%") |> # play around with this parameter
  image_rotate(degrees = 3) |> #straighten picture
  image_contrast(sharpen = 100) |>  # increases contrast
  image_convert(type = "Grayscale") |> # black and white
  image_trim() |> #trim image to remove margins 
  ocr() |> 
  cat()

Preface xvii
Prerequisites and Notation xvii
Uses for This Book xviii
What This Book Is Not min
PARTI PRELIMINARIES 1
CHAPTER 1 Introduction 3
1.1. How This Book Informs the Social Sciences 5
1.2 How This Book Informs the Digital Humanities 8
1.3. How This Book Informs Data Science in Industry
and Government 9
1.4 A Guide to This Book 10
1.5 Conclusion n
CHAPTER 2 __ Social Science Research and Text Analysis 13
2.1 Discovery 15
2.2 Measurement 16
2.3 Inference 17
2.4 Social Science as an Iterative and
Cumulative Process 7
2.5 An Agnostic Approach to Text Analysis 18
2.6 Discovery, Measurement, and Causal Inference:
How the Chinese Government Censors Social
Media 20
2.7 Six Principles of Text Analysis 22
2.71 Social Science Theories and Substantive
Knowledge are Essential for Research Design 22)
2.72 Text Analysis does not Replace Humans—It
Augments Them 24
2.73 Building, Refining, and Testing Social Science
Theories Requires Iteration and Cumulation 26
2.74 Text Analysis Methods Distill Generalizations
from Language 28
2.75 The Best Method Depends on the Task 29

Slight improvements! Still not perfect, but OCR hardly ever is.

Read PDFs

If we want to read PDFs, we can also harness the power of tesseract in combination with magick and pdftools. In this example, I ocr a multi-page PDF document containing newspaper articles.

needs(pdftools)
german <- tesseract(language = "deu")

texts <- map(1:pdf_info("figures/snippet_dereko.pdf")$pages, 
              \(x) {
                pdf_render_page("figures/snippet_dereko.pdf", page = x, dpi = 300) |> 
                  image_read() |> # Convert raw image to magick image object
                  ocr(engine = german) # OCR
                }) |> 
  reduce(c)

texts |> str_sub(1, 100) |> cat()

© Leibniz-Institut für Deutsche Sprache, Mannheim
COSMAS II-Server, C2API-Version 4.23.8 - 21.11.202 bewusst als Veranstaltungsort gewählt worden. Bundesweit soll auf über 500 Veranstaltungen die Staat Das Auto hält sich an die religiösen Regeln am Shabbat, an dem keine Arbeit getan, kein Lichtschalte (B98/MA1.29532 Berliner Zeitung, 15.05.1998; PALÄSTINA [S. 6])

Israelis und Palästinenser erinnern

Easy!

Further links

Exercises

In general, you could try all the rvest exercises with selenium to see how these things differ. Also every page is different, therefore it will probably be best if you just start with your own things. However, here is a quite tricky example.

Take a screenshot of a page of your liking and OCR it. Post-process.

Solution. Click to expand!

ocr("figures/rstudio_wiki.png", engine = english) |> 
  str_replace_all("\\n", " ") |> 
  str_squish()

OCR a PDF document you have available (e.g., one of the course readings). If you get the error “Image too small to scale,” you can use magick::image_resize().

Solution. Click to expand!

texts <- map(1:3, 
              \(x) {
                pdf_render_page("figures/Stoltz:Taylor 2020.pdf", page = x, dpi = 300) |> 
                  image_read() |> # Convert raw image to magick image object
                  image_resize("300%") |> 
                  ocr(engine = english) # OCR
                }) |> 
  reduce(c)

References

Grimmer, Justin, Margaret Roberts, and Brandon Stewart. 2022. Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton: Princeton University Press.

Install tesseract and download language packages