Chapter 7: OCR with tesseract
This chapter demonstrates how you can read images and PDF documents into R in an automated fashion. Note that OCR is not always perfect and you might have to do some significant pre- and/or post-processing. I have included some classic pre-processing commands from the magick
package, post-processing will usually be conducted using RegExes.
Install tesseract
and download language packages
Before we can start OCRing images, we need to install tesseract
via the command line. The reason for this is that the R package merely binds to the engine, but the OCRing happens “under the hood.” You can find instructions on how to install tesseract
for your respective operating system here.
Once successfully installed, we can just load the package. In order to get the best results, we need to define the language our text is in. Multiple options are available (for a list of languages, see this website) and can be downloaded using tesseract::tesseract_download()
(for Mac and Windows users).
needs(tesseract)
<- tesseract("eng") # use English model
english
tesseract_download("deu") # download German language model
Training data already exists. Overwriting /Users/felixlennert/Library/Application Support/tesseract5/tessdata/deu.traineddata
[1] "/Users/felixlennert/Library/Application Support/tesseract5/tessdata/deu.traineddata"
tesseract_info()[["available"]] # check available languages
[1] "deu" "eng" "osd" "snum"
OCR 101
Once the package and the language module is installed, you can start OCRing. For illustration purposes, we OCR the first paragraph of the RStudio Wikipedia article:
ocr("figures/rstudio_wiki.png", engine = english)
[1] "RStudio IDE (or RStudio) is an integrated development environment for R, a\nprogramming language for statistical computing and graphics. It is available in two\nformats: RStudio Desktop is a regular desktop application while RStudio Server runs on\na remote server and allows accessing RStudio using a web browser. The RStudio IDE\nis a product of Posit PBC (formerly RStudio PBC, formerly RStudio Inc.).\n"
Note that there are still line breaks in there. We can easily replace them with whitespace them using stringr::str_replace_all()
. Make sure to remove redundant whitespaces using stringr::str_squish()
needs(tidyverse)
ocr("figures/rstudio_wiki.png", engine = english) |>
str_replace_all("\\n", " ") |>
str_squish()
[1] "RStudio IDE (or RStudio) is an integrated development environment for R, a programming language for statistical computing and graphics. It is available in two formats: RStudio Desktop is a regular desktop application while RStudio Server runs on a remote server and allows accessing RStudio using a web browser. The RStudio IDE is a product of Posit PBC (formerly RStudio PBC, formerly RStudio Inc.)."
If we want deeper insights to the confidence tesseract
has in its word guesses, use tesseract::ocr_data()
.
ocr_data("figures/rstudio_wiki.png", engine = english)
# A tibble: 63 × 3
word confidence bbox
<chr> <dbl> <chr>
1 RStudio 91.8 14,18,134,43
2 IDE 96.6 145,19,196,42
3 (or 93.3 207,18,245,49
4 RStudio) 92.9 256,18,385,49
5 is 96.3 397,19,418,43
6 an 96.3 429,24,461,43
7 integrated 96.5 474,18,612,49
8 development 93.7 624,18,806,49
9 environment 96.5 816,19,991,43
10 for 93.2 1001,18,1038,43
# ℹ 53 more rows
Advanced OCR with magick
preprocessing
This worked quite well. One reason for this is that screenshots from the internet are usually very “clean.” However, often this is not the case, especially with book scans. There might be some noise/speckles in the image, some skewed text, etc. In our next example, we OCR the first page of the “Text As Data” book and preprocess it with magick
(find instructions for magick
here)
ocr("figures/tad_toc.png") |>
cat()
Preface i
Prerequisites and Notation xvll
Uses for This Book wl
What This Book Is Not —
PART |! PRELIMINARIES 1
CHAPTER 1 Introduction 3
1.1. How This Book Informs the Social Sciences 5
1.2. How This Book Informs the Digital Humanities 8
1.3 How This Book Informs Data Science in Industry
and Government
1.4 AGuide to This Book 2
1.5 Conclusion *
CHAPTER 2 Social Science Research and Text Analysis 13
2.1 Discovery
2.2 Measurement 18
2.3. Inference ae
2.4 Social Science as an Iterative and W
Cumulative Process
2.5 An Agnostic Approach to Text Analysis
2.6 Discovery, Measurement, and Causal Inference:
How the Chinese Government Censors Social
Media 20
2.7 Six Principles of Text Analysis 22
2.71 Social Science Theories and Substantive
Knowledge are Essential for Research Design 22
2.7.2 Text Analysis does not Replace Humans—lt
Augments Them 24
273 Building, Refining, and Testing Social Science
Theories Requires Iteration and Cumulation 26
2.74 Text Analysis Methods Distill Generalizations
from Language 28
2.75 The Best Method Depends on the Task 29
As we can see, there are a couple of problems – some page numbers are not detected correctly, some typos, etc. Perhaps, some manual image pre-processing can help here.
needs(magick)
image_read("figures/tad_toc.png") |>
image_resize("90%") |> # play around with this parameter
image_rotate(degrees = 3) |> #straighten picture
image_contrast(sharpen = 100) |> # increases contrast
image_convert(type = "Grayscale") |> # black and white
image_trim() |> #trim image to remove margins
ocr() |>
cat()
Preface xvii
Prerequisites and Notation xvii
Uses for This Book xviii
What This Book Is Not min
PARTI PRELIMINARIES 1
CHAPTER 1 Introduction 3
1.1. How This Book Informs the Social Sciences 5
1.2 How This Book Informs the Digital Humanities 8
1.3. How This Book Informs Data Science in Industry
and Government 9
1.4 A Guide to This Book 10
1.5 Conclusion n
CHAPTER 2 __ Social Science Research and Text Analysis 13
2.1 Discovery 15
2.2 Measurement 16
2.3 Inference 17
2.4 Social Science as an Iterative and
Cumulative Process 7
2.5 An Agnostic Approach to Text Analysis 18
2.6 Discovery, Measurement, and Causal Inference:
How the Chinese Government Censors Social
Media 20
2.7 Six Principles of Text Analysis 22
2.71 Social Science Theories and Substantive
Knowledge are Essential for Research Design 22)
2.72 Text Analysis does not Replace Humans—It
Augments Them 24
2.73 Building, Refining, and Testing Social Science
Theories Requires Iteration and Cumulation 26
2.74 Text Analysis Methods Distill Generalizations
from Language 28
2.75 The Best Method Depends on the Task 29
Slight improvements! Still not perfect, but OCR hardly ever is.
Read PDFs
If we want to read PDFs, we can also harness the power of tesseract
in combination with magick
and pdftools
. In this example, I ocr a multi-page PDF document containing newspaper articles.
needs(pdftools)
<- tesseract(language = "deu")
german
<- map(1:pdf_info("figures/snippet_dereko.pdf")$pages,
texts
\(x) {pdf_render_page("figures/snippet_dereko.pdf", page = x, dpi = 300) |>
image_read() |> # Convert raw image to magick image object
ocr(engine = german) # OCR
|>
}) reduce(c)
|> str_sub(1, 100) |> cat() texts
© Leibniz-Institut für Deutsche Sprache, Mannheim
COSMAS II-Server, C2API-Version 4.23.8 - 21.11.202 bewusst als Veranstaltungsort gewählt worden. Bundesweit soll auf über 500 Veranstaltungen die Staat Das Auto hält sich an die religiösen Regeln am Shabbat, an dem keine Arbeit getan, kein Lichtschalte (B98/MA1.29532 Berliner Zeitung, 15.05.1998; PALÄSTINA [S. 6])
Israelis und Palästinenser erinnern
Easy!
Further links
Exercises
In general, you could try all the rvest
exercises with selenium
to see how these things differ. Also every page is different, therefore it will probably be best if you just start with your own things. However, here is a quite tricky example.
- Take a screenshot of a page of your liking and OCR it. Post-process.
Solution. Click to expand!
ocr("figures/rstudio_wiki.png", engine = english) |>
str_replace_all("\\n", " ") |>
str_squish()
- OCR a PDF document you have available (e.g., one of the course readings). If you get the error “Image too small to scale,” you can use
magick::image_resize()
.
Solution. Click to expand!
<- map(1:3,
texts
\(x) {pdf_render_page("figures/Stoltz:Taylor 2020.pdf", page = x, dpi = 300) |>
image_read() |> # Convert raw image to magick image object
image_resize("300%") |>
ocr(engine = english) # OCR
|>
}) reduce(c)