The tesseract package provides R bindings Tesseract: a powerful optical character recognition (OCR) engine that supports over 100 languages. The engine is highly configurable in order to tune the detection algorithms and obtain the best possible results.
Keep in mind that OCR (pattern recognition in general) is a very difficult problem for computers. Results will rarely be perfect and the accuracy rapidly decreases with the quality of the input image. But if you can get your input images to reasonable quality, Tesseract can often help to extract most of the text from the image.
OCR is the process of finding and recognizing text inside images, for example from a screenshot, scanned paper. The image below has some example text:
library(tesseract) eng <- tesseract("eng") text <- tesseract::ocr("http://jeroen.github.io/images/testocr.png", engine = eng) cat(text)
This is a lot of 12 point text to test the ocr code and see if it works on all types of file format. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox.
Not bad! The
ocr_data() function returns all words in the image along with a bounding box and confidence rate.
results <- tesseract::ocr_data("http://jeroen.github.io/images/testocr.png", engine = eng) results
# A tibble: 60 × 3 word confidence bbox <chr> <dbl> <chr> 1 This 96.6 36,92,96,116 2 is 96.9 109,92,129,116 3 a 96.3 141,98,156,116 4 lot 96.3 169,92,201,116 5 of 96.5 212,92,240,116 6 12 96.5 251,92,282,116 7 point 96.5 296,92,364,122 8 text 96.5 374,93,427,116 9 to 96.9 437,93,463,116 10 test 97.0 474,93,526,116 # … with 50 more rows
The tesseract OCR engine uses language-specific training data in the recognize words. The OCR algorithms bias towards words and sentences that frequently appear together in a given language, just like the human brain does. Therefore the most accurate results will be obtained when using training data in the correct language.
tesseract_info() to list the languages that y