Opened 7 months ago

Closed 7 months ago

#9151 closed defect (fixed)

Missing white space in the white list of tesseract configuration

Reported by: dominic108 Owned by:
Priority: normal Component: avfilter
Version: git-master Keywords: libtesseract
Cc: Blocked By:
Blocking: Reproduced by developer: no
Analyzed by developer: no

Description

Summary of the bug: I compiled ffmpeg on Ubuntu to have the tesseract module:

ffmpeg version N-101412-gb7e7813 Copyright (c) 2000-2021 the FFmpeg developers
built with gcc 9 (Ubuntu 9.3.0-17ubuntu1~20.04)
configuration: --prefix=/home/working/app_download/ffmpeg_build --pkg-config-flags=--static --extra-cflags=-I/home/working/app_download/ffmpeg_build/include --extra-ldflags=-L/home/working/app_download/ffmpeg_build/lib --extra-libs='-lpthread -lm' --ld=g++ --bindir=/home/working/app_download/ffmpeg_build/bin --enable-gpl --enable-gnutls --enable-libaom --enable-libass --enable-libfdk-aac --enable-libfreetype --enable-libmp3lame --enable-libopus --enable-libsvtav1 --enable-libdav1d --enable-libvorbis --enable-libvpx --enable-libx264 --enable-libx265 --enable-nonfree --enable-libtesseract

I tested with the command line:

% ffmpeg -i input -vf "ocr,metadata=mode=print:file=ocr.txt:direct=1" output

Here is an extract from ocr.txt after the above command:

frame:53   pts:212752  pts_time:212.752
lavfi.ocr.text=Transcendingisthedeepsettlingoftheactivityofthemind
whilethemindremainsawake.
lavfi.ocr.confidence=0 0 95

The white spaces were not recognized by Tesseract and I checked that Tesseract perfectly recognized the white spaces when directly applied on the frame image. So, I checked the code libavfilter/vf_ocr.c to see what is going on. The white space was not in the white list of characters. Even though it is a strange idea to consider a white space as a character in the context of ocr, I boldly added a white space to the list. The original code was :

   { "whitelist", "set character whitelist", OFFSET(whitelist), AV_OPT_TYPE_STRING, {.str="0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.:;,-+_!?\"'[]{}()<>|/\\=*&%$#@!~"}, 0, 0, FLAGS },

The modified code was

   { "whitelist", "set character whitelist", OFFSET(whitelist), AV_OPT_TYPE_STRING, {.str="0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.:;,-+_!?\"'[]{}()<>|/\\=*&%$#@!~ "}, 0, 0, FLAGS },

After recompilation I tried again and here was result:

frame:53   pts:212752  pts_time:212.752
lavfi.ocr.text=Transcending is the deep settling of the activity of the mind
while the mind remains awake.
lavfi.ocr.confidence=96 96 97 96 96 96 96 96 96 96 96 96 96 96 96 96 95 

Change History (6)

comment:1 by dominic108, 7 months ago

Resolution: wontfix
Status: newclosed

I just realized that it was a default value. There might be scenarios where it is better not to detect white spaces. So, I will close this.

comment:2 by dominic108, 7 months ago

Component: undetermineddocumentation
Resolution: wontfix
Status: closedreopened
Type: defectenhancement

Perhaps it could be reopen as a documentation issue.

comment:3 by Carl Eugen Hoyos, 7 months ago

Keywords: libtesseract added; tesseract removed

Or you could send a patch - made with git format-patch - that changes the default to the FFmpeg development mailing list.

comment:4 by dominic108, 7 months ago

It seems a better default. It is even somehow suggested in https://github.com/tesseract-ocr/tesseract/issues/2923. So I sent a patch to the mailing list.

comment:5 by dominic108, 7 months ago

But, I will leave it as a documentation issue, because, if the patch is not applied, this missing white space should at the least be documented.

comment:6 by Carl Eugen Hoyos, 7 months ago

Component: documentationavfilter
Resolution: fixed
Status: reopenedclosed
Type: enhancementdefect
Version: unspecifiedgit-master

Pushed as 626e0dd060042b203c5b49512b695e03d8560da1, thank you for the report and the fix!

Note: See TracTickets for help on using tickets.