Opened 4 years ago
Closed 4 years ago
#9151 closed defect (fixed)
Missing white space in the white list of tesseract configuration
Reported by: | dominic108 | Owned by: | |
---|---|---|---|
Priority: | normal | Component: | avfilter |
Version: | git-master | Keywords: | libtesseract |
Cc: | Blocked By: | ||
Blocking: | Reproduced by developer: | no | |
Analyzed by developer: | no |
Description
Summary of the bug: I compiled ffmpeg on Ubuntu to have the tesseract module:
ffmpeg version N-101412-gb7e7813 Copyright (c) 2000-2021 the FFmpeg developers built with gcc 9 (Ubuntu 9.3.0-17ubuntu1~20.04) configuration: --prefix=/home/working/app_download/ffmpeg_build --pkg-config-flags=--static --extra-cflags=-I/home/working/app_download/ffmpeg_build/include --extra-ldflags=-L/home/working/app_download/ffmpeg_build/lib --extra-libs='-lpthread -lm' --ld=g++ --bindir=/home/working/app_download/ffmpeg_build/bin --enable-gpl --enable-gnutls --enable-libaom --enable-libass --enable-libfdk-aac --enable-libfreetype --enable-libmp3lame --enable-libopus --enable-libsvtav1 --enable-libdav1d --enable-libvorbis --enable-libvpx --enable-libx264 --enable-libx265 --enable-nonfree --enable-libtesseract
I tested with the command line:
% ffmpeg -i input -vf "ocr,metadata=mode=print:file=ocr.txt:direct=1" output
Here is an extract from ocr.txt after the above command:
frame:53 pts:212752 pts_time:212.752 lavfi.ocr.text=Transcendingisthedeepsettlingoftheactivityofthemind whilethemindremainsawake. lavfi.ocr.confidence=0 0 95
The white spaces were not recognized by Tesseract and I checked that Tesseract perfectly recognized the white spaces when directly applied on the frame image. So, I checked the code libavfilter/vf_ocr.c to see what is going on. The white space was not in the white list of characters. Even though it is a strange idea to consider a white space as a character in the context of ocr, I boldly added a white space to the list. The original code was :
{ "whitelist", "set character whitelist", OFFSET(whitelist), AV_OPT_TYPE_STRING, {.str="0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.:;,-+_!?\"'[]{}()<>|/\\=*&%$#@!~"}, 0, 0, FLAGS },
The modified code was
{ "whitelist", "set character whitelist", OFFSET(whitelist), AV_OPT_TYPE_STRING, {.str="0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.:;,-+_!?\"'[]{}()<>|/\\=*&%$#@!~ "}, 0, 0, FLAGS },
After recompilation I tried again and here was result:
frame:53 pts:212752 pts_time:212.752 lavfi.ocr.text=Transcending is the deep settling of the activity of the mind while the mind remains awake. lavfi.ocr.confidence=96 96 97 96 96 96 96 96 96 96 96 96 96 96 96 96 95
Change History (6)
comment:1 by , 4 years ago
Resolution: | → wontfix |
---|---|
Status: | new → closed |
comment:2 by , 4 years ago
Component: | undetermined → documentation |
---|---|
Resolution: | wontfix |
Status: | closed → reopened |
Type: | defect → enhancement |
Perhaps it could be reopen as a documentation issue.
comment:3 by , 4 years ago
Keywords: | libtesseract added; tesseract removed |
---|
Or you could send a patch - made with git format-patch
- that changes the default to the FFmpeg development mailing list.
comment:4 by , 4 years ago
It seems a better default. It is even somehow suggested in https://github.com/tesseract-ocr/tesseract/issues/2923. So I sent a patch to the mailing list.
comment:5 by , 4 years ago
But, I will leave it as a documentation issue, because, if the patch is not applied, this missing white space should at the least be documented.
comment:6 by , 4 years ago
Component: | documentation → avfilter |
---|---|
Resolution: | → fixed |
Status: | reopened → closed |
Type: | enhancement → defect |
Version: | unspecified → git-master |
Pushed as 626e0dd060042b203c5b49512b695e03d8560da1, thank you for the report and the fix!
I just realized that it was a default value. There might be scenarios where it is better not to detect white spaces. So, I will close this.