#6021 closed defect (fixed)
tx3g / mov_text subtitles are not encoded correctly in some specific cases
Reported by: | erikbs | Owned by: | |
---|---|---|---|
Priority: | normal | Component: | avcodec |
Version: | git-master | Keywords: | utf8 mov_text |
Cc: | erikbs, rwijnsma@xs4all.nl | Blocked By: | |
Blocking: | Reproduced by developer: | no | |
Analyzed by developer: | no |
Description
Consider the following command:
ffmpeg -i input.mp4 -i input.srt -c:a copy -c:v copy -c:s mov_text output.mp4
for converting SRT subtitles into 3GPP timed text (TTXT) and embedding them inside an MPEG4 container. Until recently, ffmpeg ignored any formatting/styling in the SRT file and just converted the raw text and timestamps instead. That produced files that had no problems.
In SRT files, the start and end points of the text to be formatted are determined by tags, e.g. <i> and </i>. In TTXT subtitles, the start and end points are instead saved as numbers. Currently ffmpeg measures these values in bytes, but it looks like they should be measured in characters instead. For example, the Chinese character 我 consists of three bytes, but is considered a single character.
The problem arises when I try to convert an SRT where a line contains multibyte characters and a formatted string, and there are less than X characters between the formatted string and the end of the line, where X is the difference between the length of the line in bytes and the length in characters. Take for example this SRT line:
1 00:00:01,000 --> 00:00:02,000 The character 我 consists of three bytes <i>this string will cause problems</i>
Measured in characters, the formatted string starts at position 40 and ends at position 71 (i.e. the character at position 71 is not part of the string). Measured in bytes (excluding the tags of course), the string starts at position 42 and ends at position 73, which are the values ffmpeg stores inside the output file. When I open this file in QuickTime Player on Mac OS X, it seems to expect that these numbers be measured in characters. Since 我 is counted as one character, it proceeds to read off the end of the line (i.e. past the 73rd byte), resulting in an instant, brutal crash. VLC, which appears to handle errors better, either tries to correct the error or just ignores the formatting when invalid data is found.
I used MP4Box to convert the SRT to TTXT and to extract the TTXT from the MP4 generated by ffmpeg using
mp4box -ttxt input.srt # convert SRT to TTXT mp4box -ttxt 3 output.mp4 # extract the third stream (subtitles)
When I compared the output files, it immediately became clear that MP4Box counts characters while ffmpeg counts bytes. During testing I was able to confirm that VLC counts in the same way as MP4Box and QuickTime: in characters, not bytes. It should also be mentioned that the standalone TTXT files, which are XML files, contain the properties fromChar and toChar, further indicating that we should count characters and not bytes. When stored inside an MPEG4 container, the TTXT files are “compressed” into some binary format I do not fully understand instead of using XML style tags. By replacing the correct bytes in the file produced by ffmpeg (byte count --> character count) using a hex editor, the file played correctly in VLC and QuickTime (with the correct letters italicized), and I also got MP4Box to extract an SRT that looked correct (without the hex editing, the SRT generated by MP4Box from the file produced by ffmpeg had the tags in wrong place).
The erroneous data seems to be written by the function encode_styl in the file libavcodec/movtextenc.c, at line 108-109 to be precise. Here the raw byte positions are written to the files. These are passed to the function through an MovTextContext struct, which has a member called style_attributes – an array where each element corresponds to a formatted string. Each element in this array is another struct, having members such as style_start and style_end. At the moment I have no idea where these values are produced, but the encode_styl function writes them to the file.
To correct the bug, the code that counts bytes and writes these values to the struct that is eventually passed to encode_styl should be corrected, so that it counts characters instead. I guess ffmpeg already has code for counting utf8 characters, but if it has not, then this article presents various ways of doing it.
Change History (20)
comment:1 by , 8 years ago
Version: | 3.2.1 → git-master |
---|
comment:2 by , 8 years ago
follow-up: 5 comment:3 by , 8 years ago
Please commit your patch locally with git commit
(mention the ticket number please), produce a patch file with git format-patch
and send the patch file to the development mailing list. You can undo your local commit with git reset HEAD^
then.
comment:4 by , 8 years ago
Keywords: | ttxt tx3g subtitles mp4 removed |
---|
comment:5 by , 8 years ago
Thanks, the patch has now been added to the list.
Two things remain:
- Fixing the decoder as well
- UTF-16, as is required by the format specification (only Big Endian UTF-16 is required, UTF-16 is optional)
I will take a look at the first one, but probably not the second, at least not now. According to the specification, UTF-8 is to be assumed unless there is a UTF-16 BOM, and I am afraid I will break something if I try to solve this without knowing the code. Besides, I have never encountered UTF-16 encoded subtitles and UTF-16 support is broken both with and without my patch (in different ways).
PS. Ticket #5429 is about the same problem. I did not realize that my ticket was a duplicate until after I had submitted it.
comment:6 by , 8 years ago
New patch submitted for correctly decoding styles when multibyte UTF-8 characters are involved.
About UTF-16:
Given that the byte length, uint64_t L
, of the string, char *text
, is known, the number of UTF-16 characters in the string can be calculated as follows:
uint64_t utf16_char_len(const char *text, uint64_t L) { uint64_t l = 0; uint16_t c = 0, start = 0; uint16_t m[2] = {0xFC00, 0xDC00}; // Bit masks if (L >= 2) c = ((uint16_t)text[0] << 8) + (uint8_t)text[1]; switch (c) { case 0xFFFE: // Little Endian, swap mask byte order m[0] = 0x00FC; m[1] = 0x00DC; case 0xFEFF: start = 2; // Skip the BOM default: for (uint64_t i = start; i < L; i += 2) if (((((uint16_t)text[i] << 8) | (uint8_t )text[i + 1] ) & m[0]) != m[1]) l++; } return l; }
This code expects to be fed valid UTF-16 data and assumes Big Endian when no BOM is present.
The format specification only requires Big Endian support, but it demands that the BOM be present for UTF-16. Exactly how this is supposed to be encoded I don’t know.
comment:7 by , 8 years ago
Cc: | added |
---|
Sorry if I misunderstand but I believe FFmpeg cannot output UTF-16 subtitles.
In any case, UTF-16 should not be part of your patch fixing UTF-8.
comment:8 by , 8 years ago
Ah, I see. That explains the error messages I received when trying to reencode a UTF-16 SRT into another UTF-16 SRT.
UTF-16 is not part of the two patches I have submitted for UTF-8.
comment:9 by , 8 years ago
When can I expect this patch to be integrated?
I'm asking because I'm having a somewhat similar situation.
Recently I found out that my Samsung Smart TV (2014) doesn't like tx3g subtitle-streams generated by FFMpeg. Whenever I mux subtitles as tx3g with -c:s mov_text
, Media Player Classic - HC on my pc never has any problems rendering them, but my tv does. Each time I start such a video generated by FFMpeg, lines only randomly appear.
If I generate the same video with MP4Box, then all lines appear normally on my tv.
Before I encountered this issue I've tried -fix_sub_duration
and several -movflags
-options, but all to no avail.
I'd like to see if erikbs's suggested patch could also fix my issue.
comment:10 by , 8 years ago
I'm a complete novice when it comes to compiling stuff from source, but in the end, with the help of ffmpeg-windows-build-helpers I've tested your suggested patch, erikbs, but it does nothing for my issue, because my Samsung SMART TV still randomly shows the subtitle lines. Again, the same subtitles muxed with MP4Box and all lines are shown.
I guess I'll mux to Matroska for the time being.
comment:11 by , 8 years ago
There are several attributes that ffmpeg does not care about when writing mov_text data. You can use MP4Box to dump the subtitles stream to XML format using the following command:
mp4box -ttxt 3 FILENAME.mp4
As you can see, attributes such as width and height are all zero in XML files dumped from ffmpeg generated MP4s. The MP4s containing subtitles converted from SRT by MP4Box have non-zero values in these fields instead, as is evident when dumping the subtitles stream to XML/TTXT. It is possible that the missing values are causing your problem. It is also possible that the patch was not applied correctly (the one I posted above is in a format that is slightly different from the one in which files are supposed to be submitted to Patchwork). Feel free to upload a sample video that is affected by the problem. You can create a 30 second sample from a longer video using this command:
ffmpeg -i INPUT_FILE.mp4 -c copy -c:s copy -ss HH:MM:SS -t 00:00:30 OUTPUT_FILE.mp4
where HH:MM:SS is the start time (hours, minutes, seconds).
It could of course also be that my patch is onto something, but is incomplete. Thus other relevant questions are:
- Are the vanishing subtitle lines styled? (look for <b>, <i> or <font in the SRT)
- Do the lines contain special characters? Multibyte characters? Characters missing from the TV’s fonts? Line breaks?
- Is there anything special about the lines that are shown? (Special characters, styling, length, …)
- Do the subtitles vanish randomly or are the same lines displayed/hidden every time?
comment:12 by , 8 years ago
I have submitted new patches by the way:
https://patchwork.ffmpeg.org/patch/2795/ (encoder)
https://patchwork.ffmpeg.org/patch/2796/ (decoder)
These hopefully meet all the requirements (no reindentation etc.)
comment:13 by , 8 years ago
I only saw your posts just now, because I didn't get email-notification.
It is also possible that the patch was not applied correctly (the one I posted above is in a format that is slightly different from the one in which files are supposed to be submitted to Patchwork).
Last time I didn't actually apply your patch the official way. I just manually edited movtextenc.c. I've also done this now, because of line-numbers in your patch not being up-to-date anymore.
Feel free to upload a sample video that is affected by the problem.
See http://rwijnsma.home.xs4all.nl/files/test/
Thus other relevant questions are: ...
Everything is just plain text. The subtitles vanish randomly.
I've successfully compiled FFMpeg with your changes, but after creating another test-video ('VPWON_1269922(FFMpeg-mux_t32)patch.mp4') my TV didn't display subtitles at all anymore.
No problem with MPC-HC on my pc.
commandline used:
ffmpeg.exe -i "full-episode" -i "VPWON_1269922.srt" -t 32 -c copy -c:s mov_text -metadata:s:s language=dut "VPWON_1269922(FFMpeg-mux_t32)patch.mp4"
comment:14 by , 6 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Closing the ticket, since I no longer experience this bug. I have looked at the code and the solution is more complicated than necessary, but it appears to do roughly the same as I suggested.
comment:16 by , 6 years ago
Well, better late than never. Given that it took too full years to fix the problem even though I provided a detailed explanation and submitted a patch 4-5 times (including here), I do not think I will bother trying again.
follow-up: 18 comment:17 by , 6 years ago
Cc: | added |
---|
First time I've been looking at this again in 2 years time. I believe my issue is different than erikbs', because this fix didn't do anything for me.
Whenever I mux subtitles it will always be to a matroska container, because muxing mov_text in a mp4 container was problematic and after testing today I can say it still is.
Ran above command again with a (obviously) newer FFmpeg binary, but now my Samsung Smart TV (2014) doesn't show/render any subtitles at all. My tv has no problems at all with the video generated by MP4Box and shows all subtitle lines.
Could it be that FFmpeg doesn't fully adhere to the ISO/IEC 14496-17 standard? There has to be a reason why my tv doesn't like the FFmpeg generated MP4[tx3g] files afterall.
comment:18 by , 6 years ago
Replying to CoRoNe:
First time I've been looking at this again in 2 years time. I believe my issue is different than erikbs', because this fix didn't do anything for me.
Whenever I mux subtitles it will always be to a matroska container, because muxing mov_text in a mp4 container was problematic and after testing today I can say it still is.
Ran above command again with a (obviously) newer FFmpeg binary, but now my Samsung Smart TV (2014) doesn't show/render any subtitles at all. My tv has no problems at all with the video generated by MP4Box and shows all subtitle lines.
Could it be that FFmpeg doesn't fully adhere to the ISO/IEC 14496-17 standard? There has to be a reason why my tv doesn't like the FFmpeg generated MP4[tx3g] files afterall.
Absolutely possible (actually I am even convinced that it does not, considering how it was overlooked that character and byte count are two different things). Through my investigation, I discovered that MP4Box writes a subtitle box size while ffmpeg does not (or, more precisely, it apparently sets the box size to 0x0). Perhaps that is why the subtitles are not shown on the TV?
Another thing to try is to encode the subtitles into an MP4 file with ffmpeg as usual, then use MP4Box to demux and remux the tx3g subtitles generated by ffmpeg (mp4box -ttxt 3 FILENAME.mp4
etc.). If that works, then ffmpeg somehow messes up the muxing, not the encoding. If it still does not work, then there is something wrong with its tx3g encoder.
comment:19 by , 6 years ago
I discovered that MP4Box writes a subtitle box size while ffmpeg does not (or, more precisely, it apparently sets the box size to 0x0). Perhaps that is why the subtitles are not shown on the TV?
Do you have a patch for this by any chance? I'm not a hero with C/C++.
...demux and remux
This works. My tv renders the subtitles perfectly. So I'm guessing something is going wrong with the muxing process.
Comparing the 2 mp4-files in MediaInfo I noticed the one created by FFmpeg mentioning "Muxing mode: sbtl" and "Alternate group: 3", where as the one created by MP4Box does not.
And for everyone's info:
--- VPWON_1269922-t32_MP4Box.ttxt 2019-04-05 23:33:21.500000000 +0200 +++ VPWON_1269922-t32_FFmpeg_demux-MP4Box.ttxt 2019-04-05 23:16:37.475016000 +0200 @@ -1,12 +1,12 @@ <?xml version="1.0" encoding="UTF-8" ?> <!-- GPAC 3GPP Text Stream --> <TextStream version="1.1"> -<TextStreamHeader width="400" height="60" layer="0" translation_x="0" translation_y="0"> +<TextStreamHeader width="0" height="0" layer="0" translation_x="0" translation_y="0"> <TextSampleDescription horizontalJustification="center" verticalJustification="bottom" backColor="0 0 0 0" verticalText="no" fillTextRegion="no" continuousKaraoke="no" scroll="None"> <FontTable> <FontTableEntry fontName="Serif" fontID="1"/> </FontTable> -<TextBox top="0" left="0" bottom="60" right="400"/> +<TextBox top="0" left="0" bottom="0" right="0"/> <Style styles="Normal" fontID="1" fontSize="18" color="ff ff ff ff"/> </TextSampleDescription> </TextStreamHeader>
This appears to be the only output difference between MP4Box's and FFmpeg's tx3g encoder.
'VPWON_1269922-t32_MP4Box.ttxt': mp4box -ttxt "VPWON_1269922-t32.srt"
'VPWON_1269922-t32_FFmpeg_demux-MP4Box.ttxt': mp4box -ttxt 3 "VPWON_1269922(FFMpeg-mux_t32).mp4"
comment:20 by , 3 years ago
Yep, my TV (LG C9) cannot play Muxing mode: sbtl with more than one subtitle tracks at all!!
mp4box works just fine as it muxes in a different mode, mov style.
Okay, so I took a closer look at the TTXT format documenation, and it clearly states that text lengths are counted in characters (not bytes). (Note that this document only touches the XML form. In binary form, the length of the different blocks (text, style etc.) appear to be counted in bytes.)
I also played around with the code, and think I have found a solution:
My modified code produces files that play correctly in QuickTime and contain subtitle streams equivalent to what MP4Box produces (MP4Box also sets height/width, ffmpeg does not, but this does not seem to matter). I have tested it on subtitles with nested styles too. My modifications may not be the prettiest, but the code works as far as I can see. I have not touched movtextdec.c, so when extracting TTXT subtitles from an MP4, ffmpeg generates SRT files with misplaced style tags.
For movtextenc.c I have created a patch. Any help with submitting it is highly appreciated. Me + git = disaster, and I do not even know if it is in the correct format (I used
diff -u file1 file2
):