Opened 11 months ago

Last modified 8 months ago

#6021 new defect

tx3g / mov_text subtitles are not encoded correctly in some specific cases

Reported by: erikbs Owned by:
Priority: normal Component: avcodec
Version: git-master Keywords: utf8 mov_text
Cc: erikbs Blocked By:
Blocking: Reproduced by developer: no
Analyzed by developer: no

Description

Consider the following command:

ffmpeg -i input.mp4 -i input.srt -c:a copy -c:v copy -c:s mov_text output.mp4

for converting SRT subtitles into 3GPP timed text (TTXT) and embedding them inside an MPEG4 container. Until recently, ffmpeg ignored any formatting/styling in the SRT file and just converted the raw text and timestamps instead. That produced files that had no problems.

In SRT files, the start and end points of the text to be formatted are determined by tags, e.g. <i> and </i>. In TTXT subtitles, the start and end points are instead saved as numbers. Currently ffmpeg measures these values in bytes, but it looks like they should be measured in characters instead. For example, the Chinese character 我 consists of three bytes, but is considered a single character.

The problem arises when I try to convert an SRT where a line contains multibyte characters and a formatted string, and there are less than X characters between the formatted string and the end of the line, where X is the difference between the length of the line in bytes and the length in characters. Take for example this SRT line:

1
00:00:01,000 --> 00:00:02,000
The character 我 consists of three bytes
<i>this string will cause problems</i>

Measured in characters, the formatted string starts at position 40 and ends at position 71 (i.e. the character at position 71 is not part of the string). Measured in bytes (excluding the tags of course), the string starts at position 42 and ends at position 73, which are the values ffmpeg stores inside the output file. When I open this file in QuickTime? Player on Mac OS X, it seems to expect that these numbers be measured in characters. Since 我 is counted as one character, it proceeds to read off the end of the line (i.e. past the 73rd byte), resulting in an instant, brutal crash. VLC, which appears to handle errors better, either tries to correct the error or just ignores the formatting when invalid data is found.

I used MP4Box to convert the SRT to TTXT and to extract the TTXT from the MP4 generated by ffmpeg using

mp4box -ttxt input.srt # convert SRT to TTXT
mp4box -ttxt 3 output.mp4 # extract the third stream (subtitles)

When I compared the output files, it immediately became clear that MP4Box counts characters while ffmpeg counts bytes. During testing I was able to confirm that VLC counts in the same way as MP4Box and QuickTime?: in characters, not bytes. It should also be mentioned that the standalone TTXT files, which are XML files, contain the properties fromChar and toChar, further indicating that we should count characters and not bytes. When stored inside an MPEG4 container, the TTXT files are “compressed” into some binary format I do not fully understand instead of using XML style tags. By replacing the correct bytes in the file produced by ffmpeg (byte count --> character count) using a hex editor, the file played correctly in VLC and QuickTime? (with the correct letters italicized), and I also got MP4Box to extract an SRT that looked correct (without the hex editing, the SRT generated by MP4Box from the file produced by ffmpeg had the tags in wrong place).

The erroneous data seems to be written by the function encode_styl in the file libavcodec/movtextenc.c, at line 108-109 to be precise. Here the raw byte positions are written to the files. These are passed to the function through an MovTextContext? struct, which has a member called style_attributes – an array where each element corresponds to a formatted string. Each element in this array is another struct, having members such as style_start and style_end. At the moment I have no idea where these values are produced, but the encode_styl function writes them to the file.

To correct the bug, the code that counts bytes and writes these values to the struct that is eventually passed to encode_styl should be corrected, so that it counts characters instead. I guess ffmpeg already has code for counting utf8 characters, but if it has not, then this article presents various ways of doing it.

Change History (13)

comment:1 Changed 11 months ago by erikbs

  • Version changed from 3.2.1 to git-master

comment:2 Changed 11 months ago by erikbs

Okay, so I took a closer look at the TTXT format documenation, and it clearly states that text lengths are counted in characters (not bytes). (Note that this document only touches the XML form. In binary form, the length of the different blocks (text, style etc.) appear to be counted in bytes.)

I also played around with the code, and think I have found a solution:

  1. In the MovTextContext? struct, introduce a new member variable for the text position as characters (without replacing the existing text_pos, which is measured in bytes and needed for storing the length of the blocks)
  2. Increase the new variable by counting UTF8 characters instead of bytes (according to the format documentation, only UTF8 is supported anyway). A simple for+if will do (loop through all bytes and only count those where the two most significant bits are not 10)
  3. Replace all instances of text_pos with the new variable in code regarding style.

My modified code produces files that play correctly in QuickTime? and contain subtitle streams equivalent to what MP4Box produces (MP4Box also sets height/width, ffmpeg does not, but this does not seem to matter). I have tested it on subtitles with nested styles too. My modifications may not be the prettiest, but the code works as far as I can see. I have not touched movtextdec.c, so when extracting TTXT subtitles from an MP4, ffmpeg generates SRT files with misplaced style tags.

For movtextenc.c I have created a patch. Any help with submitting it is highly appreciated. Me + git = disaster, and I do not even know if it is in the correct format (I used diff -u file1 file2):

--- movtextenc.c	2016-02-26 21:15:02.000000000 +0000
+++ movtextenc.c	2016-12-15 02:29:03.000000000 +0000
@@ -70,6 +70,7 @@
     uint8_t style_fontsize;
     uint32_t style_color;
     uint16_t text_pos;
+    uint16_t text_pos_char;
 } MovTextContext;
 
 typedef struct {
@@ -216,10 +217,10 @@
             }
 
             s->style_attributes_temp->style_flag = 0;
-            s->style_attributes_temp->style_start = AV_RB16(&s->text_pos);
+            s->style_attributes_temp->style_start = AV_RB16(&s->text_pos_char);
         } else {
             if (s->style_attributes_temp->style_flag) { //break the style record here and start a new one
-                s->style_attributes_temp->style_end = AV_RB16(&s->text_pos);
+                s->style_attributes_temp->style_end = AV_RB16(&s->text_pos_char);
                 av_dynarray_add(&s->style_attributes, &s->count, s->style_attributes_temp);
                 s->style_attributes_temp = av_malloc(sizeof(*s->style_attributes_temp));
                 if (!s->style_attributes_temp) {
@@ -230,10 +231,10 @@
                 }
 
                 s->style_attributes_temp->style_flag = s->style_attributes[s->count - 1]->style_flag;
-                s->style_attributes_temp->style_start = AV_RB16(&s->text_pos);
+                s->style_attributes_temp->style_start = AV_RB16(&s->text_pos_char);
             } else {
                 s->style_attributes_temp->style_flag = 0;
-                s->style_attributes_temp->style_start = AV_RB16(&s->text_pos);
+                s->style_attributes_temp->style_start = AV_RB16(&s->text_pos_char);
             }
         }
         switch (style){
@@ -248,7 +249,7 @@
             break;
         }
     } else {
-        s->style_attributes_temp->style_end = AV_RB16(&s->text_pos);
+        s->style_attributes_temp->style_end = AV_RB16(&s->text_pos_char);
         av_dynarray_add(&s->style_attributes, &s->count, s->style_attributes_temp);
 
         s->style_attributes_temp = av_malloc(sizeof(*s->style_attributes_temp));
@@ -273,7 +274,7 @@
             break;
         }
         if (s->style_attributes_temp->style_flag) { //start of new style record
-            s->style_attributes_temp->style_start = AV_RB16(&s->text_pos);
+            s->style_attributes_temp->style_start = AV_RB16(&s->text_pos_char);
         }
     }
     s->box_flags |= STYL_BOX;
@@ -284,11 +285,11 @@
     MovTextContext *s = priv;
     if (color_id == 2) {    //secondary color changes
         if (s->box_flags & HLIT_BOX) {  //close tag
-            s->hlit.end = AV_RB16(&s->text_pos);
+            s->hlit.end = AV_RB16(&s->text_pos_char);
         } else {
             s->box_flags |= HCLR_BOX;
             s->box_flags |= HLIT_BOX;
-            s->hlit.start = AV_RB16(&s->text_pos);
+            s->hlit.start = AV_RB16(&s->text_pos_char);
             s->hclr.color = color | (0xFF << 24);  //set alpha value to FF
         }
     }
@@ -302,7 +303,10 @@
 {
     MovTextContext *s = priv;
     av_bprint_append_data(&s->buffer, text, len);
-    s->text_pos += len;
+    s->text_pos += len; // length of text in bytes
+    for (int i = 0; i < len; i++) // length of text in characters
+        if ((text[i] & 0xC0) != 0x80)
+            s->text_pos_char ++;
 }
 
 static void mov_text_new_line_cb(void *priv, int forced)
@@ -310,6 +314,7 @@
     MovTextContext *s = priv;
     av_bprint_append_data(&s->buffer, "\n", 1);
     s->text_pos += 1;
+    s->text_pos_char += 1;
 }
 
 static const ASSCodesCallbacks mov_text_callbacks = {
@@ -328,6 +333,7 @@
     size_t j;
 
     s->text_pos = 0;
+    s->text_pos_char = 0;
     s->count = 0;
     s->box_flags = 0;
     s->style_entries = 0;

comment:3 follow-up: Changed 11 months ago by cehoyos

Please commit your patch locally with git commit (mention the ticket number please), produce a patch file with git format-patch and send the patch file to the development mailing list. You can undo your local commit with git reset HEAD^ then.

comment:4 Changed 11 months ago by cehoyos

  • Keywords ttxt tx3g subtitles mp4 removed

comment:5 in reply to: ↑ 3 Changed 11 months ago by erikbs

Thanks, the patch has now been added to the list.

Two things remain:

  1. Fixing the decoder as well
  2. UTF-16, as is required by the format specification (only Big Endian UTF-16 is required, UTF-16 is optional)

I will take a look at the first one, but probably not the second, at least not now. According to the specification, UTF-8 is to be assumed unless there is a UTF-16 BOM, and I am afraid I will break something if I try to solve this without knowing the code. Besides, I have never encountered UTF-16 encoded subtitles and UTF-16 support is broken both with and without my patch (in different ways).

PS. Ticket #5429 is about the same problem. I did not realize that my ticket was a duplicate until after I had submitted it.

Last edited 11 months ago by erikbs (previous) (diff)

comment:6 Changed 11 months ago by erikbs

New patch submitted for correctly decoding styles when multibyte UTF-8 characters are involved.

About UTF-16:
Given that the byte length, uint64_t L, of the string, char *text, is known, the number of UTF-16 characters in the string can be calculated as follows:

uint64_t utf16_char_len(const char *text, uint64_t L) {
    uint64_t l = 0;
    uint16_t c = 0, start = 0;
    uint16_t m[2] = {0xFC00, 0xDC00}; // Bit masks
    
    if (L >= 2) c = ((uint16_t)text[0] << 8) + (uint8_t)text[1];
    switch (c) {
        case 0xFFFE:    // Little Endian, swap mask byte order
            m[0] = 0x00FC; m[1] = 0x00DC;
        case 0xFEFF:
            start = 2;  // Skip the BOM
        default:
            for (uint64_t i = start; i < L; i += 2)
                if (((((uint16_t)text[i] << 8) |
                    (uint8_t )text[i + 1] ) & m[0]) != m[1]) l++;
    }
    return l;
}

This code expects to be fed valid UTF-16 data and assumes Big Endian when no BOM is present.

The format specification only requires Big Endian support, but it demands that the BOM be present for UTF-16. Exactly how this is supposed to be encoded I don’t know.

comment:7 Changed 11 months ago by cehoyos

  • Cc erikbs added

Sorry if I misunderstand but I believe FFmpeg cannot output UTF-16 subtitles.
In any case, UTF-16 should not be part of your patch fixing UTF-8.

comment:8 Changed 11 months ago by erikbs

Ah, I see. That explains the error messages I received when trying to reencode a UTF-16 SRT into another UTF-16 SRT.

UTF-16 is not part of the two patches I have submitted for UTF-8.

comment:9 Changed 11 months ago by CoRoNe

When can I expect this patch to be integrated?
I'm asking because I'm having a somewhat similar situation.

Recently I found out that my Samsung Smart TV (2014) doesn't like tx3g subtitle-streams generated by FFMpeg. Whenever I mux subtitles as tx3g with -c:s mov_text, Media Player Classic - HC on my pc never has any problems rendering them, but my tv does. Each time I start such a video generated by FFMpeg, lines only randomly appear.
If I generate the same video with MP4Box, then all lines appear normally on my tv.
Before I encountered this issue I've tried -fix_sub_duration and several -movflags-options, but all to no avail.

I'd like to see if erikbs's suggested patch could also fix my issue.

comment:10 Changed 11 months ago by CoRoNe

I'm a complete novice when it comes to compiling stuff from source, but in the end, with the help of ffmpeg-windows-build-helpers I've tested your suggested patch, erikbs, but it does nothing for my issue, because my Samsung SMART TV still randomly shows the subtitle lines. Again, the same subtitles muxed with MP4Box and all lines are shown.
I guess I'll mux to Matroska for the time being.

comment:11 Changed 9 months ago by erikbs

There are several attributes that ffmpeg does not care about when writing mov_text data. You can use MP4Box to dump the subtitles stream to XML format using the following command:

mp4box -ttxt 3 FILENAME.mp4

As you can see, attributes such as width and height are all zero in XML files dumped from ffmpeg generated MP4s. The MP4s containing subtitles converted from SRT by MP4Box have non-zero values in these fields instead, as is evident when dumping the subtitles stream to XML/TTXT. It is possible that the missing values are causing your problem. It is also possible that the patch was not applied correctly (the one I posted above is in a format that is slightly different from the one in which files are supposed to be submitted to Patchwork). Feel free to upload a sample video that is affected by the problem. You can create a 30 second sample from a longer video using this command:

ffmpeg -i INPUT_FILE.mp4 -c copy -c:s copy -ss HH:MM:SS -t 00:00:30 OUTPUT_FILE.mp4

where HH:MM:SS is the start time (hours, minutes, seconds).

It could of course also be that my patch is onto something, but is incomplete. Thus other relevant questions are:

  1. Are the vanishing subtitle lines styled? (look for <b>, <i> or <font in the SRT)
  2. Do the lines contain special characters? Multibyte characters? Characters missing from the TV’s fonts? Line breaks?
  3. Is there anything special about the lines that are shown? (Special characters, styling, length, …)
  4. Do the subtitles vanish randomly or are the same lines displayed/hidden every time?

comment:12 Changed 9 months ago by erikbs

I have submitted new patches by the way:

https://patchwork.ffmpeg.org/patch/2795/ (encoder)

https://patchwork.ffmpeg.org/patch/2796/ (decoder)

These hopefully meet all the requirements (no reindentation etc.)

comment:13 Changed 8 months ago by CoRoNe

I only saw your posts just now, because I didn't get email-notification.

It is also possible that the patch was not applied correctly (the one I posted above is in a format that is slightly different from the one in which files are supposed to be submitted to Patchwork).

Last time I didn't actually apply your patch the official way. I just manually edited movtextenc.c. I've also done this now, because of line-numbers in your patch not being up-to-date anymore.

Feel free to upload a sample video that is affected by the problem.

See http://rwijnsma.home.xs4all.nl/files/test/

Thus other relevant questions are: ...

Everything is just plain text. The subtitles vanish randomly.

I've successfully compiled FFMpeg with your changes, but after creating another test-video ('VPWON_1269922(FFMpeg-mux_t32)patch.mp4') my TV didn't display subtitles at all anymore.
No problem with MPC-HC on my pc.

commandline used:

ffmpeg.exe -i "full-episode" -i "VPWON_1269922.srt" -t 32 -c copy -c:s mov_text
-metadata:s:s language=dut "VPWON_1269922(FFMpeg-mux_t32)patch.mp4"
Note: See TracTickets for help on using tickets.