Opened 4 years ago

Closed 2 years ago

Last modified 2 years ago

#1163 closed defect (fixed)

ffprobe can produce invalid XML

Reported by: Ian Owned by: stefano
Priority: normal Component: ffprobe
Version: git-master Keywords: utf8
Cc: n@ndj7.com, eml+ffmpeg@tupil.com Blocked By:
Blocking: Reproduced by developer: yes
Analyzed by developer: yes

Description

ffprobe can output invalid XML as xml_escape_str only handles < > ' " and &. For example most escape characters below 32 are invalid UTF-8.

This replacement version of the function replaces any invalid UTF-8 characters by the inverted question mark.

static const char *xml_escape_str(char **dst, size_t *dst_size, const char *src,
                                  void *log_ctx)
{
    // the unknown character (inverted question mark)
    const unsigned char BAD_CHARACTER_1 = 194, BAD_CHARACTER_2 = 191;

    const char *p;
    char *q;
    int copyAll = 1;
    size_t size = 1;

    /* precompute size */
    for (p = src; *p;) {
        int badChar = 0;
        unsigned char byte;

        ESCAPE_CHECK_SIZE(src, size, SIZE_MAX-10);

        byte = (unsigned char)*p;
        if (byte < 32 && byte != 9 && byte != 10 && byte != 13) {
            badChar = 1;
            ++p;
        } else if (byte < 128) {
            switch (byte) {
                case '&' : size += 5; /* &amp; */  copyAll = 0; break;
                case '<' : size += 4; /* &lt; */   copyAll = 0; break;
                case '>' : size += 4; /* &gt; */   copyAll = 0; break;
                case '\"': size += 6; /* &quot; */ copyAll = 0; break;
                case '\'': size += 6; /* &apos; */ copyAll = 0; break;
                default: size++;
                }
            ++p;
            ++size;
            }
        else if (byte < 0xC0)
            {
            badChar = 1;
            ++p;
            }
        else
            {
            int extra;

            copyAll = 0;
            if (byte < 0xe0)
                extra = 1;
            else if (byte < 0xf0)
                extra = 2;
            else if (byte < 0xf8)
                extra = 3;
            else
                badChar = 1;

            if (badChar)
                ++p;
            else
                {
                ++p;
                for (int i = 0; i < extra && *p; ++i, ++p)
                    {
                    byte = (unsigned char)*p;
                    if ((byte & 0xc0) != 0x80)
                        badChar = 1;
                    }
                if (!badChar)
                    size += extra;
                }
            }
        if (badChar) {
            size += 2;
            copyAll = 0;
            }
        }

    ESCAPE_REALLOC_BUF(dst_size, dst, src, size);

#define COPY_STR(str) {      \
        const char *s = str; \
        while (*s)           \
            *q++ = *s++;     \
    }

    p = src;
    q = *dst;
    if (copyAll)
        COPY_STR(p)
    else {
        while (*p) {
            int badChar = 0;
            unsigned char byte;
    
            byte = (unsigned char)*p;
            if (byte < 32 && byte != 9 && byte != 10 && byte != 13) {
                badChar = 1;
                ++p;
            } else if (byte < 128) {
                switch (byte) {
                    case '&' : COPY_STR("&amp;");  break;
                    case '<' : COPY_STR("&lt;");   break;
                    case '>' : COPY_STR("&gt;");   break;
                    case '\"': COPY_STR("&quot;"); break;
                    case '\'': COPY_STR("&apos;"); break;
                    default: *q++ = *p;
                    }
                ++p;
                ++size;
                }
            else if (byte < 0xC0)
                {
                badChar = 1;
                ++p;
                }
            else
                {
                int extra;
    
                copyAll = 0;
                if (byte < 0xe0)
                    extra = 1;
                else if (byte < 0xf0)
                    extra = 2;
                else if (byte < 0xf8)
                    extra = 3;
                else
                    badChar = 1;
    
                if (badChar)
                    ++p;
                else
                    {
                    const char *startChar = p;
                    int i;
                    ++p;
                    for (i = 0; i < extra && *p; ++i, ++p)
                        {
                        byte = (unsigned char)*p;
                        if ((byte & 0xc0) != 0x80)
                            badChar = 1;
                        }
                    if (!badChar) {
                        for (i = 0; i < extra;)
                            *q++ = *startChar++;
                    }
                }
            }
            if (badChar) {
                *q++ = BAD_CHARACTER_1;
                *q++ = BAD_CHARACTER_2;
            }
        }
    }
    *q = 0;

    return *dst;
}

Change History (23)

comment:1 Changed 4 years ago by cehoyos

Is the problem also reproducible with current git head?

Please send unified patches to ffmpeg-devel and please provide an example (including console) on how to produce the invalid xml.

comment:2 Changed 4 years ago by Đonny

Hi, I can reproduce it with ffprobe version N-42704-g85761ef built on Jul 20 2012 20:39:19

Version 0, edited 4 years ago by Đonny (next)

comment:3 follow-ups: Changed 3 years ago by nathanaeljones

  • Cc n@ndj7.com added

I would be willing to fund development of a patch if it can be completed/integrated soon. Anyone interested?

This is currently blocking our integration of ffmpeg thumbnailing within http://imageresizing.net. A large percentage (~40%) of videos cause ffprobe to produce invalid xml.

comment:4 in reply to: ↑ 3 Changed 3 years ago by cehoyos

Replying to nathanaeljones:

I would be willing to fund development of a patch if it can be completed/integrated soon. Anyone interested?

First step would be to make this a reproducible ticket by providing a failing command line together with complete, uncut console output (current git head).

comment:5 in reply to: ↑ 3 Changed 3 years ago by saste

Replying to nathanaeljones:

I would be willing to fund development of a patch if it can be completed/integrated soon. Anyone interested?

This is currently blocking our integration of ffmpeg thumbnailing within http://imageresizing.net. A large percentage (~40%) of videos cause ffprobe to produce invalid xml.

I'm interested in this, but I need more samples. Also the problem seems more related to invalid input data and/or encoding, so it looks more like a libavformat problem. Data should be probably validated and sanitized at the demuxing level, before rendering is performed at the application level.

comment:6 in reply to: ↑ description Changed 3 years ago by saste

Replying to Ian:

ffprobe can output invalid XML as xml_escape_str only handles < > ' " and &. For example most escape characters below 32 are invalid UTF-8.

Useful samples are in tickets #2955.

comment:7 Changed 3 years ago by eelco

  • Cc eml+ffmpeg@tupil.com added

comment:8 Changed 3 years ago by saste

  • Analyzed by developer set
  • Keywords utf8 added
  • Reproduced by developer set
  • Status changed from new to open
  • Version changed from 0.10.2 to unspecified

Please test the patches in attachment and report.

comment:9 Changed 3 years ago by mente

Applied on latest version: works like a charm. Thanks @saste!

comment:10 follow-up: Changed 3 years ago by mente

Hm, not so good as I thought. On corrupted files I got segmentation fault.

comment:11 in reply to: ↑ 10 Changed 3 years ago by saste

Replying to mente:

Hm, not so good as I thought. On corrupted files I got segmentation fault.

Check new patchset, in case of error please provide the sample.

comment:12 Changed 3 years ago by saste

  • Resolution set to fixed
  • Status changed from open to closed

I believed this should be fixed in:

commit cbba331aa02f29870581ff0b7ded7477b279ae2c
Author: Stefano Sabatini <stefasab@gmail.com>
Date:   Wed Oct 2 16:22:17 2013 +0200

    ffprobe: implement string validation setting
    
    This should fix trac tickets #1163, #2502.

Feel free to test and reopen the ticket in case of issues.

comment:13 follow-up: Changed 2 years ago by eelco

  • Analyzed by developer unset
  • Reproduced by developer unset
  • Resolution fixed deleted
  • Status changed from closed to reopened
  • Version changed from unspecified to git-master

Uploaded 1163-utf8bug.rm to incoming.

ffprobe -print_format xml -show_format 1163-utf8bug.rm > 1163-utf8bug.xml
ffprobe version N-63665-g921d5ae Copyright (c) 2007-2014 the FFmpeg developers
  built on Jun  1 2014 12:14:19 with Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn)
  configuration: --prefix=/Users/eelco/Projects/Beamer/FFmpeg/build --disable-shared --enable-static --disable-ffserver --disable-doc --enable-gpl --extra-cflags='-DMACOSX_DEPLOYMENT_TARGET=10.7 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk -mmacosx-version-min=10.7 -I/Users/eelco/Projects/Beamer/FFmpeg/build/include' --extra-ldflags='-isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk -mmacosx-version-min=10.7 -L/Users/eelco/Projects/Beamer/FFmpeg/build/lib'
  libavutil      52. 88.100 / 52. 88.100
  libavcodec     55. 66.100 / 55. 66.100
  libavformat    55. 42.100 / 55. 42.100
  libavdevice    55. 13.101 / 55. 13.101
  libavfilter     4.  5.100 /  4.  5.100
  libswscale      2.  6.100 /  2.  6.100
  libswresample   0. 19.100 /  0. 19.100
  libpostproc    52.  3.100 / 52.  3.100
[rm @ 0x7f9ae1015800] Failed to fully read block
Input #0, rm, from '/Users/eelco/Desktop/1163-utf8bug.rm':
  Metadata:
    Audiences       : BatchRealProducer 's Audience;
    audioMode       : music
    Creation Date   : 2/14/2013 1:40:55
    Description     : This RealMedia File Compressed by Batch Real Producer
    Generated By    : Helix Producer SDK 10.0 for Windows, Build 10.0.0.335
    Modification Date: 2/14/2013 1:40:55
    videoMode       : sharp
    Keywords        : 
    File ID         : 31e4c7ec-29b7-4ffd-8ea5-40f11991751a
    title           : tsks
    author          : ????????TSKS??2014??
    copyright       : ??????ʹTSKS???????tskscn.com??
    comment         : 
  Duration: 00:59:01.08, start: 0.000000, bitrate: 0 kb/s
    Stream #0:0: Audio: cook (cook / 0x6B6F6F63), 44100 Hz, stereo, fltp, 44 kb/s
    Stream #0:1: Video: rv40 (RV40 / 0x30345652), yuv420p, 624x352, 536 kb/s, 29.97 fps, 29.97 tbr, 1k tbn, 1k tbc
[xml @ 0x7f9ae1012e00] 10 invalid UTF-8 sequence(s) found in string '????????TSKS??2014??', replaced with '�'
[xml @ 0x7f9ae1012e00] 18 invalid UTF-8 sequence(s) found in string '??????ʹTSKS???????tskscn.com??', replaced with '�'
xmllint 1163-utf8bug.xml 
1163-utf8bug.xml:14: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xC0 0xA1 0xEF 0xBF
        <tag key="author" value="�����??�TSKS��2014��"/>
                                                ^

Changed 2 years ago by cehoyos

comment:14 in reply to: ↑ 13 ; follow-up: Changed 2 years ago by saste

Replying to eelco:

Uploaded 1163-utf8bug.rm to incoming.

[...]

xmllint 1163-utf8bug.xml 
1163-utf8bug.xml:14: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xC0 0xA1 0xEF 0xBF
        <tag key="author" value="�����??�TSKS��2014��"/>
                                                ^

C0A1 is an overlong encoding for character ! (decimal number 33), using two bytes in place of one. Since this is not allowed we should probably fix the decoder to disallow it.

Last edited 2 years ago by saste (previous) (diff)

comment:15 in reply to: ↑ 14 Changed 2 years ago by saste

  • Analyzed by developer set
  • Component changed from ffprobe to avutil
  • Reproduced by developer set
  • Resolution set to fixed
  • Status changed from reopened to closed

Replying to saste:
[...]

C0A1 is an overlong encoding for character ! (decimal number 33), using two bytes in place of one. Since this is not allowed we should probably fix the decoder to disallow it.

Should be fixed in:

commit d4ec07dfe7dbc86e8f6403781c511b9463a526d2
Author: Stefano Sabatini <stefasab@gmail.com>
Date:   Thu Aug 28 17:37:27 2014 +0200

    lavu/avstring: check for overlong encodings in av_utf8_decode()
    
    Fix reopened trac ticket #1163.

comment:16 Changed 2 years ago by saste

  • Component changed from avutil to ffprobe
Note: See TracTickets for help on using tickets.