Opened 23 months ago

Last modified 6 months ago

#6006 reopened defect

FPS filter stutters when converting framerate

Reported by: Misaki Owned by:
Priority: normal Component: avfilter
Version: git-master Keywords: fps
Cc: Blocked By:
Blocking: Reproduced by developer: yes
Analyzed by developer: no

Description

I am seeing the FPS choose the wrong frames. YouTube? videos sometimes demonstrate this problem in both mp4 (with x264) and webm (with vp9) formats, usually only one of the formats instead of both. This can happen when the input video file is 60 fps, and the user is looking at a 30 fps version. The usual pattern is that every 3 frames, one of them is wrong. If the source video actually had a duplicate frame every other frame, then when this happens, one out of three frames will be a duplicate of one of the other two, and a frame that was doubled in the 60 fps video is missing entirely from the 30 fps one.

I thought this was because YouTube? was using an fps filter that selected the 'next' frame or something, because in recent years it also started handling variable-fps videos incorrectly. But I just found it, and was able to replicate it, in ffmpeg.

The manual says out of the different frame selection methods, the fps filter chooses the nearest frame.

It's safe to say I don't understand exactly why this is happening. My first attempt to replicate it failed.

This command replicates it:

ffmpeg -filter_complex testsrc=rate=30,settb=1/1000,setpts=PTS+18,showinfo,fps=15,showinfo -t 0.2 -hide_banner -f null -

So these time values were used just because they're very similar to the video where I encountered this problem.

Output from showinfo filters:

[Parsed_showinfo_3 @ 0x3c15980] n: 0 pts: 18 pts_time:0.018 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:88C4D19A plane_checksum:[88C4D19A] mean:[127] stdev:[125.7]
[Parsed_showinfo_3 @ 0x3c15980] n: 1 pts: 51 pts_time:0.051 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:FD48FF60 plane_checksum:[FD48FF60] mean:[127] stdev:[125.7]
[Parsed_showinfo_3 @ 0x3c15980] n: 2 pts: 85 pts_time:0.085 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:5BBC2F63 plane_checksum:[5BBC2F63] mean:[127] stdev:[125.7]
[Parsed_showinfo_5 @ 0x3c16d80] n: 0 pts: 0 pts_time:0 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:FD48FF60 plane_checksum:[FD48FF60] mean:[127] stdev:[125.7]
[Parsed_showinfo_3 @ 0x3c15980] n: 3 pts: 118 pts_time:0.118 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:39A855A7 plane_checksum:[39A855A7] mean:[127] stdev:[125.7]
[Parsed_showinfo_5 @ 0x3c16d80] n: 1 pts: 1 pts_time:0.0666667 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:5BBC2F63 plane_checksum:[5BBC2F63] mean:[127] stdev:[125.7]
[Parsed_showinfo_3 @ 0x3c15980] n: 4 pts: 151 pts_time:0.151 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:6B3D7C29 plane_checksum:[6B3D7C29] mean:[128] stdev:[125.7]
[Parsed_showinfo_3 @ 0x3c15980] n: 5 pts: 185 pts_time:0.185 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:7E969AAD plane_checksum:[7E969AAD] mean:[128] stdev:[125.7]
[Parsed_showinfo_5 @ 0x3c16d80] n: 2 pts: 2 pts_time:0.133333 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:6B3D7C29 plane_checksum:[6B3D7C29] mean:[128] stdev:[125.7]
[Parsed_showinfo_3 @ 0x3c15980] n: 6 pts: 218 pts_time:0.218 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:0930B896 plane_checksum:[0930B896] mean:[128] stdev:[125.6]
[Parsed_showinfo_3 @ 0x3c15980] n: 7 pts: 251 pts_time:0.251 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:CED6CEBF plane_checksum:[CED6CEBF] mean:[128] stdev:[125.6]
[Parsed_showinfo_3 @ 0x3c15980] n: 8 pts: 285 pts_time:0.285 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:3F25E3D1 plane_checksum:[3F25E3D1] mean:[128] stdev:[125.6]
[Parsed_showinfo_5 @ 0x3c16d80] n: 3 pts: 3 pts_time:0.2 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:CED6CEBF plane_checksum:[CED6CEBF] mean:[128] stdev:[125.6]

The result is slightly different from the original file I found this in, but it still shows the problem. The filter selects frames with n= 1, 2, 4, 7, instead of every other frame.

This is somehow related to the offset. In the original file, this is apparently because the audio starts sooner. Removing the offset might fix the problem but cause slight desync issues. If YouTube? is in fact using ffmpeg to process input, video streams not starting at 0 could be the reason it's sometimes bugged.

In the original file, I tried looking at the detailed output from -v trace, and finding the frame that was nearest to an interval based on the output framerate, but could not understand why a frame was being dropped when it was nearest to the interval's middle.

Change History (5)

comment:1 Changed 23 months ago by cehoyos

  • Keywords fps added

Please provide ffmpeg command line (without the hide_banner option) and complete, uncut console output to make this a valid ticket.

comment:2 Changed 23 months ago by Misaki

 software/ffmpeg/ffmpeg -filter_complex testsrc=rate=30,settb=1/1000,setpts=PTS+18,showinfo,fps=15,showinfo -t 0.2 -f null -
ffmpeg version N-82759-g1f5630a-static http://johnvansickle.com/ffmpeg/  Copyright (c) 2000-2016 the FFmpeg developers
  built with gcc 5.4.1 (Debian 5.4.1-4) 20161202
  configuration: --enable-gpl --enable-version3 --enable-static --disable-debug --disable-ffplay --disable-indev=sndio --disable-outdev=sndio --cc=gcc-5 --enable-fontconfig --enable-frei0r --enable-gnutls --enable-gray --enable-libass --enable-libfreetype --enable-libfribidi --enable-libmp3lame --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-libopus --enable-librtmp --enable-libsoxr --enable-libspeex --enable-libtheora --enable-libvidstab --enable-libvo-amrwbenc --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxvid --enable-libzimg
  libavutil      55. 41.101 / 55. 41.101
  libavcodec     57. 66.109 / 57. 66.109
  libavformat    57. 58.101 / 57. 58.101
  libavdevice    57.  2.100 / 57.  2.100
  libavfilter     6. 68.100 /  6. 68.100
  libswscale      4.  3.101 /  4.  3.101
  libswresample   2.  4.100 /  2.  4.100
  libpostproc    54.  2.100 / 54.  2.100
[Parsed_showinfo_3 @ 0x3b6e940] config in time_base: 1/1000, frame_rate: 30/1
[Parsed_showinfo_3 @ 0x3b6e940] config out time_base: 0/0, frame_rate: 0/0
[Parsed_showinfo_5 @ 0x3b6fd40] config in time_base: 1/15, frame_rate: 15/1
[Parsed_showinfo_5 @ 0x3b6fd40] config out time_base: 0/0, frame_rate: 0/0
Output #0, null, to 'pipe:':
  Metadata:
    encoder         : Lavf57.58.101
    Stream #0:0: Video: wrapped_avframe, rgb24, 320x240 [SAR 1:1 DAR 4:3], q=2-31, 200 kb/s, 15 fps, 15 tbn, 15 tbc (default)
    Metadata:
      encoder         : Lavc57.66.109 wrapped_avframe
Stream mapping:
  showinfo -> Stream #0:0 (wrapped_avframe)
Press [q] to stop, [?] for help
[Parsed_showinfo_3 @ 0x3b6e940] n:   0 pts:     18 pts_time:0.018   pos:       -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:88C4D19A plane_checksum:[88C4D19A] mean:[127] stdev:[125.7]
[Parsed_showinfo_3 @ 0x3b6e940] n:   1 pts:     51 pts_time:0.051   pos:       -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:FD48FF60 plane_checksum:[FD48FF60] mean:[127] stdev:[125.7]
[Parsed_showinfo_3 @ 0x3b6e940] n:   2 pts:     85 pts_time:0.085   pos:       -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:5BBC2F63 plane_checksum:[5BBC2F63] mean:[127] stdev:[125.7]
[Parsed_showinfo_5 @ 0x3b6fd40] n:   0 pts:      0 pts_time:0       pos:       -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:FD48FF60 plane_checksum:[FD48FF60] mean:[127] stdev:[125.7]
[Parsed_showinfo_3 @ 0x3b6e940] n:   3 pts:    118 pts_time:0.118   pos:       -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:39A855A7 plane_checksum:[39A855A7] mean:[127] stdev:[125.7]
[Parsed_showinfo_5 @ 0x3b6fd40] n:   1 pts:      1 pts_time:0.0666667 pos:       -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:5BBC2F63 plane_checksum:[5BBC2F63] mean:[127] stdev:[125.7]
[Parsed_showinfo_3 @ 0x3b6e940] n:   4 pts:    151 pts_time:0.151   pos:       -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:6B3D7C29 plane_checksum:[6B3D7C29] mean:[128] stdev:[125.7]
[Parsed_showinfo_3 @ 0x3b6e940] n:   5 pts:    185 pts_time:0.185   pos:       -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:7E969AAD plane_checksum:[7E969AAD] mean:[128] stdev:[125.7]
[Parsed_showinfo_5 @ 0x3b6fd40] n:   2 pts:      2 pts_time:0.133333 pos:       -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:6B3D7C29 plane_checksum:[6B3D7C29] mean:[128] stdev:[125.7]
[Parsed_showinfo_3 @ 0x3b6e940] n:   6 pts:    218 pts_time:0.218   pos:       -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:0930B896 plane_checksum:[0930B896] mean:[128] stdev:[125.6]
[Parsed_showinfo_3 @ 0x3b6e940] n:   7 pts:    251 pts_time:0.251   pos:       -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:CED6CEBF plane_checksum:[CED6CEBF] mean:[128] stdev:[125.6]
[Parsed_showinfo_3 @ 0x3b6e940] n:   8 pts:    285 pts_time:0.285   pos:       -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:3F25E3D1 plane_checksum:[3F25E3D1] mean:[128] stdev:[125.6]
[Parsed_showinfo_5 @ 0x3b6fd40] n:   3 pts:      3 pts_time:0.2     pos:       -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:CED6CEBF plane_checksum:[CED6CEBF] mean:[128] stdev:[125.6]
frame=    3 fps=0.0 q=-0.0 Lsize=N/A time=00:00:00.20 bitrate=N/A speed=   4x    
video:1kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
Last edited 23 months ago by Misaki (previous) (diff)

comment:3 Changed 8 months ago by cehoyos

  • Reproduced by developer set
  • Resolution set to fixed
  • Status changed from new to closed

Fixed by Calvin Walton in e4edc567a077d34f579d31ef0bfe164c7abfac4c

comment:4 Changed 6 months ago by Misaki

  • Resolution fixed deleted
  • Status changed from closed to reopened
  • Summary changed from FPS filter not selecting right frames from input to FPS filter stutters when converting framerate

Reproduced on latest git build from https://johnvansickle.com/ffmpeg/.

$  ./ffmpeg -filter_complex color=white:r=30,format=gray,fade=in:0:235,settb=1/1000,fps=15:start_time=0:round=near,showinfo -t 0.2  -f null -

ffmpeg version N-45834-ga12899ad9-static https://johnvansickle.com/ffmpeg/  Copyright (c) 2000-2018 the FFmpeg developers
  built with gcc 6.3.0 (Debian 6.3.0-18+deb9u1) 20170516
  configuration: --enable-gpl --enable-version3 --enable-static --disable-debug --disable-ffplay --disable-indev=sndio --disable-outdev=sndio --cc=gcc-6 --enable-libxml2 --enable-fontconfig --enable-frei0r --enable-gnutls --enable-gray --enable-libaom --enable-libfribidi --enable-libass --enable-libfreetype --enable-libmp3lame --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-librubberband --enable-libsoxr --enable-libspeex --enable-libvorbis --enable-libopus --enable-libtheora --enable-libvidstab --enable-libvo-amrwbenc --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg
  libavutil      56. 15.100 / 56. 15.100
  libavcodec     58. 19.100 / 58. 19.100
  libavformat    58. 13.100 / 58. 13.100
  libavdevice    58.  4.100 / 58.  4.100
  libavfilter     7. 19.100 /  7. 19.100
  libswscale      5.  2.100 /  5.  2.100
  libswresample   3.  2.100 /  3.  2.100
  libpostproc    55.  2.100 / 55.  2.100
Stream mapping:
  showinfo -> Stream #0:0 (wrapped_avframe)
Press [q] to stop, [?] for help
[Parsed_showinfo_5 @ 0x4f46940] config in time_base: 1/15, frame_rate: 15/1
[Parsed_showinfo_5 @ 0x4f46940] config out time_base: 0/0, frame_rate: 0/0
Output #0, null, to 'pipe:':
  Metadata:
    encoder         : Lavf58.13.100
    Stream #0:0: Video: wrapped_avframe, rgb24, 320x240 [SAR 1:1 DAR 4:3], q=2-31, 200 kb/s, 15 fps, 15 tbn, 15 tbc (default)
    Metadata:
      encoder         : Lavc58.19.100 wrapped_avframe
[Parsed_showinfo_5 @ 0x4f46940] n:   0 pts:      0 pts_time:0       pos:       -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:79FA842D plane_checksum:[79FA842D] mean:[1] stdev:[0.0]
[Parsed_showinfo_5 @ 0x4f46940] n:   1 pts:      1 pts_time:0.0666667 pos:       -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:F3F40869 plane_checksum:[F3F40869] mean:[2] stdev:[0.0]
[Parsed_showinfo_5 @ 0x4f46940] n:   2 pts:      2 pts_time:0.133333 pos:       -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:E7F710D2 plane_checksum:[E7F710D2] mean:[4] stdev:[0.0]
[Parsed_showinfo_5 @ 0x4f46940] n:   3 pts:      3 pts_time:0.2     pos:       -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:56039D68 plane_checksum:[56039D68] mean:[7] stdev:[0.0]
frame=    3 fps=0.0 q=-0.0 Lsize=N/A time=00:00:00.20 bitrate=N/A speed=8.27x    
video:2kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown

Frame number (from 0) is the mean:[x] value. With round=near, it's [1,2,4,7]. With round=up and round=inf, it's [0,1,4,6]. With round=zero and round=down, it's [1,4,5,7]. With start_time=1 and '-t 1.2', it's [31,34,35,37]; however, with 'start_time=0.11:round=near' and '-t 0.31', it's [4,6,8,10].

This report would be much better if I offered a patch, but I don't have the expertise.

When framerate is halved, the fps filter uses an interval or point that's midway between the start point and the next input frame. I don't know how it's done mathematically or in the code, but this is how it works. Due to timebase rounding, sometimes the next frame is closer, and sometimes the previous one is.

With input framerate = output framerate, the fps filter will still duplicate and drop frames with some rounding methods; in this case, 'fps=30:start_time=0:round=down' leads to [1,1,2,4,4,5,7].

The frame duplication in the first case can be avoided by using a start_time offset that exceeds the variation in frames. But if YouTube? can get this wrong, normal users can as well. At least one of the rounding methods should work without adjusting the start time, and it might as well be 'near'.

Conceptually, the interval should be centered on the first frame, not on the average between first and last frames of the input interval. So for input=30fps and output=30fps, the interval is [-1/60,1/60]. Frame at 0 is closest. For output=15fps, interval is [-1/30,1/30]. Frame at 0 is still closest. Second interval for 15 fps is [1/30,3/30], and the frame at pts=67:pts_time=0.067 is closest. There's one very obvious candidate instead of two very similar candidate frames.

This can still lead to problems if the input timebase isn't divisible by framerate, and the offset gradually or suddenly jumps so that the decision point is equidistant from two frames. Could happen with variable framerate video that has been edited; when concatenating two videos with ffmpeg, which leads to odd offsets due to audio having a different length than video, like the second clip being 0.01 sec early or late; or when converting ~59.97 fps video to 30 fps.

The muxers in ffmpeg, or ffmpeg itself, will add or drop a frame based on an acceptable offset, like 0.5 or 1.5 times the distance between frames based on framerate from -r [rate]. The fps filter could try to select the next input frame to use for output based on the previous one used, by attempting to moderate the high-frequency variations introduced by timebase rounding, but this would be a more extensive patch and just making round=near work for halving framerate would a good fix by itself.

Conceptually, as decision point goes from 'low' to 'high', no tracking of previous input frame used would lead to low frame used, then high-frequency 'noise' as the filter switches between low and high, then high frame used. With tracking/average done, the 'low' frame would be used a little bit longer until it switches to high. If time goes down for some reason, possibly a video source that has variable lag, the transition will be delayed again. It's easy to set the default to a value that is unnoticeable for these edge cases but exceeds variation from timebase rounding. The choice of a timebase of 1/1000 for .webm videos is based on the assumption that 1/1000 sec is unnoticeable.

Last edited 6 months ago by Misaki (previous) (diff)

comment:5 Changed 6 months ago by Misaki

Maybe people don't discuss design choices in comments here, and instead use other venues like IRC. But I feel out of place there, particularly since I don't know any programming languages, and have not found it to be useful.

Current details of fps filter, as far as I can gather from limited testing without understanding the code:
Filter gets start point from user or first frame. I don't understand this comment or the code that follows it:

+     * The dance with offsets is required to match the rounding behaviour of the
+     * previous version of the fps filter when using the start_time option. */"

But anyway, having timestamps 0.04 sec later seems to cause the same output as using start_time=-0.04.

The input timestamps are rounded up or down to an output timestamp, as opposed to rounding the output timestamps to an input one.

The frame with the last input timestamp corresponding to an output timestamp is used for that output frame.

So with output r=10 and round=near, two input frames at 0.04 and 1.04 (1 fps with offset added), the output frame at 1.0 uses the second input frame. All frames before it duplicate the first frame.

With 100 fps input and 10 fps output, the out_frame at 1.0 uses in_frame at 1.04 for round=near; in_frame at 1.09 for round=down; in_frame at 1.0 for round=up.

One concern for design and use is whether the video remains synchronized. With the default round=near, the frame that's displayed for [1,1.1> is an average of what's happening during that time. At least one media player, totem, shows the upcoming frame if you pause it, which may or may not be a bug. If a 30-fps video is converted to 60 fps by duplicating frames and the user is processing the 60-fps version, then the second of each frame could be slightly higher quality, though I suppose it could be the opposite if the first displayed frame is a B-frame. I would tend to say that it's better for content to be displayed late (or rather, "on time") than early, but this might be my biases like the way low-fps lag works in computer games. Low fps triggers an expectation of what the content should be, even if it isn't being updated.

But more important is whether there are unwanted duplicates or skipping of frames. Suppose a user wants the input frame at 1.0 to be displayed at 1.0 in the output. Currently they could use round=up, which (maybe counterintuitively) does this. But if it causes every third frame to jitter by being one input frame early, this filter option is not useful.

Out of the three options (for positive timestamps), round=near is in the middle. Output frames should be displayed around the input timestamps used for this method. The other two rounding methods can return frames that are either later or earlier than this time.

With the 100 fps to 10 fps example, either start_time=0:round=near would continue to select input frames at 0.04, 0.14, 0.24 etc. but change to display them at those same times by using a timebase lower than 1/fps (timestamp interval higher than 1), or use the frames at 0, 0.1, 0.2 etc. and display them at those times.

Currently, round=near appears to work like the other methods, by processing all input frames up to the next output frame's time interval then outputting the last one. Interval for 10 fps, 1.0 timestamp appears to be [0.95,1.05>, so if input fps is 30.1 with frames at 1.03 and 1.063, the frame at 1.03 will be used even though 1.063 is closer to 1.05. The input frames are associated with the nearest output frame; the output frame does not select the input frame that's closest.

I think this should be changed so the output frame at time X is, in fact, the input frame closest to X, but the other rounding methods also have this bias. If only 'near' is adjusted, then on average it might be slightly above the average of 'round=up' and 'round=down'.

But anyway, as an adjustment to current code, there would be a check to see if each input frame is closer to the output frame, instead of just using the last one. Then adjust the intervals for round={up,down} down by half a frame.

This only fixes jittering or stuttering for the simple case, as discussed above.

Note: See TracTickets for help on using tickets.