Opened 8 years ago
Last modified 7 years ago
#6006 reopened defect
FPS filter stutters when converting framerate
Reported by: | Misaki | Owned by: | |
---|---|---|---|
Priority: | normal | Component: | avfilter |
Version: | git-master | Keywords: | fps |
Cc: | Blocked By: | ||
Blocking: | Reproduced by developer: | yes | |
Analyzed by developer: | no |
Description
I am seeing the FPS choose the wrong frames. YouTube videos sometimes demonstrate this problem in both mp4 (with x264) and webm (with vp9) formats, usually only one of the formats instead of both. This can happen when the input video file is 60 fps, and the user is looking at a 30 fps version. The usual pattern is that every 3 frames, one of them is wrong. If the source video actually had a duplicate frame every other frame, then when this happens, one out of three frames will be a duplicate of one of the other two, and a frame that was doubled in the 60 fps video is missing entirely from the 30 fps one.
I thought this was because YouTube was using an fps filter that selected the 'next' frame or something, because in recent years it also started handling variable-fps videos incorrectly. But I just found it, and was able to replicate it, in ffmpeg.
The manual says out of the different frame selection methods, the fps filter chooses the nearest frame.
It's safe to say I don't understand exactly why this is happening. My first attempt to replicate it failed.
This command replicates it:
ffmpeg -filter_complex testsrc=rate=30,settb=1/1000,setpts=PTS+18,showinfo,fps=15,showinfo -t 0.2 -hide_banner -f null -
So these time values were used just because they're very similar to the video where I encountered this problem.
Output from showinfo filters:
[Parsed_showinfo_3 @ 0x3c15980] n: 0 pts: 18 pts_time:0.018 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:88C4D19A plane_checksum:[88C4D19A] mean:[127] stdev:[125.7]
[Parsed_showinfo_3 @ 0x3c15980] n: 1 pts: 51 pts_time:0.051 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:FD48FF60 plane_checksum:[FD48FF60] mean:[127] stdev:[125.7]
[Parsed_showinfo_3 @ 0x3c15980] n: 2 pts: 85 pts_time:0.085 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:5BBC2F63 plane_checksum:[5BBC2F63] mean:[127] stdev:[125.7]
[Parsed_showinfo_5 @ 0x3c16d80] n: 0 pts: 0 pts_time:0 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:FD48FF60 plane_checksum:[FD48FF60] mean:[127] stdev:[125.7]
[Parsed_showinfo_3 @ 0x3c15980] n: 3 pts: 118 pts_time:0.118 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:39A855A7 plane_checksum:[39A855A7] mean:[127] stdev:[125.7]
[Parsed_showinfo_5 @ 0x3c16d80] n: 1 pts: 1 pts_time:0.0666667 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:5BBC2F63 plane_checksum:[5BBC2F63] mean:[127] stdev:[125.7]
[Parsed_showinfo_3 @ 0x3c15980] n: 4 pts: 151 pts_time:0.151 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:6B3D7C29 plane_checksum:[6B3D7C29] mean:[128] stdev:[125.7]
[Parsed_showinfo_3 @ 0x3c15980] n: 5 pts: 185 pts_time:0.185 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:7E969AAD plane_checksum:[7E969AAD] mean:[128] stdev:[125.7]
[Parsed_showinfo_5 @ 0x3c16d80] n: 2 pts: 2 pts_time:0.133333 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:6B3D7C29 plane_checksum:[6B3D7C29] mean:[128] stdev:[125.7]
[Parsed_showinfo_3 @ 0x3c15980] n: 6 pts: 218 pts_time:0.218 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:0930B896 plane_checksum:[0930B896] mean:[128] stdev:[125.6]
[Parsed_showinfo_3 @ 0x3c15980] n: 7 pts: 251 pts_time:0.251 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:CED6CEBF plane_checksum:[CED6CEBF] mean:[128] stdev:[125.6]
[Parsed_showinfo_3 @ 0x3c15980] n: 8 pts: 285 pts_time:0.285 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:3F25E3D1 plane_checksum:[3F25E3D1] mean:[128] stdev:[125.6]
[Parsed_showinfo_5 @ 0x3c16d80] n: 3 pts: 3 pts_time:0.2 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:CED6CEBF plane_checksum:[CED6CEBF] mean:[128] stdev:[125.6]
The result is slightly different from the original file I found this in, but it still shows the problem. The filter selects frames with n= 1, 2, 4, 7, instead of every other frame.
This is somehow related to the offset. In the original file, this is apparently because the audio starts sooner. Removing the offset might fix the problem but cause slight desync issues. If YouTube is in fact using ffmpeg to process input, video streams not starting at 0 could be the reason it's sometimes bugged.
In the original file, I tried looking at the detailed output from -v trace, and finding the frame that was nearest to an interval based on the output framerate, but could not understand why a frame was being dropped when it was nearest to the interval's middle.
Change History (5)
comment:1 by , 8 years ago
Keywords: | fps added |
---|
comment:2 by , 8 years ago
software/ffmpeg/ffmpeg -filter_complex testsrc=rate=30,settb=1/1000,setpts=PTS+18,showinfo,fps=15,showinfo -t 0.2 -f null - ffmpeg version N-82759-g1f5630a-static http://johnvansickle.com/ffmpeg/ Copyright (c) 2000-2016 the FFmpeg developers built with gcc 5.4.1 (Debian 5.4.1-4) 20161202 configuration: --enable-gpl --enable-version3 --enable-static --disable-debug --disable-ffplay --disable-indev=sndio --disable-outdev=sndio --cc=gcc-5 --enable-fontconfig --enable-frei0r --enable-gnutls --enable-gray --enable-libass --enable-libfreetype --enable-libfribidi --enable-libmp3lame --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-libopus --enable-librtmp --enable-libsoxr --enable-libspeex --enable-libtheora --enable-libvidstab --enable-libvo-amrwbenc --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxvid --enable-libzimg libavutil 55. 41.101 / 55. 41.101 libavcodec 57. 66.109 / 57. 66.109 libavformat 57. 58.101 / 57. 58.101 libavdevice 57. 2.100 / 57. 2.100 libavfilter 6. 68.100 / 6. 68.100 libswscale 4. 3.101 / 4. 3.101 libswresample 2. 4.100 / 2. 4.100 libpostproc 54. 2.100 / 54. 2.100 [Parsed_showinfo_3 @ 0x3b6e940] config in time_base: 1/1000, frame_rate: 30/1 [Parsed_showinfo_3 @ 0x3b6e940] config out time_base: 0/0, frame_rate: 0/0 [Parsed_showinfo_5 @ 0x3b6fd40] config in time_base: 1/15, frame_rate: 15/1 [Parsed_showinfo_5 @ 0x3b6fd40] config out time_base: 0/0, frame_rate: 0/0 Output #0, null, to 'pipe:': Metadata: encoder : Lavf57.58.101 Stream #0:0: Video: wrapped_avframe, rgb24, 320x240 [SAR 1:1 DAR 4:3], q=2-31, 200 kb/s, 15 fps, 15 tbn, 15 tbc (default) Metadata: encoder : Lavc57.66.109 wrapped_avframe Stream mapping: showinfo -> Stream #0:0 (wrapped_avframe) Press [q] to stop, [?] for help [Parsed_showinfo_3 @ 0x3b6e940] n: 0 pts: 18 pts_time:0.018 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:88C4D19A plane_checksum:[88C4D19A] mean:[127] stdev:[125.7] [Parsed_showinfo_3 @ 0x3b6e940] n: 1 pts: 51 pts_time:0.051 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:FD48FF60 plane_checksum:[FD48FF60] mean:[127] stdev:[125.7] [Parsed_showinfo_3 @ 0x3b6e940] n: 2 pts: 85 pts_time:0.085 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:5BBC2F63 plane_checksum:[5BBC2F63] mean:[127] stdev:[125.7] [Parsed_showinfo_5 @ 0x3b6fd40] n: 0 pts: 0 pts_time:0 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:FD48FF60 plane_checksum:[FD48FF60] mean:[127] stdev:[125.7] [Parsed_showinfo_3 @ 0x3b6e940] n: 3 pts: 118 pts_time:0.118 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:39A855A7 plane_checksum:[39A855A7] mean:[127] stdev:[125.7] [Parsed_showinfo_5 @ 0x3b6fd40] n: 1 pts: 1 pts_time:0.0666667 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:5BBC2F63 plane_checksum:[5BBC2F63] mean:[127] stdev:[125.7] [Parsed_showinfo_3 @ 0x3b6e940] n: 4 pts: 151 pts_time:0.151 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:6B3D7C29 plane_checksum:[6B3D7C29] mean:[128] stdev:[125.7] [Parsed_showinfo_3 @ 0x3b6e940] n: 5 pts: 185 pts_time:0.185 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:7E969AAD plane_checksum:[7E969AAD] mean:[128] stdev:[125.7] [Parsed_showinfo_5 @ 0x3b6fd40] n: 2 pts: 2 pts_time:0.133333 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:6B3D7C29 plane_checksum:[6B3D7C29] mean:[128] stdev:[125.7] [Parsed_showinfo_3 @ 0x3b6e940] n: 6 pts: 218 pts_time:0.218 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:0930B896 plane_checksum:[0930B896] mean:[128] stdev:[125.6] [Parsed_showinfo_3 @ 0x3b6e940] n: 7 pts: 251 pts_time:0.251 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:CED6CEBF plane_checksum:[CED6CEBF] mean:[128] stdev:[125.6] [Parsed_showinfo_3 @ 0x3b6e940] n: 8 pts: 285 pts_time:0.285 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:3F25E3D1 plane_checksum:[3F25E3D1] mean:[128] stdev:[125.6] [Parsed_showinfo_5 @ 0x3b6fd40] n: 3 pts: 3 pts_time:0.2 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:CED6CEBF plane_checksum:[CED6CEBF] mean:[128] stdev:[125.6] frame= 3 fps=0.0 q=-0.0 Lsize=N/A time=00:00:00.20 bitrate=N/A speed= 4x video:1kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
comment:3 by , 7 years ago
Reproduced by developer: | set |
---|---|
Resolution: | → fixed |
Status: | new → closed |
Fixed by Calvin Walton in e4edc567a077d34f579d31ef0bfe164c7abfac4c
comment:4 by , 7 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
Summary: | FPS filter not selecting right frames from input → FPS filter stutters when converting framerate |
Reproduced on latest git build from https://johnvansickle.com/ffmpeg/.
$ ./ffmpeg -filter_complex color=white:r=30,format=gray,fade=in:0:235,settb=1/1000,fps=15:start_time=0:round=near,showinfo -t 0.2 -f null - ffmpeg version N-45834-ga12899ad9-static https://johnvansickle.com/ffmpeg/ Copyright (c) 2000-2018 the FFmpeg developers built with gcc 6.3.0 (Debian 6.3.0-18+deb9u1) 20170516 configuration: --enable-gpl --enable-version3 --enable-static --disable-debug --disable-ffplay --disable-indev=sndio --disable-outdev=sndio --cc=gcc-6 --enable-libxml2 --enable-fontconfig --enable-frei0r --enable-gnutls --enable-gray --enable-libaom --enable-libfribidi --enable-libass --enable-libfreetype --enable-libmp3lame --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-librubberband --enable-libsoxr --enable-libspeex --enable-libvorbis --enable-libopus --enable-libtheora --enable-libvidstab --enable-libvo-amrwbenc --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg libavutil 56. 15.100 / 56. 15.100 libavcodec 58. 19.100 / 58. 19.100 libavformat 58. 13.100 / 58. 13.100 libavdevice 58. 4.100 / 58. 4.100 libavfilter 7. 19.100 / 7. 19.100 libswscale 5. 2.100 / 5. 2.100 libswresample 3. 2.100 / 3. 2.100 libpostproc 55. 2.100 / 55. 2.100 Stream mapping: showinfo -> Stream #0:0 (wrapped_avframe) Press [q] to stop, [?] for help [Parsed_showinfo_5 @ 0x4f46940] config in time_base: 1/15, frame_rate: 15/1 [Parsed_showinfo_5 @ 0x4f46940] config out time_base: 0/0, frame_rate: 0/0 Output #0, null, to 'pipe:': Metadata: encoder : Lavf58.13.100 Stream #0:0: Video: wrapped_avframe, rgb24, 320x240 [SAR 1:1 DAR 4:3], q=2-31, 200 kb/s, 15 fps, 15 tbn, 15 tbc (default) Metadata: encoder : Lavc58.19.100 wrapped_avframe [Parsed_showinfo_5 @ 0x4f46940] n: 0 pts: 0 pts_time:0 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:79FA842D plane_checksum:[79FA842D] mean:[1] stdev:[0.0] [Parsed_showinfo_5 @ 0x4f46940] n: 1 pts: 1 pts_time:0.0666667 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:F3F40869 plane_checksum:[F3F40869] mean:[2] stdev:[0.0] [Parsed_showinfo_5 @ 0x4f46940] n: 2 pts: 2 pts_time:0.133333 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:E7F710D2 plane_checksum:[E7F710D2] mean:[4] stdev:[0.0] [Parsed_showinfo_5 @ 0x4f46940] n: 3 pts: 3 pts_time:0.2 pos: -1 fmt:rgb24 sar:1/1 s:320x240 i:P iskey:1 type:I checksum:56039D68 plane_checksum:[56039D68] mean:[7] stdev:[0.0] frame= 3 fps=0.0 q=-0.0 Lsize=N/A time=00:00:00.20 bitrate=N/A speed=8.27x video:2kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
Frame number (from 0) is the mean:[x] value. With round=near, it's [1,2,4,7]. With round=up and round=inf, it's [0,1,4,6]. With round=zero and round=down, it's [1,4,5,7]. With start_time=1 and '-t 1.2', it's [31,34,35,37]; however, with 'start_time=0.11:round=near' and '-t 0.31', it's [4,6,8,10].
This report would be much better if I offered a patch, but I don't have the expertise.
When framerate is halved, the fps filter uses an interval or point that's midway between the start point and the next input frame. I don't know how it's done mathematically or in the code, but this is how it works. Due to timebase rounding, sometimes the next frame is closer, and sometimes the previous one is.
With input framerate = output framerate, the fps filter will still duplicate and drop frames with some rounding methods; in this case, 'fps=30:start_time=0:round=down' leads to [1,1,2,4,4,5,7].
The frame duplication in the first case can be avoided by using a start_time offset that exceeds the variation in frames. But if YouTube can get this wrong, normal users can as well. At least one of the rounding methods should work without adjusting the start time, and it might as well be 'near'.
Conceptually, the interval should be centered on the first frame, not on the average between first and last frames of the input interval. So for input=30fps and output=30fps, the interval is [-1/60,1/60]. Frame at 0 is closest. For output=15fps, interval is [-1/30,1/30]. Frame at 0 is still closest. Second interval for 15 fps is [1/30,3/30], and the frame at pts=67:pts_time=0.067 is closest. There's one very obvious candidate instead of two very similar candidate frames.
This can still lead to problems if the input timebase isn't divisible by framerate, and the offset gradually or suddenly jumps so that the decision point is equidistant from two frames. Could happen with variable framerate video that has been edited; when concatenating two videos with ffmpeg, which leads to odd offsets due to audio having a different length than video, like the second clip being 0.01 sec early or late; or when converting ~59.97 fps video to 30 fps.
The muxers in ffmpeg, or ffmpeg itself, will add or drop a frame based on an acceptable offset, like 0.5 or 1.5 times the distance between frames based on framerate from -r [rate]. The fps filter could try to select the next input frame to use for output based on the previous one used, by attempting to moderate the high-frequency variations introduced by timebase rounding, but this would be a more extensive patch and just making round=near work for halving framerate would a good fix by itself.
Conceptually, as decision point goes from 'low' to 'high', no tracking of previous input frame used would lead to low frame used, then high-frequency 'noise' as the filter switches between low and high, then high frame used. With tracking/average done, the 'low' frame would be used a little bit longer until it switches to high. If time goes down for some reason, possibly a video source that has variable lag, the transition will be delayed again. It's easy to set the default to a value that is unnoticeable for these edge cases but exceeds variation from timebase rounding. The choice of a timebase of 1/1000 for .webm videos is based on the assumption that 1/1000 sec is unnoticeable.
comment:5 by , 7 years ago
Maybe people don't discuss design choices in comments here, and instead use other venues like IRC. But I feel out of place there, particularly since I don't know any programming languages, and have not found it to be useful.
Current details of fps filter, as far as I can gather from limited testing without understanding the code:
Filter gets start point from user or first frame. I don't understand this comment or the code that follows it:
+ * The dance with offsets is required to match the rounding behaviour of the + * previous version of the fps filter when using the start_time option. */"
But anyway, having timestamps 0.04 sec later seems to cause the same output as using start_time=-0.04.
The input timestamps are rounded up or down to an output timestamp, as opposed to rounding the output timestamps to an input one.
The frame with the last input timestamp corresponding to an output timestamp is used for that output frame.
So with output r=10 and round=near, two input frames at 0.04 and 1.04 (1 fps with offset added), the output frame at 1.0 uses the second input frame. All frames before it duplicate the first frame.
With 100 fps input and 10 fps output, the out_frame at 1.0 uses in_frame at 1.04 for round=near; in_frame at 1.09 for round=down; in_frame at 1.0 for round=up.
One concern for design and use is whether the video remains synchronized. With the default round=near, the frame that's displayed for [1,1.1> is an average of what's happening during that time. At least one media player, totem, shows the upcoming frame if you pause it, which may or may not be a bug. If a 30-fps video is converted to 60 fps by duplicating frames and the user is processing the 60-fps version, then the second of each frame could be slightly higher quality, though I suppose it could be the opposite if the first displayed frame is a B-frame. I would tend to say that it's better for content to be displayed late (or rather, "on time") than early, but this might be my biases like the way low-fps lag works in computer games. Low fps triggers an expectation of what the content should be, even if it isn't being updated.
But more important is whether there are unwanted duplicates or skipping of frames. Suppose a user wants the input frame at 1.0 to be displayed at 1.0 in the output. Currently they could use round=up, which (maybe counterintuitively) does this. But if it causes every third frame to jitter by being one input frame early, this filter option is not useful.
Out of the three options (for positive timestamps), round=near is in the middle. Output frames should be displayed around the input timestamps used for this method. The other two rounding methods can return frames that are either later or earlier than this time.
With the 100 fps to 10 fps example, either start_time=0:round=near would continue to select input frames at 0.04, 0.14, 0.24 etc. but change to display them at those same times by using a timebase lower than 1/fps (timestamp interval higher than 1), or use the frames at 0, 0.1, 0.2 etc. and display them at those times.
Currently, round=near appears to work like the other methods, by processing all input frames up to the next output frame's time interval then outputting the last one. Interval for 10 fps, 1.0 timestamp appears to be [0.95,1.05>, so if input fps is 30.1 with frames at 1.03 and 1.063, the frame at 1.03 will be used even though 1.063 is closer to 1.05. The input frames are associated with the nearest output frame; the output frame does not select the input frame that's closest.
I think this should be changed so the output frame at time X is, in fact, the input frame closest to X, but the other rounding methods also have this bias. If only 'near' is adjusted, then on average it might be slightly above the average of 'round=up' and 'round=down'.
But anyway, as an adjustment to current code, there would be a check to see if each input frame is closer to the output frame, instead of just using the last one. Then adjust the intervals for round={up,down} down by half a frame.
This only fixes jittering or stuttering for the simple case, as discussed above.
Please provide ffmpeg command line (without the hide_banner option) and complete, uncut console output to make this a valid ticket.